An Analysis of Unsafe Responses with Magic Expressions across Large Language Models

Research output: Contribution to journalConference articlepeer-review

Abstract

With advancements in AI models' ability to understand input, the risk of generating harmful outputs in response to malicious prompts has grown. This has led to increased awareness among researchers regarding AI safety, resulting in diverse ongoing investigations. However, existing studies have primarily relied on coarse-grained datasets to define risk factors, with limited availability of high-quality datasets that address fine-grained risks. To address this gap, we developed a high-quality, Korean AI safety evaluation dataset with overlooked risk factors in current LLMs. This dataset was curated with the involvement of human annotators to ensure its quality and relevance. We found that adding simple modifications, such as magic expressions, increases the likelihood of bypassing model guardrails. The analysis further revealed that this effect varies across different categories of risk factors. Additionally, we evaluated the harmfulness of LLM-generated outputs by measuring the frequency of risk-related keywords in the responses. This approach used prompt-based evaluation methods to quantify the degree of risk, providing a structured framework for assessing the potential dangers of model outputs.

Original languageEnglish
Pages (from-to)406-409
Number of pages4
JournalProceedings of the IEEE International Conference on Big Data and Smart Computing, BIGCOMP
Issue number2025
DOIs
StatePublished - 2025
Event2025 IEEE International Conference on Big Data and Smart Computing, BigComp 2025 - Kota Kinabalu, Malaysia
Duration: 9 Feb 202512 Feb 2025

Keywords

  • AI Safety
  • Clustering
  • Keyword Extraction
  • Large Language Model
  • LLM Jailbreaking

Fingerprint

Dive into the research topics of 'An Analysis of Unsafe Responses with Magic Expressions across Large Language Models'. Together they form a unique fingerprint.

Cite this