Abstract
With advancements in AI models' ability to understand input, the risk of generating harmful outputs in response to malicious prompts has grown. This has led to increased awareness among researchers regarding AI safety, resulting in diverse ongoing investigations. However, existing studies have primarily relied on coarse-grained datasets to define risk factors, with limited availability of high-quality datasets that address fine-grained risks. To address this gap, we developed a high-quality, Korean AI safety evaluation dataset with overlooked risk factors in current LLMs. This dataset was curated with the involvement of human annotators to ensure its quality and relevance. We found that adding simple modifications, such as magic expressions, increases the likelihood of bypassing model guardrails. The analysis further revealed that this effect varies across different categories of risk factors. Additionally, we evaluated the harmfulness of LLM-generated outputs by measuring the frequency of risk-related keywords in the responses. This approach used prompt-based evaluation methods to quantify the degree of risk, providing a structured framework for assessing the potential dangers of model outputs.
| Original language | English |
|---|---|
| Pages (from-to) | 406-409 |
| Number of pages | 4 |
| Journal | Proceedings of the IEEE International Conference on Big Data and Smart Computing, BIGCOMP |
| Issue number | 2025 |
| DOIs | |
| State | Published - 2025 |
| Event | 2025 IEEE International Conference on Big Data and Smart Computing, BigComp 2025 - Kota Kinabalu, Malaysia Duration: 9 Feb 2025 → 12 Feb 2025 |
Keywords
- AI Safety
- Clustering
- Keyword Extraction
- Large Language Model
- LLM Jailbreaking