Program/Track C/C.1.1/Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
Narek Maloyan, Dmitry Namiot
20m
Large Language Models (LLMs) are increasingly used as automated judges for evaluating text quality, code correctness, and argument strength. However, these LLM-as-a-judge systems are vulnerable to adversarial attacks that can manipulate their assessments. This paper investigates the vulnerability of LLM-as-a-judge systems to prompt injection attacks, drawing insights from both academic literature and practical solutions from the "LLMs: You Can't Please Them All" Kaggle competition. We present a comprehensive framework for developing and evaluating adversarial attacks against LLM judges, distinguishing between content-author attacks and system-prompt attacks. Our experimental evaluation spans five models (including Gemma-3-27B-Instruct, Gemma-3-4B-Instruct, Llama-3.2-3B-Instruct, and frontier models like GPT-4 and Claude-3-Opus), four distinct evaluation tasks, and multiple defense mechanisms with precisely specified implementations. Through rigorous statistical analysis (n=50 prompts per condition, bootstrap confidence intervals), we demonstrate that sophisticated attacks can achieve success rates of up to 73.8\% against popular LLM judges, with Contextual Misdirection being the most effective method against Gemma models at 67.7\%. We find that smaller models like Gemma-3-4B-Instruct are more vulnerable (65.9\% average success rate) than their larger counterparts, and that attacks show high transferability (50.5-62.6\%) across different architectures. We compare our approach with recent work including Universal-Prompt-Injection \cite{liu2024automatic} and AdvPrompter \cite{paulus2024advprompter}, demonstrating both complementary insights and novel contributions. Our findings highlight critical vulnerabilities in current LLM-as-a-judge systems and provide recommendations for developing more robust evaluation frameworks, including using multi-model committees with diverse architectures and preferring comparative assessment over absolute scoring methods. To ensure reproducibility, we release our code, evaluation harness, and processed datasets.