Program/Track C/C.1.1/Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

Narek Maloyan, Dmitry Namiot

09:00, 25 Sep. 202520m

Large Language Models (LLMs) are increasingly used as automated judges for evaluating text quality, code correctness, and argument strength. However, these LLM-as-a-judge systems are vulnerable to adversarial attacks that can manipulate their assessments. This paper investigates the vulnerability of LLM-as-a-judge systems to prompt injection attacks, drawing insights from both academic literature and practical solutions from the "LLMs: You Can't Please Them All" Kaggle competition. We present a comprehensive framework for developing and evaluating adversarial attacks against LLM judges, distinguishing between content-author attacks and system-prompt attacks. Our experimental evaluation spans five models (including Gemma-3-27B-Instruct, Gemma-3-4B-Instruct, Llama-3.2-3B-Instruct, and frontier models like GPT-4 and Claude-3-Opus), four distinct evaluation tasks, and multiple defense mechanisms with precisely specified implementations. Through rigorous statistical analysis (n=50 prompts per condition, bootstrap confidence intervals), we demonstrate that sophisticated attacks can achieve success rates of up to 73.8\% against popular LLM judges, with Contextual Misdirection being the most effective method against Gemma models at 67.7\%. We find that smaller models like Gemma-3-4B-Instruct are more vulnerable (65.9\% average success rate) than their larger counterparts, and that attacks show high transferability (50.5-62.6\%) across different architectures. We compare our approach with recent work including Universal-Prompt-Injection \cite{liu2024automatic} and AdvPrompter \cite{paulus2024advprompter}, demonstrating both complementary insights and novel contributions. Our findings highlight critical vulnerabilities in current LLM-as-a-judge systems and provide recommendations for developing more robust evaluation frameworks, including using multi-model committees with diverse architectures and preferring comparative assessment over absolute scoring methods. To ensure reproducibility, we release our code, evaluation harness, and processed datasets.