Understanding jailbreaking and prompt-based injections

Understanding jailbreaking and prompt-based injections

Through the world's Llama Guards and Guardrails AI, one problem persists in modern LLMs: the threat of being jailbroken. Jailbreaking in LLMs has become a growing concern as these models continue to be adapted as AI Assistants and Agents in every single pipeline imaginable. AI adoption has slowly become the norm. Still, such threats are crucial to mitigate as they pose significant security risks among pipelines that handle the most sensitive of our data.

This blog explores the mechanisms and case studies surrounding Jailbreaking and Prompt Injection, so let us first try to understand what jailbreaking is in LLMs.

What is jailbreaking in LLMs?

Jailbreaking in LLMs refers to instances where a user knowingly or unknowingly manipulates these models to circumvent the developers' ethical instructions. When models holding knowledge of hundreds of billions of parameters like GPT4o, Llama3.1, or Claude Sonet are “jailbroken,” they can generate content otherwise restricted, such as harmful advice or divulging sensitive information.

This process often involves the attacker cleverly constructing inputs that guide the model into producing unauthorized outputs, frequently using language or instructions that exploit its natural language processing capabilities. As the sophistication of LLMs increases, the possibility of them being jailbroken poses a severe risk to applications that rely on their safety mechanisms to ensure compliance with ethical standards.

Here are some key elements of jailbreaking as discussed in “Smooth LLM: Defending Large Language Models Against Jailbreaking Attacks”:

  1. Direct prompts: Due to ChatGPT's safety measures, which were induced through RLHF, initial attempts using straightforward prompts (e.g., asking for personal details) are often unsuccessful. ChatGPT is fine-tuned to avoid directly revealing private information or engaging in unethical behavior through such queries.
  2. Jailbreaking prompts: Jailbreaking involves role-playing prompts, like "Do Anything Now" (which we shall discuss later), which change the system’s "persona" and convince it to respond without restrictions. These prompts exploit the model’s interpretative flexibility to trick it into bypassing safeguards.
  3. Multi-step jailbreaking: Building on the chain-of-thought prompting technique, it systematically dismantles the model’s ethical barriers by breaking down tasks into manageable steps and slowly coaxing it into providing sensitive information. By simulating conversations with multiple turns and gradual moral erosion, this approach can lead to unintended model outputs.
  4. Response verification: The paper also touches on validating extracted data, employing methods like sampling-based decoding or generating multiple answers and comparing them for accuracy.

Now, Jailbreaking and Prompt Injection are two sides of the coin. Both aim to manipulate the model to produce responses that otherwise may not have flown. Let us take a look at Prompt Injection.

Understanding prompt-based injections

Prompt injection attacks exploit the interaction between user inputs and the language models embedded within applications. Attackers craft inputs that manipulate the model’s prompt interpreter, resulting in unintended, sometimes harmful behaviors. These attacks resemble traditional injection exploits (e.g., SQL injections), where untrusted inputs are processed without adequate sanitization.

In prompt injections, the attacker hijacks the model by embedding harmful instructions directly into prompts. These can instruct the model to reveal private data, perform tasks beyond its original design, or circumvent ethical boundaries. The paper "Prompt Injection Attacks Against LLM-Integrated Applications” presents real-world examples, highlighting how this method poses risks even to models with robust safety protocols.

Prompt-based

  1. Direct prompt injection: In this scenario, an attacker explicitly includes harmful instructions in the model's input. For instance, jailbreaking attempts where users input commands designed to bypass restrictions. This requires proactive exploitation by the attacker.
  2. Indirect prompt injection: This method incorporates malicious instructions through mechanisms like retrieval systems, where harmful prompts are retrieved from external sources. While this type still necessitates explicit input, it allows for more stealthy attacks than direct methods.

A significant challenge arises from LLMs' open-ended nature. They attempt to complete prompts by predicting the following words, making them particularly vulnerable to cleverly worded or concealed malicious instructions. For instance, an attacker could start with a seemingly innocuous input, followed by embedded code or phrasing that causes the LLM to generate unauthorized outputs.

Case studies

ChatGPT and DAN ("Do Anything Now") jailbreak

The DAN jailbreak became famous for its ability to bypass ChatGPT's safety measures. By using a carefully crafted prompt, users could get ChatGPT to act as "DAN" (Do Anything Now), a persona that disregarded all ethical and safety constraints imposed by OpenAI. This jailbreak relied on adversarial prompting, instructing ChatGPT to "pretend" it was something unrestricted, and this behavior tricked the model into bypassing its content moderation system. Users could elicit harmful or unethical responses, including generating explicit content, advice on illegal activities, or disallowed opinions.

In technical terms, the DAN jailbreak operates much like a prompt injection, focusing on exploiting the model's internal logic that responds to role-playing scenarios. The model's instruction-following nature was its weakness, as DAN prompts effectively created a dual instruction system—one within the constraints and one, DAN, free from them. The DAN jailbreak highlights a key challenge in language models: their tendency to follow complex instructions, even when contradicting safety features faithfully.​

GPT-4 jailbreaking and red-teaming

GPT-4, released with more robust safeguards, was subjected to extensive red-teaming to test its resilience against jailbreaks. Red-teaming involves security researchers deliberately attempting to exploit the model to identify vulnerabilities. Despite improvements, GPT-4 was still vulnerable to adversarial attacks and complex jailbreaks, where users could manipulate it into generating unsafe outputs, bypassing ethical boundaries set during development. For instance, prompts containing subtle adversarial phrases could trick GPT-4 into disclosing dangerous information, such as ways to exploit cybersecurity vulnerabilities or generate misinformation.

Red-teaming tests also revealed that while GPT-4's safety features were more sophisticated than those of earlier models, its complexity left it susceptible to creative and evolving jailbreak methods. Attackers found it could be led astray by exploiting prompt layering, which gives the model a series of nested tasks. This showcases the importance of continuous updates in safety systems and the need for models to handle layered or chained instructions better.

Anthropic’s Claude AI jailbreak

Anthropic’s Claude AI was designed with explicit safety features, using a principle-based approach for ethical alignment. However, despite these efforts, Claude AI was not immune to jailbreaking attempts (which the company revealed in a later paper titled: “Many-shot Jailbreaking”). Users found ways to trick the model using adversarial language patterns, often disguised as innocuous requests. In a notable jailbreak, adversaries manipulated Claude into generating offensive and dangerous outputs by asking it to "discuss hypothetical scenarios," effectively bypassing the ethical constraints under the guise of role-playing or purely theoretical discussion.

Claude's vulnerability was similar to that of other models: its reliance on recognizing intent from the prompt. By masking malicious intent in complex hypothetical scenarios, attackers could guide Claude into generating outputs that violated its ethical guidelines. This attack demonstrated that sophisticated adversarial prompts could still find loopholes even with solid safety architectures, like those used by Anthropic.

Defending against jailbreaking and prompt injection

Defending against jailbreaking and prompt injection in large language models (LLMs) requires layered protection strategies. One approach is the Gatekeeper Layer, which filters and rewrites suspicious prompts before they reach the model. This prevents jailbreaks like the "Do Anything Now" (DAN) attacks, where manipulated prompts bypass restrictions. Real-time threat detection helps mitigate prompt injections by evaluating inputs and ensuring models adhere to ethical guidelines. By intercepting prompts that attempt to override system instructions, the model stays within its designed boundaries.

In addition, specialized evaluation models monitor and analyze incoming inputs to prevent attacks. These models neutralize harmful prompts and correct hallucinations or bias, making them particularly effective in tools like GPT-4, where red-teaming exercises uncover vulnerabilities. By simulating potential attacks, these defenses help build more robust safeguards. They also incorporate feedback loops that refine the model’s resilience against new jailbreaking methods, improving defense over time.

Task-specific fine-tuning and self-consistency checks reduce the risks of prompt injection and jailbreaking. These techniques ensure outputs align with ethical standards, even under ambiguous or risky conditions. As models become more explainable and adaptive, LLMs are better equipped to detect and resist sophisticated attacks, including those that rely on poisoned datasets or subtle prompt manipulations. These advancements make the models less prone to exploitation through complex jailbreaking attempts.

Conclusion

In conclusion, defending against prompt injection and jailbreaking attacks on large language models (LLMs) requires a multi-layered approach. Techniques like the Gatekeeper Layer, which filters and rewrites suspicious prompts, and specialized evaluation models that monitor input for vulnerabilities have significantly improved the ability to prevent these attacks. These safeguards ensure that LLMs stay within ethical boundaries and reduce risks associated with manipulative prompts.

Further Reading

Backdooring Instruction-Tuned Large Language Models

with Virtual Prompt Injection

Many Shot Jailbreaking

OpenAI Red Teaming

Maxim is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with Maxim.