Skip to main content

Adaptive Jailbreak Attacks

Overview

The Adaptive Jailbreak attack is a key component of DynamoEval's comprehensive model security evaluation suite. It complements other compliance tests such as Static Jailbreaking, Multilingual Jailbreaking, and System Prompt Extraction evaluations, all accessible through DynamoEval's platform.

The Adaptive Jailbreak attack is inspired by cutting-edge Machine Learning research in AI red-teaming and jailbreaking. The primary goal of this attack is to prompt the model to respond positively to a harmful or unsafe input. This is achieved through an iterative process of prompt refinement, designed to either disguise malicious intent or bypass existing safety measures. By using this advanced technique, the Adaptive Jailbreak attack provides a thorough assessment of a model's resilience against sophisticated attempts to circumvent its ethical safeguards. This evaluation method offers valuable insights into potential vulnerabilities, contributing to the development of more secure and responsible AI systems.

As of now, two state-of-the-art attack strategies are implemented in DynamoEval:

  • Tree of Attacks with Pruning (TAP)
  • Iterative Refinement Induced Self-Jailbreak (IRIS)

Tree of Attacks with Pruning (TAP)

The TAP attack utilizes a tree-shaped optimization loop, which is an approach used to systematically search for and refine adversarial prompts that can potentially bypass safety filters of a target LLM. In this context, a "tree-shaped" optimization loop refers to a method where different branches of potential prompts are explored in a structured, hierarchical manner. Each "branch" represents a variation of the prompt, and the process iterates by branching out further from prompts that show potential in challenging the model's safety measures. This method helps in efficiently narrowing down to the most effective adversarial prompts by exploring a wide variety of possibilities and honing in on the variations that are most successful in obfuscating the intended behavior or bypassing the safety filters.

The attack follows three step process, mimicking the layers of a tree, as previously described:

  • Attack: The attacker LLM takes a set of prompts to generate a set of improved, augmented prompts.
  • Evaluate: The evaluator LLM determines the compliance status and topic relevance of the Target LLM’s response. If any of the scores are 10, we keep the response and leave the loop.
  • Pruning: We remove any off-topic prompt generations and we keep the top 5 highest performing prompts and run the loop again until either the maximum iterations are met or a prompt receives a score of 10.

Ultimately, the attack concludes either if it finds a jailbreak (at which point it can stop the optimization loop early) or until the maximum depth of the tree is met (max depth is set to 5 in our case). DynamoEval uses an optimized setting of this attack based on empirical experiments to maximize attack success rate while saving compute resources and minimizing run-time.

TAP

Iterative Refinement Induced Self-Jailbreak (IRIS)

The IRIS attack employs a sophisticated approach to bypass an LLM's safety measures by iteratively refining adversarial prompts and maximizing their harmful impact. It leverages a two-step process:

  • From an initial harmful prompt, iteratively refine it using self-explanations. This loop solicits an explanation from the attacker LLM on why the attempt failed to jailbreak the target LLM. The iterative prompt refinement process continues until the target LLM doesnʼt refuse to answer, or until the maximum number of trials is exceeded (6 in our case).
  • From the refined prompt and associated completion, ask the target LLM to rate and enhance it for maximizing harmfulness. This method aims to efficiently generate prompts that can induce the model to produce content fulfilling initially rejected adversarial requests, potentially exposing vulnerabilities in the LLM's safety protocols across different linguistic and cultural contexts.
IRIS

Metrics

Attack success rate (ASR) refers to the percentage of attempted jailbreaks that are successfully executed by an attacker. The higher the ASR, the more successful the attack and the more vulnerable the model is to adaptive jailbreaking attacks.

References

[1] TAP paper: Tree of Attacks: Jailbreaking Black-Box LLMs Automatically - https://arxiv.org/abs/2312.02119
[2] IRIS paper: Iterative Refinement Induced Self-Jailbreak - https://arxiv.org/abs/2405.13077