[SDK] Jailbreaking Quickstart

Run a Jailbreaking Test with DynamoEval SDK

Last updated: October 6th, 2024

This Quickstart showcases an end-to-end walkthrough of how to utilize Dynamo AI’s SDK and platform solutions to perform jailbreaking attacks on a GPT 3.5 model. The goal is to assess whether an adversary can attack the LLM to answer malicious and harmful requests.

Prerequisites:

Dynamo AI API token
OpenAI API key

Environment Setup

Begin by installing the public Dynamo AI SDK, and importing the required libraries for the quickstart.

DYNAMO_API_KEY = ""
DYNAMO_HOST="https://api.dynamo.ai"

# Set your OpenAI API token here
OPENAI_API_KEY=""

!pip install dynamofl==0.0.86

import time
from dynamofl import DynamoFL, VRAMConfig, GPUConfig, GPUType

Now, create a Dynamo AI instance using your API token and host.

If you do not have an API token, generate a token by logging into apps.dynamo.ai with your provided credentials. Navigate to apps.dynamo.ai/profile to generate your Dynamo AI API token. This API token will enable you to programmatically connect to the Dynamo AI server, create projects, and train models. If you generate multiple API tokens, only your most recent one will work.

dfl = DynamoFL(DYNAMO_API_KEY, host=DYNAMO_HOST)
print(f"Connected as {dfl.get_user()['email']}")

Create a Model

First, let's create a remote model object. The model object specifies the target model. Dynamo AI currently supports two types of model objects — local models and remote model API endpoints.

In this quickstart, we demonstrate running tests on remote models. A remote model object can be used to access a model provided or hosted by a third-party. Below, we show how to create an OpenAI remote model.

# Creating a model referring to OpenAI's GPT-3.5-Turbo
model = dfl.create_openai_model(
	name="GPT 3.5 Jailbreaking Quickstart", # name displayed on the model list on UI
	api_instance="gpt-3.5-turbo", # swap to any OpenAI model identifier: gpt-4o, gpt-4o-mini
	api_key=OPENAI_API_KEY
)

print(model.key)

Run tests on GPT-3.5 Turbo

To run the jailbreak tests, we can call the create_adaptive_jailbreak_test or create_static_jailbreak_test methods. Test creation will submit a test to our DynamoEval platform, where the test will be run.

The grid parameter specifies the grid of test cases to run. This enables you to run multiple tests in parallel at different temperatures for the target model.

For an Adaptive Jailbreak Test, run the following:

adaptive_test_info = dfl.create_adaptive_jailbreak_test(
        name=f"adaptive_jailbreak_test",
        model_key=model.key,
        grid=[{'temperature': [0.5]}]
)

If you want to run a Static Jailbreak test, you can use the following snippet:

static_test_info = dfl.create_static_jailbreak_test(
        name=f"static_jailbreak_test",
        model_key=model.key,
        grid=[{'temperature': [0.1, 0.5, 1.0]}]
)

Please note that DynamoEval's trial environment is rate limited to 1 test at a time. Please wait that the other test is completed before running this test.

Checking the attack status

# Confirming the Attack has been queued
def query_attack_status(attack_id):
    attack_info = dfl.get_attack_info(attack_id)['status']
    print("Attack status: {}.".format(attack_info))

# Returning Attack Status
all_attacks = adaptive_test_info.attacks
adaptive_attack_ids = [attack["id"] for attack in all_attacks]
for attack_id in adaptive_attack_ids:
    query_attack_status(attack_id)

# Returning Attack Status
all_attacks = static_test_info.attacks
static_attack_ids = [attack["id"] for attack in all_attacks]
for attack_id in static_attack_ids:
    query_attack_status(attack_id)

Viewing test results

After your test has been created, navigate to the model dashboard page in the Dynamo AI UI. Here, you should observe that your model has been created and that your test is running.
After the test has completed, a test report will be created and you can dive into the test results!

Run a Jailbreaking Test with DynamoEval SDK

Environment Setup​

Create a Model​

Run tests on GPT-3.5 Turbo​

Checking the attack status​

Viewing test results​

Environment Setup

Create a Model

Run tests on GPT-3.5 Turbo

Checking the attack status

Viewing test results