[SDK] Sequence Extraction Quickstart (GPT-3.5-Turbo)

Last updated: Nov 18th, 2024

This Quickstart showcases an end-to-end walkthrough of how to utilize DynamoFL’s SDK and platform solutions to run privacy tests assessing whether a particular model has been pre-trained or fine-tuned on a given text dataset or set of documents. For demonstration purposes, we investigate whether the GPT-3.5-Turbo model from OpenAI has been pre-trained on Form 10-K Annual Reports.

Prerequisites:

DynamoFL API token
OpenAI API token

Background on Form 10-K and the dataset

The Form 10-K report is a comprehensive report filed annually by a publicly-traded company about its financial performance. Since 1934, the U.S. Securities and Exchange Commission (SEC) created the Form 10-K report to provide transparency of to shareholders.

DynamoFL has preprocessed a public repository of historical 10K reports from HuggingFace JanosAudran/financial-reports-sec. By merging all the rows into per-section instead of per-sentence, DynamoFL compiled the small_full subset of the HuggingFace public repository into a dataset that is well suitable for this test.

From the above dataset, we use a small subset (small_full) of the dataset, which covers 188 10k reports 10 different publically-traded companies across 1994-2021. For reference, the stock ticker symbols for the companies that we are analyzing today are: ABT, ACU, AE, AIR, AMD, APD, BKTI, CECE, MATX, WDDD.

Environment Setup

Begin by installing the public Dynamo AI SDK, and importing the required libraries for the quickstart.

# Set your DynamoFL and OpenAI API token here
DYNAMOFL_API_KEY=""
DYNAMOFL_HOST="https://api.dynamo.ai"
OPENAI_API_KEY=""

!pip install dynamofl==0.0.86

import time
from dynamofl import DynamoFL, VRAMConfig, GPUConfig, GPUType

Now, create a Dynamo AI instance using your API token and host.

If you do not have an API token, generate a token by logging into apps.dynamo.ai with your provided credentials. Navigate to apps.dynamo.ai/profile to generate your Dynamo AI API token. This API token will enable you to programmatically connect to the Dynamo AI server, create projects, and train models. If you generate multiple API tokens, only your most recent one will work.

dfl = DynamoFL(DYNAMO_API_KEY, host=DYNAMO_HOST)
print(f"Connected as {dfl.get_user()['email']}")

Create a Model

First, let's create a remote model object. The model object specifies the target model. Dynamo AI currently supports two types of model objects — local models and remote model API endpoints.

In this quickstart, we demonstrate running tests on remote models. A remote model object can be used to access a model provided or hosted by a third-party. Below, we show how to create an OpenAI remote model.

SLUG = time.time()
model_key="GPT_3.5_Turbo_{}".format(SLUG).format() # unique model identifier key
api_instance="gpt-3.5-turbo-0613" # identifier of finetuned model on OpenAI

# Creating a model referring to OpenAI's GPT-3.5-Turbo
model = dfl.create_openai_model(key=model_key,
															  name="GPT 3.5 Model", # naming parameter
								 								api_instance=api_instance,
																api_key=OPENAI_API_KEY)

Create a dataset object

To run a privacy evaluation test, we also need to specify the dataset you'd like to analyze for memorization. A dataset can be created by specifying the dataset file path. Here, we also provide the dataset with a unique key and an identifying name. At this time, Dynamo AI only accepts csv datasets.

For this quickstart, we will use the 10-K reports dataset. Run the below command to download the dataset.

# Download 10-K Reports dataset:
!curl "https://www.dropbox.com/scl/fi/99lb7k9rsptgbkcknd00y/10k_dataset.csv?rlkey=82yl7oz3ze86jfng7f5cj2dz8&dl=1" -o 10k_dataset.csv -J -L -k

dataset_file_path = "10k_dataset.csv"

# Creating a dataset object
dataset = dfl.create_dataset(
    key="dataset_{}".format(SLUG).format(),
    file_path=dataset_file_path,
)

print(dataset._id)

Run tests on GPT-3.5-Turbo

To run a Sequence Extraction privacy evaluation test, we can call the create_sequence_extraction_test method. Test creation will submit a test to our DynamoEval platform, where the test will be run.

test_info_sequence_extraction = dfl.create_sequence_extraction_test(
    name="sequence_extraction_test_{}".format(SLUG).format(),
    model_key=model.key,
    dataset_id=dataset._id,
    gpu=GPUConfig(gpu_type=GPUType.V100, gpu_count=1), # default GPU parameters
    memorization_granularity="paragraph",
    sampling_rate=100,
    source = "10-K SEC Reports",
    title_column="title",
    text_column="text",
    is_finetuned=False,
    grid=[
        {'prompt_length': [100, 200],
         'seq_len': [256],
         'temperature': [0, 0.5]}
    ]
)

Checking the attack status

# Confirming the Attack has been queued
def query_attack_status(attack_id):
    attack_info = dfl.get_attack_info(attack_id)['status']
    print("Attack status: {}.".format(attack_info))

all_attacks = test_info_sequence_extraction.attacks
attack_ids = [attack["id"] for attack in all_attacks]
for attack_id in attack_ids:
  print(query_attack_status(attack_id)) 

Viewing test results

After your test has been created, navigate to the model dashboard page in the Dynamo AI UI. Here, you should observe that your model has been created and that your test is running.
After the test has completed, a test report will be created and you can dive into the test results!

Background on Form 10-K and the dataset

Environment Setup​

Create a Model​

Create a dataset object​

Run tests on GPT-3.5-Turbo​

Checking the attack status​

Viewing test results​