[SDK] Summarization Performance Test Quickstart

Summarization Performance Evaluations on GPT-3.5-Turbo with DynamoEval SDK

Last updated: October 10th, 2024

This Quickstart showcases an end-to-end walkthrough of how to utilize DynamoAI’s SDK and platform solutions to run summarization performance evaluation. For demonstration purposes, we use a subset of Xsum dataset, GPT-3.5-Turbo as the base model.

Prerequisites:

DynamoAI API token
OpenAI API Token

Initiation

If you do not have a DynamoAI API token, generate a token by logging into apps.dynamo.ai with your provided credentials.

Navigate to apps.dynamo.ai/profile to generate your DynamoAI API token. This API token will enable you to programmatically connect to the DynamoAI server, create projects, and train models. If you generate multiple API tokens, only your most recent one will work.

You may need other credentials like below throughout the quickstart.

# Set your DynamoAI API token here
DYNAMOFL_API_KEY="<dynamofl-api-key>"
# Set the URL of your locally deployed DynamoAI platform or use "https://api.dynamo.ai"
# if you are using the DynamoAI Sandbox environment
DYNAMOFL_HOST="<dynamofl-platform-host-url>"
# Set you OpenAI API key here
OPENAI_API_KEY = "<your-openai-api-key>"

Environment Setup

Begin by installing the public DynamoAI SDK, importing the required libraries, and downloading the required dataset and model files for the quickstart.

!pip install dynamofl

import os
import time
from dynamofl import DynamoFL
from dynamofl import VRAMConfig, GPUConfig, GPUType

# download dataset
!gdown https://drive.google.com/uc?id=1e-vPx9dGGDYD_Fjeqz5jl4YAnlsekfoF

Initiation

Create a DynamoFL instance using your API token and host.

dfl = DynamoFL(DYNAMOFL_API_KEY, host=DYNAMOFL_HOST)

Create a model object

First, create a remote model object. The model object specifies the model that privacy tests will be run on during the create_performance_test method. DynamoAI currently supports two types of model objects — local models and remote model API endpoints.

In this quickstart, we demonstrate running tests on remote models. A remote model object can be used to access a model provided by a third-party— currently DynamoAI provides support for both OpenAI and Azure OpenAI models.

SLUG = time.time()
model_key="GPT_3.5_Turbo_{}".format(SLUG).format() # unique model identifier key
model_provider="openai" # name of model provider
api_instance="gpt-3.5-turbo" # identifier of gpt-3.5-turbo model on OpenAI

# Creating a model referring to OpenAI's GPT-3.5-Turbo
model = dfl.create_openai_model(
    key=model_key,
    name="openai-gpt-3.5-turbo-summarization", # name his as you want
    api_key=OPENAI_API_KEY,
    api_instance=api_instance
)

# model key
print(model.key)

Create a dataset object

We also need to specify the dataset used for evaluation. A dataset can be created by specifying the dataset file path.

At this time, DynamoAI only accepts csv datasets or datasets hosted on HuggingFace Hub. We recommend using datasets with less than 100 datapoints for testing. The dataset must contain a column of text that will be summarized.

For this quickstart, you can use this sample dataset, which is a subset of 50 news articles and human-annotated summaries taken from an open-source dataset Xsum.

# dataset upload
dataset_file_path = "xsum-50.csv"  # file path to dataset
dataset = dfl.create_dataset(key="dataset_{}".format(SLUG).format(), file_path=dataset_file_path, name="xsum")

# dataset id
print(dataset._id)

Run Performance Tests on GPT-3.5-Turbo

To run the test, we can call the create_performance_test method. Test creation will submit a test to your cloud machine-learning platform, where the test will be run.

Setting test type and selecting metrics

DynamoAI currently supports two metrics for Summarization Performance Test:

rouge measures syntactic similarity between the generated summary and the target text.
bertscore measures semantic similarity between the generated summary and the target text.

Specify your chosen metrics in the performance_metrics parameter in the create_performance_test function. For more details on metric descriptions and calculations, please see the appendix.

metrics = ["rouge", "bertscore"]

Setting Dataset Input Column Name

It’s important to specify the dataset column that will be used for the input queries -- this column name can be provided to the input_column parameter of create_performance_test. For summarization, it is the column that contains the instruction and the text to be summarized.

Optionally, you can also set reference_column as the column that contains the target text to which the generated summary should be compared against. This column may contain human-annoated summaries. If this is not set, the generated summary will be compared against the text in input_column by default.

For the xsum dataset set up the appropriate columns.

input_column = "document"
reference_column = "summary"

Setting topic list

DynamoAI also supports optional user-provided list of topics to cluster the input texts for better performance analysis. This list can be provided as topic_list parameter in create_performance_test method. When nothing is provided, it will automatically cluster and extract set of keywords that are representative of input texts in each cluster.

topic_list = [
    "news article about sports",
    "news article about politics and government",
    "news article about science",
    "news article about accidents and crime",
]

Setting Model Hyperparameters

The following model hyperparameters can also be modified.

temperature: Temperature controls the amount of "randomness" in your models generations (0 being deterministic, 1 being random)
seq_len: Length (number of tokens) of the generated sequence.

DynamoAI provides grid search over these hyperparameter values, where the grid can be specified by a dictionary as follows.

grid=[
    {
        "temperature": [1.0, 0.5, 0.1], # run tests over three temperature values
        "seq_len": [128],
    }
]

Running the test

Use the set of parameters specified above to run the test.

performance_test = dfl.create_performance_test(
    name=f"performance_test_{SLUG}",
    model_key=model.key,
    dataset_id=dataset._id,
    gpu=VRAMConfig(vramGB=16),
    performance_metrics=metrics,
    topic_list=topic_list,
    input_column=input_column,
    reference_column=reference_column,
    grid=grid,
)

# test id
print(performance_test.id)

# Helper function to check if the attack has been queued
def query_attack_status(attack_id):
    attack_info = dfl.get_attack_info(attack_id)
    print("Attack status: {}.".format(attack_info))

# Check Attack Status. Rerun this cell to see what the status of each test looks like.
# Once all of them show COMPLETED, you can head to the model dashboard page in the UI to check out the report.
all_attacks = performance_test.attacks
attack_ids = [attack["id"] for attack in all_attacks]
for attack_id in attack_ids:
    query_attack_status(attack_id)

Viewing Test Results

After your test has been created, navigate to the model dashboard page in the DynamoAI UI. Here, you should observe that your model and dataset have been created and that your test is running. After the test has completed, a test report file will be created and can be downloaded for an deep-dive into the test results!

Appendix

ROUGE

ROUGE is a metric designed to evaluate summary quality against a reference text by measuring the token-level overlap between the model-generated text and the reference. DynamoAI evaluates three types of ROUGE scores: ROUGE-1, ROUGE-2, and ROUGE-L. ROUGE-1 measures the overlap of unigrams (i.e., tokens of length 1) between the model-generated outputs and reference texts, while ROUGE-2 measures the overlap of bigrams (i.e., tokens of length 2) between the model generated outputs and reference texts. On the other hand, ROUGE-L measures the longest common subsequence between the generated text and reference.

BERTScore

Rather than relying on exact token-level matches to compute summarization quality, BERTScore computes semantic similarity between a reference text and model response. For this, BERTScore leverages the pre-trained embeddings from the BERT model.

Summarization Performance Evaluations on GPT-3.5-Turbo with DynamoEval SDK

Initiation​

Environment Setup​

Initiation​

Create a model object​

Create a dataset object​

Run Performance Tests on GPT-3.5-Turbo​

Setting test type and selecting metrics​

Setting Dataset Input Column Name​

Setting topic list​

Setting Model Hyperparameters​

Running the test​

Viewing Test Results​

Appendix​

ROUGE​

BERTScore​

Initiation

Environment Setup

Initiation

Create a model object

Create a dataset object

Run Performance Tests on GPT-3.5-Turbo

Setting test type and selecting metrics

Setting Dataset Input Column Name

Setting topic list

Setting Model Hyperparameters

Running the test

Viewing Test Results

Appendix

ROUGE

BERTScore