[SDK] Summarization Hallucination Test Quickstart

Summarization Hallucination Evaluations on GPT-3.5-Turbo with DynamoEval SDK

Last updated: October 10th, 2024

This Quickstart showcases an end-to-end walkthrough of how to utilize DynamoAI’s SDK and platform solutions to run summarization hallucination evaluation. For demonstration purposes, we use a subset of Xsum dataset, GPT-3.5-Turbo as the base model.

Prerequisites:

DynamoAI API token
OpenAI API Token

Initiation

If you do not have a DynamoAI API token, generate a token by logging into apps.dynamo.ai with your provided credentials.

Navigate to apps.dynamo.ai/profile to generate your DynamoAI API token. This API token will enable you to programmatically connect to the DynamoAI server, create projects, and train models. If you generate multiple API tokens, only your most recent one will work.

You may need other credentials like below throughout the quickstart.

# Set your DynamoAI API token here
DYNAMOFL_API_KEY="<dynamofl-api-key>"
# Set the URL of your locally deployed DynamoAI platform or use "https://api.dynamo.ai"
# if you are using the DynamoAI Sandbox environment
DYNAMOFL_HOST="<dynamofl-platform-host-url>"
# Set you OpenAI API key here
OPENAI_API_KEY = "<your-openai-api-key>"

Environment Setup

Begin by installing the public DynamoAI SDK, importing the required libraries, and downloading the required dataset and model files for the quickstart.

!pip install dynamofl

import os
import time
from dynamofl import DynamoFL
from dynamofl import VRAMConfig, GPUConfig, GPUType

# download dataset
!gdown https://drive.google.com/uc?id=1L02luLua0YgMnLoHCAHjdy70PVk6LDzL

Initiation

Create a DynamoFL instance using your API token and host.

dfl = DynamoFL(DYNAMOFL_API_KEY, host=DYNAMOFL_HOST)

Create a model object

First, create a remote model object. The model object specifies the model that privacy tests will be run on during the create_hallucination_test method. DynamoAI currently supports two types of model objects — local models and remote model API endpoints.

In this quickstart, we demonstrate running tests on remote models. A remote model object can be used to access a model provided by a third-party— currently DynamoAI provides support for both OpenAI and Azure OpenAI models.

SLUG = time.time()
model_key="GPT_3.5_Turbo_{}".format(SLUG).format() # unique model identifier key
model_provider="openai" # name of model provider
api_instance="gpt-3.5-turbo" # identifier of gpt-3.5-turbo model on OpenAI

# Creating a model referring to OpenAI's GPT-3.5-Turbo
model = dfl.create_openai_model(
    key=model_key,
    name="openai-gpt-3.5-turbo-summarization", # name his as you want
    api_key=OPENAI_API_KEY,
    api_instance=api_instance
)

# model key
print(model.key)

Create a dataset object

We also need to specify the dataset used for evaluation. A dataset can be created by specifying the dataset file path.

At this time, DynamoAI only accepts csv datasets or datasets hosted on HuggingFace Hub. We recommend using datasets with less than 100 datapoints for testing. The dataset must contain a column of text that will be summarized.

# dataset upload
dataset_file_path = "xsum-small.csv"  # file path to dataset
dataset = dfl.create_dataset(key="dataset_{}".format(SLUG).format(), file_path=dataset_file_path, name="xsum")

# dataset id
print(dataset._id)

Run Hallucination Tests on GPT-3.5-Turbo

To run the test, we can call the create_hallucination_test method. Test creation will submit a test to your cloud machine learning platform, where the test will be run.

Setting test type and selecting metrics

DynamoFL currently supports two metrics for Summarization hallucination Test:

nli-consistency measures the logical consistency between the document being summarized and the model-generated summary,
unieval-factuality measures factuality of the model-generated summary compared to the document being summarized

Specify your chosen metrics in the hallucination_metrics parameter in the create_hallucination_test function. For more details on metric descriptions and calculations, please see the appendix.

metrics = ["nli-consistency", "unieval-factuality"]

Setting Dataset Input Column Name

It’s important to specify the dataset column that will be used for the input queries -- this column name can be provided to the input_column parameter of create_hallucination_test. For summarization, it is the column that contains the instruction and the text to be summarized.

Optionally, you can also set reference_column as the column that contains the target text to which the generated summary should be compared against. If this is not set, the generated summary will be compared against the text in input_column by default.

For the xsum dataset set up the appropriate columns.

input_column = "document"
reference_column = "document" # we need to fact-check the generated summary against the original document

Setting topic list

DynamoAI also supports optional user-provided list of topics to cluster the input texts for better performance analysis. This list can be provided as topic_list parameter in create_hallucination_test method. When nothing is provided, it will automatically cluster and extract set of keywords that are representative of input texts in each cluster.

topic_list = [
    "news article about sports",
    "news article about politics and government",
    "news article about science",
    "news article about accidents and crime",
]

Setting Model Hyperparameters

The following model hyperparameters can also be modified.

temperature: Temperature controls the amount of "randomness" in your models generations (0 being deterministic, 1 being random)
seq_len: Length (number of tokens) of the generated sequence.

DynamoAI provides grid search over these hyperparameter values, where the grid can be specified by a dictionary as follows.

grid=[
    {
        "temperature": [1.0, 0.5, 0.1], # run tests over three temperature values
        "seq_len": [128],
    }
]

Running the test

Use the set of parameters specified above to run the test.

hallucination_test = dfl.create_hallucination_test(
    name=f"hallucination_test_{SLUG}",
    model_key=model.key,
    dataset_id=dataset._id,
    gpu=VRAMConfig(vramGB=16),
    hallucination_metrics=metrics,
    topic_list=topic_list,
    input_column=input_column,
    reference_column=reference_column,
    grid=grid,
)

# test id
print(hallucination_test.id)

# Helper function to check if the attack has been queued
def query_attack_status(attack_id):
    attack_info = dfl.get_attack_info(attack_id)
    print("Attack status: {}.".format(attack_info))

# Check Attack Status. Rerun this cell to see what the status of each test looks like.
# Once all of them show COMPLETED, you can head to the model dashboard page in the UI to check out the report.
all_attacks = hallucination_test.attacks
attack_ids = [attack["id"] for attack in all_attacks]
for attack_id in attack_ids:
    query_attack_status(attack_id)

Viewing Test Results

After your test has been created, navigate to the model dashboard page in the DynamoAI UI. Here, you should observe that your model and dataset have been created and that your test is running. After the test has completed, a test report file will be created and can be downloaded for an deep-dive into the test results!

Appendix

NLI Consistency

The NLI consistency test measures the logical consistency between an input text (or document) and a model-generated summary. The NLI consistency evaluation is conducted by providing a summarization model with a set of input documents and scoring each (document, summary) pair as being entailing, contradicting, or neutral using a Natural Language Inference (NLI) model. Entailing indicates that the summary logically implies the content in the document while contradicting indicates otherwise, and neutral indicates that no logical relationship can be drawn. The NLI consistency test reports an entailment score.

NLI consistency score: In this evaluation, NLI consistency score is a score between 0 and 1, representing the average degree to which the summaries logically imply the contents of the input text. 0 indicates a low degree of entailment and a high degree of hallucination, while 1 indicates a high degree of entailment and a low degree of hallucination.

UniEval Factuality

The UniEval factuality evaluation test measures the factual support between an input text (or document) and a model-generated summary. The factuality score is calculated by providing a summarization model with a set of input documents and scoring each (document, summary) pair using an LLM as an evaluator. The evaluator LLM used has been specifically trained on a set of boolean question-answer prompts related to evaluating factual support, and has been found to significantly outperform various state-of-the-art evaluators.

UniEval Factuality Score: In this evaluation, factual support is represented as a score between 0 and 1, with 0 implying a low degree of factuality and a high degree of hallucination and 1 indicating a high degree of entailment and a low degree of hallucination.

Summarization Hallucination Evaluations on GPT-3.5-Turbo with DynamoEval SDK

Initiation​

Environment Setup​

Initiation​

Create a model object​

Create a dataset object​

Run Hallucination Tests on GPT-3.5-Turbo​

Setting test type and selecting metrics​

Setting Dataset Input Column Name​

Setting topic list​

Setting Model Hyperparameters​

Running the test​

Viewing Test Results​

Appendix​

NLI Consistency​

UniEval Factuality​

Initiation

Environment Setup

Initiation

Create a model object

Create a dataset object

Run Hallucination Tests on GPT-3.5-Turbo

Setting test type and selecting metrics

Setting Dataset Input Column Name

Setting topic list

Setting Model Hyperparameters

Running the test

Viewing Test Results

Appendix

NLI Consistency

UniEval Factuality