[UI] Summarization Hallucination Test Quickstart
Summarization Hallucination Evaluations on GPT-3.5-Turbo with DynamoEval UI
Last updated: October 10th, 2024
This Quickstart showcases an end-to-end walkthrough of how to utilize DynamoAI’s platform solution to run summarization hallucination test. It also covers some general guidelines and specific examples for setting up test configurations.
If you are a developer and wish to follow the same quickstart with Dynamo AI’s SDK, we refer to the associated SDK Quickstart.
Create Model
Begin by navigating to the DynamoAI home page. This page contains the model registry – a collection of all the models you have uploaded for evaluation or guardrailing. The model registry contains information to help you identify your model, such as the model source, use case, and date updated.
To upload a new model to the registry, click "Upload new model" button (at the upper right corner). When this button is clicked, a popup will appear, requesting more information about the model.
This includes information such as the model name and use case. The popup will also request information about the model source. Remote inference can be used to create a connection with any model that is provided by a third party or is already hosted and can be accessed through an API endpoint. On the other hand, Local inference can be used to upload a custom model file.
Example. For this quickstart, we recommend setting the following:
- Model name: “GPT-3.5 Summarization”
- Model Source: Remote Inference
The next page of the popup will ask for more detailed information to set up the API connection.
This includes information about the model provider, API access key, model identifier, as well as an optional model endpoint (if required by your API provider).
Example. We recommend setting the following:
- API Provider: OpenAI
- API Key: (your OpenAI API key)
- Model: gpt-3.5-turbo
- Endpoint: (leave blank)
At this point, your model named "GPT3.5 Summarization" should have been created and should be displayed on the models registry.
Create Test
To create a test for the created model, click on "DynamoEval" under Links column from the Model Registry page > Click on "Testing" tab > Click "New test" at the upper right corner.
Time to fill in the test information.
- Fill in the test title to be indicative of the test you are running.
- Select "Hallucination tests"
- Select the two Test types: “NLI Consistency” and “UniEval Factuality”.
Dataset Configuration
After selecting the test type, you’ll then be asked to select a dataset.
1. Select an Existing Dataset
Here, you can select an existing dataset that has been previously uploaded. Click on the checkbox next to the dataset name. Skip to the next section if you are using the platform for the first time or would like to upload a new dataset.
2. OR Upload a New Dataset
Alternatively, you can upload a new dataset by clicking “Upload custom dataset” (upper right corner).
On the pop-up sidebar, you’ll be asked to provide a dataset name and description. We recommend you to be specific so you can clearly identify the dataset in the future. Set up an access level. Finally, you’ll be asked to upload a dataset. Currently, DynamoAI supports running attacks and evaluations on CSV datasets (Local Dataset), or those reside in HuggingFace hub (HuggingFace Dataset).
![]() | ![]() |
---|
- For HuggingFace Dataset (left), you will be asked to fill in the Dataset ID and the access token, which will be required if the dataset is private.
- For Local Dataset (right), you will be asked to drag and drop the CSV file.
Note: The uploaded local dataset should satisfy the following conditions:
- All rows and columns should contain strings of length at least 5.
- There must be at least one column populated with the data points of documents that will be summarized by the model.
- The number of data points should be more than 10.
- The number of data points should be no more than 100.
- The uploaded CSV file must be parseed with
pandas.read_csv()
method without any errors. - The first row of the CSV file should be names of the column (i.e., it should not be a data point)
Example. For this tutorial, select "Local Dataset". We recommend using this sample dataset, which is a subset of 50 news articles and human-annotated summaries taken from an open-source dataset Xsum.
3. Configure Dataset
This step ensures that DynamoEval tests can refer to the correct column(s) from the dataset.
Input text column: specify this as the column name (or feature name) of the dataset containing the document that is being summarized.
Target Text Column: specify this as the column name of the dataset containing content to which the generated summaries should be fact-checked against.
Example. We recommend using the following configurations:
- Input text column: document
- Target text column: document
Test Parameters Setup
This page will allow you to vary different test parameters to observe performance across different settings. You are able to set up a range of values with min, max, step sizes so that the tests can run with different combinations of parameters in a sweeping manner.
- Temperature: this controls the randomness of the generation (larger values yield more random generation)
- Sequence length: this controls the maximum length of the generated sequence. If the response needs to be longer, increase the size. If this value is too small, the response may get cut off.
Example. We recommend the following configurations:
- Temperature: 0, 1, 0.5 (default values) – it will sweep through three temperature settings: 0.01, 0.5, 1, each representing variation of deterministic, slightly random, and more random generation
- Sequence length: Min - 128, Max - 128, Step size - 64 - it will sweep through one setting: 128
Compute Setup
Set up compute resources to run the tests with.
- If the generation model is remote, we recommend using 1 x A10G GPU. For this tutorial, please select this option as it is running OpenAI model remotely.
- If the generation model is local, make sure to allocate sufficient GPU compute so that the local model can be fully loaded to the available VRAM.
After this step you will see a summary page. If everything looks good, click “Create Test” to finish the test setup and queue the test.
Checking Results
After queueing the test, you will see three indicators on the model’s Testing tab: Complete, In Progress, Awaiting Resources.
Once the test is marked complete, you can look through the rest results in 3 different ways:
- Dashboard: In the Dashboard tab, examine the key metrics.
- Deep-dive: Under the Testing tab, click on “View Test Details” for the test that was run.
- See report: Under the Testing tab, click on the drop down arrow on the right for the test that was run, and click “Download report” to view the generated RAG Hallucination Test report.