[SDK] RAG Hallucination Test Quickstart

RAG Hallucination Evaluations with DynamoEval SDK (GPT-3.5-Turbo)

Last updated: November 1st, 2024

This Quickstart showcases an end-to-end walkthrough of:

how to utilize DynamoAI's SDK and platform solutions to run RAG hallucination evaluations assessing retrieval relevance, response relevance, and response faithfulness For demonstration purposes, we will use two datasets as the source documents and queries with different use cases:
(multidoc2dial-conv-test-quickstart.csv) Processed version of MultiDoc2Dial dataset for call center application
(fintabnetqa-sujet-mixed-49.csv) Processed version of mixture of TableVQABench dataset and Sujet Financial Dataset for Q&A on financial documents and tables.

And we will use GPT3.5 from OpenAI as the base generation model, and a vector database using ChromaDB.

Prerequisites:

DynamoAI API token
OpenAI API Token

Colab notebook using multidoc2dial-conv-test-quickstart.csv data:

Colab notebook using fintabnetqa-sujet-mixed-49.csv data:

Initiation

If you do not have a DynamoAI API token, generate a token by logging into apps.dynamo.ai with your provided credentials.

Navigate to apps.dynamo.ai/profile to generate your DynamoAI API token. This API token will enable you to programmatically connect to the DynamoAI server, create projects, and train models. If you generate multiple API tokens, only your most recent one will work.

You may need other credentials like below throughout the quickstart.

# Set your DynamoAI API token here
DYNAMOFL_API_KEY="<dynamofl-api-key>"
# Set the URL of your locally deployed DynamoAI platform or use "https://api.dynamo.ai"
# if you are using the DynamoAI Sandbox environment
DYNAMOFL_HOST="<dynamofl-platform-host-url>"
# Set you OpenAI API key here
OPENAI_API_KEY = "<your-openai-api-key>"

Environment Setup

Begin by installing the public DynamoAI SDK, importing the required libraries, and downloading the required dataset and model files for the quickstart.

!pip install dynamofl

import os
import time
from dynamofl import DynamoFL
from dynamofl import VRAMConfig, GPUConfig, GPUType
from dynamofl import ChromaDB

Initiation

Create a DynamoFL instance using your API token and host.

dfl = DynamoFL(DYNAMOFL_API_KEY, host=DYNAMOFL_HOST)

Create a model object

First, create a model object. The model object specifies the model that will be run on during the create_rag_hallucination_test() method. DynamoAI currently supports two types of model objects — local models and remote model API endpoints.

In this quickstart, we demonstrate running tests on remote models. A remote model object can be used to access a model provided by a third-party — currently DynamoAI provides support for both OpenAI and Azure OpenAI models as well as Databricks hosted models.

SLUG = time.time()
model_key="GPT_3.5_Turbo_{}".format(SLUG).format() # unique model identifier key
# Creating a model referring to OpenAI's GPT-3.5-Turbo
model = dfl.create_openai_model(
    key=model_key,
    name="openai-gpt-3.5-turbo-rag",
    api_key=OPENAI_API_KEY,
    api_instance="gpt-3.5-turbo"
)

# model key
print(model.key)

Create a dataset object

To run a RAG hallucination test, we also need to specify the dataset used for evaluation. A dataset can be created by specifying the dataset file path.

At this time, DynamoAI accepts csv datasets, or datasets hosted on HuggingFace Hub. We recommend using datasets with no more than 100 datapoints for testing.

The uploaded datset should satisfy the following conditions:

All rows and columns should contain strings of length at least 5.
There must be at least one column populated with the data points of queries that will be used as the input for the RAG system.
The number of data points should be more than 10.
The uploaded CSV file must be parseed with pandas.read_csv() method without any errors.
The first row of the CSV file should be names of the column (i.e., it should not be a data point)

For this quickstart, you can use either of the following datasets:

"multidoc2dial-conv-test-quickstart": dataset with queries where queries are directly extracted from the original multidoc2dial dataset. The queries will take a form of a paragraph that describes a conversation between an agent and a user, ending with a user's question.

import pandas as pd
test_file = pd.read_csv("multidoc2dial-conv-test-quickstart.csv")
test_file.head().iloc[0]

"fintabnetqa-sujet-mixed-49": dataset with queries related to financial documents and tables.

import pandas as pd
test_file = pd.read_csv("fintabnetqa-sujet-mixed-49.csv")
test_file.head().iloc[0]

After checking the dataset, upload the dataset to the DynamoAI platform:

# dataset upload
dataset_file_path = "multidoc2dial-conv-test-quickstart.csv"
dataset = dfl.create_dataset(key="dataset_{}".format(SLUG).format(), file_path=dataset_file_path, name="multidoc2dial test")

# dataset upload
dataset_file_path = "fintabnetqa-sujet-mixed-49.csv"
dataset = dfl.create_dataset(key="dataset_{}".format(SLUG).format(), file_path=dataset_file_path, name="financial document test")

# dataset id
print(dataset._id)

Specify a Vector DB Instance

To run RAG hallucination test, a key step is providing configuration details of the vectorDB used for document retrieval.

At this time, DynamoAI SDK supports several vector options to connect to your RAG system:

ChromaDB
LlamaIndex VectorStore
Databricks VectorSearch
Postgres VectorDB
Custom RAG applications (via REST API endpoints)

Custom vector databases can be integrated using the DynamoAI CustomRagDB wrapper, which enables connections through REST APIs with flexible request/response transformations. Additional details are available in the Custom RAG Application section.

(see Connecting Vector Databases for more information) -- if you need help integrating your existing database, please reach out to your team.

For this quickstart, you will be using an existing Chroma vector DB populated with documents from the multidoc2dial-conv-test-quickstart and fintabnetqa-sujet-mixed-49 dataset, using the arguments below.

host: specifies where the DB is hosted
port: port number for connection
collection: collection name where the contents are stored
ef_inputs: specifies what embedding model is being used to create the vector index. ef_type specifies the function provider's name (currently supports SentenceTransformer (sentence_transformer), HuggingFace (hf), and OpenAI (openai)), and model_name specifies specific model being used. Appropriate api_key should be specified when using models that are not public (e.g., huggingface-hub access tokens for HuggingFace, API key for OpenAI)

For multidoc2dial-conv-test-quickstart dataset, use the following:

# chroma db set up in DynamoAI's trial environment
chroma_args = {
    "host": "https://chromadb.internal.dynamo.ai",
    "port": 8000,
    "collection": "multidoc2dial",
    "ef_inputs": {
        "ef_type": "sentence_transformer",
        "model_name": "all-MiniLM-L6-v2",  # public model
    },
}

For fintabnetqa-sujet-mixed-49 dataset, use the following:

# chroma db set up in DynamoAI's trial environment
chroma_args = {
    "host": "https://chromadb.internal.dynamo.ai",
    "port": 8000,
    "collection": "fintabnetqa_10k_mixed", # note the collection name difference
    "ef_inputs": {
        "ef_type": "sentence_transformer",
        "model_name": "all-MiniLM-L6-v2",  # public model
    },
}

Then create a specification object that can be readily used by the SDK method later.

# Create a ChromaDB specification object
chromadb_setup = ChromaDB(**chroma_args)

Run Tests on OpenAI Model

To run RAG Hallucination Test, we can call the create_rag_hallucination_test method. Test creation will submit a test to your cloud machine learning platform, where the test will be run.

First, we need to specify different test paramaeters for the method. For further details on best practices for configuring each test parameter, please see the appendix.

Setting Dataset Input Column Name

When configuring a rag hallucination test, it's important to specify the dataset column that contains the input queries. For both types of datasets, queries column contains the queries we want to use.

input_column = "queries"

Setting Model Hyperparameters

The following model hyperparameters can also be modified.

temperature: Temperature controls the amount of "randomness" in your models generations (0 being deterministic, 1 being random)
seq_len: Length (number of tokens) of the generated sequence.
retrieve_top_k: Number of documents retrieved from the vectorDB provided as context to the generation model

DynamoAI provides grid search over these hyperparameter values, where the grid can be specified by a dictionary as follows.

grid=[
    {
        "temperature": [1.0, 0.1], # run tests over two temperature values
        "seq_len": [128],
        "retrieve_top_k": [1, 2],
    }
]

Setting Test Type and Metrics

DynamoAI currently supports three types of RAG hallucination evaluations:

retrieval-relevance measures the relevancy of documents retrieved from the vector database to input queries
response-relevance measures the relevancy of model generated responses to input queries
faithfulness measures the faithfulness of model generated responses to the retrieved document context

To select RAG hallucination evaluation metrics, specify your chosen metrics in the rag_hallucination_metrics parameter in the create_rag_hallucination_test function. For more details on metric descriptions and calculations, please see the appendix.

rag_hallucination_metrics = ["retrieval-relevance", "response-relevance", "faithfulness"]

Specifying Prompt Template

When running RAG hallucination tests, DynamoAI provides the option to also modify the prompt used for providing context to the model. The following prompt template represents the default one used by DynamoAI. If modifying the prompt template, please ensure to add a {context} and {question} placeholder to identify where the retrieved document context and input query will be provided in the prompt.

For multidoc2dial-conv-test-quickstart dataset, because the queries are in a format of a conversation, we do want to have a more specific instruction on how to process it, like the following prompt template:

prompt_template = """Write how the Agent should respond to the User in the conversation, solely based on the context provided.\nContext: {context}\nConversation: {question}"""

For fintabnetqa-sujet-mixed-49 dataset, the queries are simple questions, so we can try a more generic instruction for the template:

prompt_template = """Answer the following question based on the context provided.\n\nQuestion: {question}\n\nContext:\n{context}\n\nAnswer:"""

Specifying List of Topics for Clustering

DynamoAI also supports optional user-provided list of topics to cluster the input texts for better performance analysis. This list can be provided as topic_list parameter in create_rag_hallucination_test() method. When nothing is provided, it will automatically cluster and extract set of keywords that are representative of input texts in each cluster.

For multidoc2dial-conv-test-quickstart, here is an example topic list.

topic_list=[
    "department of motor vehicles",
    "student scholarship and financial support",
    "veteran benefits and healthcare",
    "social security services",
]

For fintabnetqa-sujet-mixed-49, try leaving them as a blank list.

topic_list = []

Putting It all Together

Now that all test parameters are set up, run create_rag_hallucination_test() method using the parameters. Running the cell below will create the test and queue it on the machine.

set vector_db to different vectordb setups if using others.

test_info = dfl.create_rag_hallucination_test(
    name=f"rag_hallucination_test_{SLUG}", # optional test name
    model_key=model.key, # previously created model identifier key
    dataset_id=dataset._id, # previously created dataset id
    input_column=input_column, # input column name for the queries
    prompt_template=prompt_template,  # prompt_template
    vector_db=chromadb_setup,  # chromadb db args
    rag_hallucination_metrics=rag_hallucination_metrics,  # metrics for the tests
    topic_list=topic_list, # list of topics used to cluster
    grid=grid, # grid of hyperparameters to repeat the evaluations with
    gpu=GPUConfig(gpu_type=GPUType.A10G, gpu_count=1), # default GPU parameters
)

# test id
print(test_info.id)

# Helper function confirming the Attack has been queued
def query_attack_status(attack_id):
    attack_info = dfl.get_attack_info(attack_id)
    print("Attack status: {}.".format(attack_info))

# Check Attack Status. Rerun this cell to see what the status of each test looks like.
# Once all of them show COMPLETED, you can head to the model dashboard page in the DynamoAI UI to check out the report.
all_attacks = test_info.attacks
attack_ids = [attack["id"] for attack in all_attacks]
for attack_id in attack_ids:
    query_attack_status(attack_id)

Viewing Test Results

After your test has been created, navigate to the model dashboard page in the DynamoAI UI (under Models tab). Here, you should observe that your model and dataset have been created and that your test is running. After the test has completed, a test report file will be created and can be downloaded for an deep-dive into the test results!

Appendix

Metric Definitions

Retrieval Relevance

Retrieval relevance represents the degee of relevance of the documents retrieved from the vector database using the embedding model for each query.

To measure the retrieval relevance, DynamoAI generates an LLM Relevance Score. Various studies have shown the effectiveness of LLMs as reference-free evaluators for tasks such as content relevance.

The LLM Relevance Score is computed by prompting an LLM to evaluate the sufficiency of the content in the retrieved documents to answer a given user query. For each query, DyanmoFL computes the relevance score against the top document retrieved from the vector database. Based on this score, each (query, retrieved document) pair will be classified as either positive (relevant) or negative (not relevant). A negative classification indicates that the retrieved document may not contain the key information required to answer the given query.

Response Faithfulness

Reponse faithfulness represents how faithfulness model generated responses are to the retrieved documents.

To measure response faithfulness, DynamoAI relies on an NLI consistency score. Here, DynamoAI uses a natural language inference (NLI) model, which labels a (retrieved document, generated response) pair as either being entailing, contradicting, or neutral, with corresponding scores.

DynamoAI specifically runs the NLI model over each combination of sentence pairs from a document and the generated summary. This enables DynamoAI to then retain the highest entailment score for a given generated summary sentence. Finally, DynamoAI takes the mean of all the maximum entailment scores for each sentence in the response to provide an aggregate score for the full response. Based on this score, each (set of retrieved documents, response) will be classified as either positive (no issue) or negative (response potentially missing/contradicting some key information in the retrieved documents).

Response Relevance

Response Relevance represents the relevance of model generated responses to the query.

To measure the response relevance, DynamoAI generates a Response Relevance Score. Similar to the Retrieval Relevance Score, the Response Relevance Score is computed by prompting an LLM to generate a score between 0 and 1 that indicates how relevant the generated response is to the question. Based on this score, each (query, response) will be classified as either positive (relevant) or negative (not relevant). A negative classification indicates that the response may not contain the information that answers the query.

Connecting Vector Databases

For RAG workflows, a vector database in generally used to retrieve the most relevant documents to a particular input query. This is done by:

Passing the input query to the same embedding model used to create the vector database to generate a vector embedding representation of the query
Querying the vector database with the embedding representation of the query to retrieve the most semantically similar vectors in the database (representing documents)
Using these retrieved vectors to fetch the documents they represent and passing these as context to the language model

To support evaluation on this workflow, DynamoAI requires providing a VectorDB connection to RAG Hallucination tests. If you need help setting up a vector database connection, please reach out to our team.

ChromaDB

DynamoAI supports connection to Chroma vector databases. To create a ChromaDB object, DynamoAI requires the host, port, and collection name of a persistent database instance. The database instance must be hosted in the same VPC as your DynamoAI deployment.

LlamaIndex

DynamoAI supports vector indices using llama-index. To use the llama index vector indices, DynamoAI requires access to a remote S3 bucket to work as a persistent directory with necessary credentials and configuration details (e.g., type and name of the embedding model used to create the vector database). These files do not need to be hosted in the same VPS as your DynamoAI deployment.

For setting up connection to LlamaIndex VectorStore through DynamoAI SDK, following fields are required:

aws_key and aws_secret: AWS S3 credentials
s3_bucket_name: bucket name to be used as a remote persistent directory for the vector index
ef_inputs: specifies what embedding model is being used to create the vector index. ef_type specifies the function provider's name (currently supports SentenceTransformer (sentence_transformer), HuggingFace (hf), and OpenAI (openai)), and model_name specifies specific model being used. Appropriate api_key should be specified when using models that are not public (e.g., huggingface-hub access tokens for HuggingFace, API key for OpenAI)

# Set your AWS access key here (only when using LlamaIndex vectorstore)
AWS_KEY = "<s3-aws-key>" # AWS credentials to connect to remote S3 with persistent directory 
AWS_SECRET_KEY = "<s3-aws-secret-key>" # AWS credentials to connect to remote S3 with persistent directory

# llamaindex vectorstore set up in DynamoAI's trial envrionment
llamaindex_arg = {
    "aws_key": AWS_KEY,
    "aws_secret": AWS_SECRET_KEY,
    "s3_bucket_name": "<s3_bucket_name>",
    "ef_inputs": {
        "ef_type": "sentence_transformer", # embedding function provider
        "model_name": "all-MiniLM-L6-v2", # embedding function model name
    },
}

from dynamofl import LlamaIndexDB
llamaindex_setup = LlamaIndexDB(**llamaindex_arg)

and pass this llamaindex_setup as the vector_db argument in create_rag_hallucination_test() function.

Databricks VectorSearch

DynamoAI supports vector serach using Databricks VectorSearch. DynamoAI requires API access to the Databricks workspace hosting the vector index endpoint, with other parameters:

host: Databricks workspace URL
index_name: Vector index name
token: Databricks workspace API access token
id_column: column name of the database where data point id is stored
content-column: column name of the database where content is stored

# Set your Databricks credentials here
DBRX_HOST = "<databricks-workspace-url>" 
DBRX_TOKEN = "<databricks-access-token>"
DBRX_INDEX = "<databricks-vector-index-name>"

dbrx_args = {
    "host": DBRX_HOST,
    "index_name": DBRX_INDEX,
    "token": DBRX_TOKEN,
    "id_column": "id",
    "content_column": "content",
}

from dynamofl import DatabricksVectorSearch
databricks_setup = DatabricksVectorSearch(**dbrx_args)

and pass this databricks_setup as the vector_db argument in create_rag_hallucination_test() function.

Postgres VectorDB (`pgvector`)

DynamoAI supports vectordb setup via Postgres aka pgvector. DynamoAI requires API access to the server hosting the Postgres database with vector extension enabled.

user: User name for the server
password: password for the server
host: host address for the server
port: port number
db_name: database name
table_name: table name where the vector index is created
content_column: name of the column of the table where data content is stored
id_column: name of the column of the table where the data point id is stored
ef_inputs: specifies what embedding model is being used to create the vector index. ef_type specifies the function provider's name (currently supports SentenceTransformer (sentence_transformer), HuggingFace (hf), and OpenAI (openai)), and model_name specifies specific model being used. Appropriate api_key should be specified when using models that are not public (e.g., huggingface-hub access tokens for HuggingFace, API key for OpenAI)

pgvector_args = {
    "user": "<your-username>",
    "password": "<your-password>",
    "host": "<your-hostname>",
    "port": port_num,
    "db_name": "<your-dbname>",
    "table_name": "<your-table-name>",
    "content_column": "<content-column>",
    "id_column": "<id-column>",
    "ef_inputs": {
        "ef_type": "sentence_transformer",
        "model_name": "all-MiniLM-L6-v2",
    },
}

from dynamofl import PostgresVectorDB
pgvector_setup = PostgresVectorDB(**pgvector_args)

and pass this pgvector_setup as the vector_db argument in create_rag_hallucination_test() function.

CustomRagDB

DynamoAI supports connection to custom RAG applications through the CustomRagDB wrapper. For more details on how to set up a custom RAG application, please see the Custom RAG Application section. DynamoAI requires API access to the custom RAG application's REST API endpoint.

custom_rag_arg = {
    "custom_rag_application_id": 12 # id of custom-rag-application
}

from dynamofl import CustomRagDB
custom_rag_setup = CustomRagDB(**custom_rag_arg)

and pass this custom_rag_setup as the vector_db argument in create_rag_hallucination_test() function.

Embedding Functions

To connect a vector database, DynamoAI also requires providing an embedding function provider and model name that was used for embedding the documents, which will be used for the document retrieval. Here, DynamoAI currently supports embedding functions from the following providers:

Hugging Face ('hf')
OpenAI ('openai')
Azure OpenAI ('openai_azure')
Sentence Transformers ('sentence_transformer')

RAG Hallucination Evaluations with DynamoEval SDK (GPT-3.5-Turbo)

Initiation​

Environment Setup​

Initiation​

Create a model object​

Create a dataset object​

Specify a Vector DB Instance​

Run Tests on OpenAI Model​

Setting Dataset Input Column Name​

Setting Model Hyperparameters​

Setting Test Type and Metrics​

Specifying Prompt Template​

Specifying List of Topics for Clustering​

Putting It all Together​

Viewing Test Results​

Appendix

Metric Definitions​

Retrieval Relevance​

Response Faithfulness​

Response Relevance​

Connecting Vector Databases​

ChromaDB​

LlamaIndex​

Databricks VectorSearch​

Postgres VectorDB (pgvector)​

CustomRagDB​

Embedding Functions​

Embedding Functions​

Initiation

Environment Setup

Initiation

Create a model object

Create a dataset object

Specify a Vector DB Instance

Run Tests on OpenAI Model

Setting Dataset Input Column Name

Setting Model Hyperparameters

Setting Test Type and Metrics

Specifying Prompt Template

Specifying List of Topics for Clustering

Putting It all Together

Viewing Test Results

Metric Definitions

Retrieval Relevance

Response Faithfulness

Response Relevance

Connecting Vector Databases

ChromaDB

LlamaIndex

Databricks VectorSearch

Postgres VectorDB (`pgvector`)

CustomRagDB

Embedding Functions

Embedding Functions