[SDK] RAG Hallucination Test Quickstart
RAG Hallucination Evaluations with DynamoEval SDK (GPT-3.5-Turbo)
Last updated: November 1st, 2024
This Quickstart showcases an end-to-end walkthrough of:
- how to utilize DynamoAI's SDK and platform solutions to run RAG hallucination evaluations assessing retrieval relevance, response relevance, and response faithfulness For demonstration purposes, we will use two datasets as the source documents and queries with different use cases:
- (
multidoc2dial-conv-test-quickstart.csv
) Processed version of MultiDoc2Dial dataset for call center application - (
fintabnetqa-sujet-mixed-49.csv
) Processed version of mixture of TableVQABench dataset and Sujet Financial Dataset for Q&A on financial documents and tables.
And we will use GPT3.5 from OpenAI as the base generation model, and a vector database using ChromaDB.
Prerequisites:
- DynamoAI API token
- OpenAI API Token
Colab notebook using multidoc2dial-conv-test-quickstart.csv
data:
Colab notebook using fintabnetqa-sujet-mixed-49.csv
data:
Initiation
If you do not have a DynamoAI API token, generate a token by logging into apps.dynamo.ai with your provided credentials.
Navigate to apps.dynamo.ai/profile to generate your DynamoAI API token. This API token will enable you to programmatically connect to the DynamoAI server, create projects, and train models. If you generate multiple API tokens, only your most recent one will work.
You may need other credentials like below throughout the quickstart.
# Set your DynamoAI API token here
DYNAMOFL_API_KEY="<dynamofl-api-key>"
# Set the URL of your locally deployed DynamoAI platform or use "https://api.dynamo.ai"
# if you are using the DynamoAI Sandbox environment
DYNAMOFL_HOST="<dynamofl-platform-host-url>"
# Set you OpenAI API key here
OPENAI_API_KEY = "<your-openai-api-key>"
Environment Setup
Begin by installing the public DynamoAI SDK, importing the required libraries, and downloading the required dataset and model files for the quickstart.
!pip install dynamofl
import os
import time
from dynamofl import DynamoFL
from dynamofl import VRAMConfig, GPUConfig, GPUType
from dynamofl import ChromaDB
Initiation
Create a DynamoFL instance using your API token and host.
dfl = DynamoFL(DYNAMOFL_API_KEY, host=DYNAMOFL_HOST)
Create a model object
First, create a model object. The model object specifies the model that will be run on during the create_rag_hallucination_test()
method. DynamoAI currently supports two types of model objects — local models and remote model API endpoints.
In this quickstart, we demonstrate running tests on remote models. A remote model object can be used to access a model provided by a third-party — currently DynamoAI provides support for both OpenAI and Azure OpenAI models as well as Databricks hosted models.
SLUG = time.time()
model_key="GPT_3.5_Turbo_{}".format(SLUG).format() # unique model identifier key
# Creating a model referring to OpenAI's GPT-3.5-Turbo
model = dfl.create_openai_model(
key=model_key,
name="openai-gpt-3.5-turbo-rag",
api_key=OPENAI_API_KEY,
api_instance="gpt-3.5-turbo"
)
# model key
print(model.key)
Create a dataset object
To run a RAG hallucination test, we also need to specify the dataset used for evaluation. A dataset can be created by specifying the dataset file path.
At this time, DynamoAI accepts csv datasets, or datasets hosted on HuggingFace Hub. We recommend using datasets with no more than 100 datapoints for testing.
The uploaded datset should satisfy the following conditions:
- All rows and columns should contain strings of length at least 5.
- There must be at least one column populated with the data points of queries that will be used as the input for the RAG system.
- The number of data points should be more than 10.
- The uploaded CSV file must be parseed with
pandas.read_csv()
method without any errors. - The first row of the CSV file should be names of the column (i.e., it should not be a data point)
For this quickstart, you can use either of the following datasets:
"multidoc2dial-conv-test-quickstart"
: dataset with queries where queries are directly extracted from the original multidoc2dial dataset. The queries will take a form of a paragraph that describes a conversation between an agent and a user, ending with a user's question.
import pandas as pd
test_file = pd.read_csv("multidoc2dial-conv-test-quickstart.csv")
test_file.head().iloc[0]
"fintabnetqa-sujet-mixed-49"
: dataset with queries related to financial documents and tables.
import pandas as pd
test_file = pd.read_csv("fintabnetqa-sujet-mixed-49.csv")
test_file.head().iloc[0]
After checking the dataset, upload the dataset to the DynamoAI platform:
# dataset upload
dataset_file_path = "multidoc2dial-conv-test-quickstart.csv"
dataset = dfl.create_dataset(key="dataset_{}".format(SLUG).format(), file_path=dataset_file_path, name="multidoc2dial test")
# dataset upload
dataset_file_path = "fintabnetqa-sujet-mixed-49.csv"
dataset = dfl.create_dataset(key="dataset_{}".format(SLUG).format(), file_path=dataset_file_path, name="financial document test")
# dataset id
print(dataset._id)
Specify a Vector DB Instance
To run RAG hallucination test, a key step is providing configuration details of the vectorDB used for document retrieval.
At this time, DynamoAI SDK supports several vector options to connect to your RAG system:
- ChromaDB
- LlamaIndex VectorStore
- Databricks VectorSearch
- Postgres VectorDB
- Custom RAG applications (via REST API endpoints)
Custom vector databases can be integrated using the DynamoAI CustomRagDB wrapper, which enables connections through REST APIs with flexible request/response transformations. Additional details are available in the Custom RAG Application section.
(see Connecting Vector Databases for more information) -- if you need help integrating your existing database, please reach out to your team.
For this quickstart, you will be using an existing Chroma vector DB populated with documents from the multidoc2dial-conv-test-quickstart
and fintabnetqa-sujet-mixed-49
dataset, using the arguments below.
host
: specifies where the DB is hostedport
: port number for connectioncollection
: collection name where the contents are storedef_inputs
: specifies what embedding model is being used to create the vector index.ef_type
specifies the function provider's name (currently supports SentenceTransformer (sentence_transformer
), HuggingFace (hf
), and OpenAI (openai
)), andmodel_name
specifies specific model being used. Appropriateapi_key
should be specified when using models that are not public (e.g., huggingface-hub access tokens for HuggingFace, API key for OpenAI)
For multidoc2dial-conv-test-quickstart
dataset, use the following:
# chroma db set up in DynamoAI's trial environment
chroma_args = {
"host": "https://chromadb.internal.dynamo.ai",
"port": 8000,
"collection": "multidoc2dial",
"ef_inputs": {
"ef_type": "sentence_transformer",
"model_name": "all-MiniLM-L6-v2", # public model
},
}
For fintabnetqa-sujet-mixed-49
dataset, use the following:
# chroma db set up in DynamoAI's trial environment
chroma_args = {
"host": "https://chromadb.internal.dynamo.ai",
"port": 8000,
"collection": "fintabnetqa_10k_mixed", # note the collection name difference
"ef_inputs": {
"ef_type": "sentence_transformer",
"model_name": "all-MiniLM-L6-v2", # public model
},
}
Then create a specification object that can be readily used by the SDK method later.
# Create a ChromaDB specification object
chromadb_setup = ChromaDB(**chroma_args)
Run Tests on OpenAI Model
To run RAG Hallucination Test, we can call the create_rag_hallucination_test
method. Test creation will submit a test to your cloud machine learning platform, where the test will be run.
First, we need to specify different test paramaeters for the method. For further details on best practices for configuring each test parameter, please see the appendix.
Setting Dataset Input Column Name
When configuring a rag hallucination test, it's important to specify the dataset column that contains the input queries. For both types of datasets, queries
column contains the queries we want to use.
input_column = "queries"
Setting Model Hyperparameters
The following model hyperparameters can also be modified.
temperature
: Temperature controls the amount of "randomness" in your models generations (0 being deterministic, 1 being random)seq_len
: Length (number of tokens) of the generated sequence.retrieve_top_k
: Number of documents retrieved from the vectorDB provided as context to the generation model
DynamoAI provides grid search over these hyperparameter values, where the grid can be specified by a dictionary as follows.
grid=[
{
"temperature": [1.0, 0.1], # run tests over two temperature values
"seq_len": [128],
"retrieve_top_k": [1, 2],
}
]
Setting Test Type and Metrics
DynamoAI currently supports three types of RAG hallucination evaluations:
retrieval-relevance
measures the relevancy of documents retrieved from the vector database to input queriesresponse-relevance
measures the relevancy of model generated responses to input queriesfaithfulness
measures the faithfulness of model generated responses to the retrieved document context
To select RAG hallucination evaluation metrics, specify your chosen metrics in the rag_hallucination_metrics
parameter in the create_rag_hallucination_test
function. For more details on metric descriptions and calculations, please see the appendix.
rag_hallucination_metrics = ["retrieval-relevance", "response-relevance", "faithfulness"]
Specifying Prompt Template
When running RAG hallucination tests, DynamoAI provides the option to also modify the prompt used for providing context to the model. The following prompt template represents the default one used by DynamoAI. If modifying the prompt template, please ensure to add a {context}
and {question}
placeholder to identify where the retrieved document context and input query will be provided in the prompt.
- For
multidoc2dial-conv-test-quickstart
dataset, because the queries are in a format of a conversation, we do want to have a more specific instruction on how to process it, like the following prompt template:
prompt_template = """Write how the Agent should respond to the User in the conversation, solely based on the context provided.\nContext: {context}\nConversation: {question}"""
- For
fintabnetqa-sujet-mixed-49
dataset, the queries are simple questions, so we can try a more generic instruction for the template:
prompt_template = """Answer the following question based on the context provided.\n\nQuestion: {question}\n\nContext:\n{context}\n\nAnswer:"""
Specifying List of Topics for Clustering
DynamoAI also supports optional user-provided list of topics to cluster the input texts for better performance analysis. This list can be provided as topic_list
parameter in create_rag_hallucination_test()
method.
When nothing is provided, it will automatically cluster and extract set of keywords that are representative of input texts in each cluster.
- For
multidoc2dial-conv-test-quickstart
, here is an example topic list.
topic_list=[
"department of motor vehicles",
"student scholarship and financial support",
"veteran benefits and healthcare",
"social security services",
]
- For
fintabnetqa-sujet-mixed-49
, try leaving them as a blank list.
topic_list = []
Putting It all Together
Now that all test parameters are set up, run create_rag_hallucination_test()
method using the parameters. Running the cell below will create the test and queue it on the machine.
- set
vector_db
to different vectordb setups if using others.
test_info = dfl.create_rag_hallucination_test(
name=f"rag_hallucination_test_{SLUG}", # optional test name
model_key=model.key, # previously created model identifier key
dataset_id=dataset._id, # previously created dataset id
input_column=input_column, # input column name for the queries
prompt_template=prompt_template, # prompt_template
vector_db=chromadb_setup, # chromadb db args
rag_hallucination_metrics=rag_hallucination_metrics, # metrics for the tests
topic_list=topic_list, # list of topics used to cluster
grid=grid, # grid of hyperparameters to repeat the evaluations with
gpu=GPUConfig(gpu_type=GPUType.A10G, gpu_count=1), # default GPU parameters
)
# test id
print(test_info.id)
# Helper function confirming the Attack has been queued
def query_attack_status(attack_id):
attack_info = dfl.get_attack_info(attack_id)
print("Attack status: {}.".format(attack_info))
# Check Attack Status. Rerun this cell to see what the status of each test looks like.
# Once all of them show COMPLETED, you can head to the model dashboard page in the DynamoAI UI to check out the report.
all_attacks = test_info.attacks
attack_ids = [attack["id"] for attack in all_attacks]
for attack_id in attack_ids:
query_attack_status(attack_id)
Viewing Test Results
After your test has been created, navigate to the model dashboard page in the DynamoAI UI (under Models tab). Here, you should observe that your model and dataset have been created and that your test is running. After the test has completed, a test report file will be created and can be downloaded for an deep-dive into the test results!
Appendix
Metric Definitions
Retrieval Relevance
Retrieval relevance represents the degee of relevance of the documents retrieved from the vector database using the embedding model for each query.
To measure the retrieval relevance, DynamoAI generates an LLM Relevance Score. Various studies have shown the effectiveness of LLMs as reference-free evaluators for tasks such as content relevance.
The LLM Relevance Score is computed by prompting an LLM to evaluate the sufficiency of the content in the retrieved documents to answer a given user query. For each query, DyanmoFL computes the relevance score against the top document retrieved from the vector database. Based on this score, each (query, retrieved document) pair will be classified as either positive (relevant) or negative (not relevant). A negative classification indicates that the retrieved document may not contain the key information required to answer the given query.
Response Faithfulness
Reponse faithfulness represents how faithfulness model generated responses are to the retrieved documents.
To measure response faithfulness, DynamoAI relies on an NLI consistency score. Here, DynamoAI uses a natural language inference (NLI) model, which labels a (retrieved document, generated response) pair as either being entailing, contradicting, or neutral, with corresponding scores.
DynamoAI specifically runs the NLI model over each combination of sentence pairs from a document and the generated summary. This enables DynamoAI to then retain the highest entailment score for a given generated summary sentence. Finally, DynamoAI takes the mean of all the maximum entailment scores for each sentence in the response to provide an aggregate score for the full response. Based on this score, each (set of retrieved documents, response) will be classified as either positive (no issue) or negative (response potentially missing/contradicting some key information in the retrieved documents).
Response Relevance
Response Relevance represents the relevance of model generated responses to the query.
To measure the response relevance, DynamoAI generates a Response Relevance Score. Similar to the Retrieval Relevance Score, the Response Relevance Score is computed by prompting an LLM to generate a score between 0 and 1 that indicates how relevant the generated response is to the question. Based on this score, each (query, response) will be classified as either positive (relevant) or negative (not relevant). A negative classification indicates that the response may not contain the information that answers the query.
Connecting Vector Databases
For RAG workflows, a vector database in generally used to retrieve the most relevant documents to a particular input query. This is done by:
- Passing the input query to the same embedding model used to create the vector database to generate a vector embedding representation of the query
- Querying the vector database with the embedding representation of the query to retrieve the most semantically similar vectors in the database (representing documents)
- Using these retrieved vectors to fetch the documents they represent and passing these as context to the language model
To support evaluation on this workflow, DynamoAI requires providing a VectorDB connection to RAG Hallucination tests. If you need help setting up a vector database connection, please reach out to our team.
ChromaDB
DynamoAI supports connection to Chroma vector databases. To create a ChromaDB object, DynamoAI requires the host, port, and collection name of a persistent database instance. The database instance must be hosted in the same VPC as your DynamoAI deployment.
LlamaIndex
DynamoAI supports vector indices using llama-index. To use the llama index vector indices, DynamoAI requires access to a remote S3 bucket to work as a persistent directory with necessary credentials and configuration details (e.g., type and name of the embedding model used to create the vector database). These files do not need to be hosted in the same VPS as your DynamoAI deployment.
For setting up connection to LlamaIndex VectorStore through DynamoAI SDK, following fields are required:
aws_key
andaws_secret
: AWS S3 credentialss3_bucket_name
: bucket name to be used as a remote persistent directory for the vector indexef_inputs
: specifies what embedding model is being used to create the vector index.ef_type
specifies the function provider's name (currently supports SentenceTransformer (sentence_transformer
), HuggingFace (hf
), and OpenAI (openai
)), andmodel_name
specifies specific model being used. Appropriateapi_key
should be specified when using models that are not public (e.g., huggingface-hub access tokens for HuggingFace, API key for OpenAI)
# Set your AWS access key here (only when using LlamaIndex vectorstore)
AWS_KEY = "<s3-aws-key>" # AWS credentials to connect to remote S3 with persistent directory
AWS_SECRET_KEY = "<s3-aws-secret-key>" # AWS credentials to connect to remote S3 with persistent directory
# llamaindex vectorstore set up in DynamoAI's trial envrionment
llamaindex_arg = {
"aws_key": AWS_KEY,
"aws_secret": AWS_SECRET_KEY,
"s3_bucket_name": "<s3_bucket_name>",
"ef_inputs": {
"ef_type": "sentence_transformer", # embedding function provider
"model_name": "all-MiniLM-L6-v2", # embedding function model name
},
}
from dynamofl import LlamaIndexDB
llamaindex_setup = LlamaIndexDB(**llamaindex_arg)
and pass this llamaindex_setup
as the vector_db
argument in create_rag_hallucination_test()
function.
Databricks VectorSearch
DynamoAI supports vector serach using Databricks VectorSearch. DynamoAI requires API access to the Databricks workspace hosting the vector index endpoint, with other parameters:
host
: Databricks workspace URLindex_name
: Vector index nametoken
: Databricks workspace API access tokenid_column
: column name of the database where data point id is storedcontent-column
: column name of the database where content is stored
# Set your Databricks credentials here
DBRX_HOST = "<databricks-workspace-url>"
DBRX_TOKEN = "<databricks-access-token>"
DBRX_INDEX = "<databricks-vector-index-name>"
dbrx_args = {
"host": DBRX_HOST,
"index_name": DBRX_INDEX,
"token": DBRX_TOKEN,
"id_column": "id",
"content_column": "content",
}
from dynamofl import DatabricksVectorSearch
databricks_setup = DatabricksVectorSearch(**dbrx_args)
and pass this databricks_setup
as the vector_db
argument in create_rag_hallucination_test()
function.
Postgres VectorDB (pgvector
)
DynamoAI supports vectordb setup via Postgres aka pgvector. DynamoAI requires API access to the server hosting the Postgres database with vector extension enabled.
user
: User name for the serverpassword
: password for the serverhost
: host address for the serverport
: port numberdb_name
: database nametable_name
: table name where the vector index is createdcontent_column
: name of the column of the table where data content is storedid_column
: name of the column of the table where the data point id is storedef_inputs
: specifies what embedding model is being used to create the vector index.ef_type
specifies the function provider's name (currently supports SentenceTransformer (sentence_transformer
), HuggingFace (hf
), and OpenAI (openai
)), andmodel_name
specifies specific model being used. Appropriateapi_key
should be specified when using models that are not public (e.g., huggingface-hub access tokens for HuggingFace, API key for OpenAI)
pgvector_args = {
"user": "<your-username>",
"password": "<your-password>",
"host": "<your-hostname>",
"port": port_num,
"db_name": "<your-dbname>",
"table_name": "<your-table-name>",
"content_column": "<content-column>",
"id_column": "<id-column>",
"ef_inputs": {
"ef_type": "sentence_transformer",
"model_name": "all-MiniLM-L6-v2",
},
}
from dynamofl import PostgresVectorDB
pgvector_setup = PostgresVectorDB(**pgvector_args)
and pass this pgvector_setup
as the vector_db
argument in create_rag_hallucination_test()
function.
CustomRagDB
DynamoAI supports connection to custom RAG applications through the CustomRagDB wrapper. For more details on how to set up a custom RAG application, please see the Custom RAG Application section. DynamoAI requires API access to the custom RAG application's REST API endpoint.
custom_rag_arg = {
"custom_rag_application_id": 12 # id of custom-rag-application
}
from dynamofl import CustomRagDB
custom_rag_setup = CustomRagDB(**custom_rag_arg)
and pass this custom_rag_setup
as the vector_db
argument in create_rag_hallucination_test()
function.
Embedding Functions
Embedding Functions
To connect a vector database, DynamoAI also requires providing an embedding function provider and model name that was used for embedding the documents, which will be used for the document retrieval. Here, DynamoAI currently supports embedding functions from the following providers:
- Hugging Face ('hf')
- OpenAI ('openai')
- Azure OpenAI ('openai_azure')
- Sentence Transformers ('sentence_transformer')