Evaluate SDK Library
dynamofl.evaluate SDK reference
These DynamoFL SDK methods provide tools that enable the quantitative assessment of LLM performance.
from dynamofl.evaluate import *
Example Script
from dynamofl.evaluate import (
calculate_compression_ratio,
calculate_cosine_similarity_bert,
calculate_cosine_similarity_tfidf,
calculate_ngram_overlap,
calculate_coverage,
compute_metrics,
)
sample_text = {
"reference_text": "The Japanese Zen term shoshin translates as ‘beginner’s mind’ and refers to a paradox: the more you know about a subject, the more likely you are to close your mind to further learning. As the Zen monk Shunryu Suzuki put it in his book Zen Mind, Beginner’s Mind (1970): ‘In the beginner’s mind there are many possibilities, but in the expert’s there are few.’ Many historical examples demonstrate how the expert mind (or feeling like an expert) can lead to closed-mindedness and the obstruction of scientific progress. In 1912, for instance, when the German geophysicist and explorer Alfred Wegener proposed – counter to the received wisdom of the day – that the Earth is made up of shifting continental plates, he was ridiculed by expert geologists around the world. His German compatriots referred to his ‘delirious ravings’ while experts in the United States accused him of peddling pseudoscience. It would take decades before the orthodoxy was overturned and the accuracy of his theory was acknowledged. Similar stories abound. In my own field of neuroscience, for example, belief in the legendary Spanish neuroscientist Santiago Ramón y Cajal’s ‘harsh decree’ that adult humans are unable to grow new neurons persisted for decades in the face of mounting contradictory evidence. Intellectual hubris doesn’t afflict only established scientific experts. Merely having a university degree in a subject can lead people to grossly overestimate their knowledge. In one pertinent study in 2015, researchers at Yale University asked graduates to estimate their knowledge of various topics relevant to their degrees, and then tested their actual ability to explain those topics. The participants frequently overestimated their level of understanding, apparently mistaking the ‘peak knowledge’ they had at the time they studied at university for their considerably more modest current knowledge. Unfortunately, just as Suzuki wrote and as historical anecdotes demonstrate, there is research evidence that even feeling like an expert also breeds closed-mindedness. Another study involved giving people the impression that they were relatively expert on a topic (for example, by providing them with inflated scores on a test of political knowledge), which led them to be less willing to consider other political viewpoints – a phenomenon the researchers called ‘the earned dogmatism effect’.",
"summary_text": "The Japanese Zen term shoshin translates as beginners mind and refers to a paradox the more you know about a subject, the more likely you are to close your mind to further learning. As the Zen monk Shunryu Suzuki put it in his book Zen Mind, Beginners Mind 1970 In the beginners mind there are many possibilities, but in the experts there are few. Many historical examples demonstrate how the expert mind or feeling like an expert can lead to closed-mindedness and the obstruction of scientific progress. In 1912, for instance, when the German geophysicist and explorer Alfred Wegener proposed counter to the received wisdom of the day that the Earth is made up of shifting continental plates, he was ridiculed by expert geologists around the world. His German compatriots referred to his delirious ravings while experts in the United States accused him of peddling pseudoscience. It would take decades before the orthodoxy was overturned and the accuracy of his theory was acknowledged. Similar stories abound. In my own field of neuroscience, for example, belief in the legendary Spanish neuroscientist Santiago Ramn y Cajals harsh decree that adult humans are unable to grow new neurons persisted for decades in the face of mounting contradictory evidence."
}
'''
Use the compute_metrics() wrapper method for calculating text comparison metrics
between a reference text and a summary text.
'''
text_comparison_results = compute_metrics(
reference_text=sample_text["reference_text"],
summary_text=sample_text["summary_text"],
# Add custom arguments for 'cosine_similarity_bert' and 'ngram_overlap' methods
kwargs={
'calculate_cosine_similarity_bert': {'max_chunk_tokens': 256},
'calculate_ngram_overlap': {'n': 5}
}
)
'''
Alternatively, call one or more text comparison methods individually
'''
compression_ratio_results = calculate_compression_ratio(
text=sample_text["summary_text"],
)
cosine_similarity_bert_results = calculate_cosine_similarity_bert(
reference_text=sample_text["reference_text"],
summary_text=sample_text["summary_text"],
)
cosine_similarity_tfidf_results = calculate_cosine_similarity_tfidf(
reference_text=sample_text["reference_text"],
summary_text=reference_text["summary_text"],
)
ngram_overlap_results = calculate_ngram_overlap(
document=sample_text["reference_text"],
generated_summary=sample_text["summary_text"],
)
coverage_results = calculate_coverage(
reference_text=sample_text["reference_text"],
summary_text=sample_text["summary_text"],
)
Method calculate_compression_ratio()
This method measures the repetitive behavior of the models. The higher the compression ratio, the more repetitive the text will be.
Release Notes
Added support for calculating the compression ratio for a summarized text.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
text | string | ✅ | The input text that you want to calculate the compression ratio for. |
return_dict | boolean | ❌ | Indicates whether to return the compression ratio as a dictionary. Default is True. |
Returns
Return Type | Description |
---|---|
dict | A dictionary containing the compression_ratio key and its corresponding value if return_dict is True. |
float | The compression ratio as a floating-point number if return_dict is False. |
Example Usage
from dynamofl.evaluate import calculate_compression_ratio
results = calculate_compression_ratio(
text="This is a sample text.",
)
Method calculate_cosine_similarity_bert()
This method calculates the cosine similarity between a reference text and a summary text using BERT CLS (Classification) embeddings. It tokenizes the text, chunks it into smaller pieces, computes cosine similarities for each chunk, and returns the results as a dictionary or tuple.
Release Notes
Added support for calculating cosine similarity using BERT **[CLS]**
embeddings.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
reference_text | string | ✅ | The reference text used for comparison. |
summary_text | string | ✅ | The summary text to compare against the reference text. |
max_chunk_tokens | int | ❌ | The maximum number of tokens allowed in each text chunk. Default is 512. |
round_to | int | ❌ | The number of decimal places to round the similarity scores to. Default is 2. |
return_dict | boolean | ❌ | Indicates whether to return the results as a dictionary. Default is True. |
Returns
Return Type | Description |
---|---|
dict | A dictionary containing the number of chunks ("num_chunks_bert") and the similarity scores ("similarity_bert") if return_dict is True. |
tuple | A tuple containing the number of chunks and the similarity scores as a list if return_dict is False. |
Example Usage
from dynamofl.evaluate import calculate_cosine_similarity_bert
results = calculate_cosine_similarity_bert(
reference_text="Reference text",
summary_text="Summary text",
)
Method calculate_cosine_similarity_tfidf()
This method calculates the cosine similarity between a reference text and a summary text using TF-IDF (Term Frequency-Inverse Document Frequency) vectors.
Release Notes
Added support for calculating cosine similarity using the TfIdfVectorizer
.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
reference_text | string | ✅ | The reference text used for comparison. |
summary_text | string | ✅ | The summary text to compare against the reference text. |
max_chunk_words | int | ❌ | The maximum number of words allowed in each text chunk. Default is 512. |
round_to | int | ❌ | The number of decimal places to round the similarity scores to. Default is 2. |
return_dict | boolean | ❌ | Indicates whether to return the results as a dictionary. Default is True. |
Returns
Return Type | Description |
---|---|
dict | A dictionary containing the number of chunks ("num_chunks_tfidf") and the similarity scores ("similarity_tfidf") if return_dict is True. |
tuple | A tuple containing the number of chunks and the similarity scores as a list if return_dict is False. |
Example Usage
from dynamofl.evaluate import calculate_cosine_similarity_tfidf
results = calculate_cosine_similarity_tfidf(
reference_text="Reference text",
summary_text="Summary text",
)
Method calculate_ngram_overlap()
This method calculates the n-gram overlap between a document and a generated summary.
Release Notes
Added support for calculating or calculating n-gram
overlap between a document and a generated summary.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
document | string | ✅ | The document content used for comparison. |
generated_summary | string | ✅ | The generated summary text to compare against the document. |
n | int | ❌ | The value of 'n' for n-grams (default is 10). |
return_dict | boolean | ❌ | Indicates whether to return the results as a dictionary. Default is True. |
Returns
Return Type | Description |
---|---|
dict | A dictionary containing the number of document n-grams ("document_ngrams"), the number of generated n-grams ("generated_ngrams"), common n-grams ("common_ngrams"), and the overlap ratio ("overlap") if return_dict is True. |
tuple | A tuple containing the number of document n-grams, the number of generated n-grams, common n-grams, and the overlap ratio as individual elements if return_dict is False. |
Example Usage
from dynamofl.evaluate import calculate_ngram_overlap
results = calculate_ngram_overlap(
document="Document content",
generated_summary="Generated summary",
)
Method calculate_coverage()
This method calculates coverage using n-grams between a reference text and a summary text by splitting the reference text into chunks. Summaries with good coverage should have even overlap across all chunks of the reference text. Inversely, a summary covering only the first few lines will have different overlap scores across chunks.
Release Notes
Added support for calculating coverage using n-grams between a reference text and a summary text.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
reference_text | string | ✅ | The reference text used for comparison. |
summary_text | string | ✅ | The summary text to calculate coverage against. |
max_chunk_words | int | ❌ | The maximum number of words allowed in each text chunk. Default is 512. |
n | int | ❌ | The value of 'n' for n-grams (default is 10). |
round_to | int | ❌ | The number of decimal places to round the coverage scores to. Default is 2. |
return_dict | boolean | ❌ | Indicates whether to return the results as a dictionary. Default is True. |
Returns
Return Type | Description |
---|---|
dict | A dictionary containing the number of chunks ("num_chunks_ngram") and the coverage scores ("coverage") if return_dict is True. |
tuple | A tuple containing the number of chunks and the coverage scores as a list if return_dict is False. |
Example Usage
from dynamofl.evaluate import calculate_coverage
results = calculate_coverage(
reference_text="Reference text",
summary_text="Summary text",
)
Method compute_metrics()
This method serves as a convenient wrapper for calculating text comparison metrics between a reference text and a summary text. It seamlessly integrates various metrics, including compression ratio, cosine similarity using BERT embeddings, cosine similarity using TF-IDF vectors, n-gram overlap, and coverage, making it easier to obtain multiple metrics with a single call.
Release Notes
Added support for calculating the compression ratio for a summarized text.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
reference_text | string | ✅ | The reference text used for comparison. |
summary_text | string | ✅ | The summary text to calculate metrics against. |
return_all | boolean | ❌ | Indicates whether to return all values of all metrics (Please refer to the values returned by the methods listed above). Default is False. |
kwargs | dict | ❌ | A dictionary containing metric-specific parameters as key-value pairs (Please refer to the parameters accepted by the methods listed above). |
Returns
Return Type | Description |
---|---|
dict | A dictionary containing the computed metrics. |
Example Usage
from dynamofl.evaluate import compute_metrics
results = compute_metrics(
reference_text="Reference text",
summary_text="Summary text",
# Add custom arguments for 'cosine_similarity_bert' and 'ngram_overlap' methods
kwargs={
'calculate_cosine_similarity_bert': {'max_chunk_tokens': 256},
'calculate_ngram_overlap': {'n': 5}
}
)
Method retrieval_relevance_judge_text()
This method judges if the retrieved contexts are relevant to each question, by prompting a Mistral model.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
questions | string or list | ✅ | A single question or a list of questions to check |
retrieved | string or list | ✅ | A retrieved context or a list of retrieved context. Must be of the same length as questions. |
api_key | string | ❌ | Mistral API key for accessing Mistral LLM as the judge. Defaults to None. If not provided, looks for environment variable MISTRAL_API_KEY for the key. If the environment variable is not set up, provide the valid API key for this parameter. |
Returns
Return Type | Description |
---|---|
dict of list | A dictionary of lists each containing the computed retrieval relevance label and explanations |
Example Usage
from dynamofl.evaluate import retrieval_relevance_judge_text
# set the Mistral API key
api_key = "<your-mistral-api-key>"
# or set the Mistral API key as an environment variable
import os
os.environ["MISTRAL_API_KEY"] = "<your-mistral-api-key>"
# question and context list to check pair by pair
question_lst = [
"What happens if I don't update my address registered to each vehicle?",
"What benefit can you get as a veteran?",
]
context_lst = [
"Many DMV customers make easily avoidable mistakes that cause them significant problems, including encounters with law enforcement and impounded vehicles. Because we see customers make these mistakes over and over again , we are issuing this list of the top five DMV mistakes and how to avoid them. \n\n1. Forgetting to Update Address \nBy statute , you must report a change of address to DMV within ten days of moving. That is the case for the address associated with your license, as well as all the addresses associated with each registered vehicle, which may differ. It is not sufficient to only: write your new address on the back of your old license; tell the United States Postal Service; or inform the police officer writing you a ticket. If you fail to keep your address current , you will miss a suspension order and may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors. This really happens , but the good news is this is a problem that is easily avoidable. Learn more about how to change the address on your license and registrations [1]",
"If you already receive Social Security or SSI benefits and you have a bank account , you can sign up for Direct Deposit by : starting or changing Direct Deposit online Social Security benefits only , or contacting your bank, credit union or savings and loan association , or calling Social Security toll - free at 1 - 800 - 772 - 1213 TTY 1 - 800 - 325 - 0778, or Consider the Direct Express debit card as another viable option. The Direct Express card is a debit card you can use to access your benefits. And you don't need a bank account. With the Direct Express card program , we deposit your federal benefit payment directly into your card account. Your monthly benefits will be available on your payment day on time, every time. You can use the card to make purchases, pay bills or get cash at thousands of locations. It s quick and easy to sign up for the card. Call the toll - free Direct Express hotline at 1 - 800 - 333 - 1795. Also , Social Security can help you sign up. If you don't have an account , you must open an account before you can sign up for Direct Deposit. You should shop around in your area to find an account that has the features you want at a price you can afford.",
]
# run retrieval relevance check
output = retrieval_relevance_judge_text(question_lst, context_lst, api_key=api_key)
labels = output["labels"]
explanations = output["explanations"]
# Expected output: labels[0] == 1 and labels[1] == 0
# because the first quesion is relevant to the first context, but second question is not relevant to the second question.
Method faithfulness_judge_text()
This method judges if each answer is faithful to each of the retrieved context with NLI models.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
questions | string or list | ✅ | A single question or a list of questions to check |
retrieved | string or list | ✅ | A retrieved context or a list of retrieved context. Must be of the same length as answers. |
answers | string or list | ✅ | An asnwer or a list of answers. |
api_key | string | ❌ | Mistral API key for accessing Mistral LLM as the judge. Defaults to None. If not provided, looks for environment variable MISTRAL_API_KEY for the key. If the environment variable is not set up, provide the valid API key for this parameter. |
Returns
Return Type | Description |
---|---|
dict of list | A dictionary of lists each containing the computed faithfulness label and explanation. |
Example Usage
from dynamofl.evaluate import faithfulness_judge_text
# set the Mistral API key
api_key = "<your-mistral-api-key>"
# or set the Mistral API key as an environment variable
import os
os.environ["MISTRAL_API_KEY"] = "<your-mistral-api-key>"
# question and context list to check pair by pair
question_lst = [
"What happens if I don't update my address registered to each vehicle?",
"What benefit can you get as a veteran?",
]
context_lst = [
"Many DMV customers make easily avoidable mistakes that cause them significant problems, including encounters with law enforcement and impounded vehicles. Because we see customers make these mistakes over and over again , we are issuing this list of the top five DMV mistakes and how to avoid them. \n\n1. Forgetting to Update Address \nBy statute , you must report a change of address to DMV within ten days of moving. That is the case for the address associated with your license, as well as all the addresses associated with each registered vehicle, which may differ. It is not sufficient to only: write your new address on the back of your old license; tell the United States Postal Service; or inform the police officer writing you a ticket. If you fail to keep your address current , you will miss a suspension order and may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors. This really happens , but the good news is this is a problem that is easily avoidable. Learn more about how to change the address on your license and registrations [1]",
"If you already receive Social Security or SSI benefits and you have a bank account , you can sign up for Direct Deposit by : starting or changing Direct Deposit online Social Security benefits only , or contacting your bank, credit union or savings and loan association , or calling Social Security toll - free at 1 - 800 - 772 - 1213 TTY 1 - 800 - 325 - 0778, or Consider the Direct Express debit card as another viable option. The Direct Express card is a debit card you can use to access your benefits. And you don't need a bank account. With the Direct Express card program , we deposit your federal benefit payment directly into your card account. Your monthly benefits will be available on your payment day on time, every time. You can use the card to make purchases, pay bills or get cash at thousands of locations. It s quick and easy to sign up for the card. Call the toll - free Direct Express hotline at 1 - 800 - 333 - 1795. Also , Social Security can help you sign up. If you don't have an account , you must open an account before you can sign up for Direct Deposit. You should shop around in your area to find an account that has the features you want at a price you can afford.",
]
answer_lst = [
"You may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors.",
"A Veteran who s the qualifying CHAMPVA sponsor for their family may also qualify for the VA health care program based on their own Veteran status.",
]
output = faithfulness_judge_text(question_lst, context_lst, answer_lst, api_key=api_key)
labels = output["labels"]
explanations = output["explanations"]
# Exepcted output: labels[0] == 1 and labels[1] == 0
# because the first answer is faithful to the first context, but the second answer is not faithful to the second context.
Method response_relevance_judge_text()
This method judges if each answer is relevant to each question, by prompting a Mistral model.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
questions | string or list | ✅ | A single question or a list of questions to check |
answers | string or list | ✅ | An asnwer or a list of answers. Must be of the same length as questions. |
api_key | string | ❌ | Mistral API key for accessing Mistral LLM as the judge. Defaults to None. If not provided, looks for environment variable MISTRAL_API_KEY for the key. If the environment variable is not set up, provide the valid API key for this parameter. |
Returns
Return Type | Description |
---|---|
dict of list | A dictionary of lists each containing the computed response relevance label and explanations. |
Example Usage
from dynamofl.evaluate import response_relevance_judge_text
# set the Mistral API key
api_key = "<your-mistral-api-key>"
# or set the Mistral API key as an environment variable
import os
os.environ["MISTRAL_API_KEY"] = "<your-mistral-api-key>"
# question and context list to check pair by pair
question_lst = [
"What happens if I don't update my address registered to each vehicle?",
"What benefit can you get as a veteran?",
]
answer_lst = [
"You may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors.",
"Call the toll - free Direct Express hotline at 1 - 800 - 333 - 1795.",
]
# run retrieval relevance check
output = response_relevance_judge_text(question_lst, answer_lst, api_key=api_key)
labels = output["labels"]
explanations = output["explanations"]
# Expected output: labels[0] == 1 and labels[1] == 0
# because the first response is relevant to the first question, but the second resonse is not relevant to the second question.
Method generate_qa_from_context()
This method generates question/answer pairs from the given list of context, and saves a .csv file containing (context, questions, answers), under the file name provided as the paramter csv_file_name
. In case the list of context is really large, the method supports sampling from the list of context to generate question/answer pairs for a subset of context. sample_method="random"
samples random contexts from the list, while sample_method="idds"
samples contexts based on in-domain diversity sampling (idds) to ensure diversity of the sampled contexts' semantic representations.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
context_lst | list of str | ✅ | The list of context used to generate queries. |
api_key | string | ✅ | API key to use the model. Set up an environment variable |
prompt_template | string | ❌ | Prompt template for the langauge model. Must contain both {context} and {num_questions} as placeholders. |
system_prompt | string | ❌ | System prompt for the language model. |
csv_file_name | string | ❌ | File name for the generated csv file. Defaults to generated_queries.csv |
model_type | string | ❌ | The type of text generation model used (only supports "openai"). Defaults to "openai" |
model_name | string | ❌ | The model name within the specified model_type. Defaults to "gpt-3.5-turbo" |
num_questions_per_doc | integer | ❌ | number of questions to generate per context. Defaults to 3. |
do_sample | boolean | ❌ | whether to sample contexts to generate questions from (default False) |
sample_num | integer | ❌ | number of sample contexts to generate questions from (default 5) |
sample_method | string | ❌ | type of sampling strategy ("random" or "idds"), default "random" |
batch | boolean | ❌ | use batching (default True) |
batch_size | integer | ❌ | batch size (default 5) |
Returns
None
Example Usage
from dynamofl.evaluate import generate_qa_from _context
# set the openai API key
api_key = "<your-openai-api-key>"
# or set the openai API key as an environment variable
import os
os.environ["MISTRAL_API_KEY"] = "<your-openai-api-key>"
# list of context from which the question and answer should be generated
context_lst = [
"Many DMV customers make easily avoidable mistakes that cause them significant problems, including encounters with law enforcement and impounded vehicles. Because we see customers make these mistakes over and over again , we are issuing this list of the top five DMV mistakes and how to avoid them. \n\n1. Forgetting to Update Address \nBy statute , you must report a change of address to DMV within ten days of moving. That is the case for the address associated with your license, as well as all the addresses associated with each registered vehicle, which may differ. It is not sufficient to only: write your new address on the back of your old license; tell the United States Postal Service; or inform the police officer writing you a ticket. If you fail to keep your address current , you will miss a suspension order and may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors. This really happens , but the good news is this is a problem that is easily avoidable. Learn more about how to change the address on your license and registrations [1]",
"If you already receive Social Security or SSI benefits and you have a bank account , you can sign up for Direct Deposit by : starting or changing Direct Deposit online Social Security benefits only , or contacting your bank, credit union or savings and loan association , or calling Social Security toll - free at 1 - 800 - 772 - 1213 TTY 1 - 800 - 325 - 0778, or Consider the Direct Express debit card as another viable option. The Direct Express card is a debit card you can use to access your benefits. And you don't need a bank account. With the Direct Express card program , we deposit your federal benefit payment directly into your card account. Your monthly benefits will be available on your payment day on time, every time. You can use the card to make purchases, pay bills or get cash at thousands of locations. It s quick and easy to sign up for the card. Call the toll - free Direct Express hotline at 1 - 800 - 333 - 1795. Also , Social Security can help you sign up. If you don't have an account , you must open an account before you can sign up for Direct Deposit. You should shop around in your area to find an account that has the features you want at a price you can afford.",
]
# run generation
generate_qa_from_context(context_lst, api_key=api_key)
# will save a csv file named `generated_queries.csv` with context, question, and answer columns.