Skip to main content

Evaluate SDK Library

dynamofl.evaluate SDK reference

These DynamoFL SDK methods provide tools that enable the quantitative assessment of LLM performance.

from dynamofl.evaluate import *

Example Script

from dynamofl.evaluate import (
calculate_compression_ratio,
calculate_cosine_similarity_bert,
calculate_cosine_similarity_tfidf,
calculate_ngram_overlap,
calculate_coverage,
compute_metrics,
)

sample_text = {
"reference_text": "The Japanese Zen term shoshin translates as ‘beginner’s mind’ and refers to a paradox: the more you know about a subject, the more likely you are to close your mind to further learning. As the Zen monk Shunryu Suzuki put it in his book Zen Mind, Beginner’s Mind (1970): ‘In the beginner’s mind there are many possibilities, but in the expert’s there are few.’ Many historical examples demonstrate how the expert mind (or feeling like an expert) can lead to closed-mindedness and the obstruction of scientific progress. In 1912, for instance, when the German geophysicist and explorer Alfred Wegener proposed – counter to the received wisdom of the day – that the Earth is made up of shifting continental plates, he was ridiculed by expert geologists around the world. His German compatriots referred to his ‘delirious ravings’ while experts in the United States accused him of peddling pseudoscience. It would take decades before the orthodoxy was overturned and the accuracy of his theory was acknowledged. Similar stories abound. In my own field of neuroscience, for example, belief in the legendary Spanish neuroscientist Santiago Ramón y Cajal’s ‘harsh decree’ that adult humans are unable to grow new neurons persisted for decades in the face of mounting contradictory evidence. Intellectual hubris doesn’t afflict only established scientific experts. Merely having a university degree in a subject can lead people to grossly overestimate their knowledge. In one pertinent study in 2015, researchers at Yale University asked graduates to estimate their knowledge of various topics relevant to their degrees, and then tested their actual ability to explain those topics. The participants frequently overestimated their level of understanding, apparently mistaking the ‘peak knowledge’ they had at the time they studied at university for their considerably more modest current knowledge. Unfortunately, just as Suzuki wrote and as historical anecdotes demonstrate, there is research evidence that even feeling like an expert also breeds closed-mindedness. Another study involved giving people the impression that they were relatively expert on a topic (for example, by providing them with inflated scores on a test of political knowledge), which led them to be less willing to consider other political viewpoints – a phenomenon the researchers called ‘the earned dogmatism effect’.",
"summary_text": "The Japanese Zen term shoshin translates as beginners mind and refers to a paradox the more you know about a subject, the more likely you are to close your mind to further learning. As the Zen monk Shunryu Suzuki put it in his book Zen Mind, Beginners Mind 1970 In the beginners mind there are many possibilities, but in the experts there are few. Many historical examples demonstrate how the expert mind or feeling like an expert can lead to closed-mindedness and the obstruction of scientific progress. In 1912, for instance, when the German geophysicist and explorer Alfred Wegener proposed counter to the received wisdom of the day that the Earth is made up of shifting continental plates, he was ridiculed by expert geologists around the world. His German compatriots referred to his delirious ravings while experts in the United States accused him of peddling pseudoscience. It would take decades before the orthodoxy was overturned and the accuracy of his theory was acknowledged. Similar stories abound. In my own field of neuroscience, for example, belief in the legendary Spanish neuroscientist Santiago Ramn y Cajals harsh decree that adult humans are unable to grow new neurons persisted for decades in the face of mounting contradictory evidence."
}

'''
Use the compute_metrics() wrapper method for calculating text comparison metrics
between a reference text and a summary text.
'''
text_comparison_results = compute_metrics(
reference_text=sample_text["reference_text"],
summary_text=sample_text["summary_text"],
# Add custom arguments for 'cosine_similarity_bert' and 'ngram_overlap' methods
kwargs={
'calculate_cosine_similarity_bert': {'max_chunk_tokens': 256},
'calculate_ngram_overlap': {'n': 5}
}
)

'''
Alternatively, call one or more text comparison methods individually
'''
compression_ratio_results = calculate_compression_ratio(
text=sample_text["summary_text"],
)

cosine_similarity_bert_results = calculate_cosine_similarity_bert(
reference_text=sample_text["reference_text"],
summary_text=sample_text["summary_text"],
)

cosine_similarity_tfidf_results = calculate_cosine_similarity_tfidf(
reference_text=sample_text["reference_text"],
summary_text=reference_text["summary_text"],
)

ngram_overlap_results = calculate_ngram_overlap(
document=sample_text["reference_text"],
generated_summary=sample_text["summary_text"],
)

coverage_results = calculate_coverage(
reference_text=sample_text["reference_text"],
summary_text=sample_text["summary_text"],
)

Method calculate_compression_ratio()

This method measures the repetitive behavior of the models. The higher the compression ratio, the more repetitive the text will be.

Release Notes

Added support for calculating the compression ratio for a summarized text.

Parameters

ParamTypeRequired?Description
textstringThe input text that you want to calculate the compression ratio for.
return_dictbooleanIndicates whether to return the compression ratio as a dictionary. Default is True.

Returns

Return TypeDescription
dictA dictionary containing the compression_ratio key and its corresponding value if return_dict is True.
floatThe compression ratio as a floating-point number if return_dict is False.

Example Usage

from dynamofl.evaluate import calculate_compression_ratio

results = calculate_compression_ratio(
text="This is a sample text.",
)

Method calculate_cosine_similarity_bert()

This method calculates the cosine similarity between a reference text and a summary text using BERT CLS (Classification) embeddings. It tokenizes the text, chunks it into smaller pieces, computes cosine similarities for each chunk, and returns the results as a dictionary or tuple.

Release Notes

Added support for calculating cosine similarity using BERT **[CLS]** embeddings.

Parameters

ParamTypeRequired?Description
reference_textstringThe reference text used for comparison.
summary_textstringThe summary text to compare against the reference text.
max_chunk_tokensintThe maximum number of tokens allowed in each text chunk. Default is 512.
round_tointThe number of decimal places to round the similarity scores to. Default is 2.
return_dictbooleanIndicates whether to return the results as a dictionary. Default is True.

Returns

Return TypeDescription
dictA dictionary containing the number of chunks ("num_chunks_bert") and the similarity scores ("similarity_bert") if return_dict is True.
tupleA tuple containing the number of chunks and the similarity scores as a list if return_dict is False.

Example Usage

from dynamofl.evaluate import calculate_cosine_similarity_bert

results = calculate_cosine_similarity_bert(
reference_text="Reference text",
summary_text="Summary text",
)

Method calculate_cosine_similarity_tfidf()

This method calculates the cosine similarity between a reference text and a summary text using TF-IDF (Term Frequency-Inverse Document Frequency) vectors.

Release Notes

Added support for calculating cosine similarity using the TfIdfVectorizer.

Parameters

ParamTypeRequired?Description
reference_textstringThe reference text used for comparison.
summary_textstringThe summary text to compare against the reference text.
max_chunk_wordsintThe maximum number of words allowed in each text chunk. Default is 512.
round_tointThe number of decimal places to round the similarity scores to. Default is 2.
return_dictbooleanIndicates whether to return the results as a dictionary. Default is True.

Returns

Return TypeDescription
dictA dictionary containing the number of chunks ("num_chunks_tfidf") and the similarity scores ("similarity_tfidf") if return_dict is True.
tupleA tuple containing the number of chunks and the similarity scores as a list if return_dict is False.

Example Usage

from dynamofl.evaluate import calculate_cosine_similarity_tfidf

results = calculate_cosine_similarity_tfidf(
reference_text="Reference text",
summary_text="Summary text",
)

Method calculate_ngram_overlap()

This method calculates the n-gram overlap between a document and a generated summary.

Release Notes

Added support for calculating or calculating n-gram overlap between a document and a generated summary.

Parameters

ParamTypeRequired?Description
documentstringThe document content used for comparison.
generated_summarystringThe generated summary text to compare against the document.
nintThe value of 'n' for n-grams (default is 10).
return_dictbooleanIndicates whether to return the results as a dictionary. Default is True.

Returns

Return TypeDescription
dictA dictionary containing the number of document n-grams ("document_ngrams"), the number of generated n-grams ("generated_ngrams"), common n-grams ("common_ngrams"), and the overlap ratio ("overlap") if return_dict is True.
tupleA tuple containing the number of document n-grams, the number of generated n-grams, common n-grams, and the overlap ratio as individual elements if return_dict is False.

Example Usage

from dynamofl.evaluate import calculate_ngram_overlap

results = calculate_ngram_overlap(
document="Document content",
generated_summary="Generated summary",
)

Method calculate_coverage()

This method calculates coverage using n-grams between a reference text and a summary text by splitting the reference text into chunks. Summaries with good coverage should have even overlap across all chunks of the reference text. Inversely, a summary covering only the first few lines will have different overlap scores across chunks.

Release Notes

Added support for calculating coverage using n-grams between a reference text and a summary text.

Parameters

ParamTypeRequired?Description
reference_textstringThe reference text used for comparison.
summary_textstringThe summary text to calculate coverage against.
max_chunk_wordsintThe maximum number of words allowed in each text chunk. Default is 512.
nintThe value of 'n' for n-grams (default is 10).
round_tointThe number of decimal places to round the coverage scores to. Default is 2.
return_dictbooleanIndicates whether to return the results as a dictionary. Default is True.

Returns

Return TypeDescription
dictA dictionary containing the number of chunks ("num_chunks_ngram") and the coverage scores ("coverage") if return_dict is True.
tupleA tuple containing the number of chunks and the coverage scores as a list if return_dict is False.

Example Usage

from dynamofl.evaluate import calculate_coverage

results = calculate_coverage(
reference_text="Reference text",
summary_text="Summary text",
)

Method compute_metrics()

This method serves as a convenient wrapper for calculating text comparison metrics between a reference text and a summary text. It seamlessly integrates various metrics, including compression ratio, cosine similarity using BERT embeddings, cosine similarity using TF-IDF vectors, n-gram overlap, and coverage, making it easier to obtain multiple metrics with a single call.

Release Notes

Added support for calculating the compression ratio for a summarized text.

Parameters

ParamTypeRequired?Description
reference_textstringThe reference text used for comparison.
summary_textstringThe summary text to calculate metrics against.
return_allbooleanIndicates whether to return all values of all metrics (Please refer to the values returned by the methods listed above). Default is False.
kwargsdictA dictionary containing metric-specific parameters as key-value pairs (Please refer to the parameters accepted by the methods listed above).

Returns

Return TypeDescription
dictA dictionary containing the computed metrics.

Example Usage

from dynamofl.evaluate import compute_metrics

results = compute_metrics(
reference_text="Reference text",
summary_text="Summary text",
# Add custom arguments for 'cosine_similarity_bert' and 'ngram_overlap' methods
kwargs={
'calculate_cosine_similarity_bert': {'max_chunk_tokens': 256},
'calculate_ngram_overlap': {'n': 5}
}
)

Method retrieval_relevance_judge_text()

This method judges if the retrieved contexts are relevant to each question, by prompting a Mistral model.

Parameters

ParamTypeRequired?Description
questionsstring or listA single question or a list of questions to check
retrievedstring or listA retrieved context or a list of retrieved context. Must be of the same length as questions.
api_keystringMistral API key for accessing Mistral LLM as the judge. Defaults to None. If not provided, looks for environment variable MISTRAL_API_KEY for the key. If the environment variable is not set up, provide the valid API key for this parameter.

Returns

Return TypeDescription
dict of listA dictionary of lists each containing the computed retrieval relevance label and explanations

Example Usage

from dynamofl.evaluate import retrieval_relevance_judge_text

# set the Mistral API key
api_key = "<your-mistral-api-key>"

# or set the Mistral API key as an environment variable
import os
os.environ["MISTRAL_API_KEY"] = "<your-mistral-api-key>"

# question and context list to check pair by pair
question_lst = [
"What happens if I don't update my address registered to each vehicle?",
"What benefit can you get as a veteran?",
]

context_lst = [
"Many DMV customers make easily avoidable mistakes that cause them significant problems, including encounters with law enforcement and impounded vehicles. Because we see customers make these mistakes over and over again , we are issuing this list of the top five DMV mistakes and how to avoid them. \n\n1. Forgetting to Update Address \nBy statute , you must report a change of address to DMV within ten days of moving. That is the case for the address associated with your license, as well as all the addresses associated with each registered vehicle, which may differ. It is not sufficient to only: write your new address on the back of your old license; tell the United States Postal Service; or inform the police officer writing you a ticket. If you fail to keep your address current , you will miss a suspension order and may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors. This really happens , but the good news is this is a problem that is easily avoidable. Learn more about how to change the address on your license and registrations [1]",
"If you already receive Social Security or SSI benefits and you have a bank account , you can sign up for Direct Deposit by : starting or changing Direct Deposit online Social Security benefits only , or contacting your bank, credit union or savings and loan association , or calling Social Security toll - free at 1 - 800 - 772 - 1213 TTY 1 - 800 - 325 - 0778, or Consider the Direct Express debit card as another viable option. The Direct Express card is a debit card you can use to access your benefits. And you don't need a bank account. With the Direct Express card program , we deposit your federal benefit payment directly into your card account. Your monthly benefits will be available on your payment day on time, every time. You can use the card to make purchases, pay bills or get cash at thousands of locations. It s quick and easy to sign up for the card. Call the toll - free Direct Express hotline at 1 - 800 - 333 - 1795. Also , Social Security can help you sign up. If you don't have an account , you must open an account before you can sign up for Direct Deposit. You should shop around in your area to find an account that has the features you want at a price you can afford.",
]

# run retrieval relevance check
output = retrieval_relevance_judge_text(question_lst, context_lst, api_key=api_key)
labels = output["labels"]
explanations = output["explanations"]

# Expected output: labels[0] == 1 and labels[1] == 0
# because the first quesion is relevant to the first context, but second question is not relevant to the second question.

Method faithfulness_judge_text()

This method judges if each answer is faithful to each of the retrieved context with NLI models.

Parameters

ParamTypeRequired?Description
questionsstring or listA single question or a list of questions to check
retrievedstring or listA retrieved context or a list of retrieved context. Must be of the same length as answers.
answersstring or listAn asnwer or a list of answers.
api_keystringMistral API key for accessing Mistral LLM as the judge. Defaults to None. If not provided, looks for environment variable MISTRAL_API_KEY for the key. If the environment variable is not set up, provide the valid API key for this parameter.

Returns

Return TypeDescription
dict of listA dictionary of lists each containing the computed faithfulness label and explanation.

Example Usage

from dynamofl.evaluate import faithfulness_judge_text

# set the Mistral API key
api_key = "<your-mistral-api-key>"

# or set the Mistral API key as an environment variable
import os
os.environ["MISTRAL_API_KEY"] = "<your-mistral-api-key>"

# question and context list to check pair by pair
question_lst = [
"What happens if I don't update my address registered to each vehicle?",
"What benefit can you get as a veteran?",
]

context_lst = [
"Many DMV customers make easily avoidable mistakes that cause them significant problems, including encounters with law enforcement and impounded vehicles. Because we see customers make these mistakes over and over again , we are issuing this list of the top five DMV mistakes and how to avoid them. \n\n1. Forgetting to Update Address \nBy statute , you must report a change of address to DMV within ten days of moving. That is the case for the address associated with your license, as well as all the addresses associated with each registered vehicle, which may differ. It is not sufficient to only: write your new address on the back of your old license; tell the United States Postal Service; or inform the police officer writing you a ticket. If you fail to keep your address current , you will miss a suspension order and may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors. This really happens , but the good news is this is a problem that is easily avoidable. Learn more about how to change the address on your license and registrations [1]",
"If you already receive Social Security or SSI benefits and you have a bank account , you can sign up for Direct Deposit by : starting or changing Direct Deposit online Social Security benefits only , or contacting your bank, credit union or savings and loan association , or calling Social Security toll - free at 1 - 800 - 772 - 1213 TTY 1 - 800 - 325 - 0778, or Consider the Direct Express debit card as another viable option. The Direct Express card is a debit card you can use to access your benefits. And you don't need a bank account. With the Direct Express card program , we deposit your federal benefit payment directly into your card account. Your monthly benefits will be available on your payment day on time, every time. You can use the card to make purchases, pay bills or get cash at thousands of locations. It s quick and easy to sign up for the card. Call the toll - free Direct Express hotline at 1 - 800 - 333 - 1795. Also , Social Security can help you sign up. If you don't have an account , you must open an account before you can sign up for Direct Deposit. You should shop around in your area to find an account that has the features you want at a price you can afford.",
]

answer_lst = [
"You may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors.",
"A Veteran who s the qualifying CHAMPVA sponsor for their family may also qualify for the VA health care program based on their own Veteran status.",
]

output = faithfulness_judge_text(question_lst, context_lst, answer_lst, api_key=api_key)
labels = output["labels"]
explanations = output["explanations"]

# Exepcted output: labels[0] == 1 and labels[1] == 0
# because the first answer is faithful to the first context, but the second answer is not faithful to the second context.

Method response_relevance_judge_text()

This method judges if each answer is relevant to each question, by prompting a Mistral model.

Parameters

ParamTypeRequired?Description
questionsstring or listA single question or a list of questions to check
answersstring or listAn asnwer or a list of answers. Must be of the same length as questions.
api_keystringMistral API key for accessing Mistral LLM as the judge. Defaults to None. If not provided, looks for environment variable MISTRAL_API_KEY for the key. If the environment variable is not set up, provide the valid API key for this parameter.

Returns

Return TypeDescription
dict of listA dictionary of lists each containing the computed response relevance label and explanations.

Example Usage

from dynamofl.evaluate import response_relevance_judge_text

# set the Mistral API key
api_key = "<your-mistral-api-key>"

# or set the Mistral API key as an environment variable
import os
os.environ["MISTRAL_API_KEY"] = "<your-mistral-api-key>"

# question and context list to check pair by pair
question_lst = [
"What happens if I don't update my address registered to each vehicle?",
"What benefit can you get as a veteran?",
]

answer_lst = [
"You may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors.",
"Call the toll - free Direct Express hotline at 1 - 800 - 333 - 1795.",
]

# run retrieval relevance check
output = response_relevance_judge_text(question_lst, answer_lst, api_key=api_key)
labels = output["labels"]
explanations = output["explanations"]

# Expected output: labels[0] == 1 and labels[1] == 0
# because the first response is relevant to the first question, but the second resonse is not relevant to the second question.

Method generate_qa_from_context()

This method generates question/answer pairs from the given list of context, and saves a .csv file containing (context, questions, answers), under the file name provided as the paramter csv_file_name. In case the list of context is really large, the method supports sampling from the list of context to generate question/answer pairs for a subset of context. sample_method="random" samples random contexts from the list, while sample_method="idds" samples contexts based on in-domain diversity sampling (idds) to ensure diversity of the sampled contexts' semantic representations.

Parameters

ParamTypeRequired?Description
context_lstlist of strThe list of context used to generate queries.
api_keystringAPI key to use the model. Set up an environment variable
prompt_templatestringPrompt template for the langauge model. Must contain both {context} and {num_questions} as placeholders.
system_promptstringSystem prompt for the language model.
csv_file_namestringFile name for the generated csv file. Defaults to generated_queries.csv
model_typestringThe type of text generation model used (only supports "openai"). Defaults to "openai"
model_namestringThe model name within the specified model_type. Defaults to "gpt-3.5-turbo"
num_questions_per_docintegernumber of questions to generate per context. Defaults to 3.
do_samplebooleanwhether to sample contexts to generate questions from (default False)
sample_numintegernumber of sample contexts to generate questions from (default 5)
sample_methodstringtype of sampling strategy ("random" or "idds"), default "random"
batchbooleanuse batching (default True)
batch_sizeintegerbatch size (default 5)

Returns

None

Example Usage

from dynamofl.evaluate import generate_qa_from _context 

# set the openai API key
api_key = "<your-openai-api-key>"

# or set the openai API key as an environment variable
import os
os.environ["MISTRAL_API_KEY"] = "<your-openai-api-key>"

# list of context from which the question and answer should be generated
context_lst = [
"Many DMV customers make easily avoidable mistakes that cause them significant problems, including encounters with law enforcement and impounded vehicles. Because we see customers make these mistakes over and over again , we are issuing this list of the top five DMV mistakes and how to avoid them. \n\n1. Forgetting to Update Address \nBy statute , you must report a change of address to DMV within ten days of moving. That is the case for the address associated with your license, as well as all the addresses associated with each registered vehicle, which may differ. It is not sufficient to only: write your new address on the back of your old license; tell the United States Postal Service; or inform the police officer writing you a ticket. If you fail to keep your address current , you will miss a suspension order and may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors. This really happens , but the good news is this is a problem that is easily avoidable. Learn more about how to change the address on your license and registrations [1]",
"If you already receive Social Security or SSI benefits and you have a bank account , you can sign up for Direct Deposit by : starting or changing Direct Deposit online Social Security benefits only , or contacting your bank, credit union or savings and loan association , or calling Social Security toll - free at 1 - 800 - 772 - 1213 TTY 1 - 800 - 325 - 0778, or Consider the Direct Express debit card as another viable option. The Direct Express card is a debit card you can use to access your benefits. And you don't need a bank account. With the Direct Express card program , we deposit your federal benefit payment directly into your card account. Your monthly benefits will be available on your payment day on time, every time. You can use the card to make purchases, pay bills or get cash at thousands of locations. It s quick and easy to sign up for the card. Call the toll - free Direct Express hotline at 1 - 800 - 333 - 1795. Also , Social Security can help you sign up. If you don't have an account , you must open an account before you can sign up for Direct Deposit. You should shop around in your area to find an account that has the features you want at a price you can afford.",
]

# run generation
generate_qa_from_context(context_lst, api_key=api_key)
# will save a csv file named `generated_queries.csv` with context, question, and answer columns.