Evaluate SDK Library

dynamofl.evaluate SDK reference

These DynamoFL SDK methods provide tools that enable the quantitative assessment of LLM performance.

from dynamofl.evaluate import *

Example Script

from dynamofl.evaluate import (
  calculate_compression_ratio,
  calculate_cosine_similarity_bert,
  calculate_cosine_similarity_tfidf,
  calculate_ngram_overlap,
  calculate_coverage,
  compute_metrics,
)

sample_text = {
    "reference_text": "The Japanese Zen term shoshin translates as ‘beginner’s mind’ and refers to a paradox: the more you know about a subject, the more likely you are to close your mind to further learning. As the Zen monk Shunryu Suzuki put it in his book Zen Mind, Beginner’s Mind (1970): ‘In the beginner’s mind there are many possibilities, but in the expert’s there are few.’ Many historical examples demonstrate how the expert mind (or feeling like an expert) can lead to closed-mindedness and the obstruction of scientific progress. In 1912, for instance, when the German geophysicist and explorer Alfred Wegener proposed – counter to the received wisdom of the day – that the Earth is made up of shifting continental plates, he was ridiculed by expert geologists around the world. His German compatriots referred to his ‘delirious ravings’ while experts in the United States accused him of peddling pseudoscience. It would take decades before the orthodoxy was overturned and the accuracy of his theory was acknowledged. Similar stories abound. In my own field of neuroscience, for example, belief in the legendary Spanish neuroscientist Santiago Ramón y Cajal’s ‘harsh decree’ that adult humans are unable to grow new neurons persisted for decades in the face of mounting contradictory evidence. Intellectual hubris doesn’t afflict only established scientific experts. Merely having a university degree in a subject can lead people to grossly overestimate their knowledge. In one pertinent study in 2015, researchers at Yale University asked graduates to estimate their knowledge of various topics relevant to their degrees, and then tested their actual ability to explain those topics. The participants frequently overestimated their level of understanding, apparently mistaking the ‘peak knowledge’ they had at the time they studied at university for their considerably more modest current knowledge. Unfortunately, just as Suzuki wrote and as historical anecdotes demonstrate, there is research evidence that even feeling like an expert also breeds closed-mindedness. Another study involved giving people the impression that they were relatively expert on a topic (for example, by providing them with inflated scores on a test of political knowledge), which led them to be less willing to consider other political viewpoints – a phenomenon the researchers called ‘the earned dogmatism effect’.",
    "summary_text": "The Japanese Zen term shoshin translates as beginners mind and refers to a paradox the more you know about a subject, the more likely you are to close your mind to further learning. As the Zen monk Shunryu Suzuki put it in his book Zen Mind, Beginners Mind 1970 In the beginners mind there are many possibilities, but in the experts there are few. Many historical examples demonstrate how the expert mind or feeling like an expert can lead to closed-mindedness and the obstruction of scientific progress. In 1912, for instance, when the German geophysicist and explorer Alfred Wegener proposed counter to the received wisdom of the day that the Earth is made up of shifting continental plates, he was ridiculed by expert geologists around the world. His German compatriots referred to his delirious ravings while experts in the United States accused him of peddling pseudoscience. It would take decades before the orthodoxy was overturned and the accuracy of his theory was acknowledged. Similar stories abound. In my own field of neuroscience, for example, belief in the legendary Spanish neuroscientist Santiago Ramn y Cajals harsh decree that adult humans are unable to grow new neurons persisted for decades in the face of mounting contradictory evidence."
}

'''
Use the compute_metrics() wrapper method for calculating text comparison metrics 
between a reference text and a summary text. 
'''
text_comparison_results = compute_metrics(
	reference_text=sample_text["reference_text"], 
	summary_text=sample_text["summary_text"],
	# Add custom arguments for 'cosine_similarity_bert' and 'ngram_overlap' methods
	kwargs={
		'calculate_cosine_similarity_bert': {'max_chunk_tokens': 256},
    'calculate_ngram_overlap': {'n': 5}
	}
)

'''
Alternatively, call one or more text comparison methods individually
'''
compression_ratio_results = calculate_compression_ratio(
	text=sample_text["summary_text"],
)

cosine_similarity_bert_results = calculate_cosine_similarity_bert(
	reference_text=sample_text["reference_text"], 
	summary_text=sample_text["summary_text"],
)

cosine_similarity_tfidf_results = calculate_cosine_similarity_tfidf(
	reference_text=sample_text["reference_text"], 
	summary_text=reference_text["summary_text"],
)

ngram_overlap_results = calculate_ngram_overlap(
	document=sample_text["reference_text"], 
	generated_summary=sample_text["summary_text"],
)

coverage_results = calculate_coverage(
	reference_text=sample_text["reference_text"], 
	summary_text=sample_text["summary_text"],
)

Method `calculate_compression_ratio()`

This method measures the repetitive behavior of the models. The higher the compression ratio, the more repetitive the text will be.

Release Notes

Added support for calculating the compression ratio for a summarized text.

Parameters

Param	Type	Required?	Description
text	string	✅	The input text that you want to calculate the compression ratio for.
return_dict	boolean	❌	Indicates whether to return the compression ratio as a dictionary. Default is True.

Returns

Return Type	Description
dict	A dictionary containing the compression_ratio key and its corresponding value if return_dict is True.
float	The compression ratio as a floating-point number if return_dict is False.

Example Usage

from dynamofl.evaluate import calculate_compression_ratio

results = calculate_compression_ratio(
	text="This is a sample text.",
)

Method `calculate_cosine_similarity_bert()`

This method calculates the cosine similarity between a reference text and a summary text using BERT CLS (Classification) embeddings. It tokenizes the text, chunks it into smaller pieces, computes cosine similarities for each chunk, and returns the results as a dictionary or tuple.

Release Notes

Added support for calculating cosine similarity using BERT **[CLS]** embeddings.

Parameters

Param	Type	Required?	Description
reference_text	string	✅	The reference text used for comparison.
summary_text	string	✅	The summary text to compare against the reference text.
max_chunk_tokens	int	❌	The maximum number of tokens allowed in each text chunk. Default is 512.
round_to	int	❌	The number of decimal places to round the similarity scores to. Default is 2.
return_dict	boolean	❌	Indicates whether to return the results as a dictionary. Default is True.

Returns

Return Type	Description
dict	A dictionary containing the number of chunks ("num_chunks_bert") and the similarity scores ("similarity_bert") if return_dict is True.
tuple	A tuple containing the number of chunks and the similarity scores as a list if return_dict is False.

Example Usage

from dynamofl.evaluate import calculate_cosine_similarity_bert

results = calculate_cosine_similarity_bert(
	reference_text="Reference text", 
	summary_text="Summary text",
)

Method `calculate_cosine_similarity_tfidf()`

This method calculates the cosine similarity between a reference text and a summary text using TF-IDF (Term Frequency-Inverse Document Frequency) vectors.

Release Notes

Added support for calculating cosine similarity using the TfIdfVectorizer.

Parameters

Param	Type	Required?	Description
reference_text	string	✅	The reference text used for comparison.
summary_text	string	✅	The summary text to compare against the reference text.
max_chunk_words	int	❌	The maximum number of words allowed in each text chunk. Default is 512.
round_to	int	❌	The number of decimal places to round the similarity scores to. Default is 2.
return_dict	boolean	❌	Indicates whether to return the results as a dictionary. Default is True.

Returns

Return Type	Description
dict	A dictionary containing the number of chunks ("num_chunks_tfidf") and the similarity scores ("similarity_tfidf") if return_dict is True.
tuple	A tuple containing the number of chunks and the similarity scores as a list if return_dict is False.

Example Usage

from dynamofl.evaluate import calculate_cosine_similarity_tfidf

results = calculate_cosine_similarity_tfidf(
	reference_text="Reference text", 
	summary_text="Summary text",
)

Method `calculate_ngram_overlap()`

This method calculates the n-gram overlap between a document and a generated summary.

Release Notes

Added support for calculating or calculating n-gram overlap between a document and a generated summary.

Parameters

Param	Type	Required?	Description
document	string	✅	The document content used for comparison.
generated_summary	string	✅	The generated summary text to compare against the document.
n	int	❌	The value of 'n' for n-grams (default is 10).
return_dict	boolean	❌	Indicates whether to return the results as a dictionary. Default is True.

Returns

Return Type	Description
dict	A dictionary containing the number of document n-grams ("document_ngrams"), the number of generated n-grams ("generated_ngrams"), common n-grams ("common_ngrams"), and the overlap ratio ("overlap") if return_dict is True.
tuple	A tuple containing the number of document n-grams, the number of generated n-grams, common n-grams, and the overlap ratio as individual elements if return_dict is False.

Example Usage

from dynamofl.evaluate import calculate_ngram_overlap

results = calculate_ngram_overlap(
	document="Document content", 
	generated_summary="Generated summary",
)

Method `calculate_coverage()`

This method calculates coverage using n-grams between a reference text and a summary text by splitting the reference text into chunks. Summaries with good coverage should have even overlap across all chunks of the reference text. Inversely, a summary covering only the first few lines will have different overlap scores across chunks.

Release Notes

Added support for calculating coverage using n-grams between a reference text and a summary text.

Parameters

Param	Type	Required?	Description
reference_text	string	✅	The reference text used for comparison.
summary_text	string	✅	The summary text to calculate coverage against.
max_chunk_words	int	❌	The maximum number of words allowed in each text chunk. Default is 512.
n	int	❌	The value of 'n' for n-grams (default is 10).
round_to	int	❌	The number of decimal places to round the coverage scores to. Default is 2.
return_dict	boolean	❌	Indicates whether to return the results as a dictionary. Default is True.

Returns

Return Type	Description
dict	A dictionary containing the number of chunks ("num_chunks_ngram") and the coverage scores ("coverage") if return_dict is True.
tuple	A tuple containing the number of chunks and the coverage scores as a list if return_dict is False.

Example Usage

from dynamofl.evaluate import calculate_coverage

results = calculate_coverage(
	reference_text="Reference text", 
	summary_text="Summary text",
)

Method `compute_metrics()`

This method serves as a convenient wrapper for calculating text comparison metrics between a reference text and a summary text. It seamlessly integrates various metrics, including compression ratio, cosine similarity using BERT embeddings, cosine similarity using TF-IDF vectors, n-gram overlap, and coverage, making it easier to obtain multiple metrics with a single call.

Release Notes

Added support for calculating the compression ratio for a summarized text.

Parameters

Param	Type	Required?	Description
reference_text	string	✅	The reference text used for comparison.
summary_text	string	✅	The summary text to calculate metrics against.
return_all	boolean	❌	Indicates whether to return all values of all metrics (Please refer to the values returned by the methods listed above). Default is False.
kwargs	dict	❌	A dictionary containing metric-specific parameters as key-value pairs (Please refer to the parameters accepted by the methods listed above).

Returns

Return Type	Description
dict	A dictionary containing the computed metrics.

Example Usage

from dynamofl.evaluate import compute_metrics

results = compute_metrics(
	reference_text="Reference text", 
	summary_text="Summary text",
	# Add custom arguments for 'cosine_similarity_bert' and 'ngram_overlap' methods
	kwargs={
		'calculate_cosine_similarity_bert': {'max_chunk_tokens': 256},
    'calculate_ngram_overlap': {'n': 5}
	}
)

Method `retrieval_relevance_judge_text()`

This method judges if the retrieved contexts are relevant to each question, by prompting a Mistral model.

Parameters

Param	Type	Required?	Description
questions	string or list	✅	A single question or a list of questions to check
retrieved	string or list	✅	A retrieved context or a list of retrieved context. Must be of the same length as questions.
api_key	string	❌	Mistral API key for accessing Mistral LLM as the judge. Defaults to None. If not provided, looks for environment variable MISTRAL_API_KEY for the key. If the environment variable is not set up, provide the valid API key for this parameter.

Returns

Return Type	Description
dict of list	A dictionary of lists each containing the computed retrieval relevance label and explanations

Example Usage

from dynamofl.evaluate import retrieval_relevance_judge_text

# set the Mistral API key
api_key = "<your-mistral-api-key>"

# or set the Mistral API key as an environment variable
import os
os.environ["MISTRAL_API_KEY"] = "<your-mistral-api-key>"

# question and context list to check pair by pair
question_lst = [
    "What happens if I don't update my address registered to each vehicle?",
    "What benefit can you get as a veteran?",
]

context_lst = [
    "Many DMV customers make easily avoidable mistakes that cause them significant problems, including encounters with law enforcement and impounded vehicles. Because we see customers make these mistakes over and over again , we are issuing this list of the top five DMV mistakes and how to avoid them. \n\n1. Forgetting to Update Address \nBy statute , you must report a change of address to DMV within ten days of moving. That is the case for the address associated with your license, as well as all the addresses associated with each registered vehicle, which may differ. It is not sufficient to only: write your new address on the back of your old license; tell the United States Postal Service; or inform the police officer writing you a ticket. If you fail to keep your address current , you will miss a suspension order and may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors. This really happens , but the good news is this is a problem that is easily avoidable. Learn more about how to change the address on your license and registrations [1]",
    "If you already receive Social Security or SSI benefits and you have a bank account , you can sign up for Direct Deposit by : starting or changing Direct Deposit online Social Security benefits only , or contacting your bank, credit union or savings and loan association , or calling Social Security toll - free at 1 - 800 - 772 - 1213 TTY 1 - 800 - 325 - 0778, or Consider the Direct Express debit card as another viable option. The Direct Express card is a debit card you can use to access your benefits. And you don't need a bank account. With the Direct Express card program , we deposit your federal benefit payment directly into your card account. Your monthly benefits will be available on your payment day on time, every time. You can use the card to make purchases, pay bills or get cash at thousands of locations. It s quick and easy to sign up for the card. Call the toll - free Direct Express hotline at 1 - 800 - 333 - 1795. Also , Social Security can help you sign up. If you don't have an account , you must open an account before you can sign up for Direct Deposit. You should shop around in your area to find an account that has the features you want at a price you can afford.",
]

# run retrieval relevance check
output = retrieval_relevance_judge_text(question_lst, context_lst, api_key=api_key)
labels = output["labels"]
explanations = output["explanations"]

# Expected output: labels[0] == 1 and labels[1] == 0 
# because the first quesion is relevant to the first context, but second question is not relevant to the second question.

Method `faithfulness_judge_text()`

This method judges if each answer is faithful to each of the retrieved context with NLI models.

Parameters

Param	Type	Required?	Description
questions	string or list	✅	A single question or a list of questions to check
retrieved	string or list	✅	A retrieved context or a list of retrieved context. Must be of the same length as answers.
answers	string or list	✅	An asnwer or a list of answers.
api_key	string	❌	Mistral API key for accessing Mistral LLM as the judge. Defaults to None. If not provided, looks for environment variable MISTRAL_API_KEY for the key. If the environment variable is not set up, provide the valid API key for this parameter.

Returns

Return Type	Description
dict of list	A dictionary of lists each containing the computed faithfulness label and explanation.

Example Usage

from dynamofl.evaluate import faithfulness_judge_text

# set the Mistral API key
api_key = "<your-mistral-api-key>"

# or set the Mistral API key as an environment variable
import os
os.environ["MISTRAL_API_KEY"] = "<your-mistral-api-key>"

# question and context list to check pair by pair
question_lst = [
    "What happens if I don't update my address registered to each vehicle?",
    "What benefit can you get as a veteran?",
]

context_lst = [
    "Many DMV customers make easily avoidable mistakes that cause them significant problems, including encounters with law enforcement and impounded vehicles. Because we see customers make these mistakes over and over again , we are issuing this list of the top five DMV mistakes and how to avoid them. \n\n1. Forgetting to Update Address \nBy statute , you must report a change of address to DMV within ten days of moving. That is the case for the address associated with your license, as well as all the addresses associated with each registered vehicle, which may differ. It is not sufficient to only: write your new address on the back of your old license; tell the United States Postal Service; or inform the police officer writing you a ticket. If you fail to keep your address current , you will miss a suspension order and may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors. This really happens , but the good news is this is a problem that is easily avoidable. Learn more about how to change the address on your license and registrations [1]",
    "If you already receive Social Security or SSI benefits and you have a bank account , you can sign up for Direct Deposit by : starting or changing Direct Deposit online Social Security benefits only , or contacting your bank, credit union or savings and loan association , or calling Social Security toll - free at 1 - 800 - 772 - 1213 TTY 1 - 800 - 325 - 0778, or Consider the Direct Express debit card as another viable option. The Direct Express card is a debit card you can use to access your benefits. And you don't need a bank account. With the Direct Express card program , we deposit your federal benefit payment directly into your card account. Your monthly benefits will be available on your payment day on time, every time. You can use the card to make purchases, pay bills or get cash at thousands of locations. It s quick and easy to sign up for the card. Call the toll - free Direct Express hotline at 1 - 800 - 333 - 1795. Also , Social Security can help you sign up. If you don't have an account , you must open an account before you can sign up for Direct Deposit. You should shop around in your area to find an account that has the features you want at a price you can afford.",
]

answer_lst = [
	"You may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors.",
	"A Veteran who s the qualifying CHAMPVA sponsor for their family may also qualify for the VA health care program based on their own Veteran status.",
]

output = faithfulness_judge_text(question_lst, context_lst, answer_lst, api_key=api_key)
labels = output["labels"]
explanations = output["explanations"]

# Exepcted output: labels[0] == 1 and labels[1] == 0 
# because the first answer is faithful to the first context, but the second answer is not faithful to the second context.

Method `response_relevance_judge_text()`

This method judges if each answer is relevant to each question, by prompting a Mistral model.

Parameters

Param	Type	Required?	Description
questions	string or list	✅	A single question or a list of questions to check
answers	string or list	✅	An asnwer or a list of answers. Must be of the same length as questions.
api_key	string	❌	Mistral API key for accessing Mistral LLM as the judge. Defaults to None. If not provided, looks for environment variable MISTRAL_API_KEY for the key. If the environment variable is not set up, provide the valid API key for this parameter.

Returns

Return Type	Description
dict of list	A dictionary of lists each containing the computed response relevance label and explanations.

Example Usage

from dynamofl.evaluate import response_relevance_judge_text

# set the Mistral API key
api_key = "<your-mistral-api-key>"

# or set the Mistral API key as an environment variable
import os
os.environ["MISTRAL_API_KEY"] = "<your-mistral-api-key>"

# question and context list to check pair by pair
question_lst = [
    "What happens if I don't update my address registered to each vehicle?",
    "What benefit can you get as a veteran?",
]

answer_lst = [
	"You may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors.",
	"Call the toll - free Direct Express hotline at 1 - 800 - 333 - 1795.",
]

# run retrieval relevance check
output = response_relevance_judge_text(question_lst, answer_lst, api_key=api_key)
labels = output["labels"]
explanations = output["explanations"]

# Expected output: labels[0] == 1 and labels[1] == 0 
# because the first response is relevant to the first question, but the second resonse is not relevant to the second question.

Method `generate_qa_from_context()`

This method generates question/answer pairs from the given list of context, and saves a .csv file containing (context, questions, answers), under the file name provided as the paramter csv_file_name. In case the list of context is really large, the method supports sampling from the list of context to generate question/answer pairs for a subset of context. sample_method="random" samples random contexts from the list, while sample_method="idds" samples contexts based on in-domain diversity sampling (idds) to ensure diversity of the sampled contexts' semantic representations.

Parameters

Param	Type	Required?	Description
context_lst	list of str	✅	The list of context used to generate queries.
api_key	string	✅	API key to use the model. Set up an environment variable
prompt_template	string	❌	Prompt template for the langauge model. Must contain both {context} and {num_questions} as placeholders.
system_prompt	string	❌	System prompt for the language model.
csv_file_name	string	❌	File name for the generated csv file. Defaults to `generated_queries.csv`
model_type	string	❌	The type of text generation model used (only supports "openai"). Defaults to `"openai"`
model_name	string	❌	The model name within the specified model_type. Defaults to `"gpt-3.5-turbo"`
num_questions_per_doc	integer	❌	number of questions to generate per context. Defaults to 3.
do_sample	boolean	❌	whether to sample contexts to generate questions from (default False)
sample_num	integer	❌	number of sample contexts to generate questions from (default 5)
sample_method	string	❌	type of sampling strategy ("random" or "idds"), default "random"
batch	boolean	❌	use batching (default True)
batch_size	integer	❌	batch size (default 5)

Returns

None

Example Usage

from dynamofl.evaluate import generate_qa_from _context 

# set the openai API key
api_key = "<your-openai-api-key>"

# or set the openai API key as an environment variable
import os
os.environ["MISTRAL_API_KEY"] = "<your-openai-api-key>"

# list of context from which the question and answer should be generated
context_lst = [
    "Many DMV customers make easily avoidable mistakes that cause them significant problems, including encounters with law enforcement and impounded vehicles. Because we see customers make these mistakes over and over again , we are issuing this list of the top five DMV mistakes and how to avoid them. \n\n1. Forgetting to Update Address \nBy statute , you must report a change of address to DMV within ten days of moving. That is the case for the address associated with your license, as well as all the addresses associated with each registered vehicle, which may differ. It is not sufficient to only: write your new address on the back of your old license; tell the United States Postal Service; or inform the police officer writing you a ticket. If you fail to keep your address current , you will miss a suspension order and may be charged with operating an unregistered vehicle and/or aggravated unlicensed operation, both misdemeanors. This really happens , but the good news is this is a problem that is easily avoidable. Learn more about how to change the address on your license and registrations [1]",
    "If you already receive Social Security or SSI benefits and you have a bank account , you can sign up for Direct Deposit by : starting or changing Direct Deposit online Social Security benefits only , or contacting your bank, credit union or savings and loan association , or calling Social Security toll - free at 1 - 800 - 772 - 1213 TTY 1 - 800 - 325 - 0778, or Consider the Direct Express debit card as another viable option. The Direct Express card is a debit card you can use to access your benefits. And you don't need a bank account. With the Direct Express card program , we deposit your federal benefit payment directly into your card account. Your monthly benefits will be available on your payment day on time, every time. You can use the card to make purchases, pay bills or get cash at thousands of locations. It s quick and easy to sign up for the card. Call the toll - free Direct Express hotline at 1 - 800 - 333 - 1795. Also , Social Security can help you sign up. If you don't have an account , you must open an account before you can sign up for Direct Deposit. You should shop around in your area to find an account that has the features you want at a price you can afford.",
]

# run generation
generate_qa_from_context(context_lst, api_key=api_key)
# will save a csv file named `generated_queries.csv` with context, question, and answer columns.

dynamofl.evaluate SDK reference​

Example Script​

Method calculate_compression_ratio()​

Method calculate_cosine_similarity_bert()​

Method calculate_cosine_similarity_tfidf()​

Method calculate_ngram_overlap()​

Method calculate_coverage()​

Method compute_metrics()​

Method retrieval_relevance_judge_text()​

Method faithfulness_judge_text()​

Method response_relevance_judge_text()​

Method generate_qa_from_context()​

dynamofl.evaluate SDK reference

Example Script

Method `calculate_compression_ratio()`

Method `calculate_cosine_similarity_bert()`

Method `calculate_cosine_similarity_tfidf()`

Method `calculate_ngram_overlap()`

Method `calculate_coverage()`

Method `compute_metrics()`

Method `retrieval_relevance_judge_text()`

Method `faithfulness_judge_text()`

Method `response_relevance_judge_text()`

Method `generate_qa_from_context()`