Scenarios

class vieval.metrics.question_answering.QAMetric

Bases: BaseMetric

Evaluate the performance of a question-answering (QA) system.

evaluate(data: Dict, args)

Returns evaluation results for QA predictions.

Args:: data (Dict): A dictionary expected to contain the keys “predictions” and “references”. It represents the dataset being evaluated, with “predictions” containing the model’s answers to the questions, and “references” containing the ground truth answers.

class vieval.metrics.summary.SummaryMetric

Bases: BaseMetric

Evaluate the quality of text summaries.

evaluate(data: Dict, args)

Evaluates the generated summaries against reference summaries and computes various metrics to assess the quality of the generated summaries.

Args:: data (Dict): A dictionary expected to contain original_documents, predictions, and references as keys.
Returns:: Returns a tuple containing the original data dictionary and the result dictionary with all the computed metrics.

class vieval.metrics.text_classification.TextClassificationMetric

Bases: BaseMetric

Evaluate text classification models.

evaluate(data: Dict, args, **kwargs) → None

Evaluates the classification performance given the predictions, references, and additional arguments.

Args:: data (Dict): A dictionary expected to contain keys like predictions, references, and option_probs.
Returns:: Returns a tuple containing the original data dictionary and the result dictionary with all the computed metrics.

class vieval.metrics.toxicity.ToxicityMetric

Bases: BaseMetric

Evaluate text for toxicity.

evaluate(data: Dict, args)

Evaluates the level of toxicity in the text predictions provided via the dictionary.

Args:: data (Dict): A dictionary expected to contain a key “predictions” with text data that needs to be evaluated for toxicity.
Returns:: Returns a tuple containing the updated data dictionary and a new dictionary with the mean toxicity score calculated from the toxicity scores list.

class vieval.metrics.ir.InformationRetrievalMetric

Bases: BaseMetric

Evaluate information retrieval systems.

evaluate(data: Dict, args, **kwargs)

Evaluates the predictions using relevance judgments and computes various metrics.

Args:: data (Dict): A dictionary containing predictions to be evaluated.

class vieval.metrics.language.LanguageMetric

Bases: BaseMetric

Evaluate language generation tasks.

evaluate(data: Dict, args)

Evaluates the predictions against references and computes various metrics.

Args:: data (Dict): A dictionary that must contain keys “predictions”, “references”, and “generation_probs”. It is used to store the predictions, the references for comparison, and the log probabilities for each prediction.
Returns:: Returns a tuple containing: - data: The original data dictionary, updated with raw metric scores for each prediction-reference pair. - result: A dictionary with the average scores of the metrics across all prediction-reference pairs.

get_num_bytes(tokens: List[str]) → int

Calculates the total number of bytes of a list of tokens when encoded in UTF-8.

Args:: tokens (List[str]): A list of string tokens for which the byte length is to be calculated.

class vieval.metrics.reasoning.ReasoningMetric

Bases: BaseMetric

Evaluate reasoning capabilities, particularly in scenarios that may involve mathematical reasoning.

equal(prediction: str, refenrence: str, threshold: int = 0.9) → float

Evaluates whether the prediction is sufficiently close to the refenrence using a similarity threshold. It employs the Levenshtein ratio (a measure of the similarity between two strings) and returns 1 if the ratio exceeds the threshold, indicating a match, otherwise 0.

Args:

prediction (str): The predicted answer.

refenrence (str): The reference or ground truth answer.

threshold (int, optional): A similarity threshold for comparing the prediction and reference, defaulting to 0.9.

evaluate(data: Dict, args)

Evaluates predictions against references contained within the dictionary using various metrics.

Args:: data (Dict): A dictionary that must contain the keys “predictions” and “references”.

class vieval.metrics.translation_metric.TranslationMetric

Bases: BaseMetric

Evaluate the quality of text translations using metrics like BLEU and hLepor.

evaluate(data: Dict, args)

Computes the translation quality metrics for a set of predictions and references provided in the dictionary.

Args:

data (Dict): A dictionary expected to contain two keys:

predictions: A list of translated texts generated by the translation model.

references: A list of reference translations for evaluating the quality of the model’s predictions.

Returns:

The original data dictionary, which contains the raw predictions and references.
A result dictionary with the following keys:
- “bleu”: The computed BLEU score for the translations.
- “hLepor”: The computed hLepor score for the translations.