Postprocess¶

Process¶

This script contains functions to preprocess, evaluate, and filter outputs generated by information retrieval (IR) systems and language models (LLMs), including confidence scoring, thresholding, and output filtering. It is used to refine the quality of predictions by integrating information from both IR and LLM systems, ensuring the most relevant and confident predictions are retained.

Functions:: retriever_postprocessor: Prepares IR outputs for further processing by removing irrelevant data. llm_postprocessor: Prepares LLM outputs for further processing by removing irrelevant data. rag_heuristic_postprocessor: Processes and filters predictions using heuristic methods for confidence scoring. rag_hybrid_postprocessor: Hybrid method for processing predictions by integrating IR and LLM results using matrix-based analysis.

ontoaligner.postprocess.process.graph_postprocessor(predicts, threshold)[source]¶

Post-processes raw alignment predictions to enforce one-to-one mappings between source and target entities based on confidence scores.

This function groups candidate matches by target entities and selects the highest-scoring unique source-target pairs, filtering out predictions below a specified similarity threshold.

Parameters:

predicts (List[Dict]) – A list of predicted alignments, where each prediction is a dictionary with keys: “source” (str), “target” (str), and “score” (float).
threshold (float) – Minimum similarity score required for a prediction to be retained.

Returns:

A filtered list of one-to-one alignments, each containing:

”source”: the source entity IRI
”target”: the target entity IRI
”score” : the similarity/confidence score

Return type:

List[Dict]

Example

>>> raw_preds = [
...     {"source": "A", "target": "X", "score": 0.9},
...     {"source": "B", "target": "X", "score": 0.8},
...     {"source": "A", "target": "Y", "score": 0.7}
... ]
>>> graph_postprocessor(raw_preds, threshold=0.75)
... [{'source': 'A', 'target': 'X', 'score': 0.9}]

ontoaligner.postprocess.process.llm_postprocessor(predicts: List, mapper: Any, dataset: Any, interested_class: str = 'yes') → List[source]¶

ontoaligner.postprocess.process.rag_heuristic_postprocessor(predicts: List, topk_confidence_ratio: int = 3, topk_confidence_score: int = 1) → [List, Dict][source]¶

Processes the predictions using heuristic methods for filtering based on confidence ratio, IR score, and LLM confidence score.

Parameters:

predicts (List) – List of prediction outputs containing both IR and LLM results.
topk_confidence_ratio (int, optional) – Number of top predictions to retain based on confidence ratio.
topk_confidence_score (int, optional) – Number of top predictions to retain based on LLM confidence score.

Returns:

Filtered predictions after applying the heuristic method. Dict: Configuration settings used for filtering predictions.

Return type:

List

ontoaligner.postprocess.process.rag_hybrid_postprocessor(predicts: List, ir_score_threshold: float = 0.9, llm_confidence_th: float = 0.7) → [List, Dict][source]¶

A hybrid approach that integrates IR and LLM outputs using matrix analysis and confidence thresholds.

Parameters:

predicts (List) – List containing IR and LLM output predictions.
ir_score_threshold (float, optional) – Threshold for IR score filtering. Default is 0.9.
llm_confidence_th (float, optional) – Threshold for LLM confidence score filtering. Default is 0.7.

Returns:

A list of filtered predictions. Dict: A dictionary of configuration parameters used for filtering.

Return type:

List

ontoaligner.postprocess.process.retriever_postprocessor(predicts: List, threshold: float = 0.0) → List[source]¶

Prepares IR outputs by extracting source-target pairs and filtering based on score values.

Parameters:: predicts (List) – List of dictionaries containing source, target candidates, and score candidates.
Returns:: A list of dictionaries containing source-target pairs with positive scores.
Return type:: List

Util¶

Set of helper functions for post-processing methods.

eval_preprocess_ir_outputs: Processes and filters IR outputs based on confidence score.
threshold_finder: Determines the threshold value for a given set of scores from a dictionary.
build_outputdict: Constructs a dictionary mapping sources to their respective predicted targets and scores.
confidence_score_ratio_based_filtering: Filters predictions based on confidence ratios and a given threshold.
confidence_score_based_filtering: Filters predictions based on LLM confidence scores and IR scores.

ontoaligner.postprocess.util.build_outputdict(llm_outputs: List, ir_outputs: List) → Dict[source]¶

Builds a dictionary mapping source IRIs to target predictions with their scores from both IR and LLM outputs.

Parameters:

llm_outputs (List) – List of LLM prediction outputs.
ir_outputs (List) – List of IR prediction outputs.

Returns:

A dictionary where each source is mapped to a list of target predictions and their associated scores.

Return type:

Dict

ontoaligner.postprocess.util.confidence_score_based_filtering(outputdict_confidence_ratios: Dict, topk_confidence_score: int, llm_confidence_threshold: float, ir_score_threshold: float) → List[source]¶

Filters the predictions based on LLM confidence score and IR score, selecting the top-k predictions that exceed the given thresholds.

Parameters:

outputdict_confidence_ratios (Dict) – Dictionary with source-target predictions filtered by confidence ratio.
topk_confidence_score (int) – Number of top predictions to keep based on LLM confidence score.
llm_confidence_threshold (float) – The threshold for LLM confidence score to filter predictions.
ir_score_threshold (float) – The threshold for IR score to filter predictions.

Returns:

Filtered predictions based on LLM confidence score and IR score thresholds.

Return type:

List

ontoaligner.postprocess.util.confidence_score_ratio_based_filtering(outputdict: Dict, topk_confidence_ratio: int, cr_threshold: float) → Dict[source]¶

Filters the predictions based on confidence ratio values, selecting the top-k predictions that exceed the specified confidence ratio threshold.

Parameters:

outputdict (Dict) – Dictionary containing source-target predictions with scores and confidence ratios.
topk_confidence_ratio (int) – Number of top predictions to keep based on confidence ratio.
cr_threshold (float) – The threshold for confidence ratio to filter predictions.

Returns:

Filtered predictions with the top-k items exceeding the confidence ratio threshold.

Return type:

Dict

ontoaligner.postprocess.util.eval_preprocess_ir_outputs(predicts: List) → List[source]¶

Filters out redundant IR predictions based on the source-target pair and their respective scores.

Parameters:: predicts (List) – List of dictionaries containing source, target candidates, and score candidates.
Returns:: A filtered list of predictions with unique source-target pairs and positive scores.
Return type:: List

ontoaligner.postprocess.util.threshold_finder(dictionary: dict, index: int, use_lst: bool = False) → float[source]¶

Finds the threshold value based on the given index of a score in a dictionary.

Parameters:

dictionary (dict) – Dictionary containing predictions with scores.
index (int) – The index of the score in the prediction output to be thresholded.
use_lst (bool, optional) – Whether to use the list of values or the dictionary. Defaults to False.

Returns:

The computed threshold value.

Return type:

float

Label Mapper¶

This script provides an implementation of label mapping using different machine learning approaches. It defines a base LabelMapper class and two specific subclasses: - TFIDFLabelMapper: Uses a TfidfVectorizer and a classifier for label prediction. - SetFitShallowLabelMapper: Uses a pretrained SetFit model for label prediction.

class ontoaligner.postprocess.label_mapper.LabelMapper(label_dict: Dict[str, List[str]] | None = None, iterator_no: int = 10)[source]¶

Bases: object

Base class for label mapping, providing common functionality for derived classes.

Initializes the label mapper with training data and labels.

Parameters:

label_dict (Dict[str, List[str]]) – Dictionary mapping each label to a list of candidate phrases.
iterator_no (int) – Number of iterations to replicate training data for better generalization.

fit()¶: Placeholder for model fitting logic, implemented by subclasses.

predict(X: List[str]) → List[str]¶

Predicts labels for the given input.

Parameters:: X (List[str]) – List of input texts to classify.
Returns:: Predicted labels.
Return type:: List[str]

validate_predicts(preds: List[str])¶

Validates if predictions are among valid labels.

Parameters:: preds (List[str]) – List of predicted labels.

class ontoaligner.postprocess.label_mapper.SBERTLabelMapper(model_id: str, label_dict: Dict[str, List[str]], classifier=None, iterator_no: int = 10)[source]¶

Bases: LabelMapper

LabelMapper subclass using SentenceTransformer embeddings and a classifier for label prediction.

Example usage: >>> label_dict = { >>> “yes”:[“yes”, “correct”, “true”], >>> “no”:[“no”, “incorrect”, “false”] >>> } >>> mapper = SBERTLabelMapper(“all-MiniLM-L12-v2”, label_dict) >>> mapper.fit() >>> mapper.predict([“yes”, “correct”, “false”, “nice”, “too bad”, “very good”]) [‘yes’, ‘yes’, ‘no’, ‘yes’, ‘no’, ‘yes’]

Initializes the SBERTLabelMapper.

Parameters:

model_id (str) – Name of the pretrained SentenceTransformer model.
label_dict (Dict[str, List[str]]) – Dictionary mapping each label to a list of candidate phrases.
iterator_no (int) – Number of iterations to replicate training data.

fit()¶: Fits the classifier on the sentence embeddings.

class ontoaligner.postprocess.label_mapper.TFIDFLabelMapper(classifier: Any, ngram_range: Tuple, label_dict: Dict[str, List[str]] | None = None, analyzer: str = 'word', iterator_no: int = 10)[source]¶

Bases: LabelMapper

LabelMapper subclass using a TF-IDF vectorizer and a classifier for label prediction.

Initializes the TFIDFLabelMapper with a specified classifier and TF-IDF configuration.

Parameters:

classifier (Any) – Classifier object (e.g., LogisticRegression, SVC).
ngram_range (Tuple) – Range of n-grams for the TF-IDF vectorizer.
label_dict (Dict[str, List[str]]) – Dictionary mapping each label to a list of candidate phrases.
analyzer (str) – Specifies whether to analyze at the ‘word’ or ‘char’ level.
iterator_no (int) – Number of iterations to replicate training data.

fit()¶: Fits the TF-IDF pipeline on the training data.