Pipeline¶

OntoAlignerPipeline¶

class ontoaligner.pipeline.OntoAlignerPipeline(task_class: OMDataset, source_ontology_path: str, target_ontology_path: str, reference_matching_path: str, output_dir: str = 'results', output_format: str = 'xml')[source]¶

Bases: object

A pipeline for performing ontology alignment tasks using various methods and models.

Initializes the OntoAlignerPipeline.

Parameters:

task_class (OMDataset) – Class responsible for handling ontology matching tasks.
source_ontology_path (str) – Path to the source ontology file.
target_ontology_path (str) – Path to the target ontology file.
reference_matching_path (str) – Path to the reference alignments.
output_dir (str, optional) – Directory to save results. Defaults to “results”.
output_format (str, optional) – Format of output files. Defaults to “xml”.

__call__(method: str, encoder_model: BaseEncoder | None = None, model_class: BaseOMModel | None = None, dataset_class: Dataset | None = None, postprocessor: Any | None = None, llm_path: str | None = None, retriever_path: str | None = None, device: str = 'cuda', batch_size: int = 2048, max_length: int = 300, max_new_tokens: int = 10, top_k: int = 10, fuzzy_sm_threshold: float = 0.2, evaluate: bool = False, return_matching: bool = True, output_file_name: str = 'matchings', save_matchings: bool = False, ir_threshold: float = 0.5, ir_rag_threshold: float = 0.7, llm_threshold: float = 0.5, llm_mapper: LabelMapper | None = None, llm_mapper_interested_class: str = 'yes', answer_set: Dict = {'no': ['no', 'false'], 'yes': ['yes', 'true']}, huggingface_access_token: str = '', openai_key: str = '', device_map: str = 'auto', positive_ratio: float = 0.7, n_shots: int = 5) → [Any, Any]¶

Executes the ontology alignment process using the specified method.

Parameters:

method (str) – The method to use, e.g., “lightweight”, “retriever”, or “llm”.
encoder_model (BaseEncoder, optional) – Encoder model to encode ontologies. Defaults to None.
model_class (BaseOMModel, optional) – Model class for matching. Defaults to None.
dataset_class (Dataset, optional) – Dataset class for LLM-based methods. Defaults to None.
postprocessor (Any, optional) – Post-processing function. Defaults to None.
llm_path (str, optional) – Path to the LLM model. Defaults to None.
retriever_path (str, optional) – Path to the retriever model. Defaults to None.
device (str, optional) – Device to use for computation. Defaults to “cuda”.
batch_size (int, optional) – Batch size for LLM-based methods. Defaults to 2048.
max_length (int, optional) – Maximum input length for LLM-based methods. Defaults to 300.
max_new_tokens (int, optional) – Maximum tokens to generate for LLM-based methods. Defaults to 10.
top_k (int, optional) – Number of top matches to retrieve in the retriever method. Defaults to 10.
fuzzy_sm_threshold (float, optional) – Threshold for fuzzy matching in lightweight methods. Defaults to 0.2.
evaluate (bool, optional) – Whether to evaluate the matching results. Defaults to False.
return_matching (bool, optional) – Whether to return the matching results. Defaults to True.
output_file_name (str, optional) – Output file name without file type. Defaults to “matchings”.
save_matchings (bool, optional) – Whether to save the matching results. Defaults to False.
ir_threshold (float, optional) – Retrieval postprocessor threshold.
ir_rag_threshold (float, optional) – Retrieval postprocessor threshold in RAG module.
llm_threshold (float, optional) – LLM postprocessor threshold.
llm_mapper (LabelMapper, optional) – Mapper for LLM outputs.
llm_mapper_interested_class (str, optional) – Class to filter output pairs in LLM postprocessing.
answer_set (dict, optional) – Mapping of yes/no answers. Defaults to {“yes”: [“yes”, “true”], “no”: [“no”, “false”]}.
huggingface_access_token (str, optional) – Access token for Hugging Face models. Defaults to “”.
openai_key (str, optional) – API key for OpenAI models. Defaults to “”.
device_map (str, optional) – Device map for model allocation. Defaults to “auto”.
positive_ratio (float, optional) – Ratio of positive examples in few-shot methods. Defaults to 0.7.
n_shots (int, optional) – Number of shots for few-shot learning. Defaults to 5.

Returns:

Evaluation report if evaluate is True. Matching results if return_matching is True.

Return type:

dict or None

AlignerPipeline¶

class ontoaligner.pipeline.AlignerPipeline(encoder: BaseEncoder, aligner: BaseOMModel, om_dataset: Dict | None = None, load_params: Dict | None = None, llm_dataset_class: Dataset | None = None, batch_size: int = 1, shuffle: bool = False, postprocessor: Any | None = None, postprocessor_params: Dict | None = None, include_reference: bool = False, **kwargs)[source]¶

Bases: BaseOMModel

An aligner pipeline that runs one encoder and one ontology matching aligner.

This class follows the standard OntoAligner flow for one aligner pipeline: encode the ontology matching dataset, load the aligner if needed, generate predictions, and optionally apply a postprocessor.

Initializes the aligner pipeline.

Parameters:

encoder (BaseEncoder) – Encoder model used to encode the ontology matching dataset.
aligner (BaseOMModel) – Ontology matching aligner used to generate predictions.
om_dataset (Dict, optional) – Pre-collected ontology matching dataset. Defaults to None.
load_params (Dict, optional) – Parameters forwarded to the aligner load method. Defaults to None.
llm_dataset_class (Dataset, optional) – Dataset class used to wrap LLM inputs. Defaults to None.
batch_size (int, optional) – Batch size used for LLM dataset generation. Defaults to 1.
shuffle (bool, optional) – Whether to shuffle LLM dataset batches. Defaults to False.
postprocessor (Any, optional) – Optional postprocessor applied to predictions. Defaults to None.
postprocessor_params (Dict, optional) – Optional parameters forwarded to the postprocessor. Defaults to None.
include_reference (bool, optional) – Whether to pass reference matchings to the encoder. Defaults to False.
**kwargs – Additional keyword arguments that may be used for model configuration.

__init__(encoder: BaseEncoder, aligner: BaseOMModel, om_dataset: Dict | None = None, load_params: Dict | None = None, llm_dataset_class: Dataset | None = None, batch_size: int = 1, shuffle: bool = False, postprocessor: Any | None = None, postprocessor_params: Dict | None = None, include_reference: bool = False, **kwargs) → None¶

Initializes the aligner pipeline.

Parameters:

encoder (BaseEncoder) – Encoder model used to encode the ontology matching dataset.
aligner (BaseOMModel) – Ontology matching aligner used to generate predictions.
om_dataset (Dict, optional) – Pre-collected ontology matching dataset. Defaults to None.
load_params (Dict, optional) – Parameters forwarded to the aligner load method. Defaults to None.
llm_dataset_class (Dataset, optional) – Dataset class used to wrap LLM inputs. Defaults to None.
batch_size (int, optional) – Batch size used for LLM dataset generation. Defaults to 1.
shuffle (bool, optional) – Whether to shuffle LLM dataset batches. Defaults to False.
postprocessor (Any, optional) – Optional postprocessor applied to predictions. Defaults to None.
postprocessor_params (Dict, optional) – Optional parameters forwarded to the postprocessor. Defaults to None.
include_reference (bool, optional) – Whether to pass reference matchings to the encoder. Defaults to False.
**kwargs – Additional keyword arguments that may be used for model configuration.

generate(input_data: Dict | None = None) → List¶

Generates predictions for one aligner pipeline.

Parameters:: input_data (Dict, optional) – Optional ontology matching dataset. If not provided, the pipeline uses its own pre-collected dataset.
Returns:: A list of raw or postprocessed alignment predictions.
Return type:: List