Pipeline

Ontology Alignment Pipeline. Various methods such as lightweight matching, retriever-based matching, LLM-based matching, and RAG (Retriever-Augmented Generation) techniques has been applied.

class ontoaligner.pipeline.OntoAlignerPipeline(task_class: OMDataset, source_ontology_path: str, target_ontology_path: str, reference_matching_path: str, output_dir: str = 'results', output_format: str = 'xml')[source]

Bases: object

A pipeline for performing ontology alignment tasks using various methods and models.

Initializes the OntoAlignerPipeline.

Parameters:
  • task_class (OMDataset) – Class responsible for handling ontology matching tasks.

  • source_ontology_path (str) – Path to the source ontology file.

  • target_ontology_path (str) – Path to the target ontology file.

  • reference_matching_path (str) – Path to the reference alignments.

  • output_dir (str, optional) – Directory to save results. Defaults to “results”.

  • output_format (str, optional) – Format of output files. Defaults to “xml”.

__call__(method: str, encoder_model: BaseEncoder | None = None, model_class: BaseOMModel | None = None, dataset_class: Dataset | None = None, postprocessor: Any | None = None, llm_path: str | None = None, retriever_path: str | None = None, device: str = 'cuda', batch_size: int = 2048, max_length: int = 300, max_new_tokens: int = 10, top_k: int = 10, fuzzy_sm_threshold: float = 0.2, evaluate: bool = False, return_matching: bool = True, output_file_name: str = 'matchings', save_matchings: bool = False, ir_threshold: float = 0.5, ir_rag_threshold: float = 0.7, llm_threshold: float = 0.5, llm_mapper: LabelMapper | None = None, llm_mapper_interested_class: str = 'yes', answer_set: Dict = {'no': ['no', 'false'], 'yes': ['yes', 'true']}, huggingface_access_token: str = '', openai_key: str = '', device_map: str = 'auto', positive_ratio: float = 0.7, n_shots: int = 5) [Any, Any]

Executes the ontology alignment process using the specified method.

Parameters:
  • method (str) – The method to use, e.g., “lightweight”, “retriever”, or “llm”.

  • encoder_model (BaseEncoder, optional) – Encoder model to encode ontologies. Defaults to None.

  • model_class (BaseOMModel, optional) – Model class for matching. Defaults to None.

  • dataset_class (Dataset, optional) – Dataset class for LLM-based methods. Defaults to None.

  • postprocessor (Any, optional) – Post-processing function. Defaults to None.

  • llm_path (str, optional) – Path to the LLM model. Defaults to None.

  • retriever_path (str, optional) – Path to the retriever model. Defaults to None.

  • device (str, optional) – Device to use for computation. Defaults to “cuda”.

  • batch_size (int, optional) – Batch size for LLM-based methods. Defaults to 2048.

  • max_length (int, optional) – Maximum input length for LLM-based methods. Defaults to 300.

  • max_new_tokens (int, optional) – Maximum tokens to generate for LLM-based methods. Defaults to 10.

  • top_k (int, optional) – Number of top matches to retrieve in the retriever method. Defaults to 10.

  • fuzzy_sm_threshold (float, optional) – Threshold for fuzzy matching in lightweight methods. Defaults to 0.2.

  • evaluate (bool, optional) – Whether to evaluate the matching results. Defaults to False.

  • return_matching (bool, optional) – Whether to return the matching results. Defaults to True.

  • output_file_name (str, optional) – Output file name without file type. Defaults to “matchings”.

  • save_matchings (bool, optional) – Whether to save the matching results. Defaults to False.

  • ir_threshold (float, optional) – Retrieval postprocessor threshold.

  • ir_rag_threshold (float, optional) – Retrieval postprocessor threshold in RAG module.

  • llm_threshold (float, optional) – LLM postprocessor threshold.

  • llm_mapper (LabelMapper, optional) – Mapper for LLM outputs.

  • llm_mapper_interested_class (str, optional) – Class to filter output pairs in LLM postprocessing.

  • answer_set (dict, optional) – Mapping of yes/no answers. Defaults to {“yes”: [“yes”, “true”], “no”: [“no”, “false”]}.

  • huggingface_access_token (str, optional) – Access token for Hugging Face models. Defaults to “”.

  • openai_key (str, optional) – API key for OpenAI models. Defaults to “”.

  • device_map (str, optional) – Device map for model allocation. Defaults to “auto”.

  • positive_ratio (float, optional) – Ratio of positive examples in few-shot methods. Defaults to 0.7.

  • n_shots (int, optional) – Number of shots for few-shot learning. Defaults to 5.

Returns:

Evaluation report if evaluate is True. Matching results if return_matching is True.

Return type:

dict or None