Pipeline¶

AlignerPipeline provides a reusable execution flow for running one user-provided encoder and one ontology matching aligner over a collected ontology matching dataset. It is useful when users want direct control over the encoder, aligner, model loading, LLM dataset batching, and optional postprocessing.

Unlike a full orchestration pipeline, AlignerPipeline does not collect datasets, choose methods, define model-specific configurations, evaluate predictions, or save outputs. It focuses only on running the configured encoder-aligner setup and returning predictions.

Given two ontologies \(O_1\) and \(O_2\), AlignerPipeline produces a list of correspondence predictions through four stages:

🔧 1. Component Setup: Provide the encoder, aligner, dataset, and optional pipeline settings such as load_params, llm_dataset_class, postprocessor, or postprocessor_params.

⚙️ 2. Encoding: Convert the collected ontology matching dataset into the format expected by the aligner.

🧠 3. Prediction Generation: Generate predictions from encoded ontology data, with optional LLM dataset batching when llm_dataset_class is provided.

🧹 4. Optional Postprocessing: Apply a user-provided postprocessor to convert, filter, or normalize predictions before returning the final pipeline output.

Usage¶

This module guides you through a step-by-step process for running a single ontology alignment model using AlignerPipeline. By the end, you’ll understand how to collect an ontology matching dataset, configure an encoder and aligner, generate predictions, evaluate results, and save the outputs in XML and JSON formats.

➡️ 1: Import

Import the dataset class, encoder, aligner, pipeline, and utility modules.

import json

from ontoaligner.ontology import MaterialInformationMatOntoOMDataset
from ontoaligner.utils import metrics, xmlify
from ontoaligner.encoder import ConceptParentLightweightEncoder
from ontoaligner.aligner import SimpleFuzzySMLightweight
from ontoaligner import AlignerPipeline

➡️ 2: Parse Ontologies

Load the source ontology, target ontology, and reference alignment using an OntoAligner dataset class.

task = MaterialInformationMatOntoOMDataset()
print("Test Task:", task)

dataset = task.collect(
    source_ontology_path="assets/MI-MatOnto/mi_ontology.xml",
    target_ontology_path="assets/MI-MatOnto/matonto_ontology.xml",
    reference_matching_path="assets/MI-MatOnto/matchings.xml",
)

The collected dataset contains the source ontology items, target ontology items, and optional reference matchings.

{
    "source": [...],
    "target": [...],
    "reference": [...]
}

➡️ 3: Configure AlignerPipeline

Configure AlignerPipeline with one encoder, one aligner, and the collected ontology matching dataset.

aligner_pipeline = AlignerPipeline(
    encoder=ConceptParentLightweightEncoder(),
    aligner=SimpleFuzzySMLightweight(fuzzy_sm_threshold=0.2),
    om_dataset=dataset,
)

The encoder prepares the ontology items for the aligner. The aligner then generates candidate correspondences from the encoded data.

➡️ 4: Generate

Call generate() to encode the dataset and generate predictions.

matchings = aligner_pipeline.generate()

The output is a list of flat source-target correspondences.

[
    {"source": "...", "target": "...", "score": 0.9},
    ...
]

generate() can also receive a dataset directly through input_data. If input_data is provided, it is used instead of the dataset stored in om_dataset.

matchings = aligner_pipeline.generate(input_data=dataset)

➡️ 5: Evaluate and Export

Compare predictions to a reference alignment and export results.

# Evaluate
evaluation = metrics.evaluation_report(
    predicts=matchings,
    references=dataset["reference"],
)

print("Aligner Pipeline Evaluation Report:")
print(json.dumps(evaluation, indent=4))

Example output:

{
    "intersection": 42,
    "precision": 7.706422018348624,
    "recall": 13.90728476821192,
    "f-score": 9.917355371900827,
    "predictions-len": 545,
    "reference-len": 302
}

Export the final alignment to XML (OAEI-compatible) or JSON:

📄 Export to XML

xml_str = xmlify.xml_alignment_generator(matchings=matchings)
with open("aligner_pipeline_matchings.xml", "w", encoding="utf-8") as f:
    f.write(xml_str)

🧾 Export to JSON

with open("aligner_pipeline_matchings.json", "w", encoding="utf-8") as f:
    json.dump(matchings, f, indent=4, ensure_ascii=False)

Note

A complete aligner pipeline example is available at examples/aligner_pipeline.py.

Configuration¶

🔧 AlignerPipeline

Parameter	Type	Default	Description
encoder	BaseEncoder	—	Encoder model used to encode the ontology matching dataset.
aligner	BaseOMModel	—	Ontology matching aligner used to generate predictions.
om_dataset	dict	`None`	Pre-collected ontology matching dataset.
load_params	dict	`None`	Parameters forwarded to the aligner `load` method.
llm_dataset_class	Dataset	`None`	Dataset class used to wrap LLM inputs.
batch_size	int	`1`	Batch size used for LLM dataset generation.
shuffle	bool	`False`	Whether to shuffle LLM dataset batches.
postprocessor	Any	`None`	Optional postprocessor applied to pipeline predictions.
postprocessor_params	dict	`None`	Parameters forwarded to the postprocessor.
include_reference	bool	`False`	Whether to pass reference matchings to the encoder.
**kwargs	dict	`{}`	Additional keyword arguments forwarded to the base ontology matching model.

Configuration Example:

#FewShotRAG
AlignerPipeline(
        encoder=ConceptParentFewShotEncoder(),
        aligner=MistralLLMBERTRetrieverFSRAG(
            positive_ratio=1.0,
            n_shots=1,
            retriever_config=retriever_config,
            llm_config=llm_config,
        ),
        om_dataset=dataset,
        load_params={
            "llm_path": llm_model_path,
            "ir_path": ir_model_path,
        },
        postprocessor=rag_heuristic_postprocessor,
        postprocessor_params={
            "topk_confidence_ratio": 3,
            "topk_confidence_score": 3,
        },
        include_reference=True,
    )