Pipeline¶
AlignerPipeline provides a reusable execution flow for running one user-provided
encoder and one ontology matching aligner over a collected ontology matching dataset.
It is useful when users want direct control over the encoder, aligner, model loading,
LLM dataset batching, and optional postprocessing.
Unlike a full orchestration pipeline, AlignerPipeline does not collect
datasets, choose methods, define model-specific configurations, evaluate predictions,
or save outputs. It focuses only on running the configured encoder-aligner setup and
returning predictions.
Given two ontologies \(O_1\) and \(O_2\), AlignerPipeline produces
a list of correspondence predictions through four stages:
🔧 1. Component Setup: Provide the encoder, aligner, dataset, and optional
pipeline settings such as load_params, llm_dataset_class, postprocessor,
or postprocessor_params.
⚙️ 2. Encoding: Convert the collected ontology matching dataset into the format expected by the aligner.
🧠 3. Prediction Generation: Generate predictions from encoded ontology data, with
optional LLM dataset batching when llm_dataset_class is provided.
🧹 4. Optional Postprocessing: Apply a user-provided postprocessor to convert, filter, or normalize predictions before returning the final pipeline output.
Usage¶
This module guides you through a step-by-step process for running a single ontology alignment model using AlignerPipeline. By the end, you’ll understand how to collect an ontology matching dataset, configure an encoder and aligner, generate predictions, evaluate results, and save the outputs in XML and JSON formats.
Import the dataset class, encoder, aligner, pipeline, and utility modules.
import json
from ontoaligner.ontology import MaterialInformationMatOntoOMDataset
from ontoaligner.utils import metrics, xmlify
from ontoaligner.encoder import ConceptParentLightweightEncoder
from ontoaligner.aligner import SimpleFuzzySMLightweight
from ontoaligner import AlignerPipeline
Load the source ontology, target ontology, and reference alignment using an OntoAligner dataset class.
task = MaterialInformationMatOntoOMDataset()
print("Test Task:", task)
dataset = task.collect(
source_ontology_path="assets/MI-MatOnto/mi_ontology.xml",
target_ontology_path="assets/MI-MatOnto/matonto_ontology.xml",
reference_matching_path="assets/MI-MatOnto/matchings.xml",
)
The collected dataset contains the source ontology items, target ontology items, and optional reference matchings.
{
"source": [...],
"target": [...],
"reference": [...]
}
Configure AlignerPipeline with one encoder, one aligner, and the collected ontology matching dataset.
aligner_pipeline = AlignerPipeline(
encoder=ConceptParentLightweightEncoder(),
aligner=SimpleFuzzySMLightweight(fuzzy_sm_threshold=0.2),
om_dataset=dataset,
)
The encoder prepares the ontology items for the aligner. The aligner then generates candidate correspondences from the encoded data.
Call generate() to encode the dataset and generate predictions.
matchings = aligner_pipeline.generate()
The output is a list of flat source-target correspondences.
[
{"source": "...", "target": "...", "score": 0.9},
...
]
generate() can also receive a dataset directly through input_data. If input_data is provided, it is used instead of the dataset stored in om_dataset.
matchings = aligner_pipeline.generate(input_data=dataset)
Compare predictions to a reference alignment and export results.
# Evaluate
evaluation = metrics.evaluation_report(
predicts=matchings,
references=dataset["reference"],
)
print("Aligner Pipeline Evaluation Report:")
print(json.dumps(evaluation, indent=4))
Example output:
{
"intersection": 42,
"precision": 7.706422018348624,
"recall": 13.90728476821192,
"f-score": 9.917355371900827,
"predictions-len": 545,
"reference-len": 302
}
Export the final alignment to XML (OAEI-compatible) or JSON:
xml_str = xmlify.xml_alignment_generator(matchings=matchings)
with open("aligner_pipeline_matchings.xml", "w", encoding="utf-8") as f:
f.write(xml_str)
with open("aligner_pipeline_matchings.json", "w", encoding="utf-8") as f:
json.dump(matchings, f, indent=4, ensure_ascii=False)
Note
A complete aligner pipeline example is available at examples/aligner_pipeline.py.
Configuration¶
Parameter |
Type |
Default |
Description |
|---|---|---|---|
encoder |
BaseEncoder |
— |
Encoder model used to encode the ontology matching dataset. |
aligner |
BaseOMModel |
— |
Ontology matching aligner used to generate predictions. |
om_dataset |
dict |
|
Pre-collected ontology matching dataset. |
load_params |
dict |
|
Parameters forwarded to the aligner |
llm_dataset_class |
Dataset |
|
Dataset class used to wrap LLM inputs. |
batch_size |
int |
|
Batch size used for LLM dataset generation. |
shuffle |
bool |
|
Whether to shuffle LLM dataset batches. |
postprocessor |
Any |
|
Optional postprocessor applied to pipeline predictions. |
postprocessor_params |
dict |
|
Parameters forwarded to the postprocessor. |
include_reference |
bool |
|
Whether to pass reference matchings to the encoder. |
**kwargs |
dict |
|
Additional keyword arguments forwarded to the base ontology matching model. |
Configuration Example:
#FewShotRAG
AlignerPipeline(
encoder=ConceptParentFewShotEncoder(),
aligner=MistralLLMBERTRetrieverFSRAG(
positive_ratio=1.0,
n_shots=1,
retriever_config=retriever_config,
llm_config=llm_config,
),
om_dataset=dataset,
load_params={
"llm_path": llm_model_path,
"ir_path": ir_model_path,
},
postprocessor=rag_heuristic_postprocessor,
postprocessor_params={
"topk_confidence_ratio": 3,
"topk_confidence_score": 3,
},
include_reference=True,
)