Retrieval¶
Usage¶
This tutorial provides a guide to performing ontology alignment using the Retriever based matching model. The process includes loading ontology datasets, generating embeddings, aligning concepts with retrieval models, post-processing the matches, and evaluating the results.
Start by importing the necessary libraries and modules. These tools will help us process and align the ontologies.
import json
# Import modules from OntoAligner
from ontoaligner import ontology, encoder
from ontoaligner.utils import metrics, xmlify
from ontoaligner.aligner import SBERTRetrieval
from ontoaligner.postprocess import retriever_postprocessor
Here:
- SBERTRetrieval: Pre-trained retrieval model for semantic matching were you can load any sentence-transformer model and use it for matching.
- retriever_postprocessor: Refines matchings for better accuracy.
Define the ontology alignment task using the provided datasets and then load the ontologies and refrences.
task = MaterialInformationMatOntoOMDataset()
print("Test Task:", task)
dataset = task.collect(
source_ontology_path="assets/MI-MatOnto/mi_ontology.xml",
target_ontology_path="assets/MI-MatOnto/matonto_ontology.xml",
reference_matching_path="assets/MI-MatOnto/matchings.xml"
)
# Initialize the encoder model and encode the dataset.
encoder_model = encoder.ConceptParentLightweightEncoder()
encoder_output = encoder_model(source=dataset['source'], target=dataset['target'])
Note
For retrieval models the LightweightEncoder encoders are good to use.
Configure the retrieval model to align the source and target ontologies using semantic similarity. The SBERTRetrieval model leverages a pre-trained transformer for this task.
# Initialize retrieval model
model = SBERTRetrieval(device='cpu', top_k=10)
model.load(path="all-MiniLM-L6-v2")
# Generate matchings
matchings = model.generate(input_data=encoder_output)
The retrieval model computes semantic similarities between source and target embeddings, predicting potential alignments.
Refine the predicted matchings using the retriever_postprocessor. Postprocessing improves alignment quality by filtering or adjusting the results.
# Post-process matchings
matchings = retriever_postprocessor(matchings)
# Evaluate matchings
evaluation = metrics.evaluation_report(
predicts=matchings,
references=dataset['reference']
)
# Print evaluation report
print("Evaluation Report:", json.dumps(evaluation, indent=4))
Save the matchings in both XML and JSON formats for further analysis or use. To convert matchings to XML format, we use the xmlify utility.
# Export matchings to XML
xml_str = xmlify.xml_alignment_generator(matchings=matchings)
xml_output_path = "matchings.xml"
with open(xml_output_path, "w", encoding="utf-8") as xml_file:
xml_file.write(xml_str)
print(f"Matchings in XML format have been written to '{xml_output_path}'.")
# Export matchings to JSON
json_output_path = "matchings.json"
with open(json_output_path, "w", encoding="utf-8") as json_file:
json.dump(matchings, json_file, indent=4, ensure_ascii=False)
print(f"Matchings in JSON format have been written to '{json_output_path}'.")
Transformer Aligner¶
Transformer-based aligners leverage pretrained models from the sentence-transformers library (e.g., BERT, T5, Flan-T5, Nomic-AI) to encode ontology concepts into dense vector embeddings. SBERTRetrieval performs similarity-based matching directly over these embeddings, while SVMBERTRetrieval extends this approach by training an SVM classifier on embedding pairs to make alignment decisions.
Transformer Aligner |
Description |
Link |
|---|---|---|
|
A transformer based aligner support that uses sentence-transformer based models like BERT, T5, FlanT5, Nomic-AI, and etc. |
|
|
Trains a Support Vector Machine (SVM) classifier on embeddings for probabilistic based ranking. |
To use transformer based aligner technique:
from ontoaligner.aligner import SBERTRetrieval, SVMBERTRetrieval
aligner = SBERTRetrieval(device="cpu", top_k=5)
aligner.load(path="all-MiniLM-L6-v2")
matchings = aligner.generate(input_data=...)
Hint
Replace SBERTRetrieval with SVMBERTRetrieval if you are willing to use SVM-based retriever model.
N-Gram Aligner¶
N-Gram aligners apply traditional information retrieval techniques—such as TF-IDF and BM25—to measure textual similarity between ontology concepts based on term frequency patterns. These methods are efficient, interpretable, and particularly effective when concept labels or definitions contain meaningful lexical cues. Ideal for fast, scalable alignment in lexically rich ontologies.
N-Gram Aligner |
Description |
Link |
|---|---|---|
|
Represents each concept label using a |
|
|
BM25 retrieval model (Okapi BM25) is a probabilistic information retrieval method.This model is used to estimate class(or document) relevance based on term frequency and inverse class(or document) frequency. |
To use n-gram based aligner technique:
from ontoaligner.aligner import TFIDFRetrieval, BM25Retrieval
aligner = TFIDFRetrieval(top_k=5)
matchings = aligner.generate(input_data=...)
Hint
There is no need for
.load()at this aligners.Replace
TFIDFRetrievalwithBM25Retrievalif you are willing to use BM25-based retriever model.
OpenAI Aligner¶
OpenAI aligners utilize state-of-the-art embedding models from OpenAI (e.g., text-embedding-ada-002) to represent ontology concepts as dense semantic vectors. These aligners are well-suited for capturing deep contextual meaning across diverse domains and are especially useful when high-quality alignment is needed but local model hosting is not feasible. The embeddings are generated via OpenAI’s API and require an API key and token usage awareness.
OpenAI Aligner |
Description |
Link |
|---|---|---|
|
This model uses pre-trained embeddings from OpenAI. It is designed to use OpenAI embeddings, fit them, and transform input data into corresponding embeddings. |
To use OpenAI based aligner technique:
from ontoaligner.aligner import AdaRetrieval
aligner = AdaRetrieval(top_k=5, openai_key='...')
aligner.load(path='text-embedding-3-small')
matchings = aligner.generate(input_data=...)
Hint
More information on OpenAI embeddings can be found at OpenAI > Embedding models.