OLaLa: OM with LLMs¶

OLaLa¶

OLaLa (Ontology matching with Large Language models) is a retrieval-augmented ontology alignment system that combines dense semantic retrieval with open-source decoder language models to verify candidate correspondences. Key properties of OLaLa are:

Zero-shot / few-shot — Requires no labelled training pairs; a small set of in-context examples is sufficient.
Open-source LLMs only — All results are reproducible; no paid API is involved.
Confidence-calibrated — Binary yes/no token probabilities are normalised into a [0, 1] confidence score.
High-precision safety net — An exact-match high-precision matcher supplements the LLM to recover trivial correspondences at full confidence.
Postprocessing pipeline — Bad-host filtering, maximum-weight bipartite extraction, and confidence thresholding produce a clean one-to-one alignment.

The following diagram (Figure 1 of the paper) illustrates the overall OLaLa pipeline:

Given two ontologies \(O_1\) and \(O_2\), OLaLa produces a set of correspondence pairs \(M = \{(c, c', s) \mid c \in O_1,\; c' \in O_2,\; s \in [0,1]\}\), where \(s\) is the LLM-derived confidence that concepts \(c\) and \(c'\) refer to the same real-world entity. The pipeline has four stages:

🔍 1. Candidate Generation (SBERT): Each ontology concept is verbalized into one or more text strings using the TextExtractorSet strategy — extracting labels, descriptions, annotation-property texts, and the URI fragment (when it contains fewer than 50 % digits). All texts per resource are embedded with a Sentence-BERT model and a bidirectional cosine-similarity search returns the top-k candidates per resource. The default model is multi-qa-mpnet-base-dot-v1, and k = 5. The procedure is run in both directions (source → target and target → source), and the union of candidates is kept.

🤖 2. LLM Binary Verification: Each candidate pair is presented to a decoder LLM via a few-shot prompt (prompt 7 in the paper — see the Prompts section below):

Classify if two descriptions refer to the same real world entity (ontology matching).
### Concept one: endocrine pancreas secretion    ### Concept two: Pancreatic Endocrine Secretion ### Answer: yes
### Concept one: urinary bladder urothelium      ### Concept two: Transitional Epithelium         ### Answer: no
### Concept one: trigeminal V nerve ophthalmic division ### Concept two: Ophthalmic Nerve         ### Answer: yes
### Concept one: foot digit 1 phalanx            ### Concept two: Foot Digit 2 Phalanx           ### Answer: no
### Concept one: large intestine                 ### Concept two: Colon                          ### Answer: no
### Concept one: ocular refractive media         ### Concept two: Refractile Media               ### Answer: yes
### Concept one: {left}                          ### Concept two: {right}                        ### Answer:

Generation stops as soon as a yes / no (or true / false) token is produced. The softmax probability of the positive class is normalised by the sum of positive and negative class probabilities to yield a confidence score: \(s = p_{yes}/(p_{yes} + p_{no})\), where every correspondence with \(s \geq 0.5\) is treated as a positive match.

🎯 3. High-Precision Matching: In parallel, an exact-match high-precision matcher independently finds concepts with identical normalized labels or URI fragments (lowercased, camel-case split, non-alphanumeric characters removed). Only unambiguous 1:1 pairs (no N:M conflicts) are kept, all at confidence 1.0. These are merged into the LLM output to ensure trivial correspondences are never missed.

🧹 4. Postprocessing: The merged alignment is cleaned in three steps:

Bad-host filter — removes correspondences whose IRIs do not belong to the expected source or target ontology hosts.
Maximum-weight bipartite extraction — enforces a one-to-one mapping by solving the assignment problem with scipy.optimize.linear_sum_assignment.
Confidence filter — discards all correspondences below a configurable threshold (default 0.5).

Note

Reference: Sven Hertling and Heiko Paulheim. 2023. OLaLa: Ontology Matching with Large Language Models. In Proceedings of the 12th Knowledge Capture Conference 2023 (K-CAP ‘23). Association for Computing Machinery, New York, NY, USA, 131–139. https://doi.org/10.1145/3587259.3627571

Usage¶

➡️ 1: Import

Import the OLaLa pipeline components and utility modules.

import json
from ontoaligner.ontology import OLaLaOMDataset
from ontoaligner.encoder import OLaLaEncoder
from ontoaligner.aligner.olala import (
    OLaLaSBERTRetrieval,
    OLaLaLLMAligner,
    OLaLaHighPrecisionMatcher,
    OLaLaAligner,
)
from ontoaligner.aligner.olala.postprocessor import olala_postprocessor
from ontoaligner.utils import metrics, xmlify

Note

OLaLaAligner is a thin orchestrator that wires together the retriever, the LLM aligner, and the high-precision matcher. Each component can also be used independently.

➡️ 2: Parse Ontologies

OLaLaOMDataset.collect() calls OLaLaOntology.parse() for each OWL file. The parser extracts standard fields and OLaLa-specific fields (TextExtractorSet, OnlyLabel, high-precision texts, host) into an "olala" sub-dictionary on every concept.

task = OLaLaOMDataset(language="en")
print("Task:", task)

dataset = task.collect(
    source_ontology_path="assets/source.owl",
    target_ontology_path="assets/target.owl",
    reference_matching_path="assets/reference.xml",  # optional, for evaluation
)

Each entry in dataset["source"] / dataset["target"] has the shape:

{
    "iri":     "http://purl.obolibrary.org/obo/MA_0000001",
    "label":   "mouse",
    "olala": {
        "text_extractor_set":            ["mouse", ...],
        "normalized_text_extractor_set": ["mouse", ...],
        "only_label":                    "mouse",
        "hp_texts":                      ["mouse"],
        "host":                          "purl.obolibrary.org",
        "normalized_label":              "mouse",
        "normalized_uri_fragment":       "ma 0000001",
        ...
    }
}

Warning

Only OWL/XML ontologies are supported out of the box. For other RDF serializations, supply a custom parser.

➡️ 3: Encode Ontologies

OLaLaEncoder converts the parsed ontology items into the flat lists expected by the retriever, LLM aligner, and high-precision matcher.

encoder_model = OLaLaEncoder()

encoded_ontology = encoder_model(
    source=dataset["source"],
    target=dataset["target"],
)
# encoded_ontology == [source_items, target_items]

Each item in source_items / target_items exposes the fields texts, only_label, hp_texts, keep_for_sbert, and expected_host used by the downstream components.

➡️ 4: Initialise Components

Instantiate the three OLaLa components — retriever, LLM aligner, and high-precision matcher — then wire them into OLaLaAligner.

# SBERT candidate retriever
retriever = OLaLaSBERTRetrieval(
    device="cuda",
    top_k=5,
    both_directions=True,
    topk_per_resource=True,
)

# LLM binary verifier
llm_aligner = OLaLaLLMAligner(
    device="cuda",
    max_new_tokens=10,
    temperature=0.0,
    truncation=True,
    max_length=2048,
    padding=True,
    loading_arguments={
        "device_map": "auto",
        "torch_dtype": "torch.float16",
    },
)

# High-precision exact matcher
hp_aligner = OLaLaHighPrecisionMatcher(confidence=1.0)

# Orchestrator
olala = OLaLaAligner(
    retriever=retriever,
    llm_aligner=llm_aligner,
    hp_aligner=hp_aligner,
)

See Configuration below for a complete parameter reference.

➡️ 5: Load Models and Generate Alignments

Load the SBERT and LLM weights, then call generate().

olala.load(
    llm_path="upstage/Llama-2-70b-instruct-v2",
    retriever_path="multi-qa-mpnet-base-dot-v1",
)

alignments = olala.generate(input_data=encoded_ontology)

The raw output is a flat list of grouped LLM predictions and high-precision correspondences, each tagged with an alignment_type field:

[
    {
        "alignment_type": "rag",
        "source": "http://example.org/A",
        "target-cands": ["http://example.org/B", ...],
        "score-cands":  [0.87, ...]
    },
    {
        "alignment_type": "hp",
        "source": "http://example.org/C",
        "target": "http://example.org/D",
        "score":  1.0
    },
    ...
]

➡️ 6: Postprocess

olala_postprocessor merges LLM and high-precision predictions, applies host filtering, extracts a one-to-one alignment, and applies the confidence threshold.

final_matchings = olala_postprocessor(
    alignments,
    encoded_ontology,
    confidence_threshold=0.5,
    strict_bad_hosts=False,
)

The output is a clean list of flat correspondences:

[
    {"source": "http://example.org/A", "target": "http://example.org/B", "score": 0.87},
    ...
]

➡️ 7: Evaluate and Export

Compare predictions to a reference alignment and export results.

# Evaluate
evaluation = metrics.evaluation_report(
    predicts=final_matchings,
    references=dataset["reference"],
)
print("OLaLa Evaluation Report:")
print(json.dumps(evaluation, indent=4))

Example output:

{
    "intersection": 1317,
    "precision":     89.4,
    "recall":        89.1,
    "f-score":       90.2,
    "predictions-len": 1478,
    "reference-len": 1478
}

Export the final alignment to XML (OAEI-compatible) or JSON:

📄 Export to XML

xml_str = xmlify.xml_alignment_generator(matchings=final_matchings)
with open("olala_matchings.xml", "w", encoding="utf-8") as f:
    f.write(xml_str)

🧾 Export to JSON

with open("olala_matchings.json", "w", encoding="utf-8") as f:
    json.dump(final_matchings, f, indent=4, ensure_ascii=False)

Configuration¶

🔍 OLaLaSBERTRetrieval

Parameter	Type	Default	Description
device	str	`"cpu"`	Device for the SentenceTransformers model (`"cpu"` or `"cuda"`).
top_k	int	`5`	Number of candidate targets retrieved per source resource. Higher values increase recall but increase LLM inference cost.
both_directions	bool	`True`	If `True`, retrieval is run in both source→target and target→source directions and the union is taken.
topk_per_resource	bool	`True`	If `True`, top-k filtering is applied per resource after merging both directions, preventing any single resource from dominating.

🤖 OLaLaLLMAligner

Parameter	Type	Default	Description
device	str	`"cpu"`	Device for the language model (`"cpu"` or `"cuda"`).
max_new_tokens	int	`10`	Maximum number of tokens the model is allowed to generate per prompt. Generation usually stops early when a yes/no token is detected.
temperature	float	`0.0`	Sampling temperature. Set to `0.0` for fully greedy (deterministic) decoding.
word_stopper	bool	`True`	If `True`, generation stops immediately after the first yes/no token. Disable only for debugging or custom stopping strategies.
loading_arguments	dict	`{}`	Extra keyword arguments forwarded to `AutoModelForCausalLM.from_pretrained`. Common keys: `device_map`, `torch_dtype`, `load_in_8bit`.
system_prompt_template	str	`"{user_prompt}"`	Optional wrapper around the filled prompt. Use this to add a system message for chat-tuned models, e.g. `"[INST] {user_prompt} [/INST]"`.
dataset_class	type	`OLaLaLLMDataset`	Dataset class used to build prompts. Override to customise text verbalization.
truncation	bool	`True`	Whether to truncate inputs that exceed `max_length`.
max_length	int	`2048`	Maximum tokenized input length.
padding	bool	`True`	Whether to pad inputs to the same length within a batch.

🎯 OLaLaHighPrecisionMatcher

Parameter	Type	Default	Description
confidence	float	`1.0`	Confidence score assigned to every exact correspondence produced by this matcher. Should remain at `1.0` in most use cases.

🧹 olala_postprocessor

Parameter	Type	Default	Description
alignments	list	—	Raw output of `OLaLaAligner.generate()`.
encoded_ontology	list	—	`[source_items, target_items]` from `OLaLaEncoder`. Used to derive expected ontology hosts for bad-host filtering.
confidence_threshold	float	`0.5`	Correspondences with scores below this value are discarded. The default removes all pairs where the LLM preferred `no`.
strict_bad_hosts	bool	`False`	If `True`, correspondences whose source or target IRI host cannot be determined are also removed. Set to `True` when the ontologies have stable, well-known hosts.

Complete Configuration Example

retriever = OLaLaSBERTRetrieval(
    device="cuda",
    top_k=5,
    both_directions=True,
    topk_per_resource=True,
)

llm_aligner = OLaLaLLMAligner(
    device="cuda",
    max_new_tokens=10,
    temperature=0.0,
    truncation=True,
    max_length=2048,
    padding=True,
    system_prompt_template="[INST] {user_prompt} [/INST]",
    loading_arguments={
        "device_map": "auto",
        "torch_dtype": "torch.float16",
        "load_in_8bit": True,
    },
)

hp_aligner = OLaLaHighPrecisionMatcher(confidence=1.0)

olala = OLaLaAligner(
    retriever=retriever,
    llm_aligner=llm_aligner,
    hp_aligner=hp_aligner,
)

Prompts¶

OLaLa supports both zero-shot and few-shot prompting strategies. The table below summarises the prompts evaluated in the paper’s ablation study on the anatomy track. Prompt 7 (the default few-shot prompt) achieves the best balance between F-measure and runtime.

ID	Prompt template	Prec	Rec	F1	Time
0 (zero-shot)	`Classify if the following two concepts are the same.\n### First concept:\n{left}\n### Second concept:\n{right}\n### Answer:`	0.853	0.866	0.861	4h 19m
7 (default)	Few-shot with 3 positive + 3 negative examples and task description (see below)	0.914	0.891	0.902	2h 41m

The default prompt 7 used by OLaLaLLMDataset is:

Classify if two descriptions refer to the same real world entity (ontology matching).
### Concept one: endocrine pancreas secretion    ### Concept two: Pancreatic Endocrine Secretion ### Answer: yes
### Concept one: urinary bladder urothelium      ### Concept two: Transitional Epithelium         ### Answer: no
### Concept one: trigeminal V nerve ophthalmic division ### Concept two: Ophthalmic Nerve         ### Answer: yes
### Concept one: foot digit 1 phalanx            ### Concept two: Foot Digit 2 Phalanx           ### Answer: no
### Concept one: large intestine                 ### Concept two: Colon                          ### Answer: no
### Concept one: ocular refractive media         ### Concept two: Refractile Media               ### Answer: yes
### Concept one: {left}                          ### Concept two: {right}                        ### Answer:

Note

To use a custom prompt, subclass OLaLaLLMDataset, override the prompt class attribute, and pass your subclass via the dataset_class argument of OLaLaLLMAligner.

Advanced Usage¶

🔧 Custom System Prompt (Chat Models) Usage: Chat-tuned models such as Llama-2-70b-chat-hf expect a specific conversation template. Pass system_prompt_template to wrap the filled few-shot prompt:

llm_aligner = OLaLaLLMAligner(
    system_prompt_template="[INST] {user_prompt} [/INST]",
    ...
)

⚡ Lightweight / CPU Mode Usage: For quick experiments without GPU access, reduce the model size and disable 8-bit loading:

retriever = OLaLaSBERTRetrieval(device="cpu", top_k=3)

llm_aligner = OLaLaLLMAligner(
    device="cpu",
    loading_arguments={"torch_dtype": "torch.float32"},
)

Consider using a smaller model such as ``meta-llama/Llama-2-7b-hf`.

🔬 Components Standalone Usage: Each component can be used independently of OLaLaAligner:

# SBERT retrieval only
retriever = OLaLaSBERTRetrieval(device="cuda", top_k=5)
retriever.load(path="multi-qa-mpnet-base-dot-v1")
candidates = retriever.generate(input_data=encoded_ontology)

# LLM aligner only (accepts SBERT candidates)
llm_aligner = OLaLaLLMAligner(device="cuda", ...)
llm_aligner.load(path="upstage/Llama-2-70b-instruct-v2")
llm_predictions = llm_aligner.generate(
    input_data=[source_items, target_items, candidates]
)

# High-precision matcher only
hp_aligner = OLaLaHighPrecisionMatcher(confidence=1.0)
hp_predictions = hp_aligner.generate(input_data=encoded_ontology)

When to use the standalone matcher

Scenario	Recommendation
Fast baseline / sanity check	Run standalone; takes seconds even on large ontologies.
Ontologies with highly consistent labelling conventions	Standalone HighPrecisionMatcher may already achieve acceptable recall.
Pre-filtering before a costly LLM run	Run HP first, remove matched concepts, then feed the remainder to `OLaLaAligner`.
Full production alignment	Use `OLaLaAligner` — HP is automatically included and its results are merged with LLM predictions at confidence 1.0.

Hint