Aligners

Lightweight Aligners

This script defines different variants of the FuzzySMLightweight class, each implementing a different string similarity ratio estimation method using the RapidFuzz library.

The SimpleFuzzySMLightweight, WeightedFuzzySMLightweight, and TokenSetFuzzySMLightweight classes each override the ratio_estimate method to use different string comparison techniques from RapidFuzz for fuzzy string matching.

Classes:
  • SimpleFuzzySMLightweight: Inherits from FuzzySMLightweight and uses the basic string ratio.

  • WeightedFuzzySMLightweight: Inherits from FuzzySMLightweight and uses weighted string ratio.

  • TokenSetFuzzySMLightweight: Inherits from FuzzySMLightweight and uses token set ratio for fuzzy matching.

class ontoaligner.aligner.lightweight.models.SimpleFuzzySMLightweight(fuzzy_sm_threshold: float = 0.5, **kwargs)[source]

Bases: FuzzySMLightweight

A subclass of FuzzySMLightweight that uses the basic string similarity ratio from RapidFuzz.

Initializes the ontology matching model with optional keyword arguments.

Parameters:
  • fuzzy_sm_threshold (float) – Contains the threshold value for fuzzy string matching (e.g., ‘fuzzy_sm_threshold’).

  • **kwargs – Additional keyword arguments that may be used for model configuration or parameters.

ratio_estimate() Any

Returns the string matching ratio function from RapidFuzz.

This method overrides the parent method to return the ratio function from RapidFuzz, which is used to calculate the basic fuzzy string matching score.

Returns:

The rapidfuzz.fuzz.ratio function used for basic string similarity.

Return type:

Any

class ontoaligner.aligner.lightweight.models.TokenSetFuzzySMLightweight(fuzzy_sm_threshold: float = 0.5, **kwargs)[source]

Bases: FuzzySMLightweight

A subclass of FuzzySMLightweight that uses the token set ratio for string similarity from RapidFuzz.

Initializes the ontology matching model with optional keyword arguments.

Parameters:
  • fuzzy_sm_threshold (float) – Contains the threshold value for fuzzy string matching (e.g., ‘fuzzy_sm_threshold’).

  • **kwargs – Additional keyword arguments that may be used for model configuration or parameters.

ratio_estimate() Any

Returns the token set string matching ratio function from RapidFuzz.

This method overrides the parent method to return the token_set_ratio function from RapidFuzz, which calculates similarity by comparing sets of tokens rather than the full string.

Returns:

The rapidfuzz.fuzz.token_set_ratio function used for token set similarity.

Return type:

Any

class ontoaligner.aligner.lightweight.models.WeightedFuzzySMLightweight(fuzzy_sm_threshold: float = 0.5, **kwargs)[source]

Bases: FuzzySMLightweight

A subclass of FuzzySMLightweight that uses a weighted string similarity ratio from RapidFuzz.

Initializes the ontology matching model with optional keyword arguments.

Parameters:
  • fuzzy_sm_threshold (float) – Contains the threshold value for fuzzy string matching (e.g., ‘fuzzy_sm_threshold’).

  • **kwargs – Additional keyword arguments that may be used for model configuration or parameters.

ratio_estimate() Any

Returns the weighted string matching ratio function from RapidFuzz.

This method overrides the parent method to return the WRatio function from RapidFuzz, which calculates a weighted fuzzy matching score between two strings.

Returns:

The rapidfuzz.fuzz.WRatio function used for weighted string similarity.

Return type:

Any

Retrieval Aligners

This script defines various retrieval models used for information retrieval tasks. It includes both traditional methods (such as TF-IDF and BM25) as well as more modern approaches using bi-encoder architectures and pre-trained models. The models are designed to compute similarity scores between a query and candidate documents.

Classes:
  • BERTRetrieval: A retrieval class extending BiEncoderRetrieval using BERT-based encoding.

  • FlanT5Retrieval: A retrieval class extending BiEncoderRetrieval using Flan-T5 model encoding.

  • TFIDFRetrieval: A retrieval class using TF-IDF vectorization for document similarity estimation.

  • BM25Retrieval: A retrieval class using BM25 (Okapi BM25) model for document similarity estimation.

  • SVMBERTRetrieval: A retrieval class extending MLRetrieval using SVM-based BERT retrieval.

  • AdaRetrieval: A retrieval class using embeddings loaded from pre-trained OpenAI models.

class ontoaligner.aligner.retrieval.models.AdaRetrieval(device: str = 'cpu', top_k: int = 5, openai_key: str = 'None', **kwargs)[source]

Bases: BiEncoderRetrieval

AdaRetrieval is a subclass of BiEncoderRetrieval that uses pre-trained embeddings from OpenAI. It is designed to load embeddings from files, fit them, and transform input data into corresponding embeddings.

Initializes the Retrieval model.

Parameters:

**kwargs – Additional keyword arguments passed to the superclass constructor.

fit(inputs: Any) Any

Fits the model by transforming the input data into corresponding embeddings.

Parameters:

inputs (Any) – The input data to fit the model on.

Returns:

Transformed embeddings based on the input data.

Return type:

Any

load(path: str)

Loads the pre-trained OpenAI embeddings and label-to-index mappings from files.

Parameters:

path (str) – The directory path where the embeddings and labels are stored.

transform(inputs: Any) Any

Transforms input data into embeddings based on pre-trained OpenAI model.

Parameters:

inputs (Any) – The input data (strings) to transform into embeddings.

Returns:

An array of embeddings for the input data.

Return type:

np.array

class ontoaligner.aligner.retrieval.models.BM25Retrieval(device: str = 'cpu', top_k: int = 5, openai_key: str = 'None', **kwargs)[source]

Bases: Retrieval

BM25Retrieval implements the BM25 retrieval model (Okapi BM25), a probabilistic information retrieval method. This model is used to estimate document relevance based on term frequency and inverse document frequency. http://ethen8181.github.io/machine-learning/search/bm25_intro.html

Initializes the Retrieval model.

Parameters:

**kwargs – Additional keyword arguments passed to the superclass constructor.

estimate_similarity(query_embed: Any, candidate_embeds: Any) Any

Estimates similarity scores between the query and candidate documents using BM25.

Parameters:
  • query_embed (Any) – The query embedding or tokens.

  • candidate_embeds (Any) – The candidate document embeddings or tokens.

Returns:

BM25 similarity scores between the query and candidate documents.

Return type:

Any

fit(inputs: Any) Any

Tokenizes the input documents and fits the BM25 model.

Parameters:

inputs (Any) – The input data (documents) to fit the model on.

Returns:

None

load(path: str | None = None)

Loads the BM25 model. In this implementation, no additional loading is needed.

Parameters:

path (str, optional) – Path to load model from (default is None).

transform(inputs: Any) Any

Tokenizes the input data.

Parameters:

inputs (Any) – The input data to tokenize.

Returns:

Tokenized input data.

Return type:

Any

class ontoaligner.aligner.retrieval.models.SBERTRetrieval(device: str = 'cpu', top_k: int = 5, openai_key: str = 'None', **kwargs)[source]

Bases: BiEncoderRetrieval

SBERTRetrieval is a subclass of BiEncoderRetrieval that uses a BERT-based encoder for retrieval tasks. This class implements a method for returning the string representation of the retrieval model, appending the specific model’s name.

Initializes the Retrieval model.

Parameters:

**kwargs – Additional keyword arguments passed to the superclass constructor.

class ontoaligner.aligner.retrieval.models.SVMBERTRetrieval(device: str = 'cpu', top_k: int = 5, openai_key: str = 'None', **kwargs)[source]

Bases: MLRetrieval

SVMBERTRetrieval is a subclass of MLRetrieval that uses a Support Vector Machine (SVM) combined with BERT-based embeddings for retrieval tasks.

Initializes the Retrieval model.

Parameters:

**kwargs – Additional keyword arguments passed to the superclass constructor.

class ontoaligner.aligner.retrieval.models.TFIDFRetrieval(device: str = 'cpu', top_k: int = 5, openai_key: str = 'None', **kwargs)[source]

Bases: Retrieval

TFIDFRetrieval implements the TF-IDF vectorization method for document retrieval. It allows for fitting a TF-IDF model to input data, transforming input data into feature vectors, and estimating the similarity between query and candidate documents using cosine similarity.

Initializes the Retrieval model.

Parameters:

**kwargs – Additional keyword arguments passed to the superclass constructor.

estimate_similarity(query_embed: Any, candidate_embeds: Any) Any

Estimates the cosine similarity between the query and candidate embeddings.

Parameters:
  • query_embed (Any) – The query embedding.

  • candidate_embeds (Any) – The candidate embeddings.

Returns:

Cosine similarity scores between the query and candidate embeddings.

Return type:

Any

fit(inputs: Any) Any

Fits the TF-IDF model on the input data and transforms it into feature vectors.

Parameters:

inputs (Any) – The input data to fit the model on.

Returns:

Transformed feature vectors based on the input data.

Return type:

Any

load(path: str | None = None)

Loads the TF-IDF vectorizer model.

Parameters:

path (str, optional) – The path to load the model from (default is None).

transform(inputs: Any) Any

Transforms the input data into TF-IDF feature vectors.

Parameters:

inputs (Any) – The input data to transform.

Returns:

Transformed TF-IDF feature vectors.

Return type:

Any

LLM Aligners

This script defines various subclasses for different types of language models (LMs), including encoder-decoder models, decoder-only models, and models interfacing with OpenAI’s GPT. These classes inherit from predefined abstract base classes for LLM architectures and customize them for specific architectures and models.

class ontoaligner.aligner.llm.models.AutoModelDecoderLLM(**kwargs)[source]

Bases: DecoderLLMArch

A subclass of DecoderLLMArch for auto-decoder language models.

Initializes DecoderLLMArch with specific LLM lists for special tokenization and Hugging Face token requirements.

model

alias of AutoModelForCausalLM

tokenizer

alias of AutoTokenizer

class ontoaligner.aligner.llm.models.FlanT5LEncoderDecoderLM(**kwargs)[source]

Bases: EncoderDecoderLLMArch

A subclass of EncoderDecoderLLMArch for the Flan-T5 encoder-decoder language model.

Initializes the ontology matching model with optional keyword arguments.

Parameters:

**kwargs – Additional keyword arguments that may be used for model configuration or parameters.

model

alias of T5ForConditionalGeneration

tokenizer

alias of Placeholder

class ontoaligner.aligner.llm.models.GPTOpenAILLM(**kwargs)[source]

Bases: OpenAILLMArch

A subclass of OpenAILLMArch specifically for interacting with OpenAI’s GPT models.

Initializes the ontology matching model with optional keyword arguments.

Parameters:

**kwargs – Additional keyword arguments that may be used for model configuration or parameters.

RAG Aligners

This script defines a series of Retrieval-Augmented Generation (RAG) classes that combine different retrieval models and language models (LLMs). Each class specializes in pairing a specific retrieval model (e.g., AdaRetrieval, BERTRetrieval) with a specific language model (e.g., AutoModelDecoderRAGLLM, OpenAIRAGLLM). These classes are designed to perform retrieval-augmented generation tasks for various configurations of models.

class ontoaligner.aligner.rag.models.FalconLLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

FalconLLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the AutoModelDecoderRAGLLMV2 language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderRAGLLMV2

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.rag.models.FalconLLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

FalconLLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the AutoModelDecoderRAGLLMV2 language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderRAGLLMV2

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.rag.models.GPTOpenAILLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

GPTOpenAILLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the OpenAIRAGLLM language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of OpenAIRAGLLM

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.rag.models.GPTOpenAILLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

GPTOpenAILLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the OpenAIRAGLLM language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of OpenAIRAGLLM

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.rag.models.LLaMALLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

LLaMALLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the AutoModelDecoderRAGLLM language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderRAGLLM

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.rag.models.LLaMALLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

LLaMALLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the AutoModelDecoderRAGLLM language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderRAGLLM

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.rag.models.MPTLLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

MPTLLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the AutoModelDecoderRAGLLMV2 language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderRAGLLMV2

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.rag.models.MPTLLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

MPTLLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the AutoModelDecoderRAGLLMV2 language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderRAGLLMV2

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.rag.models.MambaLLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

MambaLLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the MambaSSMRAGLLM language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of MambaSSMRAGLLM

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.rag.models.MambaLLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

MambaLLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the MambaSSMRAGLLM language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of MambaSSMRAGLLM

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.rag.models.MistralLLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

MistralLLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the AutoModelDecoderRAGLLM language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderRAGLLM

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.rag.models.MistralLLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

MistralLLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the AutoModelDecoderRAGLLM language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderRAGLLM

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.rag.models.VicunaLLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

VicunaLLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the AutoModelDecoderRAGLLM language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderRAGLLM

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.rag.models.VicunaLLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: RAG

VicunaLLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the AutoModelDecoderRAGLLM language model.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderRAGLLM

Retrieval

alias of SBERTRetrieval

FewShot-RAG Aligners

This script defines a collection of classes that extend the FewShotRAG model, each combining a specific retrieval model and language model (LLM) configuration. These specialized configurations are tailored for various retrieval and generation tasks using different retrieval backends (Ada and BERT) and LLMs (OpenAI, AutoModelDecoderRAG, MambaSSM, etc.). Each class also overrides the string representation to identify the model configuration.

class ontoaligner.aligner.fewshot.models.FalconLLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with Ada retrieval and AutoModelDecoderRAGLLMV2 as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of AutoModelDecoderRAGLLMV2

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.fewshot.models.FalconLLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with BERT retrieval and AutoModelDecoderRAGLLMV2 as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of AutoModelDecoderRAGLLMV2

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.fewshot.models.GPTOpenAILLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with Ada retrieval and OpenAIRAGLLM as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of OpenAIRAGLLM

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.fewshot.models.GPTOpenAILLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with BERT retrieval and OpenAIRAGLLM as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of OpenAIRAGLLM

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.fewshot.models.LLaMALLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with Ada retrieval and AutoModelDecoderRAG as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of AutoModelDecoderRAGLLM

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.fewshot.models.LLaMALLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with BERT retrieval and AutoModelDecoderRAG as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of AutoModelDecoderRAGLLM

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.fewshot.models.MPTLLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with Ada retrieval and AutoModelDecoderRAGLLMV2 as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of AutoModelDecoderRAGLLMV2

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.fewshot.models.MPTLLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with BERT retrieval and AutoModelDecoderRAGLLMV2 as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of AutoModelDecoderRAGLLMV2

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.fewshot.models.MambaLLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with Ada retrieval and MambaSSMRAGLLM as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of MambaSSMRAGLLM

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.fewshot.models.MambaLLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with BERT retrieval and MambaSSMRAGLLM as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of MambaSSMRAGLLM

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.fewshot.models.MistralLLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with Ada retrieval and AutoModelDecoderRAG as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of AutoModelDecoderRAGLLM

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.fewshot.models.MistralLLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with BERT retrieval and AutoModelDecoderRAG as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of AutoModelDecoderRAGLLM

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.fewshot.models.VicunaLLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with Ada retrieval and AutoModelDecoderRAG as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of AutoModelDecoderRAGLLM

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.fewshot.models.VicunaLLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]

Bases: FewShotRAG

FewShotRAG model with BERT retrieval and AutoModelDecoderRAG as the language model (LLM).

Initializes the FewShotRAG class with specified parameters.

Parameters:
  • **kwargs – Arbitrary keyword arguments.

  • positive_ratio (float) – The ratio of positive examples in the few-shot samples.

  • n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.

Returns:

None

LLM

alias of AutoModelDecoderRAGLLM

Retrieval

alias of SBERTRetrieval

ICV-RAG Aligners

Script for integrating ICV-based language models with various retrieval mechanisms.

This script defines classes that combine different LLM and retrieval model pairings with ICV-based language modeling architectures. Each class pairs a specific retrieval model (e.g., AdaRetrieval, BERTRetrieval) with an LLM model variant (e.g., AutoModelDecoderICVLLM, AutoModelDecoderICVLLMV2) for enhanced ontology matching and retrieval-based NLP tasks.

class ontoaligner.aligner.icv.models.FalconLLMAdaRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: ICV

Class for pairing Falcon-based LLM with AdaRetrieval for ICV-based ontology matching.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderICVLLMV2

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.icv.models.FalconLLMBERTRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: ICV

Class for pairing Falcon-based LLM with BERTRetrieval for ICV-based ontology matching.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderICVLLMV2

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.icv.models.LLaMALLMAdaRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: ICV

Class for pairing LLaMA-based LLM with AdaRetrieval for ICV-based ontology matching.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderICVLLM

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.icv.models.LLaMALLMBERTRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: ICV

Class for pairing LLaMA-based LLM with BERTRetrieval for ICV-based ontology matching.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderICVLLM

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.icv.models.MPTLLMAdaRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: ICV

Class for pairing MPT-based LLM with AdaRetrieval for ICV-based ontology matching.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderICVLLMV2

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.icv.models.MPTLLMBERTRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: ICV

Class for pairing MPT-based LLM with BERTRetrieval for ICV-based ontology matching.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderICVLLMV2

Retrieval

alias of SBERTRetrieval

class ontoaligner.aligner.icv.models.VicunaLLMAdaRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: ICV

Class for pairing Vicuna-based LLM with AdaRetrieval for ICV-based ontology matching.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderICVLLM

Retrieval

alias of AdaRetrieval

class ontoaligner.aligner.icv.models.VicunaLLMBERTRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]

Bases: ICV

Class for pairing Vicuna-based LLM with BERTRetrieval for ICV-based ontology matching.

Initializes the RAG model by loading the retriever and LLM components.

Parameters:

**kwargs – Arbitrary keyword arguments passed to the parent class.

LLM

alias of AutoModelDecoderICVLLM

Retrieval

alias of SBERTRetrieval

KGE Aligners

class ontoaligner.aligner.graph.models.BoxEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'BoxE'
class ontoaligner.aligner.graph.models.CompGCNAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'CompGCN'
class ontoaligner.aligner.graph.models.ComplExAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'ComplEx'
class ontoaligner.aligner.graph.models.ConvEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'ConvE'
class ontoaligner.aligner.graph.models.CrossEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'CrossE'
class ontoaligner.aligner.graph.models.DistMultAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'DistMult'
class ontoaligner.aligner.graph.models.HolEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'HolE'
class ontoaligner.aligner.graph.models.MuREAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'MuRE'
class ontoaligner.aligner.graph.models.QuatEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'QuatE'
quat_conj(q)
quat_mul(q, r)
quat_similarity(source, target)
quat_similarity_normalized(source, target)
class ontoaligner.aligner.graph.models.RotatEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'RotatE'
class ontoaligner.aligner.graph.models.SEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'SE'
class ontoaligner.aligner.graph.models.SimplEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'SimplE'
class ontoaligner.aligner.graph.models.TransDAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'TransD'
class ontoaligner.aligner.graph.models.TransEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'TransE'
class ontoaligner.aligner.graph.models.TransFAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'TransF'
class ontoaligner.aligner.graph.models.TransHAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'TransH'
class ontoaligner.aligner.graph.models.TransRAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]

Bases: GraphEmbeddingAligner

Initializes the GraphEmbeddingAligner with training configuration.

Parameters:
  • device (str) – Device to run the model on (‘cpu’ or ‘cuda’).

  • embedding_dim (int) – Dimensionality of the entity embeddings.

  • num_epochs (int) – Number of training epochs.

  • train_batch_size (int) – Batch size for training.

  • eval_batch_size (int) – Batch size for evaluation.

  • num_negs_per_pos (int) – Number of negative samples per positive triple.

  • random_seed (int) – Random seed for reproducibility.

model: str = 'TransR'

PropMatch Aligner

class ontoaligner.aligner.propmatch.propmatch.PropMatchAligner(fmt: str = 'word2vec', lowercase: bool = False, threshold: float = 0.65, steps: int = 2, sim_weight: List[int] | None = None, start_metrics: List[float] | None = None, device: str = 'cpu', disable_domain_range: bool = False)[source]

Bases: BaseOMModel

Initialize the PropMatchAligner.

Parameters:
  • fmt – Format for word embedding (e.g., “word2vec”)

  • lowercase – Whether to lowercase text

  • threshold – Minimum similarity threshold for matches

  • steps – Number of iterative refinement steps

  • sim_weight – Which similarity components to use [0:domain, 1:label, 2:range]

  • start_metrics – Additional threshold metrics for evaluation

  • device – Device for computation (“cpu” or “cuda”)

  • disable_domain_range – If True, only uses label similarity

__init__(fmt: str = 'word2vec', lowercase: bool = False, threshold: float = 0.65, steps: int = 2, sim_weight: List[int] | None = None, start_metrics: List[float] | None = None, device: str = 'cpu', disable_domain_range: bool = False) None

Initialize the PropMatchAligner.

Parameters:
  • fmt – Format for word embedding (e.g., “word2vec”)

  • lowercase – Whether to lowercase text

  • threshold – Minimum similarity threshold for matches

  • steps – Number of iterative refinement steps

  • sim_weight – Which similarity components to use [0:domain, 1:label, 2:range]

  • start_metrics – Additional threshold metrics for evaluation

  • device – Device for computation (“cpu” or “cuda”)

  • disable_domain_range – If True, only uses label similarity

build_tf_models(source_onto: List[Dict], target_onto: List[Dict]) Tuple

Build the TF-IDF models for soft TF-IDF and general TF-IDF.

Parameters:
  • source_onto – List of source property dictionaries

  • target_onto – List of target property dictionaries

Returns:

Tuple of (soft_metric, general_metric) models

cosine_similarity(vector1: ndarray, vector2: ndarray) float

Compute the cosine similarity between two vectors.

Parameters:
  • vector1 – First vector

  • vector2 – Second vector

Returns:

Cosine similarity score

filter_adjectives(words: List[str]) List[str]

Filter adjectives from a list of words, keeping only nouns.

Parameters:

words – List of words

Returns:

List of words without adjectives (only nouns)

generate(input_data: List[Dict]) List

Generate alignments between source and target ontology properties.

Parameters:
  • source – List of source property dictionaries from encoder

  • target – List of target property dictionaries from encoder

Returns:

List of alignment dictionaries with ‘source’, ‘target’, and ‘score’

get_core_concept(entity: List[str]) List[str]

Get the core concept of an entity. The core concept is the first verb with length > 4 or the first noun with its adjectives.

Parameters:

entity – List of words from property label

Returns:

List of core concept words

get_document_similarity(label_a_items: List[str], label_b_items: List[str], general_metric_model) Tuple[float, float]

Compute the document similarity between two property descriptions.

Parameters:
  • label_a_items – List of words from property A

  • label_b_items – List of words from property B

  • general_metric_model – TF-IDF vectorizer model

Returns:

Tuple of (conf_a, conf_b) similarity scores

load(wordembedding_path: str, sentence_transformer_id: str) None

Loads the pre-trained models for word-embedding and sentence transformer.

Parameters:
  • wordembedding_path (str) – The path to the pre-trained word-embedding.

  • sentence_transformer_id (str) – The path to the pre-trained sentence transformer.

match_property(source: Dict, target: Dict, soft_metric_model, general_metric_model, confidence_map: Dict) float

Match two properties by comparing their labels, domains, and ranges.

Parameters:
  • source – Source property dictionary

  • target – Target property dictionary

  • soft_metric_model – Soft TF-IDF model for label matching

  • general_metric_model – TF-IDF model for domain/range matching

  • confidence_map – Map of previously aligned classes for confidence boosting

Returns:

Similarity confidence score

sentence_transformer_model: Any = None
wordembedding_model: Any = None
class ontoaligner.aligner.propmatch.propmatch.SoftTfIdf(corpus: list[list[str]], sim_func, threshold: float = 0.8)[source]

Bases: object

Soft TF-IDF similarity between two token lists. Uses a token-level sim_func and only counts tokens above threshold.

__init__(corpus: list[list[str]], sim_func, threshold: float = 0.8)
get_raw_score(tokens_a: list[str], tokens_b: list[str]) float

FLORA Aligner

FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic.

This module implements the FLORA aligner, an unsupervised knowledge graph alignment system that jointly aligns entities and relations using iterative fuzzy logic inference.

Algorithm Overview:

FLORA iteratively: 1. Bootstraps entity alignments from literal similarity (strings, dates, numbers) 2. Infers predicate subsumptions from aligned entity triples 3. Uses fuzzy logic rules to align additional entities based on predicate evidence 4. Repeats until convergence

Key Features: - Unsupervised: No training data required (optional seed alignments supported) - Holistic: Jointly aligns entities and relations iteratively - Interpretable: All scores grounded in fuzzy logic rules - Convergent: Monotone property ensures convergence - Robust: Handles dangling entities and incomplete mappings

References:

Peng, Yiwen, Bonald, Thomas, and Suchanek, Fabian. “FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic.” International Semantic Web Conference (ISWC), 2025. https://suchanek.name/work/publications/iswc-2025.pdf

class ontoaligner.aligner.flora.flora.FLORAAligner(alpha: float = 2.0, init_threshold: float = 0.7, gramN: int = 100, epsilon: float = 0.01, max_iterations: int = 100, string_identity: bool = False, relinit: float = 0.1, ngrams: List[int] | None = None, model_id: str | None = None, emb_path: str | None = None, training_data: str | None = None, device: str | None = None, batch_size: int | None = 32, verbose: bool = False, workers: int | None = 4, **kwargs)[source]

Bases: BaseOMModel

FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic.

A fully unsupervised system for aligning two knowledge graphs by jointly matching entities and relations through iterative fuzzy logic inference.

Pipeline Overview:

  1. Initialization – Load KGs and optional seed alignments

  2. Literal bootstrapping – Align string/date/numeric literals using embeddings

  3. First iteration – Infer predicate subsumptions from aligned literals

  4. Main loop – Iteratively align entities using fuzzy rules and update predicates

  5. Convergence – Stop when alignment scores stabilize

Parameters:

alpha (float):

Benefit-of-doubt factor for subrelation inference (higher = more lenient). Default: 3.0

init_threshold (float):

Minimum semantic similarity for bootstrapping literal alignment. Default: 0.7

gramN (int):

Maximum number of evidential triples per entity during alignment. Default: 100

epsilon (float):

Convergence threshold; stops when |Σ_new - Σ_old| < epsilon. Default: 0.01

max_iterations (int):

Maximum number of main-loop iterations. Default: 100

string_identity (bool):

If True, use only exact string matching for literals (no embeddings). Default: False

relinit (float):

Initial score for non-identical predicates. Default: 0.1

ngrams (List[int]):

N-gram sizes for functionality computation. Default: [1, 2]

model_id (str or None):

Hugging Face model ID for embedding model (e.g., ‘Lihuchen/pearl_small’). Default: None

training_data (str or None):

Path to seed alignment file (tab-separated, optional score column). Default: None

device (str or None):

Device for embeddings (‘cuda’ or ‘cpu’). Auto-detects if None. Default: None

batch_size (int or None):

Batch size for embedding computation. Default: 32

Example:

>>> from ontoaligner.aligner.flora import FLORAAligner
>>> aligner = FLORAAligner(alpha=3.0, init_threshold=0.7)
>>> matchings = aligner.generate(["kg1.ttl", "kg2.ttl"])
>>> for match in matchings[:3]:
...     print(f"{match['source']} -> {match['target']}: {match['score']:.2f}")

References:

Peng, Y., Bonald, T., & Suchanek, F. (2025). FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic. In Proc. ISWC 2025.

Initialize the FLORA aligner.

Parameters:
  • alpha – Benefit-of-doubt parameter for subrelation mapping.

  • init_threshold – Initial similarity threshold for literal bootstrapping.

  • gramN – Maximum evidences per entity in alignment rules.

  • epsilon – Convergence threshold for score changes.

  • max_iterations – Maximum iterations before forced termination.

  • string_identity – Use exact string matching only (no embeddings).

  • relinit – Initial score for unidentical predicates.

  • ngrams – N-gram sizes for functionality computation.

  • model_id – Transformer model for literal embeddings.

  • emb_path – Optional path to pretrained embeddings.

  • training_data – Path to seed alignment file.

  • device – Device for tensor operations.

  • batch_size – Batch size for embedding computations.

  • **kwargs – Additional arguments passed to BaseOMModel.

__init__(alpha: float = 2.0, init_threshold: float = 0.7, gramN: int = 100, epsilon: float = 0.01, max_iterations: int = 100, string_identity: bool = False, relinit: float = 0.1, ngrams: List[int] | None = None, model_id: str | None = None, emb_path: str | None = None, training_data: str | None = None, device: str | None = None, batch_size: int | None = 32, verbose: bool = False, workers: int | None = 4, **kwargs) None

Initialize the FLORA aligner.

Parameters:
  • alpha – Benefit-of-doubt parameter for subrelation mapping.

  • init_threshold – Initial similarity threshold for literal bootstrapping.

  • gramN – Maximum evidences per entity in alignment rules.

  • epsilon – Convergence threshold for score changes.

  • max_iterations – Maximum iterations before forced termination.

  • string_identity – Use exact string matching only (no embeddings).

  • relinit – Initial score for unidentical predicates.

  • ngrams – N-gram sizes for functionality computation.

  • model_id – Transformer model for literal embeddings.

  • emb_path – Optional path to pretrained embeddings.

  • training_data – Path to seed alignment file.

  • device – Device for tensor operations.

  • batch_size – Batch size for embedding computations.

  • **kwargs – Additional arguments passed to BaseOMModel.

bootstraping(kb1: Any, kb2: Any, same_as_scores: Dict[Any, Dict[Any, float]], predicate2super_predicate: Dict[Any, Dict[Any, float]], functionalities: Dict[Any, float], num_workers: int) Tuple[Dict[Any, Dict[Any, float]], Dict[Any, Dict[Any, float]], Dict[Any, Dict[Any, float]], Dict[Any, Dict[Any, float]]]

Perform the bootstrapping phase of entity and predicate alignment.

Runs the first iteration in parallel to align entities based on literal similarity, then infers predicate subsumptions from the aligned entity triples.

Parameters:
  • kb1 – First knowledge base.

  • kb2 – Second knowledge base.

  • same_as_scores – Initial entity alignment scores (from literal bootstrapping).

  • predicate2super_predicate – Initial predicate subsumption scores.

  • functionalities – Predicate functionality scores.

  • num_workers – Number of parallel worker processes.

Returns:

  • quasi_eqrel: Predicate quasi-equivalence relations

  • predicate2super_predicate: Updated predicate subsumption scores

  • same_as_scores: Updated entity alignments

  • ent_max_assign: Bilateral max assignments for entities

Return type:

Tuple of (quasi_eqrel, predicate2super_predicate, same_as_scores, ent_max_assign)

functionalities(kb1: Any, kb2: Any) Dict[Any, float]

Compute predicate functionality scores across two knowledge bases.

Functionality is the inverse of “diversity”: a functional predicate (like birthDate) has one value per subject, while a non-functional predicate (like knows) can have many values per subject.

Computes functionality for each predicate using n-gram analysis and returns the minimum score across both KGs (conservative estimate).

Parameters:
  • kb1 – First knowledge base.

  • kb2 – Second knowledge base.

Returns:

Dictionary mapping predicates to functionality scores in [0, 1].

generate(input_data: List[Any]) List[Dict[str, Any]]

Run the complete FLORA alignment algorithm on two knowledge graphs.

This is the main entry point implementing the full FLORA pipeline: 1. Load optional seed alignments 2. Initialize predicate subsumption scores 3. Compute predicate functionalities 4. Bootstrap entity alignment using literal similarity 5. Run main iterative alignment loop 6. Return entity alignment predictions

Input Format:

input_data should be a two-element list of Graph objects, as returned by FLORAEncoder.

Standard Usage:

>>> from ontoaligner.ontology import FLORAOMDataset
>>> from ontoaligner.encoder import FLORAEncoder
>>> from ontoaligner.aligner.flora import FLORAAligner
>>>
>>> # Parse KGs
>>> dataset = FLORAOMDataset().collect("kg1.ttl", "kg2.ttl")
>>>
>>> # Encode for aligner
>>> encoder_output = FLORAEncoder()(
...     source=dataset["source"],
...     target=dataset["target"]
... )
>>>
>>> # Align
>>> aligner = FLORAAligner()
>>> matchings = aligner.generate(input_data=encoder_output)
Parameters:

input_data – List of two Graph objects [kg1, kg2].

Returns:

  • source (str): IRI of the source KG entity

  • target (str): IRI of the target KG entity

  • score (float): Alignment confidence in [0.0, 1.0]

  • type (str): 'instance' or 'predicate'

Return type:

List of entity alignment predictions. Each prediction is a dictionary

Raises:

ValueError – If input_data does not contain exactly 2 elements.

get_predicate2super_predicate() Dict[Any, Dict[Any, float]]

Get the computed predicate subsumption scores.

Returns:

Dictionary mapping predicates to their subsumption relationships and scores.

get_same_as_scores() Dict[Any, Dict[Any, float]]

Get the computed entity alignment scores.

Returns:

Dictionary mapping source entities to target entities and their scores.

seed_alignments(training_data_path: str | None) Dict[str, Dict[str, float]]

Load optional seed alignments from a file.

Expected file format: tab-separated with entity1, entity2, and optional score. Example:

<http://example.org/Alice>  <http://example.org/A>  0.95
<http://example.org/Bob>    <http://example.org/B>
Parameters:

training_data_path – Path to the seed alignment file.

Returns:

Dictionary mapping entity pairs to alignment scores.

Raises:

FileNotFoundError – If training_data_path is specified but doesn’t exist.

class ontoaligner.aligner.flora.flora.FLORARDFWriter(prefixes: Dict[str, str])[source]

Bases: object

Writes knowledge graph alignments to RDF/Turtle format.

Converts alignment results (entity and predicate mappings with scores) to RDF triples with namespace declarations.

Initialize RDF writer with namespace prefixes.

Parameters:

prefixes – Dictionary mapping prefix names to namespace URIs. Example: {‘ex’: ‘http://example.org/’, ‘owl’: ‘http://www.w3.org/2002/07/owl#’}

__init__(prefixes: Dict[str, str]) None

Initialize RDF writer with namespace prefixes.

Parameters:

prefixes – Dictionary mapping prefix names to namespace URIs. Example: {‘ex’: ‘http://example.org/’, ‘owl’: ‘http://www.w3.org/2002/07/owl#’}

write(output_path: str, kb1: Any, kb2: Any, predicate2super_predicate: Dict[Any, Dict[Any, float]], same_as_scores: Dict[Any, Dict[Any, float]]) None

Write alignment results to RDF file.

Writes: - Namespace prefixes - Predicate subsumption (rdfs:subPropertyOf) relationships - Entity equivalence (owl:sameAs) mappings with confidence scores

Parameters:
  • output_path – File path for the output RDF/Turtle file.

  • kb1 – First knowledge base (used to filter predicates).

  • kb2 – Second knowledge base (used to filter predicates).

  • predicate2super_predicate – Predicate alignment scores.

  • same_as_scores – Entity alignment scores.

Core alignment algorithms for the FLORA (Fuzzy Logic KG Alignment) system.

This module implements the fuzzy-logic inference rules and iterative procedures that form the heart of the FLORA algorithm as described in:

Peng, Yiwen, Bonald, Thomas, and Suchanek, Fabian. “FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic.” ISWC 2025. https://suchanek.name/work/publications/iswc-2025.pdf

ontoaligner.aligner.flora.fuzzy.bilateral_max_assign(same_as_score: Dict[Any, Dict[Any, float]]) Dict[Any, Dict[Any, float]][source]

Compute bilateral max assignment from similarity scores.

Computes the bilateral max assignment, as described in equation (3) of the FLORA paper.

Parameters:

same_as_score – Nested dictionary of entity alignment scores.

Returns:

The bilateral max assignment of entities.

ontoaligner.aligner.flora.fuzzy.bootstrap_algo(kb_src: Any, kb_dst: Any, same_as_score: Dict[Any, Dict[Any, float]], pred2super_pred: Dict[Any, Dict[Any, float]], functionalities: Dict[Any, float], num_workers: int) Dict[Any, Dict[Any, float]][source]

Bootstrap the algorithm using initial literal alignments.

This function runs the first iteration in parallel to initialize entity alignment scores based on the initial literal similarity alignments.

Parameters:
  • kb_src – The source knowledge base.

  • kb_dst – The target knowledge base.

  • same_as_score – Nested dictionary of entity alignment scores (includes initial literal alignments).

  • pred2super_pred – Nested dictionary of pairwise subsumption scores.

  • functionalities – Dictionary mapping predicates to their functionality scores.

Returns:

Updated same_as_score dictionary with bootstrapped entity alignments.

ontoaligner.aligner.flora.fuzzy.compute_functionalities(kb: Any, gram: List[int] | None = None) Dict[Any, float][source]

Compute functionality scores for predicates in a knowledge base.

Functionality is measured as the ratio of unique subjects per predicate, considering n-gram combinations for higher-order relationships.

Parameters:
  • kb – The input knowledge base graph object.

  • gram – List of integers indicating n-gram sizes to consider. Defaults to [].

Returns:

Dictionary mapping predicates to their functionality scores.

ontoaligner.aligner.flora.fuzzy.compute_quasi_eqrel(kb_src: Any, kb_dst: Any, pred2super_pred: Dict[Any, Dict[Any, float]]) Dict[Any, Dict[Any, float]][source]

Compute quasi equivalence relations between two KGs’ predicates.

The quasi equivalence is represented as r≅r’ in the FLORA paper.

Parameters:
  • kb_src – The source knowledge base.

  • kb_dst – The target knowledge base.

  • pred2super_pred – Nested dictionary of pairwise subsumption scores.

Returns:

Nested dictionary of quasi equivalence relations.

ontoaligner.aligner.flora.fuzzy.first_iteration(kb_src: Any, kb_dst: Any, pred2super_pred: Dict[Any, Dict[Any, float]], functionalities: Dict[Any, float], queue: Queue, ent_match_tuple_queue: Queue, ent_max_assign: Dict[Any, Dict[Any, float]]) None[source]

First iteration used for bootstrapping the algorithm using initial literal alignments.

This is the main worker function for the bootstrapping phase, run in parallel across multiple processes. It processes entities from the queue and computes initial entity alignment scores based on predicate subsumption and functionality.

Results are put into ent_match_tuple_queue for collection by the parent process.

Parameters:
  • kb_src – The source knowledge base.

  • kb_dst – The target knowledge base.

  • pred2super_pred – Nested dictionary of pairwise subsumption scores.

  • functionalities – Dictionary mapping predicates to their functionality scores.

  • queue – Multiprocessing queue containing entities to be aligned.

  • ent_match_tuple_queue – Queue to store resulting entity alignment scores.

  • ent_max_assign – Bilateral max assignment from initial literal alignments.

ontoaligner.aligner.flora.fuzzy.initialize_predicate_subsumption(predicates1: Set[Any], predicates2: Set[Any], pred2super_pred12: Dict[Any, Dict[Any, float]] | None = None, pred2super_pred21: Dict[Any, Dict[Any, float]] | None = None, relinit: float = 0.1) Dict[Any, Dict[Any, float]][source]

Initialize predicate subsumption scores between two knowledge bases.

Sets identical relations to 1.0, and initializes others with provided scores or a default initial value.

Parameters:
  • predicates1 – Set of predicates in KB1.

  • predicates2 – Set of predicates in KB2.

  • pred2super_pred12 – Optional subsumption scores from KB1 predicates to KB2.

  • pred2super_pred21 – Optional subsumption scores from KB2 predicates to KB1.

  • relinit – Initial score for non-identical relations. Defaults to 0.1.

Returns:

Nested dictionary of pairwise subsumption scores across KGs in both directions.

ontoaligner.aligner.flora.fuzzy.map_subrelations(alpha: float, kb_src: Any, kb_dst: Any, ent_max_assign: Dict[Any, Dict[Any, float]], previous_predicate2super_predicate: Dict[Any, Dict[Any, float]]) Dict[Any, Dict[Any, float]][source]

Map subrelations in both directions using current entity alignments.

Updates predicate subsumption scores based on aligned entity pairs, computing which predicates in one KB correspond to predicates in the other.

Parameters:
  • alpha – Benefit-of-doubt parameter for subrelation mapping.

  • kb_src – The source knowledge base.

  • kb_dst – The target knowledge base.

  • ent_max_assign – Bilateral max assignment from current entity alignments.

  • previous_predicate2super_predicate – Previous subsumption scores to be updated.

Returns:

Updated predicate subsumption scores dictionary.

ontoaligner.aligner.flora.fuzzy.update_max_score_min(mapping: Dict[Any, Tuple[Tuple, float]], pred: Any, fact: Tuple, *body: float) Dict[Any, Tuple[Tuple, float]][source]

Update mapping with the maximum aligned scoring fact for each predicate.

Used in subrelation rules to track the best matching facts. Returns the updated mapping dictionary.

Parameters:
  • mapping – Dictionary to be updated with (fact, score) tuples per predicate.

  • pred – The predicate from KB2.

  • fact – The fact tuple (subject, predicate, object).

  • *body – Values in the body of the rule.

Returns:

Updated mapping dictionary with maximum scoring facts.

ontoaligner.aligner.flora.fuzzy.update_predicate_subsumption(pred2super_pred12: Dict[Any, Dict[Any, float]], pred2super_pred21: Dict[Any, Dict[Any, float]], previous_predicate2super_predicate: Dict[Any, Dict[Any, float]] | None) Dict[Any, Dict[Any, float]][source]

Update predicate subsumption scores bidirectionally.

Updates subsumption scores from KB1→KB2 and KB2→KB1, maintaining monotonicity of relation subsumption. Returns a new dictionary rather than modifying in-place.

Parameters:
  • pred2super_pred12 – Current subsumption scores from KB1 to KB2 predicates.

  • pred2super_pred21 – Current subsumption scores from KB2 to KB1 predicates.

  • previous_predicate2super_predicate – Previous subsumption scores to be updated. If None, an empty dictionary is created.

Returns:

Updated predicate subsumption scores dictionary.

ontoaligner.aligner.flora.fuzzy.update_score_additive_min(mapping: Dict[Any, Dict[Any, float]], key1: Any, key2: Any, factor: float, *body) Dict[Any, Dict[Any, float]][source]

Update score using additive minimum operator.

Updates mapping[key1][key2] by adding the rule value. Used for subrelation rules, as shown in equation (2) in the FLORA paper. Returns the updated mapping.

Parameters:
  • mapping – Nested dictionary to be updated with subrelation scores.

  • key1 – The predicate from KB1.

  • key2 – The predicate from KB2.

  • factor – Normalization factor (already multiplied by benefit-of-doubt parameter).

  • *body – Values in the body of the rule.

Returns:

Updated mapping dictionary with new subrelation scores.

ontoaligner.aligner.flora.fuzzy.update_score_min(mapping: Dict[Any, Dict[Any, float]], key1: Any, key2: Any, *body) Dict[Any, Dict[Any, float]][source]

Update score using minimum operator (Gödel logic).

Updates mapping[key1][key2] so that the rule body=>mapping[key1][key2] holds using the minimum operator, as shown in equation (1) in the FLORA paper. Returns the updated mapping dictionary.

Parameters:
  • mapping – Nested dictionary to be updated with entity alignment scores.

  • key1 – The entity from KB1.

  • key2 – The entity from KB2.

  • *body – Values in the body of the rule.

Returns:

Updated mapping dictionary with new alignment scores.

OLaLa Aligner

class ontoaligner.aligner.olala.olala.OLaLaAligner(retriever: Any, llm_aligner: Any, hp_aligner: Any, **kwargs)[source]

Bases: BaseOMModel

Initializes the ontology matching model with optional keyword arguments.

Parameters:

**kwargs – Additional keyword arguments that may be used for model configuration or parameters.

__init__(retriever: Any, llm_aligner: Any, hp_aligner: Any, **kwargs) None

Initializes the ontology matching model with optional keyword arguments.

Parameters:

**kwargs – Additional keyword arguments that may be used for model configuration or parameters.

generate(input_data: List) List

Generates ontology alignment results by chaining retrieval, LLM verification, and high-precision matching.

Parameters:

input_data (List) – A list containing the encoded source and target ontologies.

Returns:

The combined alignments from LLM and high-precision matching,

each annotated with an alignment_type field.

Return type:

List

load(llm_path: str, retriever_path: str) None
class ontoaligner.aligner.olala.dataset.OLaLaLLMDataset(source_onto: Any, target_onto: Any, candidates: Any, system_prompt_template: str = '{user_prompt}')[source]

Bases: Dataset

A dataset for OLaLa LLM candidate verification.

Initializes the OLaLa LLM dataset from candidate correspondences.

Parameters:
  • source_onto (Any) – The encoded source ontology.

  • target_onto (Any) – The encoded target ontology.

  • candidates (Any) – The SBERT candidate predictions.

  • system_prompt_template (str) – Optional prompt wrapper.

collate_fn(batches)

Collates OLaLa LLM examples.

Parameters:

batches – The batch examples.

Returns:

The collated batch.

Return type:

Dict

fill_one_sample(input_data: Any) str

Builds one OLaLa LLM prompt.

Parameters:

input_data (Any) – One candidate pair.

Returns:

The filled prompt.

Return type:

str

preprocess(text: str) str

Preprocesses text for OLaLa LLM prompting.

Parameters:

text (str) – The text to preprocess.

Returns:

The preprocessed text.

Return type:

str

prompt = 'Classify if two descriptions refer to the same real world entity (ontology matching).\n### Concept one: endocrine pancreas secretion ### Concept two: Pancreatic Endocrine Secretion ### Answer: yes\n### Concept one: urinary bladder urothelium ### Concept two: Transitional Epithelium ### Answer: no\n### Concept one: trigeminal V nerve ophthalmic division ### Concept two: Ophthalmic Nerve ### Answer: yes\n### Concept one: foot digit 1 phalanx ### Concept two: Foot Digit 2 Phalanx ### Answer: no\n### Concept one: large intestine ### Concept two: Colon ### Answer: no\n### Concept one: ocular refractive media ### Concept two: Refractile Media ### Answer: yes\n### Concept one: {left} ### Concept two: {right} ### Answer: '

This script defines OLaLa lightweight matchers for ontology matching.

The high-precision matcher creates exact correspondences from normalized labels and URI fragments.

class ontoaligner.aligner.olala.highprecision_matcher.OLaLaHighPrecisionMatcher(confidence: float = 1.0, **kwargs)[source]

Bases: Lightweight

A high-precision exact matcher for OLaLa.

Initializes the OLaLa high-precision matcher.

Parameters:
  • confidence (float) – The confidence assigned to exact matches.

  • **kwargs – Additional keyword arguments.

build_text_index(ontology: List[Dict[str, Any]]) Dict[str, Set[str]]

Builds a text-to-IRI index for source ontology entities.

Parameters:

ontology (List[Dict[str, Any]]) – The encoded source ontology.

Returns:

The text-to-source-IRI index.

Return type:

Dict[str, Set[str]]

filter_n_to_m(pairs: Set[Tuple[str, str]]) Set[Tuple[str, str]]

Removes N:M correspondences.

Parameters:

pairs (Set[Tuple[str, str]]) – The candidate pairs.

Returns:

The remaining unambiguous pairs.

Return type:

Set[Tuple[str, str]]

generate(input_data: List) List

Generates high-precision exact correspondences.

Parameters:

input_data (List) – The encoded source and target ontologies.

Returns:

The high-precision correspondences.

Return type:

List

get_string_representations(item: Dict[str, Any]) Set[str]

Retrieves high-precision string representations for one entity.

Parameters:

item (Dict[str, Any]) – The encoded ontology item.

Returns:

The normalized high-precision texts.

Return type:

Set[str]

match_exact_texts(source_ontology: List[Dict[str, Any]], target_ontology: List[Dict[str, Any]]) Set[Tuple[str, str]]

Creates exact correspondences from normalized texts.

Parameters:
  • source_ontology (List[Dict[str, Any]]) – The encoded source ontology.

  • target_ontology (List[Dict[str, Any]]) – The encoded target ontology.

Returns:

The exact candidate pairs.

Return type:

Set[Tuple[str, str]]

class ontoaligner.aligner.olala.retrieval.OLaLaSBERTRetrieval(device: str = 'cpu', top_k: int = 5, both_directions: bool = True, topk_per_resource: bool = True, **kwargs)[source]

Bases: BiEncoderRetrieval

A SentenceTransformers retrieval model for OLaLa candidate generation.

Initializes the OLaLa SBERT retrieval model.

Parameters:
  • device (str) – The device used by SentenceTransformers.

  • top_k (int) – The number of candidates to retrieve per resource.

  • both_directions (bool) – Whether to search in both ontology directions.

  • topk_per_resource (bool) – Whether top-k filtering is applied per resource.

  • **kwargs – Additional keyword arguments.

add_prediction(predictions: Dict[Tuple[str, str], float], source_iri: str, target_iri: str, score: float) None

Adds or updates a predicted correspondence.

Parameters:
  • predictions (Dict[Tuple[str, str], float]) – The prediction dictionary.

  • source_iri (str) – The source entity IRI.

  • target_iri (str) – The target entity IRI.

  • score (float) – The confidence score.

Returns:

None

filter_topk_per_resource(predictions: Dict[Tuple[str, str], float]) Dict[Tuple[str, str], float]

Filters correspondences by top-k source and target resources.

Parameters:

predictions (Dict[Tuple[str, str], float]) – Candidate pairs and scores.

Returns:

The filtered candidate pairs and scores.

Return type:

Dict[Tuple[str, str], float]

generate(input_data: List) List

Generates OLaLa SBERT candidate correspondences.

Parameters:

input_data (List) – The encoded source and target ontologies.

Returns:

The generated candidate correspondences.

Return type:

List

get_text_examples(ontology: List[Dict[str, Any]]) List[Dict[str, str]]

Creates multiple text examples per ontology resource.

Parameters:

ontology (List[Dict[str, Any]]) – The encoded ontology items.

Returns:

The text examples.

Return type:

List[Dict[str, str]]

load(path: str = 'multi-qa-mpnet-base-dot-v1')

Loads the SentenceTransformers model.

Parameters:

path (str) – The model path or HuggingFace model name.

Returns:

None

merge_predictions(predictions: Dict[Tuple[str, str], float]) List[Dict[str, Any]]

Converts pair-level predictions to OntoAligner retrieval output.

Parameters:

predictions (Dict[Tuple[str, str], float]) – Candidate pairs and scores.

Returns:

The grouped retrieval predictions.

Return type:

List[Dict[str, Any]]

search_direction(query_examples: List[Dict[str, str]], corpus_examples: List[Dict[str, str]], reverse: bool = False) Dict[Tuple[str, str], float]

Searches one ontology direction and returns candidate correspondences.

Parameters:
  • query_examples (List[Dict[str, str]]) – The query text examples.

  • corpus_examples (List[Dict[str, str]]) – The corpus text examples.

  • reverse (bool) – Whether the search direction is target-source.

Returns:

Candidate pairs and confidence scores.

Return type:

Dict[Tuple[str, str], float]