Aligners¶
Lightweight Aligners¶
This script defines different variants of the FuzzySMLightweight class, each implementing a different string similarity ratio estimation method using the RapidFuzz library.
The SimpleFuzzySMLightweight, WeightedFuzzySMLightweight, and TokenSetFuzzySMLightweight classes each override the ratio_estimate method to use different string comparison techniques from RapidFuzz for fuzzy string matching.
- Classes:
SimpleFuzzySMLightweight: Inherits from FuzzySMLightweight and uses the basic string ratio.
WeightedFuzzySMLightweight: Inherits from FuzzySMLightweight and uses weighted string ratio.
TokenSetFuzzySMLightweight: Inherits from FuzzySMLightweight and uses token set ratio for fuzzy matching.
- class ontoaligner.aligner.lightweight.models.SimpleFuzzySMLightweight(fuzzy_sm_threshold: float = 0.5, **kwargs)[source]¶
Bases:
FuzzySMLightweightA subclass of FuzzySMLightweight that uses the basic string similarity ratio from RapidFuzz.
Initializes the ontology matching model with optional keyword arguments.
- Parameters:
fuzzy_sm_threshold (float) – Contains the threshold value for fuzzy string matching (e.g., ‘fuzzy_sm_threshold’).
**kwargs – Additional keyword arguments that may be used for model configuration or parameters.
- ratio_estimate() Any¶
Returns the string matching ratio function from RapidFuzz.
This method overrides the parent method to return the ratio function from RapidFuzz, which is used to calculate the basic fuzzy string matching score.
- Returns:
The rapidfuzz.fuzz.ratio function used for basic string similarity.
- Return type:
Any
- class ontoaligner.aligner.lightweight.models.TokenSetFuzzySMLightweight(fuzzy_sm_threshold: float = 0.5, **kwargs)[source]¶
Bases:
FuzzySMLightweightA subclass of FuzzySMLightweight that uses the token set ratio for string similarity from RapidFuzz.
Initializes the ontology matching model with optional keyword arguments.
- Parameters:
fuzzy_sm_threshold (float) – Contains the threshold value for fuzzy string matching (e.g., ‘fuzzy_sm_threshold’).
**kwargs – Additional keyword arguments that may be used for model configuration or parameters.
- ratio_estimate() Any¶
Returns the token set string matching ratio function from RapidFuzz.
This method overrides the parent method to return the token_set_ratio function from RapidFuzz, which calculates similarity by comparing sets of tokens rather than the full string.
- Returns:
The rapidfuzz.fuzz.token_set_ratio function used for token set similarity.
- Return type:
Any
- class ontoaligner.aligner.lightweight.models.WeightedFuzzySMLightweight(fuzzy_sm_threshold: float = 0.5, **kwargs)[source]¶
Bases:
FuzzySMLightweightA subclass of FuzzySMLightweight that uses a weighted string similarity ratio from RapidFuzz.
Initializes the ontology matching model with optional keyword arguments.
- Parameters:
fuzzy_sm_threshold (float) – Contains the threshold value for fuzzy string matching (e.g., ‘fuzzy_sm_threshold’).
**kwargs – Additional keyword arguments that may be used for model configuration or parameters.
- ratio_estimate() Any¶
Returns the weighted string matching ratio function from RapidFuzz.
This method overrides the parent method to return the WRatio function from RapidFuzz, which calculates a weighted fuzzy matching score between two strings.
- Returns:
The rapidfuzz.fuzz.WRatio function used for weighted string similarity.
- Return type:
Any
Retrieval Aligners¶
This script defines various retrieval models used for information retrieval tasks. It includes both traditional methods (such as TF-IDF and BM25) as well as more modern approaches using bi-encoder architectures and pre-trained models. The models are designed to compute similarity scores between a query and candidate documents.
- Classes:
BERTRetrieval: A retrieval class extending BiEncoderRetrieval using BERT-based encoding.
FlanT5Retrieval: A retrieval class extending BiEncoderRetrieval using Flan-T5 model encoding.
TFIDFRetrieval: A retrieval class using TF-IDF vectorization for document similarity estimation.
BM25Retrieval: A retrieval class using BM25 (Okapi BM25) model for document similarity estimation.
SVMBERTRetrieval: A retrieval class extending MLRetrieval using SVM-based BERT retrieval.
AdaRetrieval: A retrieval class using embeddings loaded from pre-trained OpenAI models.
- class ontoaligner.aligner.retrieval.models.AdaRetrieval(device: str = 'cpu', top_k: int = 5, openai_key: str = 'None', **kwargs)[source]¶
Bases:
BiEncoderRetrievalAdaRetrieval is a subclass of BiEncoderRetrieval that uses pre-trained embeddings from OpenAI. It is designed to load embeddings from files, fit them, and transform input data into corresponding embeddings.
Initializes the Retrieval model.
- Parameters:
**kwargs – Additional keyword arguments passed to the superclass constructor.
- fit(inputs: Any) Any¶
Fits the model by transforming the input data into corresponding embeddings.
- Parameters:
inputs (Any) – The input data to fit the model on.
- Returns:
Transformed embeddings based on the input data.
- Return type:
Any
- load(path: str)¶
Loads the pre-trained OpenAI embeddings and label-to-index mappings from files.
- Parameters:
path (str) – The directory path where the embeddings and labels are stored.
- transform(inputs: Any) Any¶
Transforms input data into embeddings based on pre-trained OpenAI model.
- Parameters:
inputs (Any) – The input data (strings) to transform into embeddings.
- Returns:
An array of embeddings for the input data.
- Return type:
np.array
- class ontoaligner.aligner.retrieval.models.BM25Retrieval(device: str = 'cpu', top_k: int = 5, openai_key: str = 'None', **kwargs)[source]¶
Bases:
RetrievalBM25Retrieval implements the BM25 retrieval model (Okapi BM25), a probabilistic information retrieval method. This model is used to estimate document relevance based on term frequency and inverse document frequency. http://ethen8181.github.io/machine-learning/search/bm25_intro.html
Initializes the Retrieval model.
- Parameters:
**kwargs – Additional keyword arguments passed to the superclass constructor.
- estimate_similarity(query_embed: Any, candidate_embeds: Any) Any¶
Estimates similarity scores between the query and candidate documents using BM25.
- Parameters:
query_embed (Any) – The query embedding or tokens.
candidate_embeds (Any) – The candidate document embeddings or tokens.
- Returns:
BM25 similarity scores between the query and candidate documents.
- Return type:
Any
- fit(inputs: Any) Any¶
Tokenizes the input documents and fits the BM25 model.
- Parameters:
inputs (Any) – The input data (documents) to fit the model on.
- Returns:
None
- load(path: str | None = None)¶
Loads the BM25 model. In this implementation, no additional loading is needed.
- Parameters:
path (str, optional) – Path to load model from (default is None).
- transform(inputs: Any) Any¶
Tokenizes the input data.
- Parameters:
inputs (Any) – The input data to tokenize.
- Returns:
Tokenized input data.
- Return type:
Any
- class ontoaligner.aligner.retrieval.models.SBERTRetrieval(device: str = 'cpu', top_k: int = 5, openai_key: str = 'None', **kwargs)[source]¶
Bases:
BiEncoderRetrievalSBERTRetrieval is a subclass of BiEncoderRetrieval that uses a BERT-based encoder for retrieval tasks. This class implements a method for returning the string representation of the retrieval model, appending the specific model’s name.
Initializes the Retrieval model.
- Parameters:
**kwargs – Additional keyword arguments passed to the superclass constructor.
- class ontoaligner.aligner.retrieval.models.SVMBERTRetrieval(device: str = 'cpu', top_k: int = 5, openai_key: str = 'None', **kwargs)[source]¶
Bases:
MLRetrievalSVMBERTRetrieval is a subclass of MLRetrieval that uses a Support Vector Machine (SVM) combined with BERT-based embeddings for retrieval tasks.
Initializes the Retrieval model.
- Parameters:
**kwargs – Additional keyword arguments passed to the superclass constructor.
- class ontoaligner.aligner.retrieval.models.TFIDFRetrieval(device: str = 'cpu', top_k: int = 5, openai_key: str = 'None', **kwargs)[source]¶
Bases:
RetrievalTFIDFRetrieval implements the TF-IDF vectorization method for document retrieval. It allows for fitting a TF-IDF model to input data, transforming input data into feature vectors, and estimating the similarity between query and candidate documents using cosine similarity.
Initializes the Retrieval model.
- Parameters:
**kwargs – Additional keyword arguments passed to the superclass constructor.
- estimate_similarity(query_embed: Any, candidate_embeds: Any) Any¶
Estimates the cosine similarity between the query and candidate embeddings.
- Parameters:
query_embed (Any) – The query embedding.
candidate_embeds (Any) – The candidate embeddings.
- Returns:
Cosine similarity scores between the query and candidate embeddings.
- Return type:
Any
- fit(inputs: Any) Any¶
Fits the TF-IDF model on the input data and transforms it into feature vectors.
- Parameters:
inputs (Any) – The input data to fit the model on.
- Returns:
Transformed feature vectors based on the input data.
- Return type:
Any
- load(path: str | None = None)¶
Loads the TF-IDF vectorizer model.
- Parameters:
path (str, optional) – The path to load the model from (default is None).
- transform(inputs: Any) Any¶
Transforms the input data into TF-IDF feature vectors.
- Parameters:
inputs (Any) – The input data to transform.
- Returns:
Transformed TF-IDF feature vectors.
- Return type:
Any
LLM Aligners¶
This script defines various subclasses for different types of language models (LMs), including encoder-decoder models, decoder-only models, and models interfacing with OpenAI’s GPT. These classes inherit from predefined abstract base classes for LLM architectures and customize them for specific architectures and models.
- class ontoaligner.aligner.llm.models.AutoModelDecoderLLM(**kwargs)[source]¶
Bases:
DecoderLLMArchA subclass of DecoderLLMArch for auto-decoder language models.
Initializes DecoderLLMArch with specific LLM lists for special tokenization and Hugging Face token requirements.
- model¶
alias of
AutoModelForCausalLM
- tokenizer¶
alias of
AutoTokenizer
- class ontoaligner.aligner.llm.models.FlanT5LEncoderDecoderLM(**kwargs)[source]¶
Bases:
EncoderDecoderLLMArchA subclass of EncoderDecoderLLMArch for the Flan-T5 encoder-decoder language model.
Initializes the ontology matching model with optional keyword arguments.
- Parameters:
**kwargs – Additional keyword arguments that may be used for model configuration or parameters.
- model¶
alias of
T5ForConditionalGeneration
- tokenizer¶
alias of
Placeholder
- class ontoaligner.aligner.llm.models.GPTOpenAILLM(**kwargs)[source]¶
Bases:
OpenAILLMArchA subclass of OpenAILLMArch specifically for interacting with OpenAI’s GPT models.
Initializes the ontology matching model with optional keyword arguments.
- Parameters:
**kwargs – Additional keyword arguments that may be used for model configuration or parameters.
RAG Aligners¶
This script defines a series of Retrieval-Augmented Generation (RAG) classes that combine different retrieval models and language models (LLMs). Each class specializes in pairing a specific retrieval model (e.g., AdaRetrieval, BERTRetrieval) with a specific language model (e.g., AutoModelDecoderRAGLLM, OpenAIRAGLLM). These classes are designed to perform retrieval-augmented generation tasks for various configurations of models.
- class ontoaligner.aligner.rag.models.FalconLLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGFalconLLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the AutoModelDecoderRAGLLMV2 language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderRAGLLMV2
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.rag.models.FalconLLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGFalconLLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the AutoModelDecoderRAGLLMV2 language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderRAGLLMV2
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.rag.models.GPTOpenAILLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGGPTOpenAILLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the OpenAIRAGLLM language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
OpenAIRAGLLM
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.rag.models.GPTOpenAILLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGGPTOpenAILLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the OpenAIRAGLLM language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
OpenAIRAGLLM
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.rag.models.LLaMALLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGLLaMALLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the AutoModelDecoderRAGLLM language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderRAGLLM
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.rag.models.LLaMALLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGLLaMALLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the AutoModelDecoderRAGLLM language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderRAGLLM
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.rag.models.MPTLLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGMPTLLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the AutoModelDecoderRAGLLMV2 language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderRAGLLMV2
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.rag.models.MPTLLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGMPTLLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the AutoModelDecoderRAGLLMV2 language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderRAGLLMV2
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.rag.models.MambaLLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGMambaLLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the MambaSSMRAGLLM language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
MambaSSMRAGLLM
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.rag.models.MambaLLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGMambaLLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the MambaSSMRAGLLM language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
MambaSSMRAGLLM
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.rag.models.MistralLLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGMistralLLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the AutoModelDecoderRAGLLM language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderRAGLLM
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.rag.models.MistralLLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGMistralLLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the AutoModelDecoderRAGLLM language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderRAGLLM
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.rag.models.VicunaLLMAdaRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGVicunaLLMAdaRetrieverRAG class combines the AdaRetrieval retrieval model with the AutoModelDecoderRAGLLM language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderRAGLLM
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.rag.models.VicunaLLMBERTRetrieverRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
RAGVicunaLLMBERTRetrieverRAG class combines the SBERTRetrieval retrieval model with the AutoModelDecoderRAGLLM language model.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderRAGLLM
- Retrieval¶
alias of
SBERTRetrieval
FewShot-RAG Aligners¶
This script defines a collection of classes that extend the FewShotRAG model, each combining a specific retrieval model and language model (LLM) configuration. These specialized configurations are tailored for various retrieval and generation tasks using different retrieval backends (Ada and BERT) and LLMs (OpenAI, AutoModelDecoderRAG, MambaSSM, etc.). Each class also overrides the string representation to identify the model configuration.
- class ontoaligner.aligner.fewshot.models.FalconLLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with Ada retrieval and AutoModelDecoderRAGLLMV2 as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
AutoModelDecoderRAGLLMV2
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.fewshot.models.FalconLLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with BERT retrieval and AutoModelDecoderRAGLLMV2 as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
AutoModelDecoderRAGLLMV2
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.fewshot.models.GPTOpenAILLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with Ada retrieval and OpenAIRAGLLM as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
OpenAIRAGLLM
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.fewshot.models.GPTOpenAILLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with BERT retrieval and OpenAIRAGLLM as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
OpenAIRAGLLM
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.fewshot.models.LLaMALLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with Ada retrieval and AutoModelDecoderRAG as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
AutoModelDecoderRAGLLM
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.fewshot.models.LLaMALLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with BERT retrieval and AutoModelDecoderRAG as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
AutoModelDecoderRAGLLM
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.fewshot.models.MPTLLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with Ada retrieval and AutoModelDecoderRAGLLMV2 as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
AutoModelDecoderRAGLLMV2
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.fewshot.models.MPTLLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with BERT retrieval and AutoModelDecoderRAGLLMV2 as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
AutoModelDecoderRAGLLMV2
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.fewshot.models.MambaLLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with Ada retrieval and MambaSSMRAGLLM as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
MambaSSMRAGLLM
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.fewshot.models.MambaLLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with BERT retrieval and MambaSSMRAGLLM as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
MambaSSMRAGLLM
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.fewshot.models.MistralLLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with Ada retrieval and AutoModelDecoderRAG as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
AutoModelDecoderRAGLLM
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.fewshot.models.MistralLLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with BERT retrieval and AutoModelDecoderRAG as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
AutoModelDecoderRAGLLM
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.fewshot.models.VicunaLLMAdaRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with Ada retrieval and AutoModelDecoderRAG as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
AutoModelDecoderRAGLLM
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.fewshot.models.VicunaLLMBERTRetrieverFSRAG(positive_ratio: float = 0.7, n_shots: int = 10, retriever_config=None, llm_config=None)[source]¶
Bases:
FewShotRAGFewShotRAG model with BERT retrieval and AutoModelDecoderRAG as the language model (LLM).
Initializes the FewShotRAG class with specified parameters.
- Parameters:
**kwargs – Arbitrary keyword arguments.
positive_ratio (float) – The ratio of positive examples in the few-shot samples.
n_shots (int) – Number of shots to be used for few-shot learning, derived from input arguments.
- Returns:
None
- LLM¶
alias of
AutoModelDecoderRAGLLM
- Retrieval¶
alias of
SBERTRetrieval
ICV-RAG Aligners¶
Script for integrating ICV-based language models with various retrieval mechanisms.
This script defines classes that combine different LLM and retrieval model pairings with ICV-based language modeling architectures. Each class pairs a specific retrieval model (e.g., AdaRetrieval, BERTRetrieval) with an LLM model variant (e.g., AutoModelDecoderICVLLM, AutoModelDecoderICVLLMV2) for enhanced ontology matching and retrieval-based NLP tasks.
- class ontoaligner.aligner.icv.models.FalconLLMAdaRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
ICVClass for pairing Falcon-based LLM with AdaRetrieval for ICV-based ontology matching.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderICVLLMV2
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.icv.models.FalconLLMBERTRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
ICVClass for pairing Falcon-based LLM with BERTRetrieval for ICV-based ontology matching.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderICVLLMV2
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.icv.models.LLaMALLMAdaRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
ICVClass for pairing LLaMA-based LLM with AdaRetrieval for ICV-based ontology matching.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderICVLLM
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.icv.models.LLaMALLMBERTRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
ICVClass for pairing LLaMA-based LLM with BERTRetrieval for ICV-based ontology matching.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderICVLLM
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.icv.models.MPTLLMAdaRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
ICVClass for pairing MPT-based LLM with AdaRetrieval for ICV-based ontology matching.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderICVLLMV2
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.icv.models.MPTLLMBERTRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
ICVClass for pairing MPT-based LLM with BERTRetrieval for ICV-based ontology matching.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderICVLLMV2
- Retrieval¶
alias of
SBERTRetrieval
- class ontoaligner.aligner.icv.models.VicunaLLMAdaRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
ICVClass for pairing Vicuna-based LLM with AdaRetrieval for ICV-based ontology matching.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderICVLLM
- Retrieval¶
alias of
AdaRetrieval
- class ontoaligner.aligner.icv.models.VicunaLLMBERTRetrieverICVRAG(retriever=None, llm=None, retriever_config=None, llm_config=None)[source]¶
Bases:
ICVClass for pairing Vicuna-based LLM with BERTRetrieval for ICV-based ontology matching.
Initializes the RAG model by loading the retriever and LLM components.
- Parameters:
**kwargs – Arbitrary keyword arguments passed to the parent class.
- LLM¶
alias of
AutoModelDecoderICVLLM
- Retrieval¶
alias of
SBERTRetrieval
KGE Aligners¶
- class ontoaligner.aligner.graph.models.BoxEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'BoxE'¶
- class ontoaligner.aligner.graph.models.CompGCNAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'CompGCN'¶
- class ontoaligner.aligner.graph.models.ComplExAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'ComplEx'¶
- class ontoaligner.aligner.graph.models.ConvEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'ConvE'¶
- class ontoaligner.aligner.graph.models.CrossEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'CrossE'¶
- class ontoaligner.aligner.graph.models.DistMultAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'DistMult'¶
- class ontoaligner.aligner.graph.models.HolEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'HolE'¶
- class ontoaligner.aligner.graph.models.MuREAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'MuRE'¶
- class ontoaligner.aligner.graph.models.QuatEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'QuatE'¶
- quat_conj(q)¶
- quat_mul(q, r)¶
- quat_similarity(source, target)¶
- quat_similarity_normalized(source, target)¶
- class ontoaligner.aligner.graph.models.RotatEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'RotatE'¶
- class ontoaligner.aligner.graph.models.SEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'SE'¶
- class ontoaligner.aligner.graph.models.SimplEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'SimplE'¶
- class ontoaligner.aligner.graph.models.TransDAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'TransD'¶
- class ontoaligner.aligner.graph.models.TransEAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'TransE'¶
- class ontoaligner.aligner.graph.models.TransFAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'TransF'¶
- class ontoaligner.aligner.graph.models.TransHAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'TransH'¶
- class ontoaligner.aligner.graph.models.TransRAligner(model: str = '', device: str = 'cpu', retriever: bool = False, embedding_dim: int = 300, num_epochs: int = 50, train_batch_size: int = 128, eval_batch_size: int = 64, num_negs_per_pos: int = 5, top_k: int = 5, random_seed: int = 42)[source]¶
Bases:
GraphEmbeddingAlignerInitializes the GraphEmbeddingAligner with training configuration.
- Parameters:
device (str) – Device to run the model on (‘cpu’ or ‘cuda’).
embedding_dim (int) – Dimensionality of the entity embeddings.
num_epochs (int) – Number of training epochs.
train_batch_size (int) – Batch size for training.
eval_batch_size (int) – Batch size for evaluation.
num_negs_per_pos (int) – Number of negative samples per positive triple.
random_seed (int) – Random seed for reproducibility.
- model: str = 'TransR'¶
PropMatch Aligner¶
- class ontoaligner.aligner.propmatch.propmatch.PropMatchAligner(fmt: str = 'word2vec', lowercase: bool = False, threshold: float = 0.65, steps: int = 2, sim_weight: List[int] | None = None, start_metrics: List[float] | None = None, device: str = 'cpu', disable_domain_range: bool = False)[source]¶
Bases:
BaseOMModelInitialize the PropMatchAligner.
- Parameters:
fmt – Format for word embedding (e.g., “word2vec”)
lowercase – Whether to lowercase text
threshold – Minimum similarity threshold for matches
steps – Number of iterative refinement steps
sim_weight – Which similarity components to use [0:domain, 1:label, 2:range]
start_metrics – Additional threshold metrics for evaluation
device – Device for computation (“cpu” or “cuda”)
disable_domain_range – If True, only uses label similarity
- __init__(fmt: str = 'word2vec', lowercase: bool = False, threshold: float = 0.65, steps: int = 2, sim_weight: List[int] | None = None, start_metrics: List[float] | None = None, device: str = 'cpu', disable_domain_range: bool = False) None¶
Initialize the PropMatchAligner.
- Parameters:
fmt – Format for word embedding (e.g., “word2vec”)
lowercase – Whether to lowercase text
threshold – Minimum similarity threshold for matches
steps – Number of iterative refinement steps
sim_weight – Which similarity components to use [0:domain, 1:label, 2:range]
start_metrics – Additional threshold metrics for evaluation
device – Device for computation (“cpu” or “cuda”)
disable_domain_range – If True, only uses label similarity
- build_tf_models(source_onto: List[Dict], target_onto: List[Dict]) Tuple¶
Build the TF-IDF models for soft TF-IDF and general TF-IDF.
- Parameters:
source_onto – List of source property dictionaries
target_onto – List of target property dictionaries
- Returns:
Tuple of (soft_metric, general_metric) models
- cosine_similarity(vector1: ndarray, vector2: ndarray) float¶
Compute the cosine similarity between two vectors.
- Parameters:
vector1 – First vector
vector2 – Second vector
- Returns:
Cosine similarity score
- filter_adjectives(words: List[str]) List[str]¶
Filter adjectives from a list of words, keeping only nouns.
- Parameters:
words – List of words
- Returns:
List of words without adjectives (only nouns)
- generate(input_data: List[Dict]) List¶
Generate alignments between source and target ontology properties.
- Parameters:
source – List of source property dictionaries from encoder
target – List of target property dictionaries from encoder
- Returns:
List of alignment dictionaries with ‘source’, ‘target’, and ‘score’
- get_core_concept(entity: List[str]) List[str]¶
Get the core concept of an entity. The core concept is the first verb with length > 4 or the first noun with its adjectives.
- Parameters:
entity – List of words from property label
- Returns:
List of core concept words
- get_document_similarity(label_a_items: List[str], label_b_items: List[str], general_metric_model) Tuple[float, float]¶
Compute the document similarity between two property descriptions.
- Parameters:
label_a_items – List of words from property A
label_b_items – List of words from property B
general_metric_model – TF-IDF vectorizer model
- Returns:
Tuple of (conf_a, conf_b) similarity scores
- load(wordembedding_path: str, sentence_transformer_id: str) None¶
Loads the pre-trained models for word-embedding and sentence transformer.
- Parameters:
wordembedding_path (str) – The path to the pre-trained word-embedding.
sentence_transformer_id (str) – The path to the pre-trained sentence transformer.
- match_property(source: Dict, target: Dict, soft_metric_model, general_metric_model, confidence_map: Dict) float¶
Match two properties by comparing their labels, domains, and ranges.
- Parameters:
source – Source property dictionary
target – Target property dictionary
soft_metric_model – Soft TF-IDF model for label matching
general_metric_model – TF-IDF model for domain/range matching
confidence_map – Map of previously aligned classes for confidence boosting
- Returns:
Similarity confidence score
- sentence_transformer_model: Any = None¶
- wordembedding_model: Any = None¶
- class ontoaligner.aligner.propmatch.propmatch.SoftTfIdf(corpus: list[list[str]], sim_func, threshold: float = 0.8)[source]¶
Bases:
objectSoft TF-IDF similarity between two token lists. Uses a token-level sim_func and only counts tokens above threshold.
- __init__(corpus: list[list[str]], sim_func, threshold: float = 0.8)¶
- get_raw_score(tokens_a: list[str], tokens_b: list[str]) float¶
FLORA Aligner¶
FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic.
This module implements the FLORA aligner, an unsupervised knowledge graph alignment system that jointly aligns entities and relations using iterative fuzzy logic inference.
Algorithm Overview:
FLORA iteratively: 1. Bootstraps entity alignments from literal similarity (strings, dates, numbers) 2. Infers predicate subsumptions from aligned entity triples 3. Uses fuzzy logic rules to align additional entities based on predicate evidence 4. Repeats until convergence
Key Features: - Unsupervised: No training data required (optional seed alignments supported) - Holistic: Jointly aligns entities and relations iteratively - Interpretable: All scores grounded in fuzzy logic rules - Convergent: Monotone property ensures convergence - Robust: Handles dangling entities and incomplete mappings
References:
Peng, Yiwen, Bonald, Thomas, and Suchanek, Fabian. “FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic.” International Semantic Web Conference (ISWC), 2025. https://suchanek.name/work/publications/iswc-2025.pdf
- class ontoaligner.aligner.flora.flora.FLORAAligner(alpha: float = 2.0, init_threshold: float = 0.7, gramN: int = 100, epsilon: float = 0.01, max_iterations: int = 100, string_identity: bool = False, relinit: float = 0.1, ngrams: List[int] | None = None, model_id: str | None = None, emb_path: str | None = None, training_data: str | None = None, device: str | None = None, batch_size: int | None = 32, verbose: bool = False, workers: int | None = 4, **kwargs)[source]¶
Bases:
BaseOMModelFLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic.
A fully unsupervised system for aligning two knowledge graphs by jointly matching entities and relations through iterative fuzzy logic inference.
Pipeline Overview:
Initialization – Load KGs and optional seed alignments
Literal bootstrapping – Align string/date/numeric literals using embeddings
First iteration – Infer predicate subsumptions from aligned literals
Main loop – Iteratively align entities using fuzzy rules and update predicates
Convergence – Stop when alignment scores stabilize
Parameters:
- alpha (float):
Benefit-of-doubt factor for subrelation inference (higher = more lenient). Default: 3.0
- init_threshold (float):
Minimum semantic similarity for bootstrapping literal alignment. Default: 0.7
- gramN (int):
Maximum number of evidential triples per entity during alignment. Default: 100
- epsilon (float):
Convergence threshold; stops when |Σ_new - Σ_old| < epsilon. Default: 0.01
- max_iterations (int):
Maximum number of main-loop iterations. Default: 100
- string_identity (bool):
If True, use only exact string matching for literals (no embeddings). Default: False
- relinit (float):
Initial score for non-identical predicates. Default: 0.1
- ngrams (List[int]):
N-gram sizes for functionality computation. Default: [1, 2]
- model_id (str or None):
Hugging Face model ID for embedding model (e.g., ‘Lihuchen/pearl_small’). Default: None
- training_data (str or None):
Path to seed alignment file (tab-separated, optional score column). Default: None
- device (str or None):
Device for embeddings (‘cuda’ or ‘cpu’). Auto-detects if None. Default: None
- batch_size (int or None):
Batch size for embedding computation. Default: 32
Example:
>>> from ontoaligner.aligner.flora import FLORAAligner >>> aligner = FLORAAligner(alpha=3.0, init_threshold=0.7) >>> matchings = aligner.generate(["kg1.ttl", "kg2.ttl"]) >>> for match in matchings[:3]: ... print(f"{match['source']} -> {match['target']}: {match['score']:.2f}")
References:
Peng, Y., Bonald, T., & Suchanek, F. (2025). FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic. In Proc. ISWC 2025.
Initialize the FLORA aligner.
- Parameters:
alpha – Benefit-of-doubt parameter for subrelation mapping.
init_threshold – Initial similarity threshold for literal bootstrapping.
gramN – Maximum evidences per entity in alignment rules.
epsilon – Convergence threshold for score changes.
max_iterations – Maximum iterations before forced termination.
string_identity – Use exact string matching only (no embeddings).
relinit – Initial score for unidentical predicates.
ngrams – N-gram sizes for functionality computation.
model_id – Transformer model for literal embeddings.
emb_path – Optional path to pretrained embeddings.
training_data – Path to seed alignment file.
device – Device for tensor operations.
batch_size – Batch size for embedding computations.
**kwargs – Additional arguments passed to BaseOMModel.
- __init__(alpha: float = 2.0, init_threshold: float = 0.7, gramN: int = 100, epsilon: float = 0.01, max_iterations: int = 100, string_identity: bool = False, relinit: float = 0.1, ngrams: List[int] | None = None, model_id: str | None = None, emb_path: str | None = None, training_data: str | None = None, device: str | None = None, batch_size: int | None = 32, verbose: bool = False, workers: int | None = 4, **kwargs) None¶
Initialize the FLORA aligner.
- Parameters:
alpha – Benefit-of-doubt parameter for subrelation mapping.
init_threshold – Initial similarity threshold for literal bootstrapping.
gramN – Maximum evidences per entity in alignment rules.
epsilon – Convergence threshold for score changes.
max_iterations – Maximum iterations before forced termination.
string_identity – Use exact string matching only (no embeddings).
relinit – Initial score for unidentical predicates.
ngrams – N-gram sizes for functionality computation.
model_id – Transformer model for literal embeddings.
emb_path – Optional path to pretrained embeddings.
training_data – Path to seed alignment file.
device – Device for tensor operations.
batch_size – Batch size for embedding computations.
**kwargs – Additional arguments passed to BaseOMModel.
- bootstraping(kb1: Any, kb2: Any, same_as_scores: Dict[Any, Dict[Any, float]], predicate2super_predicate: Dict[Any, Dict[Any, float]], functionalities: Dict[Any, float], num_workers: int) Tuple[Dict[Any, Dict[Any, float]], Dict[Any, Dict[Any, float]], Dict[Any, Dict[Any, float]], Dict[Any, Dict[Any, float]]]¶
Perform the bootstrapping phase of entity and predicate alignment.
Runs the first iteration in parallel to align entities based on literal similarity, then infers predicate subsumptions from the aligned entity triples.
- Parameters:
kb1 – First knowledge base.
kb2 – Second knowledge base.
same_as_scores – Initial entity alignment scores (from literal bootstrapping).
predicate2super_predicate – Initial predicate subsumption scores.
functionalities – Predicate functionality scores.
num_workers – Number of parallel worker processes.
- Returns:
quasi_eqrel: Predicate quasi-equivalence relations
predicate2super_predicate: Updated predicate subsumption scores
same_as_scores: Updated entity alignments
ent_max_assign: Bilateral max assignments for entities
- Return type:
Tuple of (quasi_eqrel, predicate2super_predicate, same_as_scores, ent_max_assign)
- functionalities(kb1: Any, kb2: Any) Dict[Any, float]¶
Compute predicate functionality scores across two knowledge bases.
Functionality is the inverse of “diversity”: a functional predicate (like birthDate) has one value per subject, while a non-functional predicate (like knows) can have many values per subject.
Computes functionality for each predicate using n-gram analysis and returns the minimum score across both KGs (conservative estimate).
- Parameters:
kb1 – First knowledge base.
kb2 – Second knowledge base.
- Returns:
Dictionary mapping predicates to functionality scores in [0, 1].
- generate(input_data: List[Any]) List[Dict[str, Any]]¶
Run the complete FLORA alignment algorithm on two knowledge graphs.
This is the main entry point implementing the full FLORA pipeline: 1. Load optional seed alignments 2. Initialize predicate subsumption scores 3. Compute predicate functionalities 4. Bootstrap entity alignment using literal similarity 5. Run main iterative alignment loop 6. Return entity alignment predictions
Input Format:
input_datashould be a two-element list of Graph objects, as returned byFLORAEncoder.Standard Usage:
>>> from ontoaligner.ontology import FLORAOMDataset >>> from ontoaligner.encoder import FLORAEncoder >>> from ontoaligner.aligner.flora import FLORAAligner >>> >>> # Parse KGs >>> dataset = FLORAOMDataset().collect("kg1.ttl", "kg2.ttl") >>> >>> # Encode for aligner >>> encoder_output = FLORAEncoder()( ... source=dataset["source"], ... target=dataset["target"] ... ) >>> >>> # Align >>> aligner = FLORAAligner() >>> matchings = aligner.generate(input_data=encoder_output)
- Parameters:
input_data – List of two Graph objects [kg1, kg2].
- Returns:
source(str): IRI of the source KG entitytarget(str): IRI of the target KG entityscore(float): Alignment confidence in [0.0, 1.0]type(str):'instance'or'predicate'
- Return type:
List of entity alignment predictions. Each prediction is a dictionary
- Raises:
ValueError – If input_data does not contain exactly 2 elements.
- get_predicate2super_predicate() Dict[Any, Dict[Any, float]]¶
Get the computed predicate subsumption scores.
- Returns:
Dictionary mapping predicates to their subsumption relationships and scores.
- get_same_as_scores() Dict[Any, Dict[Any, float]]¶
Get the computed entity alignment scores.
- Returns:
Dictionary mapping source entities to target entities and their scores.
- seed_alignments(training_data_path: str | None) Dict[str, Dict[str, float]]¶
Load optional seed alignments from a file.
Expected file format: tab-separated with entity1, entity2, and optional score. Example:
<http://example.org/Alice> <http://example.org/A> 0.95 <http://example.org/Bob> <http://example.org/B>
- Parameters:
training_data_path – Path to the seed alignment file.
- Returns:
Dictionary mapping entity pairs to alignment scores.
- Raises:
FileNotFoundError – If training_data_path is specified but doesn’t exist.
- class ontoaligner.aligner.flora.flora.FLORARDFWriter(prefixes: Dict[str, str])[source]¶
Bases:
objectWrites knowledge graph alignments to RDF/Turtle format.
Converts alignment results (entity and predicate mappings with scores) to RDF triples with namespace declarations.
Initialize RDF writer with namespace prefixes.
- Parameters:
prefixes – Dictionary mapping prefix names to namespace URIs. Example: {‘ex’: ‘http://example.org/’, ‘owl’: ‘http://www.w3.org/2002/07/owl#’}
- __init__(prefixes: Dict[str, str]) None¶
Initialize RDF writer with namespace prefixes.
- Parameters:
prefixes – Dictionary mapping prefix names to namespace URIs. Example: {‘ex’: ‘http://example.org/’, ‘owl’: ‘http://www.w3.org/2002/07/owl#’}
- write(output_path: str, kb1: Any, kb2: Any, predicate2super_predicate: Dict[Any, Dict[Any, float]], same_as_scores: Dict[Any, Dict[Any, float]]) None¶
Write alignment results to RDF file.
Writes: - Namespace prefixes - Predicate subsumption (rdfs:subPropertyOf) relationships - Entity equivalence (owl:sameAs) mappings with confidence scores
- Parameters:
output_path – File path for the output RDF/Turtle file.
kb1 – First knowledge base (used to filter predicates).
kb2 – Second knowledge base (used to filter predicates).
predicate2super_predicate – Predicate alignment scores.
same_as_scores – Entity alignment scores.
Core alignment algorithms for the FLORA (Fuzzy Logic KG Alignment) system.
This module implements the fuzzy-logic inference rules and iterative procedures that form the heart of the FLORA algorithm as described in:
Peng, Yiwen, Bonald, Thomas, and Suchanek, Fabian. “FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic.” ISWC 2025. https://suchanek.name/work/publications/iswc-2025.pdf
- ontoaligner.aligner.flora.fuzzy.bilateral_max_assign(same_as_score: Dict[Any, Dict[Any, float]]) Dict[Any, Dict[Any, float]][source]¶
Compute bilateral max assignment from similarity scores.
Computes the bilateral max assignment, as described in equation (3) of the FLORA paper.
- Parameters:
same_as_score – Nested dictionary of entity alignment scores.
- Returns:
The bilateral max assignment of entities.
- ontoaligner.aligner.flora.fuzzy.bootstrap_algo(kb_src: Any, kb_dst: Any, same_as_score: Dict[Any, Dict[Any, float]], pred2super_pred: Dict[Any, Dict[Any, float]], functionalities: Dict[Any, float], num_workers: int) Dict[Any, Dict[Any, float]][source]¶
Bootstrap the algorithm using initial literal alignments.
This function runs the first iteration in parallel to initialize entity alignment scores based on the initial literal similarity alignments.
- Parameters:
kb_src – The source knowledge base.
kb_dst – The target knowledge base.
same_as_score – Nested dictionary of entity alignment scores (includes initial literal alignments).
pred2super_pred – Nested dictionary of pairwise subsumption scores.
functionalities – Dictionary mapping predicates to their functionality scores.
- Returns:
Updated same_as_score dictionary with bootstrapped entity alignments.
- ontoaligner.aligner.flora.fuzzy.compute_functionalities(kb: Any, gram: List[int] | None = None) Dict[Any, float][source]¶
Compute functionality scores for predicates in a knowledge base.
Functionality is measured as the ratio of unique subjects per predicate, considering n-gram combinations for higher-order relationships.
- Parameters:
kb – The input knowledge base graph object.
gram – List of integers indicating n-gram sizes to consider. Defaults to [].
- Returns:
Dictionary mapping predicates to their functionality scores.
- ontoaligner.aligner.flora.fuzzy.compute_quasi_eqrel(kb_src: Any, kb_dst: Any, pred2super_pred: Dict[Any, Dict[Any, float]]) Dict[Any, Dict[Any, float]][source]¶
Compute quasi equivalence relations between two KGs’ predicates.
The quasi equivalence is represented as r≅r’ in the FLORA paper.
- Parameters:
kb_src – The source knowledge base.
kb_dst – The target knowledge base.
pred2super_pred – Nested dictionary of pairwise subsumption scores.
- Returns:
Nested dictionary of quasi equivalence relations.
- ontoaligner.aligner.flora.fuzzy.first_iteration(kb_src: Any, kb_dst: Any, pred2super_pred: Dict[Any, Dict[Any, float]], functionalities: Dict[Any, float], queue: Queue, ent_match_tuple_queue: Queue, ent_max_assign: Dict[Any, Dict[Any, float]]) None[source]¶
First iteration used for bootstrapping the algorithm using initial literal alignments.
This is the main worker function for the bootstrapping phase, run in parallel across multiple processes. It processes entities from the queue and computes initial entity alignment scores based on predicate subsumption and functionality.
Results are put into ent_match_tuple_queue for collection by the parent process.
- Parameters:
kb_src – The source knowledge base.
kb_dst – The target knowledge base.
pred2super_pred – Nested dictionary of pairwise subsumption scores.
functionalities – Dictionary mapping predicates to their functionality scores.
queue – Multiprocessing queue containing entities to be aligned.
ent_match_tuple_queue – Queue to store resulting entity alignment scores.
ent_max_assign – Bilateral max assignment from initial literal alignments.
- ontoaligner.aligner.flora.fuzzy.initialize_predicate_subsumption(predicates1: Set[Any], predicates2: Set[Any], pred2super_pred12: Dict[Any, Dict[Any, float]] | None = None, pred2super_pred21: Dict[Any, Dict[Any, float]] | None = None, relinit: float = 0.1) Dict[Any, Dict[Any, float]][source]¶
Initialize predicate subsumption scores between two knowledge bases.
Sets identical relations to 1.0, and initializes others with provided scores or a default initial value.
- Parameters:
predicates1 – Set of predicates in KB1.
predicates2 – Set of predicates in KB2.
pred2super_pred12 – Optional subsumption scores from KB1 predicates to KB2.
pred2super_pred21 – Optional subsumption scores from KB2 predicates to KB1.
relinit – Initial score for non-identical relations. Defaults to 0.1.
- Returns:
Nested dictionary of pairwise subsumption scores across KGs in both directions.
- ontoaligner.aligner.flora.fuzzy.map_subrelations(alpha: float, kb_src: Any, kb_dst: Any, ent_max_assign: Dict[Any, Dict[Any, float]], previous_predicate2super_predicate: Dict[Any, Dict[Any, float]]) Dict[Any, Dict[Any, float]][source]¶
Map subrelations in both directions using current entity alignments.
Updates predicate subsumption scores based on aligned entity pairs, computing which predicates in one KB correspond to predicates in the other.
- Parameters:
alpha – Benefit-of-doubt parameter for subrelation mapping.
kb_src – The source knowledge base.
kb_dst – The target knowledge base.
ent_max_assign – Bilateral max assignment from current entity alignments.
previous_predicate2super_predicate – Previous subsumption scores to be updated.
- Returns:
Updated predicate subsumption scores dictionary.
- ontoaligner.aligner.flora.fuzzy.update_max_score_min(mapping: Dict[Any, Tuple[Tuple, float]], pred: Any, fact: Tuple, *body: float) Dict[Any, Tuple[Tuple, float]][source]¶
Update mapping with the maximum aligned scoring fact for each predicate.
Used in subrelation rules to track the best matching facts. Returns the updated mapping dictionary.
- Parameters:
mapping – Dictionary to be updated with (fact, score) tuples per predicate.
pred – The predicate from KB2.
fact – The fact tuple (subject, predicate, object).
*body – Values in the body of the rule.
- Returns:
Updated mapping dictionary with maximum scoring facts.
- ontoaligner.aligner.flora.fuzzy.update_predicate_subsumption(pred2super_pred12: Dict[Any, Dict[Any, float]], pred2super_pred21: Dict[Any, Dict[Any, float]], previous_predicate2super_predicate: Dict[Any, Dict[Any, float]] | None) Dict[Any, Dict[Any, float]][source]¶
Update predicate subsumption scores bidirectionally.
Updates subsumption scores from KB1→KB2 and KB2→KB1, maintaining monotonicity of relation subsumption. Returns a new dictionary rather than modifying in-place.
- Parameters:
pred2super_pred12 – Current subsumption scores from KB1 to KB2 predicates.
pred2super_pred21 – Current subsumption scores from KB2 to KB1 predicates.
previous_predicate2super_predicate – Previous subsumption scores to be updated. If None, an empty dictionary is created.
- Returns:
Updated predicate subsumption scores dictionary.
- ontoaligner.aligner.flora.fuzzy.update_score_additive_min(mapping: Dict[Any, Dict[Any, float]], key1: Any, key2: Any, factor: float, *body) Dict[Any, Dict[Any, float]][source]¶
Update score using additive minimum operator.
Updates mapping[key1][key2] by adding the rule value. Used for subrelation rules, as shown in equation (2) in the FLORA paper. Returns the updated mapping.
- Parameters:
mapping – Nested dictionary to be updated with subrelation scores.
key1 – The predicate from KB1.
key2 – The predicate from KB2.
factor – Normalization factor (already multiplied by benefit-of-doubt parameter).
*body – Values in the body of the rule.
- Returns:
Updated mapping dictionary with new subrelation scores.
- ontoaligner.aligner.flora.fuzzy.update_score_min(mapping: Dict[Any, Dict[Any, float]], key1: Any, key2: Any, *body) Dict[Any, Dict[Any, float]][source]¶
Update score using minimum operator (Gödel logic).
Updates mapping[key1][key2] so that the rule body=>mapping[key1][key2] holds using the minimum operator, as shown in equation (1) in the FLORA paper. Returns the updated mapping dictionary.
- Parameters:
mapping – Nested dictionary to be updated with entity alignment scores.
key1 – The entity from KB1.
key2 – The entity from KB2.
*body – Values in the body of the rule.
- Returns:
Updated mapping dictionary with new alignment scores.
OLaLa Aligner¶
- class ontoaligner.aligner.olala.olala.OLaLaAligner(retriever: Any, llm_aligner: Any, hp_aligner: Any, **kwargs)[source]¶
Bases:
BaseOMModelInitializes the ontology matching model with optional keyword arguments.
- Parameters:
**kwargs – Additional keyword arguments that may be used for model configuration or parameters.
- __init__(retriever: Any, llm_aligner: Any, hp_aligner: Any, **kwargs) None¶
Initializes the ontology matching model with optional keyword arguments.
- Parameters:
**kwargs – Additional keyword arguments that may be used for model configuration or parameters.
- generate(input_data: List) List¶
Generates ontology alignment results by chaining retrieval, LLM verification, and high-precision matching.
- Parameters:
input_data (List) – A list containing the encoded source and target ontologies.
- Returns:
- The combined alignments from LLM and high-precision matching,
each annotated with an alignment_type field.
- Return type:
List
- load(llm_path: str, retriever_path: str) None¶
- class ontoaligner.aligner.olala.dataset.OLaLaLLMDataset(source_onto: Any, target_onto: Any, candidates: Any, system_prompt_template: str = '{user_prompt}')[source]¶
Bases:
DatasetA dataset for OLaLa LLM candidate verification.
Initializes the OLaLa LLM dataset from candidate correspondences.
- Parameters:
source_onto (Any) – The encoded source ontology.
target_onto (Any) – The encoded target ontology.
candidates (Any) – The SBERT candidate predictions.
system_prompt_template (str) – Optional prompt wrapper.
- collate_fn(batches)¶
Collates OLaLa LLM examples.
- Parameters:
batches – The batch examples.
- Returns:
The collated batch.
- Return type:
Dict
- fill_one_sample(input_data: Any) str¶
Builds one OLaLa LLM prompt.
- Parameters:
input_data (Any) – One candidate pair.
- Returns:
The filled prompt.
- Return type:
str
- preprocess(text: str) str¶
Preprocesses text for OLaLa LLM prompting.
- Parameters:
text (str) – The text to preprocess.
- Returns:
The preprocessed text.
- Return type:
str
- prompt = 'Classify if two descriptions refer to the same real world entity (ontology matching).\n### Concept one: endocrine pancreas secretion ### Concept two: Pancreatic Endocrine Secretion ### Answer: yes\n### Concept one: urinary bladder urothelium ### Concept two: Transitional Epithelium ### Answer: no\n### Concept one: trigeminal V nerve ophthalmic division ### Concept two: Ophthalmic Nerve ### Answer: yes\n### Concept one: foot digit 1 phalanx ### Concept two: Foot Digit 2 Phalanx ### Answer: no\n### Concept one: large intestine ### Concept two: Colon ### Answer: no\n### Concept one: ocular refractive media ### Concept two: Refractile Media ### Answer: yes\n### Concept one: {left} ### Concept two: {right} ### Answer: '¶
This script defines OLaLa lightweight matchers for ontology matching.
The high-precision matcher creates exact correspondences from normalized labels and URI fragments.
- class ontoaligner.aligner.olala.highprecision_matcher.OLaLaHighPrecisionMatcher(confidence: float = 1.0, **kwargs)[source]¶
Bases:
LightweightA high-precision exact matcher for OLaLa.
Initializes the OLaLa high-precision matcher.
- Parameters:
confidence (float) – The confidence assigned to exact matches.
**kwargs – Additional keyword arguments.
- build_text_index(ontology: List[Dict[str, Any]]) Dict[str, Set[str]]¶
Builds a text-to-IRI index for source ontology entities.
- Parameters:
ontology (List[Dict[str, Any]]) – The encoded source ontology.
- Returns:
The text-to-source-IRI index.
- Return type:
Dict[str, Set[str]]
- filter_n_to_m(pairs: Set[Tuple[str, str]]) Set[Tuple[str, str]]¶
Removes N:M correspondences.
- Parameters:
pairs (Set[Tuple[str, str]]) – The candidate pairs.
- Returns:
The remaining unambiguous pairs.
- Return type:
Set[Tuple[str, str]]
- generate(input_data: List) List¶
Generates high-precision exact correspondences.
- Parameters:
input_data (List) – The encoded source and target ontologies.
- Returns:
The high-precision correspondences.
- Return type:
List
- get_string_representations(item: Dict[str, Any]) Set[str]¶
Retrieves high-precision string representations for one entity.
- Parameters:
item (Dict[str, Any]) – The encoded ontology item.
- Returns:
The normalized high-precision texts.
- Return type:
Set[str]
- match_exact_texts(source_ontology: List[Dict[str, Any]], target_ontology: List[Dict[str, Any]]) Set[Tuple[str, str]]¶
Creates exact correspondences from normalized texts.
- Parameters:
source_ontology (List[Dict[str, Any]]) – The encoded source ontology.
target_ontology (List[Dict[str, Any]]) – The encoded target ontology.
- Returns:
The exact candidate pairs.
- Return type:
Set[Tuple[str, str]]
- class ontoaligner.aligner.olala.retrieval.OLaLaSBERTRetrieval(device: str = 'cpu', top_k: int = 5, both_directions: bool = True, topk_per_resource: bool = True, **kwargs)[source]¶
Bases:
BiEncoderRetrievalA SentenceTransformers retrieval model for OLaLa candidate generation.
Initializes the OLaLa SBERT retrieval model.
- Parameters:
device (str) – The device used by SentenceTransformers.
top_k (int) – The number of candidates to retrieve per resource.
both_directions (bool) – Whether to search in both ontology directions.
topk_per_resource (bool) – Whether top-k filtering is applied per resource.
**kwargs – Additional keyword arguments.
- add_prediction(predictions: Dict[Tuple[str, str], float], source_iri: str, target_iri: str, score: float) None¶
Adds or updates a predicted correspondence.
- Parameters:
predictions (Dict[Tuple[str, str], float]) – The prediction dictionary.
source_iri (str) – The source entity IRI.
target_iri (str) – The target entity IRI.
score (float) – The confidence score.
- Returns:
None
- filter_topk_per_resource(predictions: Dict[Tuple[str, str], float]) Dict[Tuple[str, str], float]¶
Filters correspondences by top-k source and target resources.
- Parameters:
predictions (Dict[Tuple[str, str], float]) – Candidate pairs and scores.
- Returns:
The filtered candidate pairs and scores.
- Return type:
Dict[Tuple[str, str], float]
- generate(input_data: List) List¶
Generates OLaLa SBERT candidate correspondences.
- Parameters:
input_data (List) – The encoded source and target ontologies.
- Returns:
The generated candidate correspondences.
- Return type:
List
- get_text_examples(ontology: List[Dict[str, Any]]) List[Dict[str, str]]¶
Creates multiple text examples per ontology resource.
- Parameters:
ontology (List[Dict[str, Any]]) – The encoded ontology items.
- Returns:
The text examples.
- Return type:
List[Dict[str, str]]
- load(path: str = 'multi-qa-mpnet-base-dot-v1')¶
Loads the SentenceTransformers model.
- Parameters:
path (str) – The model path or HuggingFace model name.
- Returns:
None
- merge_predictions(predictions: Dict[Tuple[str, str], float]) List[Dict[str, Any]]¶
Converts pair-level predictions to OntoAligner retrieval output.
- Parameters:
predictions (Dict[Tuple[str, str], float]) – Candidate pairs and scores.
- Returns:
The grouped retrieval predictions.
- Return type:
List[Dict[str, Any]]
- search_direction(query_examples: List[Dict[str, str]], corpus_examples: List[Dict[str, str]], reverse: bool = False) Dict[Tuple[str, str], float]¶
Searches one ontology direction and returns candidate correspondences.
- Parameters:
query_examples (List[Dict[str, str]]) – The query text examples.
corpus_examples (List[Dict[str, str]]) – The corpus text examples.
reverse (bool) – Whether the search direction is target-source.
- Returns:
Candidate pairs and confidence scores.
- Return type:
Dict[Tuple[str, str], float]