Encoders

Base Encoders

Lightweight Encoders

This script defines three encoder classes that inherit from the LightweightEncoder class. These encoders are used to process and transform OWL (Web Ontology Language) items into a format suitable for downstream tasks. Each encoder is specialized for different types of OWL items: Concept, Concept with Children, and Concept with Parent.

Classes:
  • ConceptLightweightEncoder: Encodes OWL items representing concepts.

  • ConceptChildrenLightweightEncoder: Encodes OWL items representing concepts and their children.

  • ConceptParentLightweightEncoder: Encodes OWL items representing concepts and their parents.

class ontoaligner.encoder.lightweight.ConceptChildrenLightweightEncoder[source]

Bases: LightweightEncoder

Encodes OWL items that represent concepts and their children.

This class inherits from the LightweightEncoder class and is designed to encode OWL items that consist of concepts and their children. The get_owl_items method retrieves the IRI, label of the concept, and the labels of its children.

items_in_owl

Specifies the type of OWL items being encoded, in this case, a Concept with Children.

Type:

str

get_owl_items(owl: Dict) Any

Extracts the IRI and label of a concept, along with the labels of its children, from the given OWL item.

Parameters:

owl (Dict) – A dictionary representing an OWL item, expected to contain ‘iri’, ‘label’, and ‘childrens’ keys where ‘childrens’ is a list of children with ‘label’ attributes.

Returns:

A dictionary containing the IRI, label of the concept, and the concatenated labels of its children.

Return type:

Dict

items_in_owl: str = '(Concept, Children)'
class ontoaligner.encoder.lightweight.ConceptLightweightEncoder[source]

Bases: LightweightEncoder

Encodes OWL items that represent concepts.

This class inherits from the LightweightEncoder class and is designed to encode OWL items that consist of concepts. The get_owl_items method retrieves the IRI and label of the concept.

items_in_owl

Specifies the type of OWL items being encoded, in this case, a Concept.

Type:

str

get_owl_items(owl: Dict) Any

Extracts the IRI and label of a concept from the given OWL item.

Parameters:

owl (Dict) – A dictionary representing an OWL item, expected to contain ‘iri’ and ‘label’ keys.

Returns:

A dictionary containing the IRI and label of the concept.

Return type:

Dict

items_in_owl: str = '(Concept)'
class ontoaligner.encoder.lightweight.ConceptParentLightweightEncoder[source]

Bases: LightweightEncoder

Encodes OWL items that represent concepts and their parents.

This class inherits from the LightweightEncoder class and is designed to encode OWL items that consist of concepts and their parents. The get_owl_items method retrieves the IRI, label of the concept, and the labels of its parents.

items_in_owl

Specifies the type of OWL items being encoded, in this case, a Concept with Parent.

Type:

str

get_owl_items(owl: Dict) Any

Extracts the IRI and label of a concept, along with the labels of its parents, from the given OWL item.

Parameters:

owl (Dict) – A dictionary representing an OWL item, expected to contain ‘iri’, ‘label’, and ‘parents’ keys where ‘parents’ is a list of parents with ‘label’ attributes.

Returns:

A dictionary containing the IRI, label of the concept, and the concatenated labels of its parents.

Return type:

Dict

items_in_owl: str = '(Concept, Parent)'
class ontoaligner.encoder.lightweight.DocConceptLightweightEncoder[source]

Bases: LightweightEncoder

Encodes OWL items as a Document which is a combination of the IRI, label, synonyms, and comments.

This class inherits from the LightweightEncoder class and is designed to encode OWL items that consist of concepts. The get_owl_items method retrieves the IRI and label of the concept.

items_in_owl

Specifies the type of OWL items being encoded, in this case, a Concept.

Type:

str

get_owl_items(owl: Dict) Any

Extracts the IRI and label of a concept from the given OWL item.

Parameters:

owl (Dict) – A dictionary representing an OWL item, expected to contain ‘iri’ and ‘label’ keys.

Returns:

A dictionary containing the IRI and label of the concept.

Return type:

Dict

items_in_owl: str = '(Classes Only)'
preprocess(text: str) str

Preprocesses input text by replacing underscores with spaces and converting the text to lowercase.

This method is used to standardize the format of input text before processing it further for encoding.

Parameters:

text (str) – The input text that needs preprocessing.

Returns:

The preprocessed text with underscores replaced by spaces and all characters in lowercase.

Return type:

str

class ontoaligner.encoder.lightweight.LightweightEncoder[source]

Bases: BaseEncoder

A lightweight encoder for parsing ontology data and preprocessing it.

This class provides methods for parsing ontological data, applying text preprocessing, and formatting the data into a structure suitable for further processing.

get_encoder_info()

Provides information about the encoder.

Returns:

A description of the encoder’s function in the overall pipeline.

Return type:

str

get_owl_items(owl: Dict) Any

Abstract method for extracting ontology data.

This method should be implemented by subclasses to extract specific ontology data (e.g., IRI and label) from the provided ontology item.

Parameters:

owl (Dict) – A dictionary representing an ontology item.

Returns:

The extracted ontology data.

Return type:

Any

parse(**kwargs) Any

Parses the source and target ontologies, applying preprocessing.

This method extracts ontology items (IRI and label) from the source and target ontologies, applies text preprocessing to the labels, and returns the encoded data.

Parameters:

**kwargs – Contains the source and target ontologies as keyword arguments.

Returns:

A list containing two elements, the processed source and target ontologies.

Return type:

list

class ontoaligner.encoder.lightweight.MILAEncoder[source]

Bases: ConceptLightweightEncoder

get_owl_items(owl: Dict) Any

Extracts the IRI and label of a concept from the given OWL item.

Parameters:

owl (Dict) – A dictionary representing an OWL item, expected to contain ‘iri’ and ‘label’ keys.

Returns:

A dictionary containing the IRI and label of the concept.

Return type:

Dict

items_in_owl: str = '(MILA Concept)'
parse(**kwargs) Any

Parses the source and target ontologies, applying preprocessing.

This method extracts ontology items (IRI and label) from the source and target ontologies, applies text preprocessing to the labels, and returns the encoded data.

Parameters:

**kwargs – Contains the source and target ontologies as keyword arguments.

Returns:

A list containing two elements, the processed source and target ontologies.

Return type:

list

Large Language Models Encoders

class ontoaligner.encoder.llm.ConceptChildrenLLMEncoder[source]

Bases: LLMEncoder

Encodes OWL items that represent concepts and their children.

This class inherits from the LightweightEncoder class and is designed to encode OWL items that consist of concepts and their children. The get_owl_items method retrieves the IRI, label of the concept, and the labels of its children.

items_in_owl

Specifies the type of OWL items being encoded, in this case, a Concept with Children.

Type:

str

get_owl_items(owl: Dict) Any

Extracts the IRI and label of a concept, along with the labels of its children, from the given OWL item.

Parameters:

owl (Dict) – A dictionary representing an OWL item, expected to contain ‘iri’, ‘label’, and ‘childrens’ keys where ‘childrens’ is a list of children with ‘label’ attributes.

Returns:

A dictionary containing the IRI, label of the concept, and the concatenated labels of its children.

Return type:

Dict

items_in_owl: str = '(Concept, Children)'
class ontoaligner.encoder.llm.ConceptLLMEncoder[source]

Bases: LLMEncoder

Encodes OWL items that represent concepts.

This class inherits from the LightweightEncoder class and is designed to encode OWL items that consist of concepts. The get_owl_items method retrieves the IRI and label of the concept.

items_in_owl

Specifies the type of OWL items being encoded, in this case, a Concept.

Type:

str

get_owl_items(owl: Dict) Any

Extracts the IRI and label of a concept from the given OWL item.

Parameters:

owl (Dict) – A dictionary representing an OWL item, expected to contain ‘iri’ and ‘label’ keys.

Returns:

A dictionary containing the IRI and label of the concept.

Return type:

Dict

items_in_owl: str = '(Concept)'
class ontoaligner.encoder.llm.ConceptParentLLMEncoder[source]

Bases: LLMEncoder

Encodes OWL items that represent concepts and their parents.

This class inherits from the LightweightEncoder class and is designed to encode OWL items that consist of concepts and their parents. The get_owl_items method retrieves the IRI, label of the concept, and the labels of its parents.

items_in_owl

Specifies the type of OWL items being encoded, in this case, a Concept with Parent.

Type:

str

get_owl_items(owl: Dict) Any

Extracts the IRI and label of a concept, along with the labels of its parents, from the given OWL item.

Parameters:

owl (Dict) – A dictionary representing an OWL item, expected to contain ‘iri’, ‘label’, and ‘parents’ keys where ‘parents’ is a list of parents with ‘label’ attributes.

Returns:

A dictionary containing the IRI, label of the concept, and the concatenated labels of its parents.

Return type:

Dict

items_in_owl: str = '(Concept, Parent)'
class ontoaligner.encoder.llm.LLMEncoder[source]

Bases: BaseEncoder

A naive encoder for ontology alignment.

get_encoder_info() str

Provides information about the encoder and its prompt template.

Returns:

A description of the encoder’s components.

Return type:

str

get_owl_items(owl: Dict) str

Abstract method to extract ontology data as a string.

This method should be implemented by subclasses to extract specific ontology data (e.g., IRI and label) from the provided ontology item.

Parameters:

owl (Dict) – A dictionary representing an ontology item.

Returns:

The extracted ontology data as a string.

Return type:

str

parse(**kwargs) Any

Processes the source and target ontologies into a prompt for ontology alignment.

This method formats the source and target ontologies into a string representation, filling in a pre-defined template that includes ontology items (IRI and label).

Parameters:

**kwargs – Contains the source and target ontologies as keyword arguments.

Returns:

A list containing the formatted prompt string for ontology matching.

Return type:

list

Retrieval Augmented Generation Encoders

This script defines three encoder classes that extend the RAGEncoder class to specialize in encoding OWL items representing different ontology concepts. These encoders use a retrieval-based approach along with a language model encoder for efficient handling of ontology mapping tasks.

Classes:
  • ConceptRAGEncoder: Encodes OWL items representing a Concept, with a retrieval encoder and a language model encoder.

  • ConceptChildrenRAGEncoder: Encodes OWL items representing a Concept and its Children, with a retrieval encoder and a language model encoder.

  • ConceptParentRAGEncoder: Encodes OWL items representing a Concept and its Parent, with a retrieval encoder and a language model encoder.

class ontoaligner.encoder.rag.ConceptChildrenRAGEncoder[source]

Bases: RAGEncoder

Encodes OWL items representing a Concept and its Children using retrieval-based and language model encoders.

This class extends the RAGEncoder class and is specialized in encoding OWL items that consist of a Concept and its Children. The retrieval encoder uses the ConceptLightweightEncoder class to fetch the necessary items, while the language model encoder is set to “LabelChildrenRAGDataset”.

items_in_owl

Specifies the type of OWL items being encoded, in this case, a Concept and its Children.

Type:

str

retrieval_encoder

The retrieval encoder used for fetching OWL items, set to ConceptLightweightEncoder.

Type:

Any

llm_encoder

The language model encoder used, set to “LabelChildrenRAGDataset”.

Type:

str

items_in_owl: str = '(Concept, Children)'
llm_encoder: str = 'ConceptChildrenRAGDataset'
retrieval_encoder

alias of ConceptLightweightEncoder

class ontoaligner.encoder.rag.ConceptParentRAGEncoder[source]

Bases: RAGEncoder

Encodes OWL items representing a Concept and its Parent using retrieval-based and language model encoders.

This class extends the RAGEncoder class and is specialized in encoding OWL items that consist of a Concept and its Parent. The retrieval encoder uses the ConceptLightweightEncoder class to retrieve the necessary items, while the language model encoder is set to “LabelParentRAGDataset”.

items_in_owl

Specifies the type of OWL items being encoded, in this case, a Concept and its Parent.

Type:

str

retrieval_encoder

The retrieval encoder used for fetching OWL items, set to ConceptLightweightEncoder.

Type:

Any

llm_encoder

The language model encoder used, set to “LabelParentRAGDataset”.

Type:

str

items_in_owl: str = '(Concept, Parent)'
llm_encoder: str = 'ConceptParentRAGDataset'
retrieval_encoder

alias of ConceptLightweightEncoder

class ontoaligner.encoder.rag.ConceptRAGEncoder[source]

Bases: RAGEncoder

Encodes OWL items representing a Concept using retrieval-based and language model encoders.

This class extends the RAGEncoder class and is specialized in encoding OWL items that consist of a Concept. The retrieval encoder uses the ConceptLightweightEncoder class to retrieve OWL items, while the language model encoder is set to “LabelRAGDataset”.

items_in_owl

Specifies the type of OWL items being encoded, in this case, a Concept.

Type:

str

retrieval_encoder

The retrieval encoder used for fetching OWL items, set to ConceptLightweightEncoder.

Type:

Any

llm_encoder

The language model encoder used, set to “LabelRAGDataset”.

Type:

str

items_in_owl: str = '(Concept)'
llm_encoder: str = 'ConceptRAGDataset'
retrieval_encoder

alias of ConceptLightweightEncoder

class ontoaligner.encoder.rag.OLaLaEncoder[source]

Bases: BaseEncoder

An encoder for preparing OLaLa parser output.

extract_high_precision_texts(owl: Dict, normalized_uri_fragment: str, is_uri_fragment_normalization_valid: bool) List[str]

Extracts normalized texts for the high-precision matcher.

Parameters:
  • owl (Dict) – A parsed ontology item.

  • normalized_uri_fragment (str) – The URI fragment.

Returns:

The normalized high-precision texts.

Return type:

List[str]

get_encoder_info() str

Provides information about the encoder.

Returns:

A description of the encoder.

Return type:

str

get_host_uri_by_sampling(items: list, sample_size: int = 50) str

Extracts the most common host by sampling ontology items.

Parameters:
  • items (list) – The parsed ontology items.

  • sample_size (int) – The number of items to sample.

Returns:

The most common host.

Return type:

str

get_owl_items(owl: Dict, expected_host: str) Dict

Extracts OLaLa-ready fields from a parsed ontology item.

Parameters:
  • owl (Dict) – A parsed ontology item.

  • expected_host (str) – The expected ontology host.

Returns:

The prepared OLaLa ontology item.

Return type:

Dict

is_resource_for_sbert(host, expected_host) bool

Checks whether a resource should be kept for SBERT candidate generation.

items_in_owl: str = '(OLaLa TextExtractorSet)'
parse(**kwargs) Any

Parses source and target ontologies into OLaLa-ready inputs.

Parameters:

**kwargs – Contains the source and target ontologies.

Returns:

A list containing prepared source and target ontology items.

Return type:

list

class ontoaligner.encoder.rag.RAGEncoder[source]

Bases: BaseEncoder

A retrieval-augmented generation (RAG) encoder for ontology mapping.

This class leverages retrieval-augmented generation for encoding ontology data, allowing for both retrieval of relevant data and generation of encoded information.

get_encoder_info() str

Provides information about the encoder and its usage.

Returns:

A description of the encoder’s components.

Return type:

str

llm_encoder: str = None
parse(**kwargs) Any

Processes the source and target ontologies into indices for retrieval and encoding.

This method converts the source and target ontologies into mappings of IRI to index, preparing them for use in a retrieval-augmented generation model.

Parameters:

**kwargs – Contains the source and target ontologies as keyword arguments.

Returns:

A dictionary with the retrieval encoder, LLM encoder, task arguments,

and the source and target ontology index mappings.

Return type:

dict

retrieval_encoder: Any = None

FewShot-RAG Encoders

This script defines three encoder classes for few-shot learning based on the RAG (retrieval-augmented generation) method. These classes extend the functionality of the RAG-based encoders for concept, concept children, and concept parent, specializing them for few-shot datasets related to concepts and their hierarchical relationships.

Classes:
  • ConceptFewShotEncoder: A few-shot learning encoder for concepts.

  • ConceptChildrenFewShotEncoder: A few-shot learning encoder for concept children.

  • ConceptParentFewShotEncoder: A few-shot learning encoder for concept parents.

class ontoaligner.encoder.fewshot.ConceptChildrenFewShotEncoder[source]

Bases: ConceptChildrenRAGEncoder

A few-shot learning encoder for concept children using retrieval-augmented generation (RAG).

This class extends the ConceptChildrenRAGEncoder and is designed for few-shot learning tasks related to concept children. It uses a custom few-shot dataset for encoding concept children.

llm_encoder

The dataset used for few-shot learning. In this case, it uses “ConceptChildrenFewShotDataset”.

Type:

str

llm_encoder: str = 'ConceptChildrenFewShotDataset'
class ontoaligner.encoder.fewshot.ConceptFewShotEncoder[source]

Bases: ConceptRAGEncoder

A few-shot learning encoder for concepts using retrieval-augmented generation (RAG).

This class extends the ConceptRAGEncoder and is designed specifically for few-shot learning tasks related to concepts. It uses a custom few-shot dataset for encoding concepts.

llm_encoder

The dataset used for few-shot learning. In this case, it uses “ConceptFewShotDataset”.

Type:

str

llm_encoder: str = 'ConceptFewShotDataset'
class ontoaligner.encoder.fewshot.ConceptParentFewShotEncoder[source]

Bases: ConceptParentRAGEncoder

A few-shot learning encoder for concept parents using retrieval-augmented generation (RAG).

This class extends the ConceptParentRAGEncoder and is designed for few-shot learning tasks related to concept parents. It uses a custom few-shot dataset for encoding concept parents.

llm_encoder

The dataset used for few-shot learning. In this case, it uses “ConceptParentFewShotDataset”.

Type:

str

llm_encoder: str = 'ConceptParentFewShotDataset'

Graph Encoders

class ontoaligner.encoder.graph.GraphTripleEncoder[source]

Bases: BaseEncoder

encode_ontology(ontology)
get_encoder_info()

Provides information about the encoder.

parse(**kwargs) Any

Parses the source and target ontologies, applying preprocessing.

This method extracts ontology items (IRI and label) from the source and target ontologies, applies text preprocessing to the labels, and returns the encoded data.

Parameters:

**kwargs – Contains the source and target ontologies as keyword arguments.

Returns:

A list containing two elements, the processed source and target ontologies.

Return type:

list

FLORA Encoder