Text Embedding Models

`gfmrag.text_emb_models` ¶

`BaseTextEmbModel` ¶

A base class for text embedding models using SentenceTransformer.

This class provides functionality to encode text into embeddings using various SentenceTransformer models with configurable parameters.

Parameters:

Name	Type	Description	Default
`text_emb_model_name`	`str`	Name or path of the SentenceTransformer model to use	required
`normalize`	`bool`	Whether to L2-normalize the embeddings. Defaults to False.	`False`
`batch_size`	`int`	Batch size for encoding. Defaults to 32.	`32`
`query_instruct`	`str \| None`	Instruction/prompt to prepend to queries. Defaults to None.	`None`
`passage_instruct`	`str \| None`	Instruction/prompt to prepend to passages. Defaults to None.	`None`
`model_kwargs`	`dict \| None`	Additional keyword arguments for the model. Defaults to None.	`None`

Attributes:

Name	Type	Description
`text_emb_model`	`SentenceTransformer`	The underlying SentenceTransformer model
`text_emb_model_name`	`str`	Name of the model being used
`normalize`	`bool`	Whether embeddings are L2-normalized
`batch_size`	`int`	Batch size used for encoding
`query_instruct`	`str \| None`	Instruction text for queries
`passage_instruct`	`str \| None`	Instruction text for passages
`model_kwargs`	`dict \| None`	Additional model configuration parameters

Methods:

Name	Description
`encode`	list[str], is_query: bool = False, show_progress_bar: bool = True) -> torch.Tensor: Encodes a list of texts into embeddings.

Source code in gfmrag/text_emb_models/base_model.py

Python

class BaseTextEmbModel:
    """A base class for text embedding models using SentenceTransformer.

    This class provides functionality to encode text into embeddings using various
    SentenceTransformer models with configurable parameters.

    Args:
        text_emb_model_name (str): Name or path of the SentenceTransformer model to use
        normalize (bool, optional): Whether to L2-normalize the embeddings. Defaults to False.
        batch_size (int, optional): Batch size for encoding. Defaults to 32.
        query_instruct (str | None, optional): Instruction/prompt to prepend to queries. Defaults to None.
        passage_instruct (str | None, optional): Instruction/prompt to prepend to passages. Defaults to None.
        model_kwargs (dict | None, optional): Additional keyword arguments for the model. Defaults to None.

    Attributes:
        text_emb_model (SentenceTransformer): The underlying SentenceTransformer model
        text_emb_model_name (str): Name of the model being used
        normalize (bool): Whether embeddings are L2-normalized
        batch_size (int): Batch size used for encoding
        query_instruct (str | None): Instruction text for queries
        passage_instruct (str | None): Instruction text for passages
        model_kwargs (dict | None): Additional model configuration parameters

    Methods:
        encode(text: list[str], is_query: bool = False, show_progress_bar: bool = True) -> torch.Tensor:
            Encodes a list of texts into embeddings.
    """

    def __init__(
        self,
        text_emb_model_name: str,
        normalize: bool = False,
        batch_size: int = 32,
        query_instruct: str | None = None,
        passage_instruct: str | None = None,
        model_kwargs: dict | None = None,
    ) -> None:
        """
        Initialize the BaseTextEmbModel.

        Args:
            text_emb_model_name (str): Name or path of the SentenceTransformer model to use
            normalize (bool, optional): Whether to L2-normalize the embeddings. Defaults to False.
            batch_size (int, optional): Batch size for encoding. Defaults to 32.
            query_instruct (str | None, optional): Instruction/prompt to prepend to queries. Defaults to None.
            passage_instruct (str | None, optional): Instruction/prompt to prepend to passages. Defaults to None.
            model_kwargs (dict | None, optional): Additional keyword arguments for the model. Defaults to None.
        """
        self.text_emb_model_name = text_emb_model_name
        self.normalize = normalize
        self.batch_size = batch_size
        self.query_instruct = query_instruct
        self.passage_instruct = passage_instruct
        self.model_kwargs = model_kwargs

        self.text_emb_model = SentenceTransformer(
            self.text_emb_model_name,
            trust_remote_code=True,
            model_kwargs=self.model_kwargs,
        )

    def encode(
        self, text: list[str], is_query: bool = False, show_progress_bar: bool = True
    ) -> torch.Tensor:
        """
        Encodes a list of text strings into embeddings using the text embedding model.

        Args:
            text (list[str]): List of text strings to encode
            is_query (bool, optional): Whether the text is a query (True) or passage (False).
                Determines which instruction prompt to use. Defaults to False.
            show_progress_bar (bool, optional): Whether to display progress bar during encoding.
                Defaults to True.

        Returns:
            torch.Tensor: Tensor containing the encoded embeddings for the input text

        Examples:
            >>> text_emb_model = BaseTextEmbModel("sentence-transformers/all-mpnet-base-v2")
            >>> text = ["Hello, world!", "This is a test."]
            >>> embeddings = text_emb_model.encode(text)
        """

        return self.text_emb_model.encode(
            text,
            device="cuda" if torch.cuda.is_available() else "cpu",
            normalize_embeddings=self.normalize,
            batch_size=self.batch_size,
            prompt=self.query_instruct if is_query else self.passage_instruct,
            show_progress_bar=show_progress_bar,
            convert_to_tensor=True,
        )

`init(text_emb_model_name, normalize=False, batch_size=32, query_instruct=None, passage_instruct=None, model_kwargs=None)` ¶

Initialize the BaseTextEmbModel.

Parameters:

Name	Type	Description	Default
`text_emb_model_name`	`str`	Name or path of the SentenceTransformer model to use	required
`normalize`	`bool`	Whether to L2-normalize the embeddings. Defaults to False.	`False`
`batch_size`	`int`	Batch size for encoding. Defaults to 32.	`32`
`query_instruct`	`str \| None`	Instruction/prompt to prepend to queries. Defaults to None.	`None`
`passage_instruct`	`str \| None`	Instruction/prompt to prepend to passages. Defaults to None.	`None`
`model_kwargs`	`dict \| None`	Additional keyword arguments for the model. Defaults to None.	`None`

Source code in gfmrag/text_emb_models/base_model.py

Python

def __init__(
    self,
    text_emb_model_name: str,
    normalize: bool = False,
    batch_size: int = 32,
    query_instruct: str | None = None,
    passage_instruct: str | None = None,
    model_kwargs: dict | None = None,
) -> None:
    """
    Initialize the BaseTextEmbModel.

    Args:
        text_emb_model_name (str): Name or path of the SentenceTransformer model to use
        normalize (bool, optional): Whether to L2-normalize the embeddings. Defaults to False.
        batch_size (int, optional): Batch size for encoding. Defaults to 32.
        query_instruct (str | None, optional): Instruction/prompt to prepend to queries. Defaults to None.
        passage_instruct (str | None, optional): Instruction/prompt to prepend to passages. Defaults to None.
        model_kwargs (dict | None, optional): Additional keyword arguments for the model. Defaults to None.
    """
    self.text_emb_model_name = text_emb_model_name
    self.normalize = normalize
    self.batch_size = batch_size
    self.query_instruct = query_instruct
    self.passage_instruct = passage_instruct
    self.model_kwargs = model_kwargs

    self.text_emb_model = SentenceTransformer(
        self.text_emb_model_name,
        trust_remote_code=True,
        model_kwargs=self.model_kwargs,
    )

`encode(text, is_query=False, show_progress_bar=True)` ¶

Encodes a list of text strings into embeddings using the text embedding model.

Parameters:

Name	Type	Description	Default
`text`	`list[str]`	List of text strings to encode	required
`is_query`	`bool`	Whether the text is a query (True) or passage (False). Determines which instruction prompt to use. Defaults to False.	`False`
`show_progress_bar`	`bool`	Whether to display progress bar during encoding. Defaults to True.	`True`

Returns:

Type	Description
`Tensor`	torch.Tensor: Tensor containing the encoded embeddings for the input text

Examples:

Python Console Session

>>> text_emb_model = BaseTextEmbModel("sentence-transformers/all-mpnet-base-v2")
>>> text = ["Hello, world!", "This is a test."]
>>> embeddings = text_emb_model.encode(text)

Source code in gfmrag/text_emb_models/base_model.py

Python

def encode(
    self, text: list[str], is_query: bool = False, show_progress_bar: bool = True
) -> torch.Tensor:
    """
    Encodes a list of text strings into embeddings using the text embedding model.

    Args:
        text (list[str]): List of text strings to encode
        is_query (bool, optional): Whether the text is a query (True) or passage (False).
            Determines which instruction prompt to use. Defaults to False.
        show_progress_bar (bool, optional): Whether to display progress bar during encoding.
            Defaults to True.

    Returns:
        torch.Tensor: Tensor containing the encoded embeddings for the input text

    Examples:
        >>> text_emb_model = BaseTextEmbModel("sentence-transformers/all-mpnet-base-v2")
        >>> text = ["Hello, world!", "This is a test."]
        >>> embeddings = text_emb_model.encode(text)
    """

    return self.text_emb_model.encode(
        text,
        device="cuda" if torch.cuda.is_available() else "cpu",
        normalize_embeddings=self.normalize,
        batch_size=self.batch_size,
        prompt=self.query_instruct if is_query else self.passage_instruct,
        show_progress_bar=show_progress_bar,
        convert_to_tensor=True,
    )

`NVEmbedV2` ¶

Bases: BaseTextEmbModel

A text embedding model class that extends BaseTextEmbModel specifically for Nvidia models.

This class customizes the base embedding model by: 1. Setting a larger max sequence length of 32768 2. Setting right-side padding 3. Adding EOS tokens to input text

Parameters:

Name	Type	Description	Default
`text_emb_model_name`	`str`	Name or path of the text embedding model	required
`normalize`	`bool`	Whether to normalize the output embeddings	required
`batch_size`	`int`	Batch size for processing	required
`query_instruct`	`str`	Instruction prefix for query texts. Defaults to "".	`''`
`passage_instruct`	`str`	Instruction prefix for passage texts. Defaults to "".	`''`
`model_kwargs`	`dict \| None`	Additional keyword arguments for model initialization. Defaults to None.	`None`

Methods:

Name	Description
`add_eos`	Adds EOS token to each input example
`encode`	Encodes text by first adding EOS tokens then calling parent encode method

Attributes:

Name	Type	Description
`text_emb_model`		The underlying text embedding model with customized max_seq_length and padding_side

Source code in gfmrag/text_emb_models/nv_embed.py

Python

class NVEmbedV2(BaseTextEmbModel):
    """A text embedding model class that extends BaseTextEmbModel specifically for Nvidia models.

    This class customizes the base embedding model by:
    1. Setting a larger max sequence length of 32768
    2. Setting right-side padding
    3. Adding EOS tokens to input text

    Args:
        text_emb_model_name (str): Name or path of the text embedding model
        normalize (bool): Whether to normalize the output embeddings
        batch_size (int): Batch size for processing
        query_instruct (str, optional): Instruction prefix for query texts. Defaults to "".
        passage_instruct (str, optional): Instruction prefix for passage texts. Defaults to "".
        model_kwargs (dict | None, optional): Additional keyword arguments for model initialization. Defaults to None.

    Methods:
        add_eos: Adds EOS token to each input example
        encode: Encodes text by first adding EOS tokens then calling parent encode method

    Attributes:
        text_emb_model: The underlying text embedding model with customized max_seq_length and padding_side
    """

    def __init__(
        self,
        text_emb_model_name: str,
        normalize: bool,
        batch_size: int,
        query_instruct: str = "",
        passage_instruct: str = "",
        model_kwargs: dict | None = None,
    ) -> None:
        super().__init__(
            text_emb_model_name,
            normalize,
            batch_size,
            query_instruct,
            passage_instruct,
            model_kwargs,
        )
        self.text_emb_model.max_seq_length = 32768
        self.text_emb_model.tokenizer.padding_side = "right"

    def add_eos(self, input_examples: list[str]) -> list[str]:
        input_examples = [
            input_example + self.text_emb_model.tokenizer.eos_token
            for input_example in input_examples
        ]
        return input_examples

    def encode(self, text: list[str], *args: Any, **kwargs: Any) -> torch.Tensor:
        """
        Encode a list of text strings into embeddings with added EOS token.

        This method adds an EOS (end of sequence) token to each text string before encoding.

        Args:
            text (list[str]): List of text strings to encode
            *args (Any): Additional positional arguments passed to parent encode method
            **kwargs (Any): Additional keyword arguments passed to parent encode method

        Returns:
            torch.Tensor: Encoded text embeddings tensor

        Examples:
            >>> encoder = NVEmbedder()
            >>> texts = ["Hello world", "Another text"]
            >>> embeddings = encoder.encode(texts)
        """
        return super().encode(self.add_eos(text), *args, **kwargs)

`encode(text, *args, **kwargs)` ¶

Encode a list of text strings into embeddings with added EOS token.

This method adds an EOS (end of sequence) token to each text string before encoding.

Parameters:

Name	Type	Description	Default
`text`	`list[str]`	List of text strings to encode	required
`*args`	`Any`	Additional positional arguments passed to parent encode method	`()`
`**kwargs`	`Any`	Additional keyword arguments passed to parent encode method	`{}`

Returns:

Type	Description
`Tensor`	torch.Tensor: Encoded text embeddings tensor

Examples:

Python Console Session

>>> encoder = NVEmbedder()
>>> texts = ["Hello world", "Another text"]
>>> embeddings = encoder.encode(texts)

Source code in gfmrag/text_emb_models/nv_embed.py

Python

def encode(self, text: list[str], *args: Any, **kwargs: Any) -> torch.Tensor:
    """
    Encode a list of text strings into embeddings with added EOS token.

    This method adds an EOS (end of sequence) token to each text string before encoding.

    Args:
        text (list[str]): List of text strings to encode
        *args (Any): Additional positional arguments passed to parent encode method
        **kwargs (Any): Additional keyword arguments passed to parent encode method

    Returns:
        torch.Tensor: Encoded text embeddings tensor

    Examples:
        >>> encoder = NVEmbedder()
        >>> texts = ["Hello world", "Another text"]
        >>> embeddings = encoder.encode(texts)
    """
    return super().encode(self.add_eos(text), *args, **kwargs)

Text Embedding Models

gfmrag.text_emb_models ¶

BaseTextEmbModel ¶

__init__(text_emb_model_name, normalize=False, batch_size=32, query_instruct=None, passage_instruct=None, model_kwargs=None) ¶

encode(text, is_query=False, show_progress_bar=True) ¶

NVEmbedV2 ¶

encode(text, *args, **kwargs) ¶

`gfmrag.text_emb_models` ¶

`BaseTextEmbModel` ¶

`init(text_emb_model_name, normalize=False, batch_size=32, query_instruct=None, passage_instruct=None, model_kwargs=None)` ¶

`encode(text, is_query=False, show_progress_bar=True)` ¶

`NVEmbedV2` ¶

`encode(text, *args, **kwargs)` ¶