Skip to content

Text Embedding Models

gfmrag.text_emb_models

BaseTextEmbModel

A base class for text embedding models using SentenceTransformer.

This class provides functionality to encode text into embeddings using various SentenceTransformer models with configurable parameters.

Parameters:

Name Type Description Default
text_emb_model_name str

Name or path of the SentenceTransformer model to use

required
normalize bool

Whether to L2-normalize the embeddings. Defaults to False.

False
batch_size int

Batch size for encoding. Defaults to 32.

32
query_instruct str | None

Instruction/prompt to prepend to queries. Defaults to None.

None
passage_instruct str | None

Instruction/prompt to prepend to passages. Defaults to None.

None
model_kwargs dict | None

Additional keyword arguments for the model. Defaults to None.

None

Attributes:

Name Type Description
text_emb_model SentenceTransformer

The underlying SentenceTransformer model

text_emb_model_name str

Name of the model being used

normalize bool

Whether embeddings are L2-normalized

batch_size int

Batch size used for encoding

query_instruct str | None

Instruction text for queries

passage_instruct str | None

Instruction text for passages

model_kwargs dict | None

Additional model configuration parameters

Methods:

Name Description
encode

list[str], is_query: bool = False, show_progress_bar: bool = True) -> torch.Tensor: Encodes a list of texts into embeddings.

Source code in gfmrag/text_emb_models/base_model.py
Python
class BaseTextEmbModel:
    """A base class for text embedding models using SentenceTransformer.

    This class provides functionality to encode text into embeddings using various
    SentenceTransformer models with configurable parameters.

    Args:
        text_emb_model_name (str): Name or path of the SentenceTransformer model to use
        normalize (bool, optional): Whether to L2-normalize the embeddings. Defaults to False.
        batch_size (int, optional): Batch size for encoding. Defaults to 32.
        query_instruct (str | None, optional): Instruction/prompt to prepend to queries. Defaults to None.
        passage_instruct (str | None, optional): Instruction/prompt to prepend to passages. Defaults to None.
        model_kwargs (dict | None, optional): Additional keyword arguments for the model. Defaults to None.

    Attributes:
        text_emb_model (SentenceTransformer): The underlying SentenceTransformer model
        text_emb_model_name (str): Name of the model being used
        normalize (bool): Whether embeddings are L2-normalized
        batch_size (int): Batch size used for encoding
        query_instruct (str | None): Instruction text for queries
        passage_instruct (str | None): Instruction text for passages
        model_kwargs (dict | None): Additional model configuration parameters

    Methods:
        encode(text: list[str], is_query: bool = False, show_progress_bar: bool = True) -> torch.Tensor:
            Encodes a list of texts into embeddings.
    """

    def __init__(
        self,
        text_emb_model_name: str,
        normalize: bool = False,
        batch_size: int = 32,
        query_instruct: str | None = None,
        passage_instruct: str | None = None,
        model_kwargs: dict | None = None,
    ) -> None:
        """
        Initialize the BaseTextEmbModel.

        Args:
            text_emb_model_name (str): Name or path of the SentenceTransformer model to use
            normalize (bool, optional): Whether to L2-normalize the embeddings. Defaults to False.
            batch_size (int, optional): Batch size for encoding. Defaults to 32.
            query_instruct (str | None, optional): Instruction/prompt to prepend to queries. Defaults to None.
            passage_instruct (str | None, optional): Instruction/prompt to prepend to passages. Defaults to None.
            model_kwargs (dict | None, optional): Additional keyword arguments for the model. Defaults to None.
        """
        self.text_emb_model_name = text_emb_model_name
        self.normalize = normalize
        self.batch_size = batch_size
        self.query_instruct = query_instruct
        self.passage_instruct = passage_instruct
        self.model_kwargs = model_kwargs

        self.text_emb_model = SentenceTransformer(
            self.text_emb_model_name,
            trust_remote_code=True,
            model_kwargs=self.model_kwargs,
        )

    def encode(
        self, text: list[str], is_query: bool = False, show_progress_bar: bool = True
    ) -> torch.Tensor:
        """
        Encodes a list of text strings into embeddings using the text embedding model.

        Args:
            text (list[str]): List of text strings to encode
            is_query (bool, optional): Whether the text is a query (True) or passage (False).
                Determines which instruction prompt to use. Defaults to False.
            show_progress_bar (bool, optional): Whether to display progress bar during encoding.
                Defaults to True.

        Returns:
            torch.Tensor: Tensor containing the encoded embeddings for the input text

        Examples:
            >>> text_emb_model = BaseTextEmbModel("sentence-transformers/all-mpnet-base-v2")
            >>> text = ["Hello, world!", "This is a test."]
            >>> embeddings = text_emb_model.encode(text)
        """

        return self.text_emb_model.encode(
            text,
            device="cuda" if torch.cuda.is_available() else "cpu",
            normalize_embeddings=self.normalize,
            batch_size=self.batch_size,
            prompt=self.query_instruct if is_query else self.passage_instruct,
            show_progress_bar=show_progress_bar,
            convert_to_tensor=True,
        )

__init__(text_emb_model_name, normalize=False, batch_size=32, query_instruct=None, passage_instruct=None, model_kwargs=None)

Initialize the BaseTextEmbModel.

Parameters:

Name Type Description Default
text_emb_model_name str

Name or path of the SentenceTransformer model to use

required
normalize bool

Whether to L2-normalize the embeddings. Defaults to False.

False
batch_size int

Batch size for encoding. Defaults to 32.

32
query_instruct str | None

Instruction/prompt to prepend to queries. Defaults to None.

None
passage_instruct str | None

Instruction/prompt to prepend to passages. Defaults to None.

None
model_kwargs dict | None

Additional keyword arguments for the model. Defaults to None.

None
Source code in gfmrag/text_emb_models/base_model.py
Python
def __init__(
    self,
    text_emb_model_name: str,
    normalize: bool = False,
    batch_size: int = 32,
    query_instruct: str | None = None,
    passage_instruct: str | None = None,
    model_kwargs: dict | None = None,
) -> None:
    """
    Initialize the BaseTextEmbModel.

    Args:
        text_emb_model_name (str): Name or path of the SentenceTransformer model to use
        normalize (bool, optional): Whether to L2-normalize the embeddings. Defaults to False.
        batch_size (int, optional): Batch size for encoding. Defaults to 32.
        query_instruct (str | None, optional): Instruction/prompt to prepend to queries. Defaults to None.
        passage_instruct (str | None, optional): Instruction/prompt to prepend to passages. Defaults to None.
        model_kwargs (dict | None, optional): Additional keyword arguments for the model. Defaults to None.
    """
    self.text_emb_model_name = text_emb_model_name
    self.normalize = normalize
    self.batch_size = batch_size
    self.query_instruct = query_instruct
    self.passage_instruct = passage_instruct
    self.model_kwargs = model_kwargs

    self.text_emb_model = SentenceTransformer(
        self.text_emb_model_name,
        trust_remote_code=True,
        model_kwargs=self.model_kwargs,
    )

encode(text, is_query=False, show_progress_bar=True)

Encodes a list of text strings into embeddings using the text embedding model.

Parameters:

Name Type Description Default
text list[str]

List of text strings to encode

required
is_query bool

Whether the text is a query (True) or passage (False). Determines which instruction prompt to use. Defaults to False.

False
show_progress_bar bool

Whether to display progress bar during encoding. Defaults to True.

True

Returns:

Type Description
Tensor

torch.Tensor: Tensor containing the encoded embeddings for the input text

Examples:

Python Console Session
>>> text_emb_model = BaseTextEmbModel("sentence-transformers/all-mpnet-base-v2")
>>> text = ["Hello, world!", "This is a test."]
>>> embeddings = text_emb_model.encode(text)
Source code in gfmrag/text_emb_models/base_model.py
Python
def encode(
    self, text: list[str], is_query: bool = False, show_progress_bar: bool = True
) -> torch.Tensor:
    """
    Encodes a list of text strings into embeddings using the text embedding model.

    Args:
        text (list[str]): List of text strings to encode
        is_query (bool, optional): Whether the text is a query (True) or passage (False).
            Determines which instruction prompt to use. Defaults to False.
        show_progress_bar (bool, optional): Whether to display progress bar during encoding.
            Defaults to True.

    Returns:
        torch.Tensor: Tensor containing the encoded embeddings for the input text

    Examples:
        >>> text_emb_model = BaseTextEmbModel("sentence-transformers/all-mpnet-base-v2")
        >>> text = ["Hello, world!", "This is a test."]
        >>> embeddings = text_emb_model.encode(text)
    """

    return self.text_emb_model.encode(
        text,
        device="cuda" if torch.cuda.is_available() else "cpu",
        normalize_embeddings=self.normalize,
        batch_size=self.batch_size,
        prompt=self.query_instruct if is_query else self.passage_instruct,
        show_progress_bar=show_progress_bar,
        convert_to_tensor=True,
    )

NVEmbedV2

Bases: BaseTextEmbModel

A text embedding model class that extends BaseTextEmbModel specifically for Nvidia models.

This class customizes the base embedding model by: 1. Setting a larger max sequence length of 32768 2. Setting right-side padding 3. Adding EOS tokens to input text

Parameters:

Name Type Description Default
text_emb_model_name str

Name or path of the text embedding model

required
normalize bool

Whether to normalize the output embeddings

required
batch_size int

Batch size for processing

required
query_instruct str

Instruction prefix for query texts. Defaults to "".

''
passage_instruct str

Instruction prefix for passage texts. Defaults to "".

''
model_kwargs dict | None

Additional keyword arguments for model initialization. Defaults to None.

None

Methods:

Name Description
add_eos

Adds EOS token to each input example

encode

Encodes text by first adding EOS tokens then calling parent encode method

Attributes:

Name Type Description
text_emb_model

The underlying text embedding model with customized max_seq_length and padding_side

Source code in gfmrag/text_emb_models/nv_embed.py
Python
class NVEmbedV2(BaseTextEmbModel):
    """A text embedding model class that extends BaseTextEmbModel specifically for Nvidia models.

    This class customizes the base embedding model by:
    1. Setting a larger max sequence length of 32768
    2. Setting right-side padding
    3. Adding EOS tokens to input text

    Args:
        text_emb_model_name (str): Name or path of the text embedding model
        normalize (bool): Whether to normalize the output embeddings
        batch_size (int): Batch size for processing
        query_instruct (str, optional): Instruction prefix for query texts. Defaults to "".
        passage_instruct (str, optional): Instruction prefix for passage texts. Defaults to "".
        model_kwargs (dict | None, optional): Additional keyword arguments for model initialization. Defaults to None.

    Methods:
        add_eos: Adds EOS token to each input example
        encode: Encodes text by first adding EOS tokens then calling parent encode method

    Attributes:
        text_emb_model: The underlying text embedding model with customized max_seq_length and padding_side
    """

    def __init__(
        self,
        text_emb_model_name: str,
        normalize: bool,
        batch_size: int,
        query_instruct: str = "",
        passage_instruct: str = "",
        model_kwargs: dict | None = None,
    ) -> None:
        super().__init__(
            text_emb_model_name,
            normalize,
            batch_size,
            query_instruct,
            passage_instruct,
            model_kwargs,
        )
        self.text_emb_model.max_seq_length = 32768
        self.text_emb_model.tokenizer.padding_side = "right"

    def add_eos(self, input_examples: list[str]) -> list[str]:
        input_examples = [
            input_example + self.text_emb_model.tokenizer.eos_token
            for input_example in input_examples
        ]
        return input_examples

    def encode(self, text: list[str], *args: Any, **kwargs: Any) -> torch.Tensor:
        """
        Encode a list of text strings into embeddings with added EOS token.

        This method adds an EOS (end of sequence) token to each text string before encoding.

        Args:
            text (list[str]): List of text strings to encode
            *args (Any): Additional positional arguments passed to parent encode method
            **kwargs (Any): Additional keyword arguments passed to parent encode method

        Returns:
            torch.Tensor: Encoded text embeddings tensor

        Examples:
            >>> encoder = NVEmbedder()
            >>> texts = ["Hello world", "Another text"]
            >>> embeddings = encoder.encode(texts)
        """
        return super().encode(self.add_eos(text), *args, **kwargs)

encode(text, *args, **kwargs)

Encode a list of text strings into embeddings with added EOS token.

This method adds an EOS (end of sequence) token to each text string before encoding.

Parameters:

Name Type Description Default
text list[str]

List of text strings to encode

required
*args Any

Additional positional arguments passed to parent encode method

()
**kwargs Any

Additional keyword arguments passed to parent encode method

{}

Returns:

Type Description
Tensor

torch.Tensor: Encoded text embeddings tensor

Examples:

Python Console Session
>>> encoder = NVEmbedder()
>>> texts = ["Hello world", "Another text"]
>>> embeddings = encoder.encode(texts)
Source code in gfmrag/text_emb_models/nv_embed.py
Python
def encode(self, text: list[str], *args: Any, **kwargs: Any) -> torch.Tensor:
    """
    Encode a list of text strings into embeddings with added EOS token.

    This method adds an EOS (end of sequence) token to each text string before encoding.

    Args:
        text (list[str]): List of text strings to encode
        *args (Any): Additional positional arguments passed to parent encode method
        **kwargs (Any): Additional keyword arguments passed to parent encode method

    Returns:
        torch.Tensor: Encoded text embeddings tensor

    Examples:
        >>> encoder = NVEmbedder()
        >>> texts = ["Hello world", "Another text"]
        >>> embeddings = encoder.encode(texts)
    """
    return super().encode(self.add_eos(text), *args, **kwargs)