Text Embedding Models
gfmrag.text_emb_models
¶
BaseTextEmbModel
¶
A base class for text embedding models using SentenceTransformer.
This class provides functionality to encode text into embeddings using various SentenceTransformer models with configurable parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text_emb_model_name
|
str
|
Name or path of the SentenceTransformer model to use |
required |
normalize
|
bool
|
Whether to L2-normalize the embeddings. Defaults to False. |
False
|
batch_size
|
int
|
Batch size for encoding. Defaults to 32. |
32
|
query_instruct
|
str | None
|
Instruction/prompt to prepend to queries. Defaults to None. |
None
|
passage_instruct
|
str | None
|
Instruction/prompt to prepend to passages. Defaults to None. |
None
|
model_kwargs
|
dict | None
|
Additional keyword arguments for the model. Defaults to None. |
None
|
Attributes:
Name | Type | Description |
---|---|---|
text_emb_model |
SentenceTransformer
|
The underlying SentenceTransformer model |
text_emb_model_name |
str
|
Name of the model being used |
normalize |
bool
|
Whether embeddings are L2-normalized |
batch_size |
int
|
Batch size used for encoding |
query_instruct |
str | None
|
Instruction text for queries |
passage_instruct |
str | None
|
Instruction text for passages |
model_kwargs |
dict | None
|
Additional model configuration parameters |
Methods:
Name | Description |
---|---|
encode |
list[str], is_query: bool = False, show_progress_bar: bool = True) -> torch.Tensor: Encodes a list of texts into embeddings. |
Source code in gfmrag/text_emb_models/base_model.py
class BaseTextEmbModel:
"""A base class for text embedding models using SentenceTransformer.
This class provides functionality to encode text into embeddings using various
SentenceTransformer models with configurable parameters.
Args:
text_emb_model_name (str): Name or path of the SentenceTransformer model to use
normalize (bool, optional): Whether to L2-normalize the embeddings. Defaults to False.
batch_size (int, optional): Batch size for encoding. Defaults to 32.
query_instruct (str | None, optional): Instruction/prompt to prepend to queries. Defaults to None.
passage_instruct (str | None, optional): Instruction/prompt to prepend to passages. Defaults to None.
model_kwargs (dict | None, optional): Additional keyword arguments for the model. Defaults to None.
Attributes:
text_emb_model (SentenceTransformer): The underlying SentenceTransformer model
text_emb_model_name (str): Name of the model being used
normalize (bool): Whether embeddings are L2-normalized
batch_size (int): Batch size used for encoding
query_instruct (str | None): Instruction text for queries
passage_instruct (str | None): Instruction text for passages
model_kwargs (dict | None): Additional model configuration parameters
Methods:
encode(text: list[str], is_query: bool = False, show_progress_bar: bool = True) -> torch.Tensor:
Encodes a list of texts into embeddings.
"""
def __init__(
self,
text_emb_model_name: str,
normalize: bool = False,
batch_size: int = 32,
query_instruct: str | None = None,
passage_instruct: str | None = None,
model_kwargs: dict | None = None,
) -> None:
"""
Initialize the BaseTextEmbModel.
Args:
text_emb_model_name (str): Name or path of the SentenceTransformer model to use
normalize (bool, optional): Whether to L2-normalize the embeddings. Defaults to False.
batch_size (int, optional): Batch size for encoding. Defaults to 32.
query_instruct (str | None, optional): Instruction/prompt to prepend to queries. Defaults to None.
passage_instruct (str | None, optional): Instruction/prompt to prepend to passages. Defaults to None.
model_kwargs (dict | None, optional): Additional keyword arguments for the model. Defaults to None.
"""
self.text_emb_model_name = text_emb_model_name
self.normalize = normalize
self.batch_size = batch_size
self.query_instruct = query_instruct
self.passage_instruct = passage_instruct
self.model_kwargs = model_kwargs
self.text_emb_model = SentenceTransformer(
self.text_emb_model_name,
trust_remote_code=True,
model_kwargs=self.model_kwargs,
)
def encode(
self, text: list[str], is_query: bool = False, show_progress_bar: bool = True
) -> torch.Tensor:
"""
Encodes a list of text strings into embeddings using the text embedding model.
Args:
text (list[str]): List of text strings to encode
is_query (bool, optional): Whether the text is a query (True) or passage (False).
Determines which instruction prompt to use. Defaults to False.
show_progress_bar (bool, optional): Whether to display progress bar during encoding.
Defaults to True.
Returns:
torch.Tensor: Tensor containing the encoded embeddings for the input text
Examples:
>>> text_emb_model = BaseTextEmbModel("sentence-transformers/all-mpnet-base-v2")
>>> text = ["Hello, world!", "This is a test."]
>>> embeddings = text_emb_model.encode(text)
"""
return self.text_emb_model.encode(
text,
device="cuda" if torch.cuda.is_available() else "cpu",
normalize_embeddings=self.normalize,
batch_size=self.batch_size,
prompt=self.query_instruct if is_query else self.passage_instruct,
show_progress_bar=show_progress_bar,
convert_to_tensor=True,
)
__init__(text_emb_model_name, normalize=False, batch_size=32, query_instruct=None, passage_instruct=None, model_kwargs=None)
¶
Initialize the BaseTextEmbModel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text_emb_model_name
|
str
|
Name or path of the SentenceTransformer model to use |
required |
normalize
|
bool
|
Whether to L2-normalize the embeddings. Defaults to False. |
False
|
batch_size
|
int
|
Batch size for encoding. Defaults to 32. |
32
|
query_instruct
|
str | None
|
Instruction/prompt to prepend to queries. Defaults to None. |
None
|
passage_instruct
|
str | None
|
Instruction/prompt to prepend to passages. Defaults to None. |
None
|
model_kwargs
|
dict | None
|
Additional keyword arguments for the model. Defaults to None. |
None
|
Source code in gfmrag/text_emb_models/base_model.py
def __init__(
self,
text_emb_model_name: str,
normalize: bool = False,
batch_size: int = 32,
query_instruct: str | None = None,
passage_instruct: str | None = None,
model_kwargs: dict | None = None,
) -> None:
"""
Initialize the BaseTextEmbModel.
Args:
text_emb_model_name (str): Name or path of the SentenceTransformer model to use
normalize (bool, optional): Whether to L2-normalize the embeddings. Defaults to False.
batch_size (int, optional): Batch size for encoding. Defaults to 32.
query_instruct (str | None, optional): Instruction/prompt to prepend to queries. Defaults to None.
passage_instruct (str | None, optional): Instruction/prompt to prepend to passages. Defaults to None.
model_kwargs (dict | None, optional): Additional keyword arguments for the model. Defaults to None.
"""
self.text_emb_model_name = text_emb_model_name
self.normalize = normalize
self.batch_size = batch_size
self.query_instruct = query_instruct
self.passage_instruct = passage_instruct
self.model_kwargs = model_kwargs
self.text_emb_model = SentenceTransformer(
self.text_emb_model_name,
trust_remote_code=True,
model_kwargs=self.model_kwargs,
)
encode(text, is_query=False, show_progress_bar=True)
¶
Encodes a list of text strings into embeddings using the text embedding model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
list[str]
|
List of text strings to encode |
required |
is_query
|
bool
|
Whether the text is a query (True) or passage (False). Determines which instruction prompt to use. Defaults to False. |
False
|
show_progress_bar
|
bool
|
Whether to display progress bar during encoding. Defaults to True. |
True
|
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor: Tensor containing the encoded embeddings for the input text |
Examples:
>>> text_emb_model = BaseTextEmbModel("sentence-transformers/all-mpnet-base-v2")
>>> text = ["Hello, world!", "This is a test."]
>>> embeddings = text_emb_model.encode(text)
Source code in gfmrag/text_emb_models/base_model.py
def encode(
self, text: list[str], is_query: bool = False, show_progress_bar: bool = True
) -> torch.Tensor:
"""
Encodes a list of text strings into embeddings using the text embedding model.
Args:
text (list[str]): List of text strings to encode
is_query (bool, optional): Whether the text is a query (True) or passage (False).
Determines which instruction prompt to use. Defaults to False.
show_progress_bar (bool, optional): Whether to display progress bar during encoding.
Defaults to True.
Returns:
torch.Tensor: Tensor containing the encoded embeddings for the input text
Examples:
>>> text_emb_model = BaseTextEmbModel("sentence-transformers/all-mpnet-base-v2")
>>> text = ["Hello, world!", "This is a test."]
>>> embeddings = text_emb_model.encode(text)
"""
return self.text_emb_model.encode(
text,
device="cuda" if torch.cuda.is_available() else "cpu",
normalize_embeddings=self.normalize,
batch_size=self.batch_size,
prompt=self.query_instruct if is_query else self.passage_instruct,
show_progress_bar=show_progress_bar,
convert_to_tensor=True,
)
NVEmbedV2
¶
Bases: BaseTextEmbModel
A text embedding model class that extends BaseTextEmbModel specifically for Nvidia models.
This class customizes the base embedding model by: 1. Setting a larger max sequence length of 32768 2. Setting right-side padding 3. Adding EOS tokens to input text
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text_emb_model_name
|
str
|
Name or path of the text embedding model |
required |
normalize
|
bool
|
Whether to normalize the output embeddings |
required |
batch_size
|
int
|
Batch size for processing |
required |
query_instruct
|
str
|
Instruction prefix for query texts. Defaults to "". |
''
|
passage_instruct
|
str
|
Instruction prefix for passage texts. Defaults to "". |
''
|
model_kwargs
|
dict | None
|
Additional keyword arguments for model initialization. Defaults to None. |
None
|
Methods:
Name | Description |
---|---|
add_eos |
Adds EOS token to each input example |
encode |
Encodes text by first adding EOS tokens then calling parent encode method |
Attributes:
Name | Type | Description |
---|---|---|
text_emb_model |
The underlying text embedding model with customized max_seq_length and padding_side |
Source code in gfmrag/text_emb_models/nv_embed.py
class NVEmbedV2(BaseTextEmbModel):
"""A text embedding model class that extends BaseTextEmbModel specifically for Nvidia models.
This class customizes the base embedding model by:
1. Setting a larger max sequence length of 32768
2. Setting right-side padding
3. Adding EOS tokens to input text
Args:
text_emb_model_name (str): Name or path of the text embedding model
normalize (bool): Whether to normalize the output embeddings
batch_size (int): Batch size for processing
query_instruct (str, optional): Instruction prefix for query texts. Defaults to "".
passage_instruct (str, optional): Instruction prefix for passage texts. Defaults to "".
model_kwargs (dict | None, optional): Additional keyword arguments for model initialization. Defaults to None.
Methods:
add_eos: Adds EOS token to each input example
encode: Encodes text by first adding EOS tokens then calling parent encode method
Attributes:
text_emb_model: The underlying text embedding model with customized max_seq_length and padding_side
"""
def __init__(
self,
text_emb_model_name: str,
normalize: bool,
batch_size: int,
query_instruct: str = "",
passage_instruct: str = "",
model_kwargs: dict | None = None,
) -> None:
super().__init__(
text_emb_model_name,
normalize,
batch_size,
query_instruct,
passage_instruct,
model_kwargs,
)
self.text_emb_model.max_seq_length = 32768
self.text_emb_model.tokenizer.padding_side = "right"
def add_eos(self, input_examples: list[str]) -> list[str]:
input_examples = [
input_example + self.text_emb_model.tokenizer.eos_token
for input_example in input_examples
]
return input_examples
def encode(self, text: list[str], *args: Any, **kwargs: Any) -> torch.Tensor:
"""
Encode a list of text strings into embeddings with added EOS token.
This method adds an EOS (end of sequence) token to each text string before encoding.
Args:
text (list[str]): List of text strings to encode
*args (Any): Additional positional arguments passed to parent encode method
**kwargs (Any): Additional keyword arguments passed to parent encode method
Returns:
torch.Tensor: Encoded text embeddings tensor
Examples:
>>> encoder = NVEmbedder()
>>> texts = ["Hello world", "Another text"]
>>> embeddings = encoder.encode(texts)
"""
return super().encode(self.add_eos(text), *args, **kwargs)
encode(text, *args, **kwargs)
¶
Encode a list of text strings into embeddings with added EOS token.
This method adds an EOS (end of sequence) token to each text string before encoding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
list[str]
|
List of text strings to encode |
required |
*args
|
Any
|
Additional positional arguments passed to parent encode method |
()
|
**kwargs
|
Any
|
Additional keyword arguments passed to parent encode method |
{}
|
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor: Encoded text embeddings tensor |
Examples:
>>> encoder = NVEmbedder()
>>> texts = ["Hello world", "Another text"]
>>> embeddings = encoder.encode(texts)
Source code in gfmrag/text_emb_models/nv_embed.py
def encode(self, text: list[str], *args: Any, **kwargs: Any) -> torch.Tensor:
"""
Encode a list of text strings into embeddings with added EOS token.
This method adds an EOS (end of sequence) token to each text string before encoding.
Args:
text (list[str]): List of text strings to encode
*args (Any): Additional positional arguments passed to parent encode method
**kwargs (Any): Additional keyword arguments passed to parent encode method
Returns:
torch.Tensor: Encoded text embeddings tensor
Examples:
>>> encoder = NVEmbedder()
>>> texts = ["Hello world", "Another text"]
>>> embeddings = encoder.encode(texts)
"""
return super().encode(self.add_eos(text), *args, **kwargs)