πŸ”  Tokenizers#

class text_machina.src.tokenizers.base.Tokenizer(model_name)[source]#

Bases: ABC

Base class for tokenizers.

abstract decode(tokens)[source]#

Decodes a list of token ids.

Parameters:

tokens (List[int]) – list of token ids.

Returns:

decoded text.

Return type:

text (str)

distributed_truncate(texts, max_tokens)[source]#

Truncates texts from different extractors to a maximum token length. It distributes the max_tokens across all the extractor keys, so, when all of them are included in the prompt, they sum at most max_tokens.

Example

texts = {β€œsummary”: [β€œA”, β€œB”], β€œheadline”: [β€œC”, β€œD”]} max_tokens = 256 output = {

β€œsummary”: [truncated(β€œA”, 128), truncated(β€œB”, 128)], β€œheadline”: [truncated(β€œC”, 128), truncated(β€œD”, 128)],

}

Parameters:
  • texts (Dict[str, List[str]]) – texts of each extractor.

  • max_tokens (int) – max length to be distributed across extractors.

Returns:

truncated texts of each extractor.

Return type:

Dict[str, List[str]]

abstract encode(text)[source]#

Encodes a text in token ids.

Parameters:

text (str) – a text.

Returns:

list of token ids.

Return type:

List[int]

get_token_length(text)[source]#

Get the token length of a text.

Parameters:

text (str) – a text.

Returns:

token length of the text.

Return type:

int

truncate_text(text, max_tokens)[source]#

Truncates a text to a maximum token length.

Parameters:
  • text (str) – a text.

  • max_tokens (int) – max token length of the text after truncating.

Returns:

truncated text.

Return type:

str

truncate_texts(texts, max_tokens)[source]#

Truncates a list of texts to a maximum token length.

Parameters:
  • texts (List[str]) – a list of texts.

  • max_tokens (int) – max token length of each text after truncating.

Returns:

list of truncated text.

Return type:

List[str]

class text_machina.src.tokenizers.ai21.AI21Tokenizer(model_name)[source]#

Bases: Tokenizer

Tokenizer for AI21 models.

Requires the definition of the AI21_API_KEY=<key> environment variable.

decode(tokens)[source]#

Decodes a list of token ids.

Parameters:

tokens (List[int]) – list of token ids.

Returns:

decoded text.

Return type:

text (str)

encode(text)[source]#

Encodes a text in token ids.

Parameters:

text (str) – a text.

Returns:

list of token ids.

Return type:

List[int]

class text_machina.src.tokenizers.anthropic.AnthropicTokenizer(model_name)[source]#

Bases: Tokenizer

Tokenizer for Anthropic models.

Requires the definition of the ANTRHOPIC_API_KEY=<key> environment variable.

decode(tokens)[source]#

Decodes a list of token ids.

Parameters:

tokens (List[int]) – list of token ids.

Returns:

decoded text.

Return type:

text (str)

encode(text)[source]#

Encodes a text in token ids.

Parameters:

text (str) – a text.

Returns:

list of token ids.

Return type:

List[int]

class text_machina.src.tokenizers.azure_openai.AzureOpenAITokenizer(model_name)[source]#

Bases: Tokenizer

Tokenizer for AzureOpenAI models. Tokenizer can’t be inferred from the deployment name of a model. GPT-4 Tokenizer is used instead.

decode(tokens)[source]#

Decodes a list of token ids.

Parameters:

tokens (List[int]) – list of token ids.

Returns:

decoded text.

Return type:

text (str)

encode(text)[source]#

Encodes a text in token ids.

Parameters:

text (str) – a text.

Returns:

list of token ids.

Return type:

List[int]

class text_machina.src.tokenizers.bedrock.BedrockTokenizer(model_name)[source]#

Bases: Tokenizer

Tokenizer for Bedrock models.

Bedrock does not offer tokenizers. GPT-4 Tokenizer is used instead.

decode(tokens)[source]#

Decodes a list of token ids.

Parameters:

tokens (List[int]) – list of token ids.

Returns:

decoded text.

Return type:

text (str)

encode(text)[source]#

Encodes a text in token ids.

Parameters:

text (str) – a text.

Returns:

list of token ids.

Return type:

List[int]

class text_machina.src.tokenizers.cohere.CohereTokenizer(model_name)[source]#

Bases: Tokenizer

Tokenizer for Cohere models.

Requires the definition of the COHERE_API_KEY=<key> environment variable.

decode(tokens)[source]#

Decodes a list of token ids.

Parameters:

tokens (List[int]) – list of token ids.

Returns:

decoded text.

Return type:

text (str)

encode(text)[source]#

Encodes a text in token ids.

Parameters:

text (str) – a text.

Returns:

list of token ids.

Return type:

List[int]

class text_machina.src.tokenizers.hf_local.HuggingFaceLocalTokenizer(model_name)[source]#

Bases: Tokenizer

Tokenizer for HuggingFace models.

decode(tokens)[source]#

Decodes a list of token ids.

Parameters:

tokens (List[int]) – list of token ids.

Returns:

decoded text.

Return type:

text (str)

encode(text)[source]#

Encodes a text in token ids.

Parameters:

text (str) – a text.

Returns:

list of token ids.

Return type:

List[int]

class text_machina.src.tokenizers.hf_remote.HuggingFaceRemoteTokenizer(model_name)[source]#

Bases: HuggingFaceLocalTokenizer

Tokenizer for HuggingFace models served in Inference API or Endpoints.

class text_machina.src.tokenizers.inference_server.InferenceServerTokenizer(model_name)[source]#

Bases: HuggingFaceLocalTokenizer

Tokenizer for models deployed on inference servers like VLLM or TRT. This tokenizer assumes the tokenizers of the deployed models are available on the HF hub.

class text_machina.src.tokenizers.openai.OpenAITokenizer(model_name)[source]#

Bases: Tokenizer

Tokenizer for OpenAI models.

decode(tokens)[source]#

Decodes a list of token ids.

Parameters:

tokens (List[int]) – list of token ids.

Returns:

decoded text.

Return type:

text (str)

encode(text)[source]#

Encodes a text in token ids.

Parameters:

text (str) – a text.

Returns:

list of token ids.

Return type:

List[int]

class text_machina.src.tokenizers.vertex.VertexTokenizer(model_name)[source]#

Bases: Tokenizer

Tokenizer for VertexAI models.

VertexAI does not offer tokenizers. GPT-4 Tokenizer is used instead.

decode(tokens)[source]#

Decodes a list of token ids.

Parameters:

tokens (List[int]) – list of token ids.

Returns:

decoded text.

Return type:

text (str)

encode(text)[source]#

Encodes a text in token ids.

Parameters:

text (str) – a text.

Returns:

list of token ids.

Return type:

List[int]