π Tokenizers#
- class text_machina.src.tokenizers.base.Tokenizer(model_name)[source]#
Bases:
ABC
Base class for tokenizers.
- distributed_truncate(texts, max_tokens)[source]#
Truncates texts from different extractors to a maximum token length. It distributes the max_tokens across all the extractor keys, so, when all of them are included in the prompt, they sum at most max_tokens.
Example
texts = {βsummaryβ: [βAβ, βBβ], βheadlineβ: [βCβ, βDβ]} max_tokens = 256 output = {
βsummaryβ: [truncated(βAβ, 128), truncated(βBβ, 128)], βheadlineβ: [truncated(βCβ, 128), truncated(βDβ, 128)],
}
- class text_machina.src.tokenizers.ai21.AI21Tokenizer(model_name)[source]#
Bases:
Tokenizer
Tokenizer for AI21 models.
Requires the definition of the AI21_API_KEY=<key> environment variable.
- class text_machina.src.tokenizers.anthropic.AnthropicTokenizer(model_name)[source]#
Bases:
Tokenizer
Tokenizer for Anthropic models.
Requires the definition of the ANTRHOPIC_API_KEY=<key> environment variable.
- class text_machina.src.tokenizers.azure_openai.AzureOpenAITokenizer(model_name)[source]#
Bases:
Tokenizer
Tokenizer for AzureOpenAI models. Tokenizer canβt be inferred from the deployment name of a model. GPT-4 Tokenizer is used instead.
- class text_machina.src.tokenizers.bedrock.BedrockTokenizer(model_name)[source]#
Bases:
Tokenizer
Tokenizer for Bedrock models.
Bedrock does not offer tokenizers. GPT-4 Tokenizer is used instead.
- class text_machina.src.tokenizers.cohere.CohereTokenizer(model_name)[source]#
Bases:
Tokenizer
Tokenizer for Cohere models.
Requires the definition of the COHERE_API_KEY=<key> environment variable.
- class text_machina.src.tokenizers.hf_local.HuggingFaceLocalTokenizer(model_name)[source]#
Bases:
Tokenizer
Tokenizer for HuggingFace models.
- class text_machina.src.tokenizers.hf_remote.HuggingFaceRemoteTokenizer(model_name)[source]#
Bases:
HuggingFaceLocalTokenizer
Tokenizer for HuggingFace models served in Inference API or Endpoints.
- class text_machina.src.tokenizers.inference_server.InferenceServerTokenizer(model_name)[source]#
Bases:
HuggingFaceLocalTokenizer
Tokenizer for models deployed on inference servers like VLLM or TRT. This tokenizer assumes the tokenizers of the deployed models are available on the HF hub.
- class text_machina.src.tokenizers.openai.OpenAITokenizer(model_name)[source]#
Bases:
Tokenizer
Tokenizer for OpenAI models.