βοΈ Extractors#
- class text_machina.src.extractors.base.Extractor(input_config, task_type, workspace={}, args={})[source]#
Bases:
ABC
Base class for an extractor.
- check_valid_args()[source]#
Checks if the arguments passed to the extractor are valid.
- Raises:
ExtractorInvalidArgs β if the arguments are invalid.
- Return type:
- extract(dataset)[source]#
Calls _extract and cleans the extracted inputs.
- Parameters:
dataset (Dataset) β A dataset to extract inputs from.
- Returns:
- A dictionary mapping each template
key to a list of prompt inputs (one input per template key and example).
- Return type:
- Raises:
ExtractorEmptyColumns β if any field of the prompt_inputs is empty.
- class text_machina.src.extractors.dummy.Dummy(input_config, task_type)[source]#
Bases:
Extractor
Dummy extractor that fills the prompt template with empty texts.
This extractor needs one template placeholder named {dummy}.
This extractor does not need specific arguments.
- class text_machina.src.extractors.auxiliary.Auxiliary(input_config, task_type)[source]#
Bases:
Extractor
Extractor that fills the prompt template with fields from a dataset.
This extractor needs at least one template placeholder, named with the name of a field from the dataset, e.g., {summary}.
This extractor does not need specific arguments.
- class text_machina.src.extractors.entity_list.EntityList(input_config, task_type)[source]#
Bases:
Extractor
Extractor that fills the prompt template with entities extracted from a text column in the dataset.
This extractor needs a template placeholder named {entities}.
This extractor does not need specific arguments.
- text_machina.src.extractors.entity_list.extract_entities(processed_text)[source]#
Extracts entities from a Spacy doc.
- Parameters:
processed_text (Doc) β Spacy doc.
- Returns:
named entities in the doc.
- Return type:
Set[str]
- class text_machina.src.extractors.noun_list.NounList(input_config, task_type)[source]#
Bases:
Extractor
Extractor that fills the prompt template with noun-phrases extracted from a text column in the dataset.
This extractor needs a template placeholder named {nouns}.
This extractor does not need specific arguments.
- text_machina.src.extractors.noun_list.extract_nouns(processed_text)[source]#
Extracts noun chunks from a Spacy doc.
- Parameters:
processed_text (Doc) β Spacy doc.
- Returns:
noun chunks in the doc.
- Return type:
Set[str]
- class text_machina.src.extractors.sentence_prefix.SentencePrefix(input_config, task_type)[source]#
Bases:
Extractor
Extractor that fills the prompt template with a sentence prefix extracted from a text column of a dataset.
This extractor needs a template placeholder named {sentences}.
This extractor allows to pass the following arguments in the extractor_args field from the config:
- k (int): number of sentences in the prefix. If not specified,
k will be random for each sample.
- class text_machina.src.extractors.word_prefix.WordPrefix(input_config, task_type)[source]#
Bases:
Extractor
Extractor that fills the prompt template with a words prefix extracted from a text column of a dataset.
This extractor needs a template placeholder named {words}.
This extractor allows to pass the following arguments in the extractor_args field from the config:
- k (int): number of words in the prefix. If not specified,
k will be random for each sample.
- class text_machina.src.extractors.combined.Combined(input_config, task_type)[source]#
Bases:
Extractor
Extractor that combines multiple extractors.
This extractor does not need specific template placeholders, just the placehoders of the extractors being combined.
This extractor does not need specific arguments, just the arguments for the extractors being combined.
- class text_machina.src.extractors.sentence_gap.SentenceGap(input_config, task_type)[source]#
Bases:
Extractor
Extractor that fills the prompt template with a boundary of two sentences (left-side and right-side of a sampled sentence), and with the number of sentences the LLM has to generate in between the boundary sentences.
- This extractor needs two template placeholders:
- {n}: will be filled with the number of sentences to generate
between the boundary sentences.
- {boundaries}: will be filled with the boundary sentences separated
by the gap token and newlines. E.g., βsentence 1.
This extractor allows to pass the following arguments in the extractor_args field from the config:
gap_token (str): gap token, e.g., β____β
max_percentage_boundaries (float): max percentage of boundaries to sample from a text. In a text of N sentences, there will be N-1 possible boundaries of two sentences.
max_sentence_span (int): max number of sentences to be generated between the boundary sentences.
- check_valid_args()[source]#
Checks if the arguments passed to the extractor are valid.
- Raises:
ExtractorInvalidArgs β if the arguments are invalid.
- class text_machina.src.extractors.sentence_masking.SentenceMasking(input_config, task_type)[source]#
Bases:
Extractor
Extractor that fills the prompt template with a text with masked sentences and the LLM has to generate all the masked sentences.
- This extractor needs two template placeholders:
{masked_text}: will be filled with a text with masked sentences.
This extractor allows to pass the following arguments in the extractor_args field from the config:
- mask_token (str): mask token, e.g., βMASKβ. Several masks in a text
will be appended with the index, e.g. βMASK-0β
- percentage_range (List[float]): range delimiting the percentage
of sentences to be masked. At least one sentence will be always masked.
- check_valid_args()[source]#
Checks if the arguments passed to the extractor are valid.
- Raises:
ExtractorInvalidArgs β if the arguments are invalid.
- class text_machina.src.extractors.sentence_rewriting.SentenceRewriting(input_config, task_type)[source]#
Bases:
Extractor
Extractor that fills the prompt template with a sentence that has to be rewritten by an LLM.
This extractor needs two template placeholders:
- This extractor needs two template placeholders:
{sentence}: will be filled with sentence to be rewritten.
This extractor allows to pass the following arguments in the extractor_args field from the config:
- percentage_range (List[float]): range delimiting the percentage
of sentences to be rewritten. At least one sentence will be always rewritten.
- check_valid_args()[source]#
Checks if the arguments passed to the extractor are valid.
- Raises:
ExtractorInvalidArgs β if the arguments are invalid.
- class text_machina.src.extractors.word_gap.WordGap(input_config, task_type)[source]#
Bases:
Extractor
Extractor that fills the prompt template with a boundary of two word spans (left-side and right-side of a sampled word), and with the number of words the LLM has to generate in between the boundary word spans.
- This extractor needs two template placeholders:
- {n}: will be filled with the number of words to generate
between the boundary words.
- {boundaries}: will be filled with the boundary words separated
by the gap token and newlines. E.g., βwords1 ____ words2β
This extractor allows to pass the following arguments in the extractor_args field from the config:
gap_token (str): gap token, e.g., β____β
max_percentage_boundaries (float): max percentage of boundaries to sample from a text. In a text of N words, there will be N-1 possible boundaries of two word spans.
max_word_span (int): max number of words to be generated between the boundary words.
range_boundary_size (List[float, float]): range where to sample the length of the word spans in the boundaries.
- check_valid_args()[source]#
Checks if the arguments passed to the extractor are valid.
- Raises:
ExtractorInvalidArgs β if the arguments are invalid.
- class text_machina.src.extractors.word_masking.WordMasking(input_config, task_type)[source]#
Bases:
Extractor
Extractor that fills the prompt template with a text with masked word spans and the LLM has to generate all the masked word spans.
- This extractor needs two template placeholders:
{masked_text}: will be filled with a text with masked word spans.
This extractor allows to pass the following arguments in the extractor_args field from the config:
- mask_token (str): mask token, e.g., βMASKβ. Several masks in a text
will be appended with the index, e.g. βMASK-0β
- percentage_range (List[float]): range delimiting the percentage
of word spans to be masked. At least one word span will be always masked.
- span_length_range (List[int]): range where to sample the length
of each masked span.
- check_valid_args()[source]#
Checks if the arguments passed to the extractor are valid.
- Raises:
ExtractorInvalidArgs β if the arguments are invalid.
- text_machina.src.extractors.utils.clean_inputs(texts)[source]#
Remove special symbols from the texts used as prompt inputs, to avoid breaking the format of classical kinds of prompts.
- text_machina.src.extractors.utils.get_spacy_model(language)[source]#
Gets or download a Spacy model.
- Parameters:
language (str) β language.
- Returns:
a Spacy model.
- Return type:
spacy.lang