⛏️ Extractors#

class text_machina.src.extractors.base.Extractor(input_config, task_type, workspace={}, args={})[source]#

Bases: ABC

Base class for an extractor.

check_valid_args()[source]#

Checks if the arguments passed to the extractor are valid.

Raises:

ExtractorInvalidArgs – if the arguments are invalid.

Return type:

None

extract(dataset)[source]#

Calls _extract and cleans the extracted inputs.

Parameters:

dataset (Dataset) – A dataset to extract inputs from.

Returns:

A dictionary mapping each template

key to a list of prompt inputs (one input per template key and example).

Return type:

Dict[str, List[str]]

Raises:

ExtractorEmptyColumns – if any field of the prompt_inputs is empty.

prepare_human(human_texts)[source]#

Prepares the human texts. Some extractors could need to modify human texts according to the extractions, e.g., remove prefixes from texts to ensure that generations and human texts are continuations of the same prefix.

Parameters:

human_texts (List[str]) – list of human texts.

Returns:

prepared human texts.

Return type:

List[str]

class text_machina.src.extractors.dummy.Dummy(input_config, task_type)[source]#

Bases: Extractor

Dummy extractor that fills the prompt template with empty texts.

This extractor needs one template placeholder named {dummy}.

This extractor does not need specific arguments.

class text_machina.src.extractors.auxiliary.Auxiliary(input_config, task_type)[source]#

Bases: Extractor

Extractor that fills the prompt template with fields from a dataset.

This extractor needs at least one template placeholder, named with the name of a field from the dataset, e.g., {summary}.

This extractor does not need specific arguments.

class text_machina.src.extractors.entity_list.EntityList(input_config, task_type)[source]#

Bases: Extractor

Extractor that fills the prompt template with entities extracted from a text column in the dataset.

This extractor needs a template placeholder named {entities}.

This extractor does not need specific arguments.

text_machina.src.extractors.entity_list.extract_entities(processed_text)[source]#

Extracts entities from a Spacy doc.

Parameters:

processed_text (Doc) – Spacy doc.

Returns:

named entities in the doc.

Return type:

Set[str]

class text_machina.src.extractors.noun_list.NounList(input_config, task_type)[source]#

Bases: Extractor

Extractor that fills the prompt template with noun-phrases extracted from a text column in the dataset.

This extractor needs a template placeholder named {nouns}.

This extractor does not need specific arguments.

text_machina.src.extractors.noun_list.extract_nouns(processed_text)[source]#

Extracts noun chunks from a Spacy doc.

Parameters:

processed_text (Doc) – Spacy doc.

Returns:

noun chunks in the doc.

Return type:

Set[str]

class text_machina.src.extractors.sentence_prefix.SentencePrefix(input_config, task_type)[source]#

Bases: Extractor

Extractor that fills the prompt template with a sentence prefix extracted from a text column of a dataset.

This extractor needs a template placeholder named {sentences}.

This extractor allows to pass the following arguments in the extractor_args field from the config:

  • k (int): number of sentences in the prefix. If not specified,

    k will be random for each sample.

prepare_human(human_texts)[source]#

For detection and attribution tasks, removes the extracted prefix from human texts to ensure both generations and human texts are continuations of sentence prefixes.

For boundary tasks (human followed by generated), returns the prefix.

Parameters:

human_texts (List[str]) – list of human texts.

Returns:

prepared human texts.

Return type:

List[str]

class text_machina.src.extractors.word_prefix.WordPrefix(input_config, task_type)[source]#

Bases: Extractor

Extractor that fills the prompt template with a words prefix extracted from a text column of a dataset.

This extractor needs a template placeholder named {words}.

This extractor allows to pass the following arguments in the extractor_args field from the config:

  • k (int): number of words in the prefix. If not specified,

    k will be random for each sample.

prepare_human(human_texts)[source]#

For detection and attribution tasks, removes the extracted prefix from human texts to ensure both generations and human texts are continuations of word prefixes.

For boundary tasks (human followed by generated), returns the prefix.

Parameters:

human_texts (List[str]) – list of human texts.

Returns:

prepared human texts.

Return type:

List[str]

class text_machina.src.extractors.combined.Combined(input_config, task_type)[source]#

Bases: Extractor

Extractor that combines multiple extractors.

This extractor does not need specific template placeholders, just the placehoders of the extractors being combined.

This extractor does not need specific arguments, just the arguments for the extractors being combined.

class text_machina.src.extractors.sentence_gap.SentenceGap(input_config, task_type)[source]#

Bases: Extractor

Extractor that fills the prompt template with a boundary of two sentences (left-side and right-side of a sampled sentence), and with the number of sentences the LLM has to generate in between the boundary sentences.

This extractor needs two template placeholders:
  • {n}: will be filled with the number of sentences to generate

    between the boundary sentences.

  • {boundaries}: will be filled with the boundary sentences separated

    by the gap token and newlines. E.g., β€œsentence 1.

This extractor allows to pass the following arguments in the extractor_args field from the config:

  • gap_token (str): gap token, e.g., β€œ____”

  • max_percentage_boundaries (float): max percentage of boundaries to sample from a text. In a text of N sentences, there will be N-1 possible boundaries of two sentences.

  • max_sentence_span (int): max number of sentences to be generated between the boundary sentences.

check_valid_args()[source]#

Checks if the arguments passed to the extractor are valid.

Raises:

ExtractorInvalidArgs – if the arguments are invalid.

prepare_human(human_texts)[source]#

Prepares the human texts. Some extractors could need to modify human texts according to the extractions, e.g., remove prefixes from texts to ensure that generations and human texts are continuations of the same prefix.

Parameters:

human_texts (List[str]) – list of human texts.

Returns:

prepared human texts.

Return type:

List[str]

class text_machina.src.extractors.sentence_masking.SentenceMasking(input_config, task_type)[source]#

Bases: Extractor

Extractor that fills the prompt template with a text with masked sentences and the LLM has to generate all the masked sentences.

This extractor needs two template placeholders:
  • {masked_text}: will be filled with a text with masked sentences.

This extractor allows to pass the following arguments in the extractor_args field from the config:

  • mask_token (str): mask token, e.g., β€œMASK”. Several masks in a text

    will be appended with the index, e.g. β€œMASK-0”

  • percentage_range (List[float]): range delimiting the percentage

    of sentences to be masked. At least one sentence will be always masked.

check_valid_args()[source]#

Checks if the arguments passed to the extractor are valid.

Raises:

ExtractorInvalidArgs – if the arguments are invalid.

prepare_human(human_texts)[source]#

Prepares the human texts. Some extractors could need to modify human texts according to the extractions, e.g., remove prefixes from texts to ensure that generations and human texts are continuations of the same prefix.

Parameters:

human_texts (List[str]) – list of human texts.

Returns:

prepared human texts.

Return type:

List[str]

class text_machina.src.extractors.sentence_rewriting.SentenceRewriting(input_config, task_type)[source]#

Bases: Extractor

Extractor that fills the prompt template with a sentence that has to be rewritten by an LLM.

This extractor needs two template placeholders:

This extractor needs two template placeholders:
  • {sentence}: will be filled with sentence to be rewritten.

This extractor allows to pass the following arguments in the extractor_args field from the config:

  • percentage_range (List[float]): range delimiting the percentage

    of sentences to be rewritten. At least one sentence will be always rewritten.

check_valid_args()[source]#

Checks if the arguments passed to the extractor are valid.

Raises:

ExtractorInvalidArgs – if the arguments are invalid.

prepare_human(human_texts)[source]#

Prepares the human texts. Some extractors could need to modify human texts according to the extractions, e.g., remove prefixes from texts to ensure that generations and human texts are continuations of the same prefix.

Parameters:

human_texts (List[str]) – list of human texts.

Returns:

prepared human texts.

Return type:

List[str]

class text_machina.src.extractors.word_gap.WordGap(input_config, task_type)[source]#

Bases: Extractor

Extractor that fills the prompt template with a boundary of two word spans (left-side and right-side of a sampled word), and with the number of words the LLM has to generate in between the boundary word spans.

This extractor needs two template placeholders:
  • {n}: will be filled with the number of words to generate

    between the boundary words.

  • {boundaries}: will be filled with the boundary words separated

    by the gap token and newlines. E.g., β€œwords1 ____ words2”

This extractor allows to pass the following arguments in the extractor_args field from the config:

  • gap_token (str): gap token, e.g., β€œ____”

  • max_percentage_boundaries (float): max percentage of boundaries to sample from a text. In a text of N words, there will be N-1 possible boundaries of two word spans.

  • max_word_span (int): max number of words to be generated between the boundary words.

  • range_boundary_size (List[float, float]): range where to sample the length of the word spans in the boundaries.

check_valid_args()[source]#

Checks if the arguments passed to the extractor are valid.

Raises:

ExtractorInvalidArgs – if the arguments are invalid.

prepare_human(human_texts)[source]#

Prepares the human texts. Some extractors could need to modify human texts according to the extractions, e.g., remove prefixes from texts to ensure that generations and human texts are continuations of the same prefix.

Parameters:

human_texts (List[str]) – list of human texts.

Returns:

prepared human texts.

Return type:

List[str]

class text_machina.src.extractors.word_masking.WordMasking(input_config, task_type)[source]#

Bases: Extractor

Extractor that fills the prompt template with a text with masked word spans and the LLM has to generate all the masked word spans.

This extractor needs two template placeholders:
  • {masked_text}: will be filled with a text with masked word spans.

This extractor allows to pass the following arguments in the extractor_args field from the config:

  • mask_token (str): mask token, e.g., β€œMASK”. Several masks in a text

    will be appended with the index, e.g. β€œMASK-0”

  • percentage_range (List[float]): range delimiting the percentage

    of word spans to be masked. At least one word span will be always masked.

  • span_length_range (List[int]): range where to sample the length

    of each masked span.

check_valid_args()[source]#

Checks if the arguments passed to the extractor are valid.

Raises:

ExtractorInvalidArgs – if the arguments are invalid.

prepare_human(human_texts)[source]#

Prepares the human texts. Some extractors could need to modify human texts according to the extractions, e.g., remove prefixes from texts to ensure that generations and human texts are continuations of the same prefix.

Parameters:

human_texts (List[str]) – list of human texts.

Returns:

prepared human texts.

Return type:

List[str]

text_machina.src.extractors.utils.clean_inputs(texts)[source]#

Remove special symbols from the texts used as prompt inputs, to avoid breaking the format of classical kinds of prompts.

Parameters:

texts (List[str]) – list of texts.

Returns:

cleaned texts.

Return type:

List[str]

text_machina.src.extractors.utils.get_spacy_model(language)[source]#

Gets or download a Spacy model.

Parameters:

language (str) – language.

Returns:

a Spacy model.

Return type:

spacy.lang

text_machina.src.extractors.utils.spacy_pipeline(texts, language, disable_pipes=[], n_process=4)[source]#

Processes texts with spacy pipeline for entity extraction.

Parameters:
  • texts (List[str]) – list of texts.

  • language (str) – language of the text.

  • disable_pipes (List[str]) – Spacy pipes to be disabled.

  • n_process (int) – number of processes.

Returns:

list of Spacy docs.

Return type:

List[spacy.tokens.Doc]