πŸ— Postprocessing#

text_machina.src.postprocessing.batched_map(f)[source]#

Runs a function f on a dataset with batched mapping.

Parameters:

f (Callable[[List[str]], Dict[str, List[str]]]) – the function.

Returns:

the modified function.

Return type:

Callable[[Dataset], Dataset]

text_machina.src.postprocessing.filter_by_language(dataset, language='en')[source]#

Applies a language id filter, removing texts in undesired languages.

Parameters:

dataset (Dataset) – the dataset to apply the language id filter on.

Returns:

the filtered dataset.

Return type:

Dataset

text_machina.src.postprocessing.fix_encoding(texts)[source]#

Fixes the encoding in a list of texts.

Parameters:

texts (List[str]) – the texts to apply encoding-fixing to.

Returns:

the cleaned texts in dict form. The result is returned as so in order to run this using batched mapping from huggingface datasets.

Return type:

Dict[str, List[str]]

text_machina.src.postprocessing.get_langid_model()[source]#
Return type:

<module β€˜fasttext.FastText’ from β€˜/home/docs/checkouts/readthedocs.org/user_builds/textmachina/envs/latest/lib/python3.8/site-packages/fasttext/FastText.py’>

text_machina.src.postprocessing.postprocess(dataset, task_type)[source]#

Postprocesses a dataset.

Parameters:

dataset (Dataset) – the dataset to postprocess.

Returns:

the postprocessed dataset.

Return type:

Dataset

text_machina.src.postprocessing.remove_disclosure_phrases(texts)[source]#

Removes a set of disclosure phrases.

Parameters:

texts (List[str]) – the texts from which to remove disclosure phrases.

Returns:

the cleaned texts in dict form. The result is returned as so in order to run this using batched mapping from huggingface datasets.

Return type:

Dict[str, List[str]]

text_machina.src.postprocessing.remove_empty_texts(dataset)[source]#

Removes empty texts from a dataset.

Parameters:

dataset (Dataset) – the dataset to remove empty texts from.

Returns:

the dataset with removed empty texts.

Return type:

Datset

text_machina.src.postprocessing.remove_generation_errors(dataset)[source]#

Removes generation errors, i.e. texts marked with GENERATION_ERROR.

Parameters:

dataset (Dataset) – the dataset to filter.

Returns:

a filtered dataset with no error annotations.

Return type:

Dataset

text_machina.src.postprocessing.remove_label_duplicates(dataset)[source]#

Removes text with more than one associated label.

Args

dataset (Dataset): the dataset to remove label duplicates from.

Returns:

the dataset with removed label duplicates.

Return type:

Dataset

text_machina.src.postprocessing.remove_special_tokens(texts)[source]#

Removes special text generation tokens from a list of texts.

Parameters:

texts (List[str]) – the texts to apply special-token removal to.

Returns:

the cleaned texts in dict form. The result is returned as so in order to run this using batched mapping from huggingface datasets.

Return type:

Dict[str, List[str]]

text_machina.src.postprocessing.remove_text_duplicates(dataset)[source]#

Removes all text duplicates.

Args

dataset (Dataset): the dataset to remove duplicates from.

Returns:

the dataset with removed duplicates.

Return type:

Dataset

text_machina.src.postprocessing.strip(texts)[source]#

Strips whitespace from a list of texts.

Parameters:

texts (List[str]) – the texts to apply stripping to.

Returns:

the cleaned texts in dict form. The result is returned as so in order to run this using batched mapping from huggingface datasets.

Return type:

Dict[str, List[str]]

text_machina.src.postprocessing.truncate(dataset, min_length=5, min_tokens_to_truncate=2, sampling_radius_size=2.0)[source]#

Truncates texts to remove token length bias per class in each domain.

This is done by: 1. Sampling the same number of texts per label in each domain 2. Sorting them by token length 3. Grouping them such that each group has one text per label 4. Truncating the texts in the group to have the same length 5. Truncating the remainder of 1. between mean +- 2*std 6. Dropping texts with lengths < min_length

Note that all the texts are truncated, this can be modified with

min_tokens_to_truncate = 0.

Parameters:
  • dataset (Dataset) – the dataset to truncate.

  • min_length (int) – the minimum (spacy) token length.

  • min_tokens_to_truncate (int) – the minimum (spacy) tokens to truncate.

  • sampling_radius_size (float) – the radius size for the sampling of token lengths for non-grouped texts.

Returns:

the truncated dataset

Return type:

Dataset