π Postprocessing#
- text_machina.src.postprocessing.batched_map(f)[source]#
Runs a function f on a dataset with batched mapping.
- text_machina.src.postprocessing.filter_by_language(dataset, language='en')[source]#
Applies a language id filter, removing texts in undesired languages.
- Parameters:
dataset (Dataset) β the dataset to apply the language id filter on.
- Returns:
the filtered dataset.
- Return type:
Dataset
- text_machina.src.postprocessing.get_langid_model()[source]#
- Return type:
<module βfasttext.FastTextβ from β/home/docs/checkouts/readthedocs.org/user_builds/textmachina/envs/latest/lib/python3.8/site-packages/fasttext/FastText.pyβ>
- text_machina.src.postprocessing.postprocess(dataset, task_type)[source]#
Postprocesses a dataset.
- Parameters:
dataset (Dataset) β the dataset to postprocess.
- Returns:
the postprocessed dataset.
- Return type:
Dataset
- text_machina.src.postprocessing.remove_disclosure_phrases(texts)[source]#
Removes a set of disclosure phrases.
- text_machina.src.postprocessing.remove_empty_texts(dataset)[source]#
Removes empty texts from a dataset.
- Parameters:
dataset (Dataset) β the dataset to remove empty texts from.
- Returns:
the dataset with removed empty texts.
- Return type:
Datset
- text_machina.src.postprocessing.remove_generation_errors(dataset)[source]#
Removes generation errors, i.e. texts marked with GENERATION_ERROR.
- Parameters:
dataset (Dataset) β the dataset to filter.
- Returns:
a filtered dataset with no error annotations.
- Return type:
Dataset
- text_machina.src.postprocessing.remove_label_duplicates(dataset)[source]#
Removes text with more than one associated label.
- Args
dataset (Dataset): the dataset to remove label duplicates from.
- Returns:
the dataset with removed label duplicates.
- Return type:
Dataset
- text_machina.src.postprocessing.remove_special_tokens(texts)[source]#
Removes special text generation tokens from a list of texts.
- text_machina.src.postprocessing.remove_text_duplicates(dataset)[source]#
Removes all text duplicates.
- Args
dataset (Dataset): the dataset to remove duplicates from.
- Returns:
the dataset with removed duplicates.
- Return type:
Dataset
- text_machina.src.postprocessing.truncate(dataset, min_length=5, min_tokens_to_truncate=2, sampling_radius_size=2.0)[source]#
Truncates texts to remove token length bias per class in each domain.
This is done by: 1. Sampling the same number of texts per label in each domain 2. Sorting them by token length 3. Grouping them such that each group has one text per label 4. Truncating the texts in the group to have the same length 5. Truncating the remainder of 1. between mean +- 2*std 6. Dropping texts with lengths < min_length
- Note that all the texts are truncated, this can be modified with
min_tokens_to_truncate = 0.
- Parameters:
- Returns:
the truncated dataset
- Return type:
Dataset