π Data#
- class text_machina.src.data.PromptedDatasetBuilder(config)[source]#
Bases:
object
Class to manage all the prompting steps required before generating MGT.
- build()[source]#
Prepares prefixes based on input formats for a particular domain, model and dataset.
- Returns:
a dataset with prompted and human texts.
- Return type:
- get_prompt()[source]#
Returns the input format to be used as input for the text generation models
- Returns:
a prompt with template and extractor.
- Return type:
- sampling(dataset)[source]#
Sample human texts and texts to be used for generating MGT. The same amount is sampled in both cases.
This method allows to randomly sample human texts, or use the same ones than those that will be used to generate MGT.
- Parameters:
dataset (Dataset) β a dataset.
- Returns:
- tuple of texts. human texts and
texts to be used to generate MGT.
- Return type:
Tuple[List[str], Dataset]
- text_machina.src.data.concatenate(paths, save_path)[source]#
Concatenates and saves a list of datasets.
- Parameters:
paths (List[Paths]) β list with the datasets to be concatenated.
save_path (Path) β path where to save the merged dataset.
- Returns:
the merged dataset.
- Return type:
Dataset
- text_machina.src.data.domain_model_counts(dataset)[source]#
Computes counts for (domain, model) pairs, e.g:
model bloom-560m gpt2 human total domain reviews 10 10 20 40 tweets 10 10 20 40 total 20 20 40 80
- Parameters:
dataset (Dataset) β the dataset used to compute counts.
- Returns:
the (domain, model) counts.
- Return type:
pd.DataFrame
- text_machina.src.data.errors_per_model(dataset)[source]#
Computes error counts per model.
- Parameters:
dataset (Dataset) β the dataset used to compute counts.
- Returns:
the error counts per model.
- Return type:
pd.DataFrame
- text_machina.src.data.format_prompt(template, prompt_inputs)[source]#
Formats a prompt template with the prompt inputs.
- Example:
template: βWrite a text using this entities: {entities}.
- Text:β
prompt_inputs: {βentitiesβ: [βJose, Aregβ, βMarc, Angeloβ]} output: [βWrite a text using this entities: Jose, Areg.
- Text:β,
βWrite a text using this entities: Marc, Angelo.
Text:β
- Args:
template (str): the template to be formatted. prompt_inputs (Dict[str, List[str]]): prompt inputs from the extractors
- Returns:
List[str]: formatted templates.
- text_machina.src.data.get_path_from_substring(path, substring)[source]#
Checks whether a folder name within path includes substring
- Parameters:
path (Path) β path where searching for folders.
substring (str) β substring to find in the names.
- Returns:
a path of a folder named `substring` or None.
- Return type:
Optional[Path]
- text_machina.src.data.get_save_path(config, save_dir, run_name, check_exists=False)[source]#
Constructs the path to save a dataset.
- text_machina.src.data.load_dataset_from_config(config)[source]#
Loads a dataset from disk or hub.
- Parameters:
config (InputConfig) β an input config.
- Returns:
a dataset.
- Return type:
Dataset