πŸ“ Data#

class text_machina.src.data.PromptedDatasetBuilder(config)[source]#

Bases: object

Class to manage all the prompting steps required before generating MGT.

build()[source]#

Prepares prefixes based on input formats for a particular domain, model and dataset.

Returns:

a dataset with prompted and human texts.

Return type:

PromptedDataset

get_prompt()[source]#

Returns the input format to be used as input for the text generation models

Returns:

a prompt with template and extractor.

Return type:

Prompt

sampling(dataset)[source]#

Sample human texts and texts to be used for generating MGT. The same amount is sampled in both cases.

This method allows to randomly sample human texts, or use the same ones than those that will be used to generate MGT.

Parameters:

dataset (Dataset) – a dataset.

Returns:

tuple of texts. human texts and

texts to be used to generate MGT.

Return type:

Tuple[List[str], Dataset]

truncate_inputs(prompt_inputs)[source]#

Truncates prompt inputs extracted with the extractors.

Parameters:

prompt_inputs (Dict[str, List[str]]) – prompt inputs.

Returns:

truncated prompt inputs.

Return type:

Dict[str, List[str]]

text_machina.src.data.concatenate(paths, save_path)[source]#

Concatenates and saves a list of datasets.

Parameters:
  • paths (List[Paths]) – list with the datasets to be concatenated.

  • save_path (Path) – path where to save the merged dataset.

Returns:

the merged dataset.

Return type:

Dataset

text_machina.src.data.domain_model_counts(dataset)[source]#

Computes counts for (domain, model) pairs, e.g:

model bloom-560m gpt2 human total domain reviews 10 10 20 40 tweets 10 10 20 40 total 20 20 40 80

Parameters:

dataset (Dataset) – the dataset used to compute counts.

Returns:

the (domain, model) counts.

Return type:

pd.DataFrame

text_machina.src.data.errors_per_model(dataset)[source]#

Computes error counts per model.

Parameters:

dataset (Dataset) – the dataset used to compute counts.

Returns:

the error counts per model.

Return type:

pd.DataFrame

text_machina.src.data.format_prompt(template, prompt_inputs)[source]#

Formats a prompt template with the prompt inputs.

Example:

template: β€œWrite a text using this entities: {entities}.

Text:”

prompt_inputs: {β€œentities”: [β€œJose, Areg”, β€œMarc, Angelo”]} output: [β€œWrite a text using this entities: Jose, Areg.

Text:”,

β€œWrite a text using this entities: Marc, Angelo.

Text:”

Args:

template (str): the template to be formatted. prompt_inputs (Dict[str, List[str]]): prompt inputs from the extractors

Returns:

List[str]: formatted templates.

Return type:

List[str]

text_machina.src.data.get_path_from_substring(path, substring)[source]#

Checks whether a folder name within path includes substring

Parameters:
  • path (Path) – path where searching for folders.

  • substring (str) – substring to find in the names.

Returns:

a path of a folder named `substring` or None.

Return type:

Optional[Path]

text_machina.src.data.get_save_path(config, save_dir, run_name, check_exists=False)[source]#

Constructs the path to save a dataset.

Parameters:
  • config (Config) – config of this run.

  • save_dir (Path) – root of the save path.

  • run_name (str) – name of this run.

  • check_exists (bool) –

    …

Returns:

path to save a dataset.

Return type:

Path

text_machina.src.data.load_dataset_from_config(config)[source]#

Loads a dataset from disk or hub.

Parameters:

config (InputConfig) – an input config.

Returns:

a dataset.

Return type:

Dataset

text_machina.src.data.serialize_dataset(dataset, config, path, run_name)[source]#

Saves a dataset with its config as an additional column.

Parameters:
  • dataset (Dataset) – a dataset.

  • config (Config) – configuration used.

  • path (Path) – path where to save the generated dataset.

  • run_name (str) – name of this run.

Returns:

folder where the dataset was saved.

Return type:

Path