TextMachina

Unifying strategies to build MGT datasets in a single framework

icon TextMachina is a modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as:

🔎 Detection: detect whether a text has been generated by an LLM.
🕵️‍♂️ Attribution: identify what LLM has generated a text.
🚧 Boundary detection: find the boundary between human and generated text.
🎨 Mixcase: ascertain whether specific text spans are human-written or generated by LLMs.

icon TextMachina provides a user-friendly pipeline that abstracts away the inherent intricacies of building MGT datasets:

🦜 LLM integrations: easily integrates any LLM provider. Currently, supports LLMs from Anthropic, Cohere, OpenAI, Google Vertex AI, Amazon Bedrock, AI21, Azure OpenAI, models deployed on VLLM and TRT inference servers, and any model from HuggingFace deployed either locally or remotely through Inference API or Inference Endpoints. See text_machina/src/models/ to implement your own LLM provider.
✍️ Prompt templating: just write your prompt template with placeholders and let extractors to fill the template and prepare a prompt for an LLM. See text_machina/src/extractors to implement your own extractors and learn more about the placeholders for each extractor.
🔒 Constrained decoding: automatically infer LLM decoding hyper-parameters from the human texts to improve the quality and reduce the biases of your MGT datasets. See text_machina/src/constrainers to implement your own constrainers.
🛠️ Post-processing: post-process functions aimed to improve the quality of any MGT dataset and prevent common biases and artifacts. See text_machina/src/postprocessing.py to add new postprocess functions.
🌈 Bias mitigation: is built with bias prevention in mind and helps you across all the pipeline to prevent introducing spurious correlations in your datasets.
📊 Dataset exploration: explore the generated datasets and quantify its quality with a set of metrics. See text_machina/metrics and text_machina/src/interactive.py to implement your own metrics and visualizations.

The following diagram depicts the icon ‘s pipeline.

TextMachina Pipeline

📖 Citation#

Please, cite icon if you are using it in your projects:

@misc{sarvazyan2024textmachina,
      title={TextMachina: Seamless Generation of Machine-Generated Text Datasets}, 
      author={Areg Mikael Sarvazyan and José Ángel González and Marc Franco-Salvador},
      year={2024},
      eprint={2401.03946},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

🏭 Commercial Purposes#

Contact stuart.winter-tear@genaios.ai and marc.franco@genaios.ai if you are interested in using TextMachina for commercial purposes.