Unifying strategies to build MGT datasets in a single framework
TextMachina is a modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as:
π Detection: detect whether a text has been generated by an LLM.
π΅οΈββοΈ Attribution: identify what LLM has generated a text.
π§ Boundary detection: find the boundary between human and generated text.
π¨ Mixcase: ascertain whether specific text spans are human-written or generated by LLMs.
TextMachina provides a user-friendly pipeline that abstracts away the inherent intricacies of building MGT datasets:
π¦ LLM integrations: easily integrates any LLM provider. Currently, supports LLMs from Anthropic, Cohere, OpenAI, Google Vertex AI, Amazon Bedrock, AI21, Azure OpenAI, models deployed on VLLM and TRT inference servers, and any model from HuggingFace deployed either locally or remotely through Inference API or Inference Endpoints. See
text_machina/src/models/
to implement your own LLM provider.βοΈ Prompt templating: just write your prompt template with placeholders and let extractors to fill the template and prepare a prompt for an LLM. See
text_machina/src/extractors
to implement your own extractors and learn more about the placeholders for each extractor.π Constrained decoding: automatically infer LLM decoding hyper-parameters from the human texts to improve the quality and reduce the biases of your MGT datasets. See
text_machina/src/constrainers
to implement your own constrainers.π οΈ Post-processing: post-process functions aimed to improve the quality of any MGT dataset and prevent common biases and artifacts. See
text_machina/src/postprocessing.py
to add new postprocess functions.π Bias mitigation: is built with bias prevention in mind and helps you across all the pipeline to prevent introducing spurious correlations in your datasets.
π Dataset exploration: explore the generated datasets and quantify its quality with a set of metrics. See
text_machina/metrics
andtext_machina/src/interactive.py
to implement your own metrics and visualizations.
The following diagram depicts the βs pipeline.
π Citation#
Please, cite if you are using it in your projects:
@misc{sarvazyan2024textmachina,
title={TextMachina: Seamless Generation of Machine-Generated Text Datasets},
author={Areg Mikael Sarvazyan and JosΓ© Γngel GonzΓ‘lez and Marc Franco-Salvador},
year={2024},
eprint={2401.03946},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
π Commercial Purposes#
Contact stuart.winter-tear@genaios.ai and marc.franco@genaios.ai if you are interested in using TextMachina for commercial purposes.