Added AI-powered synthetic dataset generator demo

2025-07-23 12:46:58 +01:00
parent 7e3ddf460d
commit 34661108aa
2 changed files with 4851 additions and 0 deletions
--- a/community-contributions/synthetic-dataset-generator/README.md
+++ b/community-contributions/synthetic-dataset-generator/README.md
@@ -0,0 +1,37 @@
 # LLM-Powered Dataset Synthesizer: LLaMA 3 + Gradio Demo
 This interactive demo showcases a synthetic dataset generation pipeline powered by Meta's LLaMA 3.1 8B-Instruct model, running in 4-bit quantized mode. Users can input natural language prompts describing the structure and logic of a desired dataset, and the model will generate tabular data accordingly.
 ## ✨ Description
 Modern LLMs are capable of reasoning over structured data formats and generating realistic, constrained datasets. This demo leverages the LLaMA 3.1 instruct model, combined with prompt engineering, to generate high-quality synthetic tabular data from plain-language descriptions.
 Key components:
 - **LLaMA 3.1 8B-Instruct** via Hugging Face Transformers
 - **4-bit quantized loading** with `bitsandbytes` for memory efficiency
 - **Custom prompt framework** for schema + value constraints
 - **Interactive interface** built with Gradio for user-friendly data generation
 ## 🚀 Functionality
 With this tool, you can:
 - Generate synthetic datasets by describing the column names, data types, value logic, and number of rows
 - Apply constraints based on age, gender, matching conditions, and more (e.g., “females over 40; males under 40”)
 - Preview the raw model output or extract structured JSON/tabular results
 - Interactively explore and copy generated datasets from the Gradio UI
 ## 🛠️ Under the Hood
 - The model prompt template includes both a **system message** and user instruction
 - Output is parsed to extract valid JSON objects
 - The generated data is displayed in the Gradio interface and downloadable as CSV
 ## 📦 Requirements
 - Python (Colab recommended)
 - `transformers`, `bitsandbytes`, `accelerate`, `gradio`, `torch`
 - Hugging Face access token with permission to load LLaMA 3.1
 ---
 Ready to generate smart synthetic datasets with just a sentence? Try it!
--- a/community-contributions/synthetic-dataset-generator/synthgen.ipynb
+++ b/community-contributions/synthetic-dataset-generator/synthgen.ipynb