Added AI-powered synthetic dataset generator demo
This commit is contained in:
@@ -0,0 +1,37 @@
|
|||||||
|
# LLM-Powered Dataset Synthesizer: LLaMA 3 + Gradio Demo
|
||||||
|
|
||||||
|
This interactive demo showcases a synthetic dataset generation pipeline powered by Meta's LLaMA 3.1 8B-Instruct model, running in 4-bit quantized mode. Users can input natural language prompts describing the structure and logic of a desired dataset, and the model will generate tabular data accordingly.
|
||||||
|
|
||||||
|
## ✨ Description
|
||||||
|
|
||||||
|
Modern LLMs are capable of reasoning over structured data formats and generating realistic, constrained datasets. This demo leverages the LLaMA 3.1 instruct model, combined with prompt engineering, to generate high-quality synthetic tabular data from plain-language descriptions.
|
||||||
|
|
||||||
|
Key components:
|
||||||
|
- **LLaMA 3.1 8B-Instruct** via Hugging Face Transformers
|
||||||
|
- **4-bit quantized loading** with `bitsandbytes` for memory efficiency
|
||||||
|
- **Custom prompt framework** for schema + value constraints
|
||||||
|
- **Interactive interface** built with Gradio for user-friendly data generation
|
||||||
|
|
||||||
|
## 🚀 Functionality
|
||||||
|
|
||||||
|
With this tool, you can:
|
||||||
|
- Generate synthetic datasets by describing the column names, data types, value logic, and number of rows
|
||||||
|
- Apply constraints based on age, gender, matching conditions, and more (e.g., “females over 40; males under 40”)
|
||||||
|
- Preview the raw model output or extract structured JSON/tabular results
|
||||||
|
- Interactively explore and copy generated datasets from the Gradio UI
|
||||||
|
|
||||||
|
## 🛠️ Under the Hood
|
||||||
|
|
||||||
|
- The model prompt template includes both a **system message** and user instruction
|
||||||
|
- Output is parsed to extract valid JSON objects
|
||||||
|
- The generated data is displayed in the Gradio interface and downloadable as CSV
|
||||||
|
|
||||||
|
## 📦 Requirements
|
||||||
|
|
||||||
|
- Python (Colab recommended)
|
||||||
|
- `transformers`, `bitsandbytes`, `accelerate`, `gradio`, `torch`
|
||||||
|
- Hugging Face access token with permission to load LLaMA 3.1
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Ready to generate smart synthetic datasets with just a sentence? Try it!
|
||||||
4814
community-contributions/synthetic-dataset-generator/synthgen.ipynb
Normal file
4814
community-contributions/synthetic-dataset-generator/synthgen.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user