🤖 Synthetic Dataset Generator
AI-Powered Synthetic Data Generation with Claude 3 Haiku
📥 Installation
1️⃣ Clone the Repository
git clone https://github.com/yourusername/synthetic-dataset-generator.git
cd synthetic-dataset-generator
2️⃣ Create Virtual Environment (Recommended)
# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python3 -m venv venv
source venv/bin/activate
3️⃣ Install Dependencies
pip install -r requirements.txt
Requirements file (requirements.txt):
gradio>=4.0.0
anthropic>=0.25.0
pandas>=1.5.0
python-dotenv>=1.0.0
httpx==0.27.2
4️⃣ Set Up API Key
Create a .env file in the project root:
# .env
ANTHROPIC_API_KEY=your_api_key_here
Note
: Never commit your
.envfile to version control. Add it to.gitignore.
🚀 Usage
Running the Application
python app.ipynb
The Gradio interface will launch at http://localhost:7860
Basic Workflow
- Enter API Key (if not in
.env) - Describe Your Schema in plain English
- Set Number of Records (1-200)
- Add Example Format (optional, but recommended)
- Click Generate 🎉
- Download CSV when ready
📝 Example Schemas
👥 Customer Data
Generate customer data with:
- customer_id (format: CUST-XXXX)
- name (full name)
- email (valid email address)
- age (between 18-80)
- city (US cities)
- purchase_amount (between $10-$1000)
- join_date (dates in 2023-2024)
- subscription_type (Free, Basic, Premium)
👨💼 Employee Records
Generate employee records with:
- employee_id (format: EMP001, EMP002, etc.)
- name (full name)
- department (Engineering, Sales, Marketing, HR, Finance)
- salary (between $40,000-$150,000)
- hire_date (between 2020-2024)
- performance_rating (1-5)
- is_remote (true/false)
🛒 E-commerce Products
Generate e-commerce product data with:
- product_id (format: PRD-XXXX)
- product_name (creative product names)
- category (Electronics, Clothing, Home, Books, Sports)
- price (between $5-$500)
- stock_quantity (between 0-1000)
- rating (1.0-5.0)
- num_reviews (0-500)
- in_stock (true/false)
🎯 Advanced Usage
Batch Generation
For datasets larger than 50 records, the tool automatically:
- Splits generation into batches of 50
- Combines results into a single dataset
- Prevents API timeout issues
Custom Formats
Provide example JSON to guide the output format:
{
"id": "USR-001",
"name": "Jane Smith",
"email": "jane.smith@example.com",
"created_at": "2024-01-15T10:30:00Z"
}
🔧 Troubleshooting
❌ Error: proxies keyword argument
Solution: Downgrade httpx to compatible version
pip install "httpx==0.27.2"
Then restart your Python kernel/terminal.
❌ API Key Not Found
Solutions:
- Check
.envfile exists in project root - Verify
ANTHROPIC_API_KEYis spelled correctly - Ensure no extra spaces in the
.envfile - Restart the application after creating
.env
❌ JSON Parsing Error
Solutions:
- Make your schema description more specific
- Add an example format
- Reduce the number of records per batch
- Check your API key has sufficient credits
❌ Rate Limit Errors
Solutions:
- Reduce batch size in code (change
batch_size=50tobatch_size=20) - Add delays between batches
- Upgrade your Anthropic API plan
📊 Output Format
DataFrame Preview
View generated data directly in the browser with scrollable table.
CSV Download
- Automatic CSV generation
- Proper encoding (UTF-8)
- No index column
- Ready for Excel, Pandas, or any data tool
🧑💻 Skill Level
Beginner Friendly ✅
- No ML/AI expertise required
- Basic Python knowledge helpful
- Simple natural language interface
- Pre-configured examples included
💡 Tips for Best Results
- Be Specific: Include data types, ranges, and formats
- Use Examples: Provide sample JSON for complex schemas
- Start Small: Test with 5-10 records before scaling up
- Iterate: Refine your schema based on initial results
- Validate: Check the first few records before using the entire dataset
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
🙏 Acknowledgments
- Anthropic for the Claude API
- Gradio for the UI framework
- Pandas for data manipulation
📞 Support
- 📧 Email: udayslathia16@gmail.com
🔗 Related Projects
Made with ❤️ using Claude 3 Haiku
⭐ Star this repo if you find it useful!