251 lines
4.9 KiB
Markdown
251 lines
4.9 KiB
Markdown
# 🤖 Synthetic Dataset Generator
|
||
## AI-Powered Synthetic Data Generation with Claude 3 Haiku
|
||
## 📥 Installation
|
||
|
||
### 1️⃣ Clone the Repository
|
||
|
||
```bash
|
||
git clone https://github.com/yourusername/synthetic-dataset-generator.git
|
||
cd synthetic-dataset-generator
|
||
```
|
||
|
||
### 2️⃣ Create Virtual Environment (Recommended)
|
||
|
||
```bash
|
||
# Windows
|
||
python -m venv venv
|
||
venv\Scripts\activate
|
||
|
||
# macOS/Linux
|
||
python3 -m venv venv
|
||
source venv/bin/activate
|
||
```
|
||
|
||
### 3️⃣ Install Dependencies
|
||
|
||
```bash
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
**Requirements file (`requirements.txt`):**
|
||
```txt
|
||
gradio>=4.0.0
|
||
anthropic>=0.25.0
|
||
pandas>=1.5.0
|
||
python-dotenv>=1.0.0
|
||
httpx==0.27.2
|
||
```
|
||
|
||
### 4️⃣ Set Up API Key
|
||
|
||
Create a `.env` file in the project root:
|
||
|
||
```bash
|
||
# .env
|
||
ANTHROPIC_API_KEY=your_api_key_here
|
||
```
|
||
|
||
> **Note**: Never commit your `.env` file to version control. Add it to `.gitignore`.
|
||
|
||
---
|
||
|
||
## 🚀 Usage
|
||
|
||
### Running the Application
|
||
|
||
```bash
|
||
python app.ipynb
|
||
```
|
||
|
||
The Gradio interface will launch at `http://localhost:7860`
|
||
|
||
### Basic Workflow
|
||
|
||
1. **Enter API Key** (if not in `.env`)
|
||
2. **Describe Your Schema** in plain English
|
||
3. **Set Number of Records** (1-200)
|
||
4. **Add Example Format** (optional, but recommended)
|
||
5. **Click Generate** 🎉
|
||
6. **Download CSV** when ready
|
||
|
||
---
|
||
|
||
## 📝 Example Schemas
|
||
|
||
### 👥 Customer Data
|
||
```
|
||
Generate customer data with:
|
||
- customer_id (format: CUST-XXXX)
|
||
- name (full name)
|
||
- email (valid email address)
|
||
- age (between 18-80)
|
||
- city (US cities)
|
||
- purchase_amount (between $10-$1000)
|
||
- join_date (dates in 2023-2024)
|
||
- subscription_type (Free, Basic, Premium)
|
||
```
|
||
|
||
### 👨💼 Employee Records
|
||
```
|
||
Generate employee records with:
|
||
- employee_id (format: EMP001, EMP002, etc.)
|
||
- name (full name)
|
||
- department (Engineering, Sales, Marketing, HR, Finance)
|
||
- salary (between $40,000-$150,000)
|
||
- hire_date (between 2020-2024)
|
||
- performance_rating (1-5)
|
||
- is_remote (true/false)
|
||
```
|
||
|
||
### 🛒 E-commerce Products
|
||
```
|
||
Generate e-commerce product data with:
|
||
- product_id (format: PRD-XXXX)
|
||
- product_name (creative product names)
|
||
- category (Electronics, Clothing, Home, Books, Sports)
|
||
- price (between $5-$500)
|
||
- stock_quantity (between 0-1000)
|
||
- rating (1.0-5.0)
|
||
- num_reviews (0-500)
|
||
- in_stock (true/false)
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 Advanced Usage
|
||
|
||
### Batch Generation
|
||
|
||
For datasets larger than 50 records, the tool automatically:
|
||
- Splits generation into batches of 50
|
||
- Combines results into a single dataset
|
||
- Prevents API timeout issues
|
||
|
||
### Custom Formats
|
||
|
||
Provide example JSON to guide the output format:
|
||
|
||
```json
|
||
{
|
||
"id": "USR-001",
|
||
"name": "Jane Smith",
|
||
"email": "jane.smith@example.com",
|
||
"created_at": "2024-01-15T10:30:00Z"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 🔧 Troubleshooting
|
||
|
||
### ❌ Error: `proxies` keyword argument
|
||
|
||
**Solution**: Downgrade httpx to compatible version
|
||
|
||
```bash
|
||
pip install "httpx==0.27.2"
|
||
```
|
||
|
||
Then restart your Python kernel/terminal.
|
||
|
||
### ❌ API Key Not Found
|
||
|
||
**Solutions**:
|
||
1. Check `.env` file exists in project root
|
||
2. Verify `ANTHROPIC_API_KEY` is spelled correctly
|
||
3. Ensure no extra spaces in the `.env` file
|
||
4. Restart the application after creating `.env`
|
||
|
||
### ❌ JSON Parsing Error
|
||
|
||
**Solutions**:
|
||
1. Make your schema description more specific
|
||
2. Add an example format
|
||
3. Reduce the number of records per batch
|
||
4. Check your API key has sufficient credits
|
||
|
||
### ❌ Rate Limit Errors
|
||
|
||
**Solutions**:
|
||
1. Reduce batch size in code (change `batch_size=50` to `batch_size=20`)
|
||
2. Add delays between batches
|
||
3. Upgrade your Anthropic API plan
|
||
|
||
---
|
||
|
||
## 📊 Output Format
|
||
|
||
### DataFrame Preview
|
||
View generated data directly in the browser with scrollable table.
|
||
|
||
### CSV Download
|
||
- Automatic CSV generation
|
||
- Proper encoding (UTF-8)
|
||
- No index column
|
||
- Ready for Excel, Pandas, or any data tool
|
||
|
||
---
|
||
|
||
## 🧑💻 Skill Level
|
||
|
||
**Beginner Friendly** ✅
|
||
|
||
- No ML/AI expertise required
|
||
- Basic Python knowledge helpful
|
||
- Simple natural language interface
|
||
- Pre-configured examples included
|
||
|
||
---
|
||
|
||
## 💡 Tips for Best Results
|
||
|
||
1. **Be Specific**: Include data types, ranges, and formats
|
||
2. **Use Examples**: Provide sample JSON for complex schemas
|
||
3. **Start Small**: Test with 5-10 records before scaling up
|
||
4. **Iterate**: Refine your schema based on initial results
|
||
5. **Validate**: Check the first few records before using the entire dataset
|
||
|
||
---
|
||
|
||
## 🤝 Contributing
|
||
|
||
Contributions are welcome! Please feel free to submit a Pull Request.
|
||
|
||
1. Fork the repository
|
||
2. Create your feature branch
|
||
3. Commit your changes
|
||
4. Push to the branch
|
||
5. Open a Pull Request
|
||
|
||
---
|
||
|
||
|
||
## 🙏 Acknowledgments
|
||
|
||
- **Anthropic** for the Claude API
|
||
- **Gradio** for the UI framework
|
||
- **Pandas** for data manipulation
|
||
|
||
---
|
||
|
||
## 📞 Support
|
||
|
||
- 📧 Email: udayslathia16@gmail.com
|
||
|
||
---
|
||
|
||
## 🔗 Related Projects
|
||
|
||
- [Claude API Documentation](https://docs.anthropic.com/)
|
||
- [Gradio Documentation](https://gradio.app/docs/)
|
||
- [Pandas Documentation](https://pandas.pydata.org/)
|
||
|
||
---
|
||
|
||
<div align="center">
|
||
|
||
**Made with ❤️ using Claude 3 Haiku**
|
||
|
||
⭐ Star this repo if you find it useful!
|
||
|
||
</div> |