LLM_Engineering_OLD/week3/community-contributions/Synthetic Dataset Generator/README.md

# 🤖 Synthetic Dataset Generator
## AI-Powered Synthetic Data Generation with Claude 3 Haiku
## 📥 Installation

### 1️⃣ Clone the Repository

```bash
git clone https://github.com/yourusername/synthetic-dataset-generator.git
cd synthetic-dataset-generator
```

### 2️⃣ Create Virtual Environment (Recommended)

```bash
# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python3 -m venv venv
source venv/bin/activate
```

### 3️⃣ Install Dependencies

```bash
pip install -r requirements.txt
```

**Requirements file (`requirements.txt`):**
```txt
gradio>=4.0.0
anthropic>=0.25.0
pandas>=1.5.0
python-dotenv>=1.0.0
httpx==0.27.2
```

### 4️⃣ Set Up API Key

Create a `.env` file in the project root:

```bash
# .env
ANTHROPIC_API_KEY=your_api_key_here
```

> **Note**: Never commit your `.env` file to version control. Add it to `.gitignore`.

---

## 🚀 Usage

### Running the Application

```bash
python app.ipynb
```

The Gradio interface will launch at `http://localhost:7860`

### Basic Workflow

1. **Enter API Key** (if not in `.env`)
2. **Describe Your Schema** in plain English
3. **Set Number of Records** (1-200)
4. **Add Example Format** (optional, but recommended)
5. **Click Generate** 🎉
6. **Download CSV** when ready

---

## 📝 Example Schemas

### 👥 Customer Data
```
Generate customer data with:
- customer_id (format: CUST-XXXX)
- name (full name)
- email (valid email address)
- age (between 18-80)
- city (US cities)
- purchase_amount (between $10-$1000)
- join_date (dates in 2023-2024)
- subscription_type (Free, Basic, Premium)
```

### 👨‍💼 Employee Records
```
Generate employee records with:
- employee_id (format: EMP001, EMP002, etc.)
- name (full name)
- department (Engineering, Sales, Marketing, HR, Finance)
- salary (between $40,000-$150,000)
- hire_date (between 2020-2024)
- performance_rating (1-5)
- is_remote (true/false)
```

### 🛒 E-commerce Products
```
Generate e-commerce product data with:
- product_id (format: PRD-XXXX)
- product_name (creative product names)
- category (Electronics, Clothing, Home, Books, Sports)
- price (between $5-$500)
- stock_quantity (between 0-1000)
- rating (1.0-5.0)
- num_reviews (0-500)
- in_stock (true/false)
```

---

## 🎯 Advanced Usage

### Batch Generation

For datasets larger than 50 records, the tool automatically:
- Splits generation into batches of 50
- Combines results into a single dataset
- Prevents API timeout issues

### Custom Formats

Provide example JSON to guide the output format:

```json
{
  "id": "USR-001",
  "name": "Jane Smith",
  "email": "jane.smith@example.com",
  "created_at": "2024-01-15T10:30:00Z"
}
```

---

## 🔧 Troubleshooting

### ❌ Error: `proxies` keyword argument

**Solution**: Downgrade httpx to compatible version

```bash
pip install "httpx==0.27.2"
```

Then restart your Python kernel/terminal.

### ❌ API Key Not Found

**Solutions**:
1. Check `.env` file exists in project root
2. Verify `ANTHROPIC_API_KEY` is spelled correctly
3. Ensure no extra spaces in the `.env` file
4. Restart the application after creating `.env`

### ❌ JSON Parsing Error

**Solutions**:
1. Make your schema description more specific
2. Add an example format
3. Reduce the number of records per batch
4. Check your API key has sufficient credits

### ❌ Rate Limit Errors

**Solutions**:
1. Reduce batch size in code (change `batch_size=50` to `batch_size=20`)
2. Add delays between batches
3. Upgrade your Anthropic API plan

---

## 📊 Output Format

### DataFrame Preview
View generated data directly in the browser with scrollable table.

### CSV Download
- Automatic CSV generation
- Proper encoding (UTF-8)
- No index column
- Ready for Excel, Pandas, or any data tool

---

## 🧑‍💻 Skill Level

**Beginner Friendly** ✅

- No ML/AI expertise required
- Basic Python knowledge helpful
- Simple natural language interface
- Pre-configured examples included

---

## 💡 Tips for Best Results

1. **Be Specific**: Include data types, ranges, and formats
2. **Use Examples**: Provide sample JSON for complex schemas
3. **Start Small**: Test with 5-10 records before scaling up
4. **Iterate**: Refine your schema based on initial results
5. **Validate**: Check the first few records before using the entire dataset

---

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch
3. Commit your changes
4. Push to the branch
5. Open a Pull Request

---


## 🙏 Acknowledgments

- **Anthropic** for the Claude API
- **Gradio** for the UI framework
- **Pandas** for data manipulation

---

## 📞 Support

- 📧 Email: udayslathia16@gmail.com

---

## 🔗 Related Projects

- [Claude API Documentation](https://docs.anthropic.com/)
- [Gradio Documentation](https://gradio.app/docs/)
- [Pandas Documentation](https://pandas.pydata.org/)

---

<div align="center">

**Made with ❤️ using Claude 3 Haiku**

⭐ Star this repo if you find it useful!

</div>