Files
LLM_Engineering_OLD/week3/community-contributions/Synthetic Dataset Generator/README.md
2025-10-09 12:45:04 +05:30

4.9 KiB
Raw Blame History

🤖 Synthetic Dataset Generator

AI-Powered Synthetic Data Generation with Claude 3 Haiku

📥 Installation

1 Clone the Repository

git clone https://github.com/yourusername/synthetic-dataset-generator.git
cd synthetic-dataset-generator
# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

3 Install Dependencies

pip install -r requirements.txt

Requirements file (requirements.txt):

gradio>=4.0.0
anthropic>=0.25.0
pandas>=1.5.0
python-dotenv>=1.0.0
httpx==0.27.2

4 Set Up API Key

Create a .env file in the project root:

# .env
ANTHROPIC_API_KEY=your_api_key_here

Note

: Never commit your .env file to version control. Add it to .gitignore.


🚀 Usage

Running the Application

python app.ipynb

The Gradio interface will launch at http://localhost:7860

Basic Workflow

  1. Enter API Key (if not in .env)
  2. Describe Your Schema in plain English
  3. Set Number of Records (1-200)
  4. Add Example Format (optional, but recommended)
  5. Click Generate 🎉
  6. Download CSV when ready

📝 Example Schemas

👥 Customer Data

Generate customer data with:
- customer_id (format: CUST-XXXX)
- name (full name)
- email (valid email address)
- age (between 18-80)
- city (US cities)
- purchase_amount (between $10-$1000)
- join_date (dates in 2023-2024)
- subscription_type (Free, Basic, Premium)

👨‍💼 Employee Records

Generate employee records with:
- employee_id (format: EMP001, EMP002, etc.)
- name (full name)
- department (Engineering, Sales, Marketing, HR, Finance)
- salary (between $40,000-$150,000)
- hire_date (between 2020-2024)
- performance_rating (1-5)
- is_remote (true/false)

🛒 E-commerce Products

Generate e-commerce product data with:
- product_id (format: PRD-XXXX)
- product_name (creative product names)
- category (Electronics, Clothing, Home, Books, Sports)
- price (between $5-$500)
- stock_quantity (between 0-1000)
- rating (1.0-5.0)
- num_reviews (0-500)
- in_stock (true/false)

🎯 Advanced Usage

Batch Generation

For datasets larger than 50 records, the tool automatically:

  • Splits generation into batches of 50
  • Combines results into a single dataset
  • Prevents API timeout issues

Custom Formats

Provide example JSON to guide the output format:

{
  "id": "USR-001",
  "name": "Jane Smith",
  "email": "jane.smith@example.com",
  "created_at": "2024-01-15T10:30:00Z"
}

🔧 Troubleshooting

Error: proxies keyword argument

Solution: Downgrade httpx to compatible version

pip install "httpx==0.27.2"

Then restart your Python kernel/terminal.

API Key Not Found

Solutions:

  1. Check .env file exists in project root
  2. Verify ANTHROPIC_API_KEY is spelled correctly
  3. Ensure no extra spaces in the .env file
  4. Restart the application after creating .env

JSON Parsing Error

Solutions:

  1. Make your schema description more specific
  2. Add an example format
  3. Reduce the number of records per batch
  4. Check your API key has sufficient credits

Rate Limit Errors

Solutions:

  1. Reduce batch size in code (change batch_size=50 to batch_size=20)
  2. Add delays between batches
  3. Upgrade your Anthropic API plan

📊 Output Format

DataFrame Preview

View generated data directly in the browser with scrollable table.

CSV Download

  • Automatic CSV generation
  • Proper encoding (UTF-8)
  • No index column
  • Ready for Excel, Pandas, or any data tool

🧑‍💻 Skill Level

Beginner Friendly

  • No ML/AI expertise required
  • Basic Python knowledge helpful
  • Simple natural language interface
  • Pre-configured examples included

💡 Tips for Best Results

  1. Be Specific: Include data types, ranges, and formats
  2. Use Examples: Provide sample JSON for complex schemas
  3. Start Small: Test with 5-10 records before scaling up
  4. Iterate: Refine your schema based on initial results
  5. Validate: Check the first few records before using the entire dataset

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch
  3. Commit your changes
  4. Push to the branch
  5. Open a Pull Request

🙏 Acknowledgments

  • Anthropic for the Claude API
  • Gradio for the UI framework
  • Pandas for data manipulation

📞 Support



Made with ❤️ using Claude 3 Haiku

Star this repo if you find it useful!