@@ -0,0 +1,251 @@
|
||||
# 🤖 Synthetic Dataset Generator
|
||||
## AI-Powered Synthetic Data Generation with Claude 3 Haiku
|
||||
## 📥 Installation
|
||||
|
||||
### 1️⃣ Clone the Repository
|
||||
|
||||
```bash
|
||||
git clone https://github.com/yourusername/synthetic-dataset-generator.git
|
||||
cd synthetic-dataset-generator
|
||||
```
|
||||
|
||||
### 2️⃣ Create Virtual Environment (Recommended)
|
||||
|
||||
```bash
|
||||
# Windows
|
||||
python -m venv venv
|
||||
venv\Scripts\activate
|
||||
|
||||
# macOS/Linux
|
||||
python3 -m venv venv
|
||||
source venv/bin/activate
|
||||
```
|
||||
|
||||
### 3️⃣ Install Dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
**Requirements file (`requirements.txt`):**
|
||||
```txt
|
||||
gradio>=4.0.0
|
||||
anthropic>=0.25.0
|
||||
pandas>=1.5.0
|
||||
python-dotenv>=1.0.0
|
||||
httpx==0.27.2
|
||||
```
|
||||
|
||||
### 4️⃣ Set Up API Key
|
||||
|
||||
Create a `.env` file in the project root:
|
||||
|
||||
```bash
|
||||
# .env
|
||||
ANTHROPIC_API_KEY=your_api_key_here
|
||||
```
|
||||
|
||||
> **Note**: Never commit your `.env` file to version control. Add it to `.gitignore`.
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Usage
|
||||
|
||||
### Running the Application
|
||||
|
||||
```bash
|
||||
python app.ipynb
|
||||
```
|
||||
|
||||
The Gradio interface will launch at `http://localhost:7860`
|
||||
|
||||
### Basic Workflow
|
||||
|
||||
1. **Enter API Key** (if not in `.env`)
|
||||
2. **Describe Your Schema** in plain English
|
||||
3. **Set Number of Records** (1-200)
|
||||
4. **Add Example Format** (optional, but recommended)
|
||||
5. **Click Generate** 🎉
|
||||
6. **Download CSV** when ready
|
||||
|
||||
---
|
||||
|
||||
## 📝 Example Schemas
|
||||
|
||||
### 👥 Customer Data
|
||||
```
|
||||
Generate customer data with:
|
||||
- customer_id (format: CUST-XXXX)
|
||||
- name (full name)
|
||||
- email (valid email address)
|
||||
- age (between 18-80)
|
||||
- city (US cities)
|
||||
- purchase_amount (between $10-$1000)
|
||||
- join_date (dates in 2023-2024)
|
||||
- subscription_type (Free, Basic, Premium)
|
||||
```
|
||||
|
||||
### 👨💼 Employee Records
|
||||
```
|
||||
Generate employee records with:
|
||||
- employee_id (format: EMP001, EMP002, etc.)
|
||||
- name (full name)
|
||||
- department (Engineering, Sales, Marketing, HR, Finance)
|
||||
- salary (between $40,000-$150,000)
|
||||
- hire_date (between 2020-2024)
|
||||
- performance_rating (1-5)
|
||||
- is_remote (true/false)
|
||||
```
|
||||
|
||||
### 🛒 E-commerce Products
|
||||
```
|
||||
Generate e-commerce product data with:
|
||||
- product_id (format: PRD-XXXX)
|
||||
- product_name (creative product names)
|
||||
- category (Electronics, Clothing, Home, Books, Sports)
|
||||
- price (between $5-$500)
|
||||
- stock_quantity (between 0-1000)
|
||||
- rating (1.0-5.0)
|
||||
- num_reviews (0-500)
|
||||
- in_stock (true/false)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Advanced Usage
|
||||
|
||||
### Batch Generation
|
||||
|
||||
For datasets larger than 50 records, the tool automatically:
|
||||
- Splits generation into batches of 50
|
||||
- Combines results into a single dataset
|
||||
- Prevents API timeout issues
|
||||
|
||||
### Custom Formats
|
||||
|
||||
Provide example JSON to guide the output format:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "USR-001",
|
||||
"name": "Jane Smith",
|
||||
"email": "jane.smith@example.com",
|
||||
"created_at": "2024-01-15T10:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Troubleshooting
|
||||
|
||||
### ❌ Error: `proxies` keyword argument
|
||||
|
||||
**Solution**: Downgrade httpx to compatible version
|
||||
|
||||
```bash
|
||||
pip install "httpx==0.27.2"
|
||||
```
|
||||
|
||||
Then restart your Python kernel/terminal.
|
||||
|
||||
### ❌ API Key Not Found
|
||||
|
||||
**Solutions**:
|
||||
1. Check `.env` file exists in project root
|
||||
2. Verify `ANTHROPIC_API_KEY` is spelled correctly
|
||||
3. Ensure no extra spaces in the `.env` file
|
||||
4. Restart the application after creating `.env`
|
||||
|
||||
### ❌ JSON Parsing Error
|
||||
|
||||
**Solutions**:
|
||||
1. Make your schema description more specific
|
||||
2. Add an example format
|
||||
3. Reduce the number of records per batch
|
||||
4. Check your API key has sufficient credits
|
||||
|
||||
### ❌ Rate Limit Errors
|
||||
|
||||
**Solutions**:
|
||||
1. Reduce batch size in code (change `batch_size=50` to `batch_size=20`)
|
||||
2. Add delays between batches
|
||||
3. Upgrade your Anthropic API plan
|
||||
|
||||
---
|
||||
|
||||
## 📊 Output Format
|
||||
|
||||
### DataFrame Preview
|
||||
View generated data directly in the browser with scrollable table.
|
||||
|
||||
### CSV Download
|
||||
- Automatic CSV generation
|
||||
- Proper encoding (UTF-8)
|
||||
- No index column
|
||||
- Ready for Excel, Pandas, or any data tool
|
||||
|
||||
---
|
||||
|
||||
## 🧑💻 Skill Level
|
||||
|
||||
**Beginner Friendly** ✅
|
||||
|
||||
- No ML/AI expertise required
|
||||
- Basic Python knowledge helpful
|
||||
- Simple natural language interface
|
||||
- Pre-configured examples included
|
||||
|
||||
---
|
||||
|
||||
## 💡 Tips for Best Results
|
||||
|
||||
1. **Be Specific**: Include data types, ranges, and formats
|
||||
2. **Use Examples**: Provide sample JSON for complex schemas
|
||||
3. **Start Small**: Test with 5-10 records before scaling up
|
||||
4. **Iterate**: Refine your schema based on initial results
|
||||
5. **Validate**: Check the first few records before using the entire dataset
|
||||
|
||||
---
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
Contributions are welcome! Please feel free to submit a Pull Request.
|
||||
|
||||
1. Fork the repository
|
||||
2. Create your feature branch
|
||||
3. Commit your changes
|
||||
4. Push to the branch
|
||||
5. Open a Pull Request
|
||||
|
||||
---
|
||||
|
||||
|
||||
## 🙏 Acknowledgments
|
||||
|
||||
- **Anthropic** for the Claude API
|
||||
- **Gradio** for the UI framework
|
||||
- **Pandas** for data manipulation
|
||||
|
||||
---
|
||||
|
||||
## 📞 Support
|
||||
|
||||
- 📧 Email: udayslathia16@gmail.com
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Related Projects
|
||||
|
||||
- [Claude API Documentation](https://docs.anthropic.com/)
|
||||
- [Gradio Documentation](https://gradio.app/docs/)
|
||||
- [Pandas Documentation](https://pandas.pydata.org/)
|
||||
|
||||
---
|
||||
|
||||
<div align="center">
|
||||
|
||||
**Made with ❤️ using Claude 3 Haiku**
|
||||
|
||||
⭐ Star this repo if you find it useful!
|
||||
|
||||
</div>
|
||||
@@ -0,0 +1,502 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1faf5626-864e-4287-af11-535f9a3f59ae",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 🤖 Synthetic Dataset Generator\n",
|
||||
"## AI-Powered Synthetic Data Generation with Claude 3 Haiku\n",
|
||||
"Generate custom synthetic datasets by simply describing your data schema. This tool uses Claude 3 Haiku to create realistic, diverse datasets for testing, development, and prototyping.\n",
|
||||
"\n",
|
||||
"  \n",
|
||||
"\n",
|
||||
"## ✨ Features\n",
|
||||
"\n",
|
||||
"- 🎯 Schema-Based Generation - Describe your data structure in plain English\n",
|
||||
"- 🚀 Fast & Efficient - Powered by Claude 3 Haiku for cost-effective generation\n",
|
||||
"- 📊 Batch Processing - Automatically handles large datasets (200+ records)\n",
|
||||
"- 💾 Export Ready - Download as CSV for immediate use\n",
|
||||
"- 🎨 User-Friendly UI - Built with Gradio for easy interaction\n",
|
||||
"- 🔒 Secure - API key management via .env files\n",
|
||||
"- 📝 Built-in Examples - Pre-configured schemas for common use cases\n",
|
||||
"\n",
|
||||
"## 🌍 Use Cases\n",
|
||||
"\n",
|
||||
"+ 🧪 Testing & Development - Generate test data for applications\n",
|
||||
"+ 📈 Data Science - Create training datasets for ML models\n",
|
||||
"+ 🎓 Education - Generate sample datasets for learning\n",
|
||||
"+ 🏢 Prototyping - Quick data mockups for demos\n",
|
||||
"+ 🔬 Research - Synthetic data for experiments\n",
|
||||
"\n",
|
||||
"## 🧠 Model\n",
|
||||
"\n",
|
||||
"- AI Model: Anthropic's claude-3-haiku-20240307\n",
|
||||
"-Task: Structured data generation based on natural language schemas\n",
|
||||
"- Output Format: JSON arrays converted to Pandas DataFrames and CSV\n",
|
||||
"\n",
|
||||
"## 🛠️ Requirements\n",
|
||||
"### ⚙️ Hardware\n",
|
||||
"\n",
|
||||
"- ✅ CPU is sufficient — No GPU required\n",
|
||||
"- 💾 Minimal RAM (2GB+)\n",
|
||||
"\n",
|
||||
"### 📦 Software\n",
|
||||
"\n",
|
||||
"Python 3.8 or higher\n",
|
||||
"Anthropic API Key \n",
|
||||
"\n",
|
||||
"### Take the help of (`README.md`) for errors"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 28,
|
||||
"id": "7ece01a4-0676-4176-86b9-91b0be3a9786",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import gradio as gr\n",
|
||||
"import json\n",
|
||||
"import pandas as pd\n",
|
||||
"from typing import List, Dict\n",
|
||||
"import os\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"import tempfile"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"id": "01665d8a-c483-48c7-92e1-0d92ca4c9731",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"True"
|
||||
]
|
||||
},
|
||||
"execution_count": 29,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Load environment variables from .env file\n",
|
||||
"load_dotenv()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 30,
|
||||
"id": "3cf53df7-175a-46b0-8508-a8ae34afb65b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get API key from environment\n",
|
||||
"ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 31,
|
||||
"id": "53a0686e-26c7-49c0-b048-a113be756c7c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Import anthropic after other imports to avoid conflicts\n",
|
||||
"try:\n",
|
||||
" from anthropic import Anthropic, APIError\n",
|
||||
"except ImportError:\n",
|
||||
" import anthropic\n",
|
||||
" Anthropic = anthropic.Anthropic\n",
|
||||
" APIError = anthropic.APIError\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"id": "5f9cb807-ad4c-45b1-bedf-d342a14ebe4a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Initialize Anthropic client\n",
|
||||
"def create_client(api_key: str):\n",
|
||||
" \"\"\"Create Anthropic client with proper initialization\"\"\"\n",
|
||||
" try:\n",
|
||||
" # Try normal initialization\n",
|
||||
" return Anthropic(api_key=api_key)\n",
|
||||
" except TypeError as e:\n",
|
||||
" if 'proxies' in str(e):\n",
|
||||
" # Workaround for httpx version mismatch\n",
|
||||
" import httpx\n",
|
||||
" # Create a basic httpx client without proxies\n",
|
||||
" http_client = httpx.Client()\n",
|
||||
" return Anthropic(api_key=api_key, http_client=http_client)\n",
|
||||
" else:\n",
|
||||
" raise e\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 33,
|
||||
"id": "dea61271-a138-4f9b-979e-77a998a6950c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def generate_synthetic_data(\n",
|
||||
" api_key: str,\n",
|
||||
" schema_description: str,\n",
|
||||
" num_records: int,\n",
|
||||
" example_format: str = \"\"\n",
|
||||
") -> tuple:\n",
|
||||
" \"\"\"\n",
|
||||
" Generate synthetic dataset using Claude 3 Haiku\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" api_key: Anthropic API key\n",
|
||||
" schema_description: Description of the data schema\n",
|
||||
" num_records: Number of records to generate\n",
|
||||
" example_format: Optional example of desired format\n",
|
||||
" \n",
|
||||
" Returns:\n",
|
||||
" tuple: (DataFrame, status message, csv_file_path)\n",
|
||||
" \"\"\"\n",
|
||||
" try:\n",
|
||||
" # Create client\n",
|
||||
" client = create_client(api_key)\n",
|
||||
" \n",
|
||||
" # Construct the prompt\n",
|
||||
" example_section = f\"\\n\\nExample format:\\n{example_format}\" if example_format else \"\"\n",
|
||||
" \n",
|
||||
" prompt = f\"\"\"Generate {num_records} synthetic data records based on the following schema:\n",
|
||||
"\n",
|
||||
"{schema_description}{example_section}\n",
|
||||
"\n",
|
||||
"Requirements:\n",
|
||||
"1. Return ONLY a valid JSON array of objects\n",
|
||||
"2. Each object should be one record matching the schema\n",
|
||||
"3. Make the data realistic and diverse\n",
|
||||
"4. Ensure data types are appropriate (strings, numbers, booleans, dates, etc.)\n",
|
||||
"5. Do not include any explanation, only the JSON array\n",
|
||||
"\n",
|
||||
"Generate exactly {num_records} records.\"\"\"\n",
|
||||
"\n",
|
||||
" # Call Claude API with explicit parameters\n",
|
||||
" message = client.messages.create(\n",
|
||||
" model=\"claude-3-haiku-20240307\",\n",
|
||||
" max_tokens=4096,\n",
|
||||
" messages=[\n",
|
||||
" {\"role\": \"user\", \"content\": prompt}\n",
|
||||
" ]\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" # Extract the response\n",
|
||||
" response_text = message.content[0].text\n",
|
||||
" \n",
|
||||
" # Try to parse JSON from the response\n",
|
||||
" # Sometimes Claude might wrap it in markdown code blocks\n",
|
||||
" if \"```json\" in response_text:\n",
|
||||
" json_str = response_text.split(\"```json\")[1].split(\"```\")[0].strip()\n",
|
||||
" elif \"```\" in response_text:\n",
|
||||
" json_str = response_text.split(\"```\")[1].split(\"```\")[0].strip()\n",
|
||||
" else:\n",
|
||||
" json_str = response_text.strip()\n",
|
||||
" \n",
|
||||
" # Parse JSON\n",
|
||||
" data = json.loads(json_str)\n",
|
||||
" \n",
|
||||
" # Convert to DataFrame\n",
|
||||
" df = pd.DataFrame(data)\n",
|
||||
" \n",
|
||||
" # Save to temporary CSV file with proper path\n",
|
||||
" fd, temp_path = tempfile.mkstemp(suffix='.csv', prefix='synthetic_data_')\n",
|
||||
" os.close(fd) # Close the file descriptor\n",
|
||||
" \n",
|
||||
" # Write CSV to the temp file\n",
|
||||
" df.to_csv(temp_path, index=False)\n",
|
||||
" \n",
|
||||
" status = f\"✅ Successfully generated {len(df)} records!\"\n",
|
||||
" return df, status, temp_path\n",
|
||||
" \n",
|
||||
" except json.JSONDecodeError as e:\n",
|
||||
" return None, f\"❌ Error parsing JSON: {str(e)}\\n\\nResponse received:\\n{response_text[:500] if 'response_text' in locals() else 'N/A'}...\", None\n",
|
||||
" except APIError as e:\n",
|
||||
" return None, f\"❌ API Error: {str(e)}\", None\n",
|
||||
" except Exception as e:\n",
|
||||
" return None, f\"❌ Error: {type(e).__name__}: {str(e)}\", None"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 34,
|
||||
"id": "aa95c2aa-ac99-4919-94bd-981cb7bd42b7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def generate_batch_data(\n",
|
||||
" api_key: str,\n",
|
||||
" schema_description: str,\n",
|
||||
" total_records: int,\n",
|
||||
" example_format: str = \"\",\n",
|
||||
" batch_size: int = 50\n",
|
||||
") -> tuple:\n",
|
||||
" \"\"\"\n",
|
||||
" Generate larger datasets in batches\n",
|
||||
" \"\"\"\n",
|
||||
" all_data = []\n",
|
||||
" batches = (total_records + batch_size - 1) // batch_size\n",
|
||||
" \n",
|
||||
" for i in range(batches):\n",
|
||||
" records_in_batch = min(batch_size, total_records - len(all_data))\n",
|
||||
" df_batch, status, csv_path = generate_synthetic_data(\n",
|
||||
" api_key, schema_description, records_in_batch, example_format\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" if df_batch is not None:\n",
|
||||
" all_data.extend(df_batch.to_dict('records'))\n",
|
||||
" else:\n",
|
||||
" return None, f\"❌ Error in batch {i+1}: {status}\", None\n",
|
||||
" \n",
|
||||
" final_df = pd.DataFrame(all_data)\n",
|
||||
" \n",
|
||||
" # Save final CSV with proper temp file handling\n",
|
||||
" fd, temp_path = tempfile.mkstemp(suffix='.csv', prefix='synthetic_data_batch_')\n",
|
||||
" os.close(fd)\n",
|
||||
" \n",
|
||||
" final_df.to_csv(temp_path, index=False)\n",
|
||||
" \n",
|
||||
" status = f\"✅ Successfully generated {len(final_df)} records in {batches} batches!\"\n",
|
||||
" return final_df, status, temp_path\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 39,
|
||||
"id": "b73aff00-c0c0-43d4-96a9-43b0cd84de2b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create Gradio Interface\n",
|
||||
"def create_interface():\n",
|
||||
" with gr.Blocks(title=\"Synthetic Dataset Generator\", theme=gr.themes.Soft()) as demo:\n",
|
||||
" gr.Markdown(\"\"\"\n",
|
||||
" # 🤖 Synthetic Dataset Generator\n",
|
||||
" ### Powered by Claude 3 Haiku\n",
|
||||
" \n",
|
||||
" Create custom synthetic datasets by describing your schema. Claude will generate realistic data matching your specifications.\n",
|
||||
" \"\"\")\n",
|
||||
" \n",
|
||||
" with gr.Row():\n",
|
||||
" with gr.Column(scale=1):\n",
|
||||
" # Show API key input only if not found in environment\n",
|
||||
" if not ANTHROPIC_API_KEY:\n",
|
||||
" api_key_input = gr.Textbox(\n",
|
||||
" label=\"Anthropic API Key\",\n",
|
||||
" type=\"password\",\n",
|
||||
" placeholder=\"sk-ant-...\",\n",
|
||||
" info=\"API key not found in .env file\"\n",
|
||||
" )\n",
|
||||
" else:\n",
|
||||
" api_key_input = gr.Textbox(\n",
|
||||
" label=\"Anthropic API Key\",\n",
|
||||
" type=\"password\",\n",
|
||||
" value=ANTHROPIC_API_KEY,\n",
|
||||
" placeholder=\"Loaded from .env\",\n",
|
||||
" info=\"✅ API key loaded from environment\",\n",
|
||||
" interactive=False\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" schema_input = gr.Textbox(\n",
|
||||
" label=\"Data Schema Description\",\n",
|
||||
" placeholder=\"\"\"Example: Generate customer data with:\n",
|
||||
"- name (full name)\n",
|
||||
"- email (valid email address)\n",
|
||||
"- age (between 18-80)\n",
|
||||
"- city (US cities)\n",
|
||||
"- purchase_amount (between $10-$1000)\n",
|
||||
"- join_date (dates in 2023-2024)\"\"\",\n",
|
||||
" lines=10\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" example_input = gr.Textbox(\n",
|
||||
" label=\"Example Format (Optional)\",\n",
|
||||
" placeholder=\"\"\"{\"name\": \"John Doe\", \"email\": \"john@example.com\", \"age\": 35, \"city\": \"New York\", \"purchase_amount\": 299.99, \"join_date\": \"2023-05-15\"}\"\"\",\n",
|
||||
" lines=4\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" num_records = gr.Slider(\n",
|
||||
" minimum=1,\n",
|
||||
" maximum=200,\n",
|
||||
" value=10,\n",
|
||||
" step=1,\n",
|
||||
" label=\"Number of Records\"\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" generate_btn = gr.Button(\"🚀 Generate Dataset\", variant=\"primary\")\n",
|
||||
" \n",
|
||||
" with gr.Column(scale=2):\n",
|
||||
" status_output = gr.Textbox(label=\"Status\", lines=2)\n",
|
||||
" dataframe_output = gr.Dataframe(\n",
|
||||
" label=\"Generated Dataset\",\n",
|
||||
" wrap=True\n",
|
||||
" )\n",
|
||||
" csv_output = gr.File(label=\"Download CSV\", type=\"filepath\")\n",
|
||||
" \n",
|
||||
" # Examples\n",
|
||||
" gr.Markdown(\"### 📝 Example Schemas\")\n",
|
||||
" gr.Examples(\n",
|
||||
" examples=[\n",
|
||||
" [\n",
|
||||
" \"\"\"Generate employee records with:\n",
|
||||
"- employee_id (format: EMP001, EMP002, etc.)\n",
|
||||
"- name (full name)\n",
|
||||
"- department (Engineering, Sales, Marketing, HR, Finance)\n",
|
||||
"- salary (between $40,000-$150,000)\n",
|
||||
"- hire_date (between 2020-2024)\n",
|
||||
"- performance_rating (1-5)\"\"\",\n",
|
||||
" 10\n",
|
||||
" ],\n",
|
||||
" [\n",
|
||||
" \"\"\"Generate e-commerce product data with:\n",
|
||||
"- product_id (format: PRD-XXXX)\n",
|
||||
"- product_name (creative product names)\n",
|
||||
"- category (Electronics, Clothing, Home, Books, Sports)\n",
|
||||
"- price (between $5-$500)\n",
|
||||
"- stock_quantity (between 0-1000)\n",
|
||||
"- rating (1.0-5.0)\n",
|
||||
"- num_reviews (0-500)\"\"\",\n",
|
||||
" 15\n",
|
||||
" ],\n",
|
||||
" [\n",
|
||||
" \"\"\"Generate student records with:\n",
|
||||
"- student_id (format: STU2024XXX)\n",
|
||||
"- name (full name)\n",
|
||||
"- major (Computer Science, Biology, Business, Arts, Engineering)\n",
|
||||
"- gpa (2.0-4.0)\n",
|
||||
"- year (Freshman, Sophomore, Junior, Senior)\n",
|
||||
"- credits_completed (0-120)\"\"\",\n",
|
||||
" 20\n",
|
||||
" ]\n",
|
||||
" ],\n",
|
||||
" inputs=[schema_input, num_records]\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" def generate_wrapper(api_key, schema, num_rec, example):\n",
|
||||
" # Use environment API key if available, otherwise use input\n",
|
||||
" final_api_key = ANTHROPIC_API_KEY or api_key\n",
|
||||
" \n",
|
||||
" if not final_api_key:\n",
|
||||
" return None, \"❌ Please provide your Anthropic API key (either in .env file or input field)\", None\n",
|
||||
" if not schema:\n",
|
||||
" return None, \"❌ Please describe your data schema\", None\n",
|
||||
" \n",
|
||||
" # For larger datasets, use batch generation\n",
|
||||
" if num_rec > 50:\n",
|
||||
" return generate_batch_data(final_api_key, schema, num_rec, example)\n",
|
||||
" else:\n",
|
||||
" return generate_synthetic_data(final_api_key, schema, num_rec, example)\n",
|
||||
" \n",
|
||||
" generate_btn.click(\n",
|
||||
" fn=generate_wrapper,\n",
|
||||
" inputs=[api_key_input, schema_input, num_records, example_input],\n",
|
||||
" outputs=[dataframe_output, status_output, csv_output]\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" gr.Markdown(\"\"\"\n",
|
||||
" ---\n",
|
||||
" ### 💡 Tips:\n",
|
||||
" - Be specific about data types, ranges, and formats\n",
|
||||
" - Provide examples for better results\n",
|
||||
" - For large datasets (>50 records), generation happens in batches\n",
|
||||
" - Claude 3 Haiku is fast and cost-effective for this task\n",
|
||||
" \n",
|
||||
" ### 🔑 API Key Setup:\n",
|
||||
" Create a `.env` file in the same directory with:\n",
|
||||
" ```\n",
|
||||
" ANTHROPIC_API_KEY=your_api_key_here\n",
|
||||
" ```\n",
|
||||
" \n",
|
||||
" ### ⚠️ Troubleshooting:\n",
|
||||
" If you see a \"proxies\" error, update httpx:\n",
|
||||
" ```\n",
|
||||
" pip install --upgrade httpx\n",
|
||||
" ```\n",
|
||||
" \"\"\")\n",
|
||||
" \n",
|
||||
" return demo\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 40,
|
||||
"id": "cef71337-b446-46b2-b84b-d23b7dd4f13e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"* Running on local URL: http://127.0.0.1:7867\n",
|
||||
"\n",
|
||||
"To create a public link, set `share=True` in `launch()`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div><iframe src=\"http://127.0.0.1:7867/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.core.display.HTML object>"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": []
|
||||
},
|
||||
"execution_count": 40,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"demo = create_interface()\n",
|
||||
"demo.launch()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "ec34fee8-eeb1-4015-95fe-62276927d25a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.10"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -0,0 +1,5 @@
|
||||
gradio>=4.0.0
|
||||
anthropic>=0.25.0
|
||||
pandas>=1.5.0
|
||||
python-dotenv>=1.0.0
|
||||
httpx==0.27.2
|
||||
Reference in New Issue
Block a user