503 lines
18 KiB
Plaintext
503 lines
18 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1faf5626-864e-4287-af11-535f9a3f59ae",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 🤖 Synthetic Dataset Generator\n",
|
|
"## AI-Powered Synthetic Data Generation with Claude 3 Haiku\n",
|
|
"Generate custom synthetic datasets by simply describing your data schema. This tool uses Claude 3 Haiku to create realistic, diverse datasets for testing, development, and prototyping.\n",
|
|
"\n",
|
|
"  \n",
|
|
"\n",
|
|
"## ✨ Features\n",
|
|
"\n",
|
|
"- 🎯 Schema-Based Generation - Describe your data structure in plain English\n",
|
|
"- 🚀 Fast & Efficient - Powered by Claude 3 Haiku for cost-effective generation\n",
|
|
"- 📊 Batch Processing - Automatically handles large datasets (200+ records)\n",
|
|
"- 💾 Export Ready - Download as CSV for immediate use\n",
|
|
"- 🎨 User-Friendly UI - Built with Gradio for easy interaction\n",
|
|
"- 🔒 Secure - API key management via .env files\n",
|
|
"- 📝 Built-in Examples - Pre-configured schemas for common use cases\n",
|
|
"\n",
|
|
"## 🌍 Use Cases\n",
|
|
"\n",
|
|
"+ 🧪 Testing & Development - Generate test data for applications\n",
|
|
"+ 📈 Data Science - Create training datasets for ML models\n",
|
|
"+ 🎓 Education - Generate sample datasets for learning\n",
|
|
"+ 🏢 Prototyping - Quick data mockups for demos\n",
|
|
"+ 🔬 Research - Synthetic data for experiments\n",
|
|
"\n",
|
|
"## 🧠 Model\n",
|
|
"\n",
|
|
"- AI Model: Anthropic's claude-3-haiku-20240307\n",
|
|
"-Task: Structured data generation based on natural language schemas\n",
|
|
"- Output Format: JSON arrays converted to Pandas DataFrames and CSV\n",
|
|
"\n",
|
|
"## 🛠️ Requirements\n",
|
|
"### ⚙️ Hardware\n",
|
|
"\n",
|
|
"- ✅ CPU is sufficient — No GPU required\n",
|
|
"- 💾 Minimal RAM (2GB+)\n",
|
|
"\n",
|
|
"### 📦 Software\n",
|
|
"\n",
|
|
"Python 3.8 or higher\n",
|
|
"Anthropic API Key \n",
|
|
"\n",
|
|
"### Take the help of (`README.md`) for errors"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 28,
|
|
"id": "7ece01a4-0676-4176-86b9-91b0be3a9786",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import gradio as gr\n",
|
|
"import json\n",
|
|
"import pandas as pd\n",
|
|
"from typing import List, Dict\n",
|
|
"import os\n",
|
|
"from dotenv import load_dotenv\n",
|
|
"import tempfile"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 29,
|
|
"id": "01665d8a-c483-48c7-92e1-0d92ca4c9731",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"True"
|
|
]
|
|
},
|
|
"execution_count": 29,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Load environment variables from .env file\n",
|
|
"load_dotenv()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 30,
|
|
"id": "3cf53df7-175a-46b0-8508-a8ae34afb65b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Get API key from environment\n",
|
|
"ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 31,
|
|
"id": "53a0686e-26c7-49c0-b048-a113be756c7c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import anthropic after other imports to avoid conflicts\n",
|
|
"try:\n",
|
|
" from anthropic import Anthropic, APIError\n",
|
|
"except ImportError:\n",
|
|
" import anthropic\n",
|
|
" Anthropic = anthropic.Anthropic\n",
|
|
" APIError = anthropic.APIError\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 32,
|
|
"id": "5f9cb807-ad4c-45b1-bedf-d342a14ebe4a",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Initialize Anthropic client\n",
|
|
"def create_client(api_key: str):\n",
|
|
" \"\"\"Create Anthropic client with proper initialization\"\"\"\n",
|
|
" try:\n",
|
|
" # Try normal initialization\n",
|
|
" return Anthropic(api_key=api_key)\n",
|
|
" except TypeError as e:\n",
|
|
" if 'proxies' in str(e):\n",
|
|
" # Workaround for httpx version mismatch\n",
|
|
" import httpx\n",
|
|
" # Create a basic httpx client without proxies\n",
|
|
" http_client = httpx.Client()\n",
|
|
" return Anthropic(api_key=api_key, http_client=http_client)\n",
|
|
" else:\n",
|
|
" raise e\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 33,
|
|
"id": "dea61271-a138-4f9b-979e-77a998a6950c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def generate_synthetic_data(\n",
|
|
" api_key: str,\n",
|
|
" schema_description: str,\n",
|
|
" num_records: int,\n",
|
|
" example_format: str = \"\"\n",
|
|
") -> tuple:\n",
|
|
" \"\"\"\n",
|
|
" Generate synthetic dataset using Claude 3 Haiku\n",
|
|
" \n",
|
|
" Args:\n",
|
|
" api_key: Anthropic API key\n",
|
|
" schema_description: Description of the data schema\n",
|
|
" num_records: Number of records to generate\n",
|
|
" example_format: Optional example of desired format\n",
|
|
" \n",
|
|
" Returns:\n",
|
|
" tuple: (DataFrame, status message, csv_file_path)\n",
|
|
" \"\"\"\n",
|
|
" try:\n",
|
|
" # Create client\n",
|
|
" client = create_client(api_key)\n",
|
|
" \n",
|
|
" # Construct the prompt\n",
|
|
" example_section = f\"\\n\\nExample format:\\n{example_format}\" if example_format else \"\"\n",
|
|
" \n",
|
|
" prompt = f\"\"\"Generate {num_records} synthetic data records based on the following schema:\n",
|
|
"\n",
|
|
"{schema_description}{example_section}\n",
|
|
"\n",
|
|
"Requirements:\n",
|
|
"1. Return ONLY a valid JSON array of objects\n",
|
|
"2. Each object should be one record matching the schema\n",
|
|
"3. Make the data realistic and diverse\n",
|
|
"4. Ensure data types are appropriate (strings, numbers, booleans, dates, etc.)\n",
|
|
"5. Do not include any explanation, only the JSON array\n",
|
|
"\n",
|
|
"Generate exactly {num_records} records.\"\"\"\n",
|
|
"\n",
|
|
" # Call Claude API with explicit parameters\n",
|
|
" message = client.messages.create(\n",
|
|
" model=\"claude-3-haiku-20240307\",\n",
|
|
" max_tokens=4096,\n",
|
|
" messages=[\n",
|
|
" {\"role\": \"user\", \"content\": prompt}\n",
|
|
" ]\n",
|
|
" )\n",
|
|
" \n",
|
|
" # Extract the response\n",
|
|
" response_text = message.content[0].text\n",
|
|
" \n",
|
|
" # Try to parse JSON from the response\n",
|
|
" # Sometimes Claude might wrap it in markdown code blocks\n",
|
|
" if \"```json\" in response_text:\n",
|
|
" json_str = response_text.split(\"```json\")[1].split(\"```\")[0].strip()\n",
|
|
" elif \"```\" in response_text:\n",
|
|
" json_str = response_text.split(\"```\")[1].split(\"```\")[0].strip()\n",
|
|
" else:\n",
|
|
" json_str = response_text.strip()\n",
|
|
" \n",
|
|
" # Parse JSON\n",
|
|
" data = json.loads(json_str)\n",
|
|
" \n",
|
|
" # Convert to DataFrame\n",
|
|
" df = pd.DataFrame(data)\n",
|
|
" \n",
|
|
" # Save to temporary CSV file with proper path\n",
|
|
" fd, temp_path = tempfile.mkstemp(suffix='.csv', prefix='synthetic_data_')\n",
|
|
" os.close(fd) # Close the file descriptor\n",
|
|
" \n",
|
|
" # Write CSV to the temp file\n",
|
|
" df.to_csv(temp_path, index=False)\n",
|
|
" \n",
|
|
" status = f\"✅ Successfully generated {len(df)} records!\"\n",
|
|
" return df, status, temp_path\n",
|
|
" \n",
|
|
" except json.JSONDecodeError as e:\n",
|
|
" return None, f\"❌ Error parsing JSON: {str(e)}\\n\\nResponse received:\\n{response_text[:500] if 'response_text' in locals() else 'N/A'}...\", None\n",
|
|
" except APIError as e:\n",
|
|
" return None, f\"❌ API Error: {str(e)}\", None\n",
|
|
" except Exception as e:\n",
|
|
" return None, f\"❌ Error: {type(e).__name__}: {str(e)}\", None"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 34,
|
|
"id": "aa95c2aa-ac99-4919-94bd-981cb7bd42b7",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def generate_batch_data(\n",
|
|
" api_key: str,\n",
|
|
" schema_description: str,\n",
|
|
" total_records: int,\n",
|
|
" example_format: str = \"\",\n",
|
|
" batch_size: int = 50\n",
|
|
") -> tuple:\n",
|
|
" \"\"\"\n",
|
|
" Generate larger datasets in batches\n",
|
|
" \"\"\"\n",
|
|
" all_data = []\n",
|
|
" batches = (total_records + batch_size - 1) // batch_size\n",
|
|
" \n",
|
|
" for i in range(batches):\n",
|
|
" records_in_batch = min(batch_size, total_records - len(all_data))\n",
|
|
" df_batch, status, csv_path = generate_synthetic_data(\n",
|
|
" api_key, schema_description, records_in_batch, example_format\n",
|
|
" )\n",
|
|
" \n",
|
|
" if df_batch is not None:\n",
|
|
" all_data.extend(df_batch.to_dict('records'))\n",
|
|
" else:\n",
|
|
" return None, f\"❌ Error in batch {i+1}: {status}\", None\n",
|
|
" \n",
|
|
" final_df = pd.DataFrame(all_data)\n",
|
|
" \n",
|
|
" # Save final CSV with proper temp file handling\n",
|
|
" fd, temp_path = tempfile.mkstemp(suffix='.csv', prefix='synthetic_data_batch_')\n",
|
|
" os.close(fd)\n",
|
|
" \n",
|
|
" final_df.to_csv(temp_path, index=False)\n",
|
|
" \n",
|
|
" status = f\"✅ Successfully generated {len(final_df)} records in {batches} batches!\"\n",
|
|
" return final_df, status, temp_path\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 39,
|
|
"id": "b73aff00-c0c0-43d4-96a9-43b0cd84de2b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create Gradio Interface\n",
|
|
"def create_interface():\n",
|
|
" with gr.Blocks(title=\"Synthetic Dataset Generator\", theme=gr.themes.Soft()) as demo:\n",
|
|
" gr.Markdown(\"\"\"\n",
|
|
" # 🤖 Synthetic Dataset Generator\n",
|
|
" ### Powered by Claude 3 Haiku\n",
|
|
" \n",
|
|
" Create custom synthetic datasets by describing your schema. Claude will generate realistic data matching your specifications.\n",
|
|
" \"\"\")\n",
|
|
" \n",
|
|
" with gr.Row():\n",
|
|
" with gr.Column(scale=1):\n",
|
|
" # Show API key input only if not found in environment\n",
|
|
" if not ANTHROPIC_API_KEY:\n",
|
|
" api_key_input = gr.Textbox(\n",
|
|
" label=\"Anthropic API Key\",\n",
|
|
" type=\"password\",\n",
|
|
" placeholder=\"sk-ant-...\",\n",
|
|
" info=\"API key not found in .env file\"\n",
|
|
" )\n",
|
|
" else:\n",
|
|
" api_key_input = gr.Textbox(\n",
|
|
" label=\"Anthropic API Key\",\n",
|
|
" type=\"password\",\n",
|
|
" value=ANTHROPIC_API_KEY,\n",
|
|
" placeholder=\"Loaded from .env\",\n",
|
|
" info=\"✅ API key loaded from environment\",\n",
|
|
" interactive=False\n",
|
|
" )\n",
|
|
" \n",
|
|
" schema_input = gr.Textbox(\n",
|
|
" label=\"Data Schema Description\",\n",
|
|
" placeholder=\"\"\"Example: Generate customer data with:\n",
|
|
"- name (full name)\n",
|
|
"- email (valid email address)\n",
|
|
"- age (between 18-80)\n",
|
|
"- city (US cities)\n",
|
|
"- purchase_amount (between $10-$1000)\n",
|
|
"- join_date (dates in 2023-2024)\"\"\",\n",
|
|
" lines=10\n",
|
|
" )\n",
|
|
" \n",
|
|
" example_input = gr.Textbox(\n",
|
|
" label=\"Example Format (Optional)\",\n",
|
|
" placeholder=\"\"\"{\"name\": \"John Doe\", \"email\": \"john@example.com\", \"age\": 35, \"city\": \"New York\", \"purchase_amount\": 299.99, \"join_date\": \"2023-05-15\"}\"\"\",\n",
|
|
" lines=4\n",
|
|
" )\n",
|
|
" \n",
|
|
" num_records = gr.Slider(\n",
|
|
" minimum=1,\n",
|
|
" maximum=200,\n",
|
|
" value=10,\n",
|
|
" step=1,\n",
|
|
" label=\"Number of Records\"\n",
|
|
" )\n",
|
|
" \n",
|
|
" generate_btn = gr.Button(\"🚀 Generate Dataset\", variant=\"primary\")\n",
|
|
" \n",
|
|
" with gr.Column(scale=2):\n",
|
|
" status_output = gr.Textbox(label=\"Status\", lines=2)\n",
|
|
" dataframe_output = gr.Dataframe(\n",
|
|
" label=\"Generated Dataset\",\n",
|
|
" wrap=True\n",
|
|
" )\n",
|
|
" csv_output = gr.File(label=\"Download CSV\", type=\"filepath\")\n",
|
|
" \n",
|
|
" # Examples\n",
|
|
" gr.Markdown(\"### 📝 Example Schemas\")\n",
|
|
" gr.Examples(\n",
|
|
" examples=[\n",
|
|
" [\n",
|
|
" \"\"\"Generate employee records with:\n",
|
|
"- employee_id (format: EMP001, EMP002, etc.)\n",
|
|
"- name (full name)\n",
|
|
"- department (Engineering, Sales, Marketing, HR, Finance)\n",
|
|
"- salary (between $40,000-$150,000)\n",
|
|
"- hire_date (between 2020-2024)\n",
|
|
"- performance_rating (1-5)\"\"\",\n",
|
|
" 10\n",
|
|
" ],\n",
|
|
" [\n",
|
|
" \"\"\"Generate e-commerce product data with:\n",
|
|
"- product_id (format: PRD-XXXX)\n",
|
|
"- product_name (creative product names)\n",
|
|
"- category (Electronics, Clothing, Home, Books, Sports)\n",
|
|
"- price (between $5-$500)\n",
|
|
"- stock_quantity (between 0-1000)\n",
|
|
"- rating (1.0-5.0)\n",
|
|
"- num_reviews (0-500)\"\"\",\n",
|
|
" 15\n",
|
|
" ],\n",
|
|
" [\n",
|
|
" \"\"\"Generate student records with:\n",
|
|
"- student_id (format: STU2024XXX)\n",
|
|
"- name (full name)\n",
|
|
"- major (Computer Science, Biology, Business, Arts, Engineering)\n",
|
|
"- gpa (2.0-4.0)\n",
|
|
"- year (Freshman, Sophomore, Junior, Senior)\n",
|
|
"- credits_completed (0-120)\"\"\",\n",
|
|
" 20\n",
|
|
" ]\n",
|
|
" ],\n",
|
|
" inputs=[schema_input, num_records]\n",
|
|
" )\n",
|
|
" \n",
|
|
" def generate_wrapper(api_key, schema, num_rec, example):\n",
|
|
" # Use environment API key if available, otherwise use input\n",
|
|
" final_api_key = ANTHROPIC_API_KEY or api_key\n",
|
|
" \n",
|
|
" if not final_api_key:\n",
|
|
" return None, \"❌ Please provide your Anthropic API key (either in .env file or input field)\", None\n",
|
|
" if not schema:\n",
|
|
" return None, \"❌ Please describe your data schema\", None\n",
|
|
" \n",
|
|
" # For larger datasets, use batch generation\n",
|
|
" if num_rec > 50:\n",
|
|
" return generate_batch_data(final_api_key, schema, num_rec, example)\n",
|
|
" else:\n",
|
|
" return generate_synthetic_data(final_api_key, schema, num_rec, example)\n",
|
|
" \n",
|
|
" generate_btn.click(\n",
|
|
" fn=generate_wrapper,\n",
|
|
" inputs=[api_key_input, schema_input, num_records, example_input],\n",
|
|
" outputs=[dataframe_output, status_output, csv_output]\n",
|
|
" )\n",
|
|
" \n",
|
|
" gr.Markdown(\"\"\"\n",
|
|
" ---\n",
|
|
" ### 💡 Tips:\n",
|
|
" - Be specific about data types, ranges, and formats\n",
|
|
" - Provide examples for better results\n",
|
|
" - For large datasets (>50 records), generation happens in batches\n",
|
|
" - Claude 3 Haiku is fast and cost-effective for this task\n",
|
|
" \n",
|
|
" ### 🔑 API Key Setup:\n",
|
|
" Create a `.env` file in the same directory with:\n",
|
|
" ```\n",
|
|
" ANTHROPIC_API_KEY=your_api_key_here\n",
|
|
" ```\n",
|
|
" \n",
|
|
" ### ⚠️ Troubleshooting:\n",
|
|
" If you see a \"proxies\" error, update httpx:\n",
|
|
" ```\n",
|
|
" pip install --upgrade httpx\n",
|
|
" ```\n",
|
|
" \"\"\")\n",
|
|
" \n",
|
|
" return demo\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 40,
|
|
"id": "cef71337-b446-46b2-b84b-d23b7dd4f13e",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"* Running on local URL: http://127.0.0.1:7867\n",
|
|
"\n",
|
|
"To create a public link, set `share=True` in `launch()`.\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div><iframe src=\"http://127.0.0.1:7867/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
|
|
],
|
|
"text/plain": [
|
|
"<IPython.core.display.HTML object>"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": []
|
|
},
|
|
"execution_count": 40,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"demo = create_interface()\n",
|
|
"demo.launch()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "ec34fee8-eeb1-4015-95fe-62276927d25a",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.10"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|