Files
LLM_Engineering_OLD/week3/community-contributions/Synthetic Dataset Generator/app.ipynb
2025-10-09 12:45:04 +05:30

503 lines
18 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "1faf5626-864e-4287-af11-535f9a3f59ae",
"metadata": {},
"source": [
"# 🤖 Synthetic Dataset Generator\n",
"## AI-Powered Synthetic Data Generation with Claude 3 Haiku\n",
"Generate custom synthetic datasets by simply describing your data schema. This tool uses Claude 3 Haiku to create realistic, diverse datasets for testing, development, and prototyping.\n",
"\n",
"![Alt text](https://img.shields.io/badge/python-3.10+-blue.svg) ![Alt text](https://img.shields.io/badge/License-MIT-yellow.svg) \n",
"\n",
"## ✨ Features\n",
"\n",
"- 🎯 Schema-Based Generation - Describe your data structure in plain English\n",
"- 🚀 Fast & Efficient - Powered by Claude 3 Haiku for cost-effective generation\n",
"- 📊 Batch Processing - Automatically handles large datasets (200+ records)\n",
"- 💾 Export Ready - Download as CSV for immediate use\n",
"- 🎨 User-Friendly UI - Built with Gradio for easy interaction\n",
"- 🔒 Secure - API key management via .env files\n",
"- 📝 Built-in Examples - Pre-configured schemas for common use cases\n",
"\n",
"## 🌍 Use Cases\n",
"\n",
"+ 🧪 Testing & Development - Generate test data for applications\n",
"+ 📈 Data Science - Create training datasets for ML models\n",
"+ 🎓 Education - Generate sample datasets for learning\n",
"+ 🏢 Prototyping - Quick data mockups for demos\n",
"+ 🔬 Research - Synthetic data for experiments\n",
"\n",
"## 🧠 Model\n",
"\n",
"- AI Model: Anthropic's claude-3-haiku-20240307\n",
"-Task: Structured data generation based on natural language schemas\n",
"- Output Format: JSON arrays converted to Pandas DataFrames and CSV\n",
"\n",
"## 🛠️ Requirements\n",
"### ⚙️ Hardware\n",
"\n",
"- ✅ CPU is sufficient — No GPU required\n",
"- 💾 Minimal RAM (2GB+)\n",
"\n",
"### 📦 Software\n",
"\n",
"Python 3.8 or higher\n",
"Anthropic API Key \n",
"\n",
"### Take the help of (`README.md`) for errors"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "7ece01a4-0676-4176-86b9-91b0be3a9786",
"metadata": {},
"outputs": [],
"source": [
"import gradio as gr\n",
"import json\n",
"import pandas as pd\n",
"from typing import List, Dict\n",
"import os\n",
"from dotenv import load_dotenv\n",
"import tempfile"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "01665d8a-c483-48c7-92e1-0d92ca4c9731",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load environment variables from .env file\n",
"load_dotenv()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "3cf53df7-175a-46b0-8508-a8ae34afb65b",
"metadata": {},
"outputs": [],
"source": [
"# Get API key from environment\n",
"ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY')"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "53a0686e-26c7-49c0-b048-a113be756c7c",
"metadata": {},
"outputs": [],
"source": [
"# Import anthropic after other imports to avoid conflicts\n",
"try:\n",
" from anthropic import Anthropic, APIError\n",
"except ImportError:\n",
" import anthropic\n",
" Anthropic = anthropic.Anthropic\n",
" APIError = anthropic.APIError\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "5f9cb807-ad4c-45b1-bedf-d342a14ebe4a",
"metadata": {},
"outputs": [],
"source": [
"# Initialize Anthropic client\n",
"def create_client(api_key: str):\n",
" \"\"\"Create Anthropic client with proper initialization\"\"\"\n",
" try:\n",
" # Try normal initialization\n",
" return Anthropic(api_key=api_key)\n",
" except TypeError as e:\n",
" if 'proxies' in str(e):\n",
" # Workaround for httpx version mismatch\n",
" import httpx\n",
" # Create a basic httpx client without proxies\n",
" http_client = httpx.Client()\n",
" return Anthropic(api_key=api_key, http_client=http_client)\n",
" else:\n",
" raise e\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "dea61271-a138-4f9b-979e-77a998a6950c",
"metadata": {},
"outputs": [],
"source": [
"def generate_synthetic_data(\n",
" api_key: str,\n",
" schema_description: str,\n",
" num_records: int,\n",
" example_format: str = \"\"\n",
") -> tuple:\n",
" \"\"\"\n",
" Generate synthetic dataset using Claude 3 Haiku\n",
" \n",
" Args:\n",
" api_key: Anthropic API key\n",
" schema_description: Description of the data schema\n",
" num_records: Number of records to generate\n",
" example_format: Optional example of desired format\n",
" \n",
" Returns:\n",
" tuple: (DataFrame, status message, csv_file_path)\n",
" \"\"\"\n",
" try:\n",
" # Create client\n",
" client = create_client(api_key)\n",
" \n",
" # Construct the prompt\n",
" example_section = f\"\\n\\nExample format:\\n{example_format}\" if example_format else \"\"\n",
" \n",
" prompt = f\"\"\"Generate {num_records} synthetic data records based on the following schema:\n",
"\n",
"{schema_description}{example_section}\n",
"\n",
"Requirements:\n",
"1. Return ONLY a valid JSON array of objects\n",
"2. Each object should be one record matching the schema\n",
"3. Make the data realistic and diverse\n",
"4. Ensure data types are appropriate (strings, numbers, booleans, dates, etc.)\n",
"5. Do not include any explanation, only the JSON array\n",
"\n",
"Generate exactly {num_records} records.\"\"\"\n",
"\n",
" # Call Claude API with explicit parameters\n",
" message = client.messages.create(\n",
" model=\"claude-3-haiku-20240307\",\n",
" max_tokens=4096,\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": prompt}\n",
" ]\n",
" )\n",
" \n",
" # Extract the response\n",
" response_text = message.content[0].text\n",
" \n",
" # Try to parse JSON from the response\n",
" # Sometimes Claude might wrap it in markdown code blocks\n",
" if \"```json\" in response_text:\n",
" json_str = response_text.split(\"```json\")[1].split(\"```\")[0].strip()\n",
" elif \"```\" in response_text:\n",
" json_str = response_text.split(\"```\")[1].split(\"```\")[0].strip()\n",
" else:\n",
" json_str = response_text.strip()\n",
" \n",
" # Parse JSON\n",
" data = json.loads(json_str)\n",
" \n",
" # Convert to DataFrame\n",
" df = pd.DataFrame(data)\n",
" \n",
" # Save to temporary CSV file with proper path\n",
" fd, temp_path = tempfile.mkstemp(suffix='.csv', prefix='synthetic_data_')\n",
" os.close(fd) # Close the file descriptor\n",
" \n",
" # Write CSV to the temp file\n",
" df.to_csv(temp_path, index=False)\n",
" \n",
" status = f\"✅ Successfully generated {len(df)} records!\"\n",
" return df, status, temp_path\n",
" \n",
" except json.JSONDecodeError as e:\n",
" return None, f\"❌ Error parsing JSON: {str(e)}\\n\\nResponse received:\\n{response_text[:500] if 'response_text' in locals() else 'N/A'}...\", None\n",
" except APIError as e:\n",
" return None, f\"❌ API Error: {str(e)}\", None\n",
" except Exception as e:\n",
" return None, f\"❌ Error: {type(e).__name__}: {str(e)}\", None"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "aa95c2aa-ac99-4919-94bd-981cb7bd42b7",
"metadata": {},
"outputs": [],
"source": [
"def generate_batch_data(\n",
" api_key: str,\n",
" schema_description: str,\n",
" total_records: int,\n",
" example_format: str = \"\",\n",
" batch_size: int = 50\n",
") -> tuple:\n",
" \"\"\"\n",
" Generate larger datasets in batches\n",
" \"\"\"\n",
" all_data = []\n",
" batches = (total_records + batch_size - 1) // batch_size\n",
" \n",
" for i in range(batches):\n",
" records_in_batch = min(batch_size, total_records - len(all_data))\n",
" df_batch, status, csv_path = generate_synthetic_data(\n",
" api_key, schema_description, records_in_batch, example_format\n",
" )\n",
" \n",
" if df_batch is not None:\n",
" all_data.extend(df_batch.to_dict('records'))\n",
" else:\n",
" return None, f\"❌ Error in batch {i+1}: {status}\", None\n",
" \n",
" final_df = pd.DataFrame(all_data)\n",
" \n",
" # Save final CSV with proper temp file handling\n",
" fd, temp_path = tempfile.mkstemp(suffix='.csv', prefix='synthetic_data_batch_')\n",
" os.close(fd)\n",
" \n",
" final_df.to_csv(temp_path, index=False)\n",
" \n",
" status = f\"✅ Successfully generated {len(final_df)} records in {batches} batches!\"\n",
" return final_df, status, temp_path\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "b73aff00-c0c0-43d4-96a9-43b0cd84de2b",
"metadata": {},
"outputs": [],
"source": [
"# Create Gradio Interface\n",
"def create_interface():\n",
" with gr.Blocks(title=\"Synthetic Dataset Generator\", theme=gr.themes.Soft()) as demo:\n",
" gr.Markdown(\"\"\"\n",
" # 🤖 Synthetic Dataset Generator\n",
" ### Powered by Claude 3 Haiku\n",
" \n",
" Create custom synthetic datasets by describing your schema. Claude will generate realistic data matching your specifications.\n",
" \"\"\")\n",
" \n",
" with gr.Row():\n",
" with gr.Column(scale=1):\n",
" # Show API key input only if not found in environment\n",
" if not ANTHROPIC_API_KEY:\n",
" api_key_input = gr.Textbox(\n",
" label=\"Anthropic API Key\",\n",
" type=\"password\",\n",
" placeholder=\"sk-ant-...\",\n",
" info=\"API key not found in .env file\"\n",
" )\n",
" else:\n",
" api_key_input = gr.Textbox(\n",
" label=\"Anthropic API Key\",\n",
" type=\"password\",\n",
" value=ANTHROPIC_API_KEY,\n",
" placeholder=\"Loaded from .env\",\n",
" info=\"✅ API key loaded from environment\",\n",
" interactive=False\n",
" )\n",
" \n",
" schema_input = gr.Textbox(\n",
" label=\"Data Schema Description\",\n",
" placeholder=\"\"\"Example: Generate customer data with:\n",
"- name (full name)\n",
"- email (valid email address)\n",
"- age (between 18-80)\n",
"- city (US cities)\n",
"- purchase_amount (between $10-$1000)\n",
"- join_date (dates in 2023-2024)\"\"\",\n",
" lines=10\n",
" )\n",
" \n",
" example_input = gr.Textbox(\n",
" label=\"Example Format (Optional)\",\n",
" placeholder=\"\"\"{\"name\": \"John Doe\", \"email\": \"john@example.com\", \"age\": 35, \"city\": \"New York\", \"purchase_amount\": 299.99, \"join_date\": \"2023-05-15\"}\"\"\",\n",
" lines=4\n",
" )\n",
" \n",
" num_records = gr.Slider(\n",
" minimum=1,\n",
" maximum=200,\n",
" value=10,\n",
" step=1,\n",
" label=\"Number of Records\"\n",
" )\n",
" \n",
" generate_btn = gr.Button(\"🚀 Generate Dataset\", variant=\"primary\")\n",
" \n",
" with gr.Column(scale=2):\n",
" status_output = gr.Textbox(label=\"Status\", lines=2)\n",
" dataframe_output = gr.Dataframe(\n",
" label=\"Generated Dataset\",\n",
" wrap=True\n",
" )\n",
" csv_output = gr.File(label=\"Download CSV\", type=\"filepath\")\n",
" \n",
" # Examples\n",
" gr.Markdown(\"### 📝 Example Schemas\")\n",
" gr.Examples(\n",
" examples=[\n",
" [\n",
" \"\"\"Generate employee records with:\n",
"- employee_id (format: EMP001, EMP002, etc.)\n",
"- name (full name)\n",
"- department (Engineering, Sales, Marketing, HR, Finance)\n",
"- salary (between $40,000-$150,000)\n",
"- hire_date (between 2020-2024)\n",
"- performance_rating (1-5)\"\"\",\n",
" 10\n",
" ],\n",
" [\n",
" \"\"\"Generate e-commerce product data with:\n",
"- product_id (format: PRD-XXXX)\n",
"- product_name (creative product names)\n",
"- category (Electronics, Clothing, Home, Books, Sports)\n",
"- price (between $5-$500)\n",
"- stock_quantity (between 0-1000)\n",
"- rating (1.0-5.0)\n",
"- num_reviews (0-500)\"\"\",\n",
" 15\n",
" ],\n",
" [\n",
" \"\"\"Generate student records with:\n",
"- student_id (format: STU2024XXX)\n",
"- name (full name)\n",
"- major (Computer Science, Biology, Business, Arts, Engineering)\n",
"- gpa (2.0-4.0)\n",
"- year (Freshman, Sophomore, Junior, Senior)\n",
"- credits_completed (0-120)\"\"\",\n",
" 20\n",
" ]\n",
" ],\n",
" inputs=[schema_input, num_records]\n",
" )\n",
" \n",
" def generate_wrapper(api_key, schema, num_rec, example):\n",
" # Use environment API key if available, otherwise use input\n",
" final_api_key = ANTHROPIC_API_KEY or api_key\n",
" \n",
" if not final_api_key:\n",
" return None, \"❌ Please provide your Anthropic API key (either in .env file or input field)\", None\n",
" if not schema:\n",
" return None, \"❌ Please describe your data schema\", None\n",
" \n",
" # For larger datasets, use batch generation\n",
" if num_rec > 50:\n",
" return generate_batch_data(final_api_key, schema, num_rec, example)\n",
" else:\n",
" return generate_synthetic_data(final_api_key, schema, num_rec, example)\n",
" \n",
" generate_btn.click(\n",
" fn=generate_wrapper,\n",
" inputs=[api_key_input, schema_input, num_records, example_input],\n",
" outputs=[dataframe_output, status_output, csv_output]\n",
" )\n",
" \n",
" gr.Markdown(\"\"\"\n",
" ---\n",
" ### 💡 Tips:\n",
" - Be specific about data types, ranges, and formats\n",
" - Provide examples for better results\n",
" - For large datasets (>50 records), generation happens in batches\n",
" - Claude 3 Haiku is fast and cost-effective for this task\n",
" \n",
" ### 🔑 API Key Setup:\n",
" Create a `.env` file in the same directory with:\n",
" ```\n",
" ANTHROPIC_API_KEY=your_api_key_here\n",
" ```\n",
" \n",
" ### ⚠️ Troubleshooting:\n",
" If you see a \"proxies\" error, update httpx:\n",
" ```\n",
" pip install --upgrade httpx\n",
" ```\n",
" \"\"\")\n",
" \n",
" return demo\n"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "cef71337-b446-46b2-b84b-d23b7dd4f13e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* Running on local URL: http://127.0.0.1:7867\n",
"\n",
"To create a public link, set `share=True` in `launch()`.\n"
]
},
{
"data": {
"text/html": [
"<div><iframe src=\"http://127.0.0.1:7867/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"demo = create_interface()\n",
"demo.launch()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec34fee8-eeb1-4015-95fe-62276927d25a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}