Merge branch 'main' of github.com:ed-donner/llm_engineering

2025-10-12 10:46:51 -04:00
parent 18db093075 384359e218
commit 201e5f69c2
6 changed files with 6240 additions and 0 deletions
--- a/week3/community-contributions/Synthetic
+++ b/week3/community-contributions/Synthetic
@@ -0,0 +1,251 @@
+# 🤖 Synthetic Dataset Generator
+## AI-Powered Synthetic Data Generation with Claude 3 Haiku
+## 📥 Installation
+
+### 1️⃣ Clone the Repository
+
+```bash
+git clone https://github.com/yourusername/synthetic-dataset-generator.git
+cd synthetic-dataset-generator
+```
+
+### 2️⃣ Create Virtual Environment (Recommended)
+
+```bash
+# Windows
+python -m venv venv
+venv\Scripts\activate
+
+# macOS/Linux
+python3 -m venv venv
+source venv/bin/activate
+```
+
+### 3️⃣ Install Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+**Requirements file (`requirements.txt`):**
+```txt
+gradio>=4.0.0
+anthropic>=0.25.0
+pandas>=1.5.0
+python-dotenv>=1.0.0
+httpx==0.27.2
+```
+
+### 4️⃣ Set Up API Key
+
+Create a `.env` file in the project root:
+
+```bash
+# .env
+ANTHROPIC_API_KEY=your_api_key_here
+```
+
+> **Note**: Never commit your `.env` file to version control. Add it to `.gitignore`.
+
+---
+
+## 🚀 Usage
+
+### Running the Application
+
+```bash
+python app.ipynb
+```
+
+The Gradio interface will launch at `http://localhost:7860`
+
+### Basic Workflow
+
+1. **Enter API Key** (if not in `.env`)
+2. **Describe Your Schema** in plain English
+3. **Set Number of Records** (1-200)
+4. **Add Example Format** (optional, but recommended)
+5. **Click Generate** 🎉
+6. **Download CSV** when ready
+
+---
+
+## 📝 Example Schemas
+
+### 👥 Customer Data
+```
+Generate customer data with:
+- customer_id (format: CUST-XXXX)
+- name (full name)
+- email (valid email address)
+- age (between 18-80)
+- city (US cities)
+- purchase_amount (between $10-$1000)
+- join_date (dates in 2023-2024)
+- subscription_type (Free, Basic, Premium)
+```
+
+### 👨‍💼 Employee Records
+```
+Generate employee records with:
+- employee_id (format: EMP001, EMP002, etc.)
+- name (full name)
+- department (Engineering, Sales, Marketing, HR, Finance)
+- salary (between $40,000-$150,000)
+- hire_date (between 2020-2024)
+- performance_rating (1-5)
+- is_remote (true/false)
+```
+
+### 🛒 E-commerce Products
+```
+Generate e-commerce product data with:
+- product_id (format: PRD-XXXX)
+- product_name (creative product names)
+- category (Electronics, Clothing, Home, Books, Sports)
+- price (between $5-$500)
+- stock_quantity (between 0-1000)
+- rating (1.0-5.0)
+- num_reviews (0-500)
+- in_stock (true/false)
+```
+
+---
+
+## 🎯 Advanced Usage
+
+### Batch Generation
+
+For datasets larger than 50 records, the tool automatically:
+- Splits generation into batches of 50
+- Combines results into a single dataset
+- Prevents API timeout issues
+
+### Custom Formats
+
+Provide example JSON to guide the output format:
+
+```json
+{
+  "id": "USR-001",
+  "name": "Jane Smith",
+  "email": "jane.smith@example.com",
+  "created_at": "2024-01-15T10:30:00Z"
+}
+```
+
+---
+
+## 🔧 Troubleshooting
+
+### ❌ Error: `proxies` keyword argument
+
+**Solution**: Downgrade httpx to compatible version
+
+```bash
+pip install "httpx==0.27.2"
+```
+
+Then restart your Python kernel/terminal.
+
+### ❌ API Key Not Found
+
+**Solutions**:
+1. Check `.env` file exists in project root
+2. Verify `ANTHROPIC_API_KEY` is spelled correctly
+3. Ensure no extra spaces in the `.env` file
+4. Restart the application after creating `.env`
+
+### ❌ JSON Parsing Error
+
+**Solutions**:
+1. Make your schema description more specific
+2. Add an example format
+3. Reduce the number of records per batch
+4. Check your API key has sufficient credits
+
+### ❌ Rate Limit Errors
+
+**Solutions**:
+1. Reduce batch size in code (change `batch_size=50` to `batch_size=20`)
+2. Add delays between batches
+3. Upgrade your Anthropic API plan
+
+---
+
+## 📊 Output Format
+
+### DataFrame Preview
+View generated data directly in the browser with scrollable table.
+
+### CSV Download
+- Automatic CSV generation
+- Proper encoding (UTF-8)
+- No index column
+- Ready for Excel, Pandas, or any data tool
+
+---
+
+## 🧑‍💻 Skill Level
+
+**Beginner Friendly** ✅
+
+- No ML/AI expertise required
+- Basic Python knowledge helpful
+- Simple natural language interface
+- Pre-configured examples included
+
+---
+
+## 💡 Tips for Best Results
+
+1. **Be Specific**: Include data types, ranges, and formats
+2. **Use Examples**: Provide sample JSON for complex schemas
+3. **Start Small**: Test with 5-10 records before scaling up
+4. **Iterate**: Refine your schema based on initial results
+5. **Validate**: Check the first few records before using the entire dataset
+
+---
+
+## 🤝 Contributing
+
+Contributions are welcome! Please feel free to submit a Pull Request.
+
+1. Fork the repository
+2. Create your feature branch 
+3. Commit your changes 
+4. Push to the branch 
+5. Open a Pull Request
+
+---
+
+
+## 🙏 Acknowledgments
+
+- **Anthropic** for the Claude API
+- **Gradio** for the UI framework
+- **Pandas** for data manipulation
+
+---
+
+## 📞 Support
+
+- 📧 Email: udayslathia16@gmail.com
+
+---
+
+## 🔗 Related Projects
+
+- [Claude API Documentation](https://docs.anthropic.com/)
+- [Gradio Documentation](https://gradio.app/docs/)
+- [Pandas Documentation](https://pandas.pydata.org/)
+
+---
+
+<div align="center">
+
+**Made with ❤️ using Claude 3 Haiku**
+
+⭐ Star this repo if you find it useful!
+
+</div>
--- a/week3/community-contributions/Synthetic
+++ b/week3/community-contributions/Synthetic
@@ -0,0 +1,502 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1faf5626-864e-4287-af11-535f9a3f59ae",
+   "metadata": {},
+   "source": [
+    "# 🤖 Synthetic Dataset Generator\n",
+    "## AI-Powered Synthetic Data Generation with Claude 3 Haiku\n",
+    "Generate custom synthetic datasets by simply describing your data schema. This tool uses Claude 3 Haiku to create realistic, diverse datasets for testing, development, and prototyping.\n",
+    "\n",
+    "![Alt text](https://img.shields.io/badge/python-3.10+-blue.svg) ![Alt text](https://img.shields.io/badge/License-MIT-yellow.svg) \n",
+    "\n",
+    "## ✨ Features\n",
+    "\n",
+    "- 🎯 Schema-Based Generation - Describe your data structure in plain English\n",
+    "- 🚀 Fast & Efficient - Powered by Claude 3 Haiku for cost-effective generation\n",
+    "- 📊 Batch Processing - Automatically handles large datasets (200+ records)\n",
+    "- 💾 Export Ready - Download as CSV for immediate use\n",
+    "- 🎨 User-Friendly UI - Built with Gradio for easy interaction\n",
+    "- 🔒 Secure - API key management via .env files\n",
+    "- 📝 Built-in Examples - Pre-configured schemas for common use cases\n",
+    "\n",
+    "## 🌍 Use Cases\n",
+    "\n",
+    "+ 🧪 Testing & Development - Generate test data for applications\n",
+    "+ 📈 Data Science - Create training datasets for ML models\n",
+    "+ 🎓 Education - Generate sample datasets for learning\n",
+    "+ 🏢 Prototyping - Quick data mockups for demos\n",
+    "+ 🔬 Research - Synthetic data for experiments\n",
+    "\n",
+    "## 🧠 Model\n",
+    "\n",
+    "- AI Model: Anthropic's claude-3-haiku-20240307\n",
+    "-Task: Structured data generation based on natural language schemas\n",
+    "- Output Format: JSON arrays converted to Pandas DataFrames and CSV\n",
+    "\n",
+    "## 🛠️ Requirements\n",
+    "### ⚙️ Hardware\n",
+    "\n",
+    "- ✅ CPU is sufficient — No GPU required\n",
+    "- 💾 Minimal RAM (2GB+)\n",
+    "\n",
+    "### 📦 Software\n",
+    "\n",
+    "Python 3.8 or higher\n",
+    "Anthropic API Key \n",
+    "\n",
+    "### Take the help of (`README.md`) for errors"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "7ece01a4-0676-4176-86b9-91b0be3a9786",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gradio as gr\n",
+    "import json\n",
+    "import pandas as pd\n",
+    "from typing import List, Dict\n",
+    "import os\n",
+    "from dotenv import load_dotenv\n",
+    "import tempfile"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "01665d8a-c483-48c7-92e1-0d92ca4c9731",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Load environment variables from .env file\n",
+    "load_dotenv()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "3cf53df7-175a-46b0-8508-a8ae34afb65b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Get API key from environment\n",
+    "ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "53a0686e-26c7-49c0-b048-a113be756c7c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import anthropic after other imports to avoid conflicts\n",
+    "try:\n",
+    "    from anthropic import Anthropic, APIError\n",
+    "except ImportError:\n",
+    "    import anthropic\n",
+    "    Anthropic = anthropic.Anthropic\n",
+    "    APIError = anthropic.APIError\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "5f9cb807-ad4c-45b1-bedf-d342a14ebe4a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Initialize Anthropic client\n",
+    "def create_client(api_key: str):\n",
+    "    \"\"\"Create Anthropic client with proper initialization\"\"\"\n",
+    "    try:\n",
+    "        # Try normal initialization\n",
+    "        return Anthropic(api_key=api_key)\n",
+    "    except TypeError as e:\n",
+    "        if 'proxies' in str(e):\n",
+    "            # Workaround for httpx version mismatch\n",
+    "            import httpx\n",
+    "            # Create a basic httpx client without proxies\n",
+    "            http_client = httpx.Client()\n",
+    "            return Anthropic(api_key=api_key, http_client=http_client)\n",
+    "        else:\n",
+    "            raise e\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "id": "dea61271-a138-4f9b-979e-77a998a6950c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def generate_synthetic_data(\n",
+    "    api_key: str,\n",
+    "    schema_description: str,\n",
+    "    num_records: int,\n",
+    "    example_format: str = \"\"\n",
+    ") -> tuple:\n",
+    "    \"\"\"\n",
+    "    Generate synthetic dataset using Claude 3 Haiku\n",
+    "    \n",
+    "    Args:\n",
+    "        api_key: Anthropic API key\n",
+    "        schema_description: Description of the data schema\n",
+    "        num_records: Number of records to generate\n",
+    "        example_format: Optional example of desired format\n",
+    "    \n",
+    "    Returns:\n",
+    "        tuple: (DataFrame, status message, csv_file_path)\n",
+    "    \"\"\"\n",
+    "    try:\n",
+    "        # Create client\n",
+    "        client = create_client(api_key)\n",
+    "        \n",
+    "        # Construct the prompt\n",
+    "        example_section = f\"\\n\\nExample format:\\n{example_format}\" if example_format else \"\"\n",
+    "        \n",
+    "        prompt = f\"\"\"Generate {num_records} synthetic data records based on the following schema:\n",
+    "\n",
+    "{schema_description}{example_section}\n",
+    "\n",
+    "Requirements:\n",
+    "1. Return ONLY a valid JSON array of objects\n",
+    "2. Each object should be one record matching the schema\n",
+    "3. Make the data realistic and diverse\n",
+    "4. Ensure data types are appropriate (strings, numbers, booleans, dates, etc.)\n",
+    "5. Do not include any explanation, only the JSON array\n",
+    "\n",
+    "Generate exactly {num_records} records.\"\"\"\n",
+    "\n",
+    "        # Call Claude API with explicit parameters\n",
+    "        message = client.messages.create(\n",
+    "            model=\"claude-3-haiku-20240307\",\n",
+    "            max_tokens=4096,\n",
+    "            messages=[\n",
+    "                {\"role\": \"user\", \"content\": prompt}\n",
+    "            ]\n",
+    "        )\n",
+    "        \n",
+    "        # Extract the response\n",
+    "        response_text = message.content[0].text\n",
+    "        \n",
+    "        # Try to parse JSON from the response\n",
+    "        # Sometimes Claude might wrap it in markdown code blocks\n",
+    "        if \"```json\" in response_text:\n",
+    "            json_str = response_text.split(\"```json\")[1].split(\"```\")[0].strip()\n",
+    "        elif \"```\" in response_text:\n",
+    "            json_str = response_text.split(\"```\")[1].split(\"```\")[0].strip()\n",
+    "        else:\n",
+    "            json_str = response_text.strip()\n",
+    "        \n",
+    "        # Parse JSON\n",
+    "        data = json.loads(json_str)\n",
+    "        \n",
+    "        # Convert to DataFrame\n",
+    "        df = pd.DataFrame(data)\n",
+    "        \n",
+    "        # Save to temporary CSV file with proper path\n",
+    "        fd, temp_path = tempfile.mkstemp(suffix='.csv', prefix='synthetic_data_')\n",
+    "        os.close(fd)  # Close the file descriptor\n",
+    "        \n",
+    "        # Write CSV to the temp file\n",
+    "        df.to_csv(temp_path, index=False)\n",
+    "        \n",
+    "        status = f\"✅ Successfully generated {len(df)} records!\"\n",
+    "        return df, status, temp_path\n",
+    "        \n",
+    "    except json.JSONDecodeError as e:\n",
+    "        return None, f\"❌ Error parsing JSON: {str(e)}\\n\\nResponse received:\\n{response_text[:500] if 'response_text' in locals() else 'N/A'}...\", None\n",
+    "    except APIError as e:\n",
+    "        return None, f\"❌ API Error: {str(e)}\", None\n",
+    "    except Exception as e:\n",
+    "        return None, f\"❌ Error: {type(e).__name__}: {str(e)}\", None"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "aa95c2aa-ac99-4919-94bd-981cb7bd42b7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def generate_batch_data(\n",
+    "    api_key: str,\n",
+    "    schema_description: str,\n",
+    "    total_records: int,\n",
+    "    example_format: str = \"\",\n",
+    "    batch_size: int = 50\n",
+    ") -> tuple:\n",
+    "    \"\"\"\n",
+    "    Generate larger datasets in batches\n",
+    "    \"\"\"\n",
+    "    all_data = []\n",
+    "    batches = (total_records + batch_size - 1) // batch_size\n",
+    "    \n",
+    "    for i in range(batches):\n",
+    "        records_in_batch = min(batch_size, total_records - len(all_data))\n",
+    "        df_batch, status, csv_path = generate_synthetic_data(\n",
+    "            api_key, schema_description, records_in_batch, example_format\n",
+    "        )\n",
+    "        \n",
+    "        if df_batch is not None:\n",
+    "            all_data.extend(df_batch.to_dict('records'))\n",
+    "        else:\n",
+    "            return None, f\"❌ Error in batch {i+1}: {status}\", None\n",
+    "    \n",
+    "    final_df = pd.DataFrame(all_data)\n",
+    "    \n",
+    "    # Save final CSV with proper temp file handling\n",
+    "    fd, temp_path = tempfile.mkstemp(suffix='.csv', prefix='synthetic_data_batch_')\n",
+    "    os.close(fd)\n",
+    "    \n",
+    "    final_df.to_csv(temp_path, index=False)\n",
+    "    \n",
+    "    status = f\"✅ Successfully generated {len(final_df)} records in {batches} batches!\"\n",
+    "    return final_df, status, temp_path\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "id": "b73aff00-c0c0-43d4-96a9-43b0cd84de2b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create Gradio Interface\n",
+    "def create_interface():\n",
+    "    with gr.Blocks(title=\"Synthetic Dataset Generator\", theme=gr.themes.Soft()) as demo:\n",
+    "        gr.Markdown(\"\"\"\n",
+    "        # 🤖 Synthetic Dataset Generator\n",
+    "        ### Powered by Claude 3 Haiku\n",
+    "        \n",
+    "        Create custom synthetic datasets by describing your schema. Claude will generate realistic data matching your specifications.\n",
+    "        \"\"\")\n",
+    "        \n",
+    "        with gr.Row():\n",
+    "            with gr.Column(scale=1):\n",
+    "                # Show API key input only if not found in environment\n",
+    "                if not ANTHROPIC_API_KEY:\n",
+    "                    api_key_input = gr.Textbox(\n",
+    "                        label=\"Anthropic API Key\",\n",
+    "                        type=\"password\",\n",
+    "                        placeholder=\"sk-ant-...\",\n",
+    "                        info=\"API key not found in .env file\"\n",
+    "                    )\n",
+    "                else:\n",
+    "                    api_key_input = gr.Textbox(\n",
+    "                        label=\"Anthropic API Key\",\n",
+    "                        type=\"password\",\n",
+    "                        value=ANTHROPIC_API_KEY,\n",
+    "                        placeholder=\"Loaded from .env\",\n",
+    "                        info=\"✅ API key loaded from environment\",\n",
+    "                        interactive=False\n",
+    "                    )\n",
+    "                \n",
+    "                schema_input = gr.Textbox(\n",
+    "                    label=\"Data Schema Description\",\n",
+    "                    placeholder=\"\"\"Example: Generate customer data with:\n",
+    "- name (full name)\n",
+    "- email (valid email address)\n",
+    "- age (between 18-80)\n",
+    "- city (US cities)\n",
+    "- purchase_amount (between $10-$1000)\n",
+    "- join_date (dates in 2023-2024)\"\"\",\n",
+    "                    lines=10\n",
+    "                )\n",
+    "                \n",
+    "                example_input = gr.Textbox(\n",
+    "                    label=\"Example Format (Optional)\",\n",
+    "                    placeholder=\"\"\"{\"name\": \"John Doe\", \"email\": \"john@example.com\", \"age\": 35, \"city\": \"New York\", \"purchase_amount\": 299.99, \"join_date\": \"2023-05-15\"}\"\"\",\n",
+    "                    lines=4\n",
+    "                )\n",
+    "                \n",
+    "                num_records = gr.Slider(\n",
+    "                    minimum=1,\n",
+    "                    maximum=200,\n",
+    "                    value=10,\n",
+    "                    step=1,\n",
+    "                    label=\"Number of Records\"\n",
+    "                )\n",
+    "                \n",
+    "                generate_btn = gr.Button(\"🚀 Generate Dataset\", variant=\"primary\")\n",
+    "            \n",
+    "            with gr.Column(scale=2):\n",
+    "                status_output = gr.Textbox(label=\"Status\", lines=2)\n",
+    "                dataframe_output = gr.Dataframe(\n",
+    "                    label=\"Generated Dataset\",\n",
+    "                    wrap=True\n",
+    "                )\n",
+    "                csv_output = gr.File(label=\"Download CSV\", type=\"filepath\")\n",
+    "        \n",
+    "        # Examples\n",
+    "        gr.Markdown(\"### 📝 Example Schemas\")\n",
+    "        gr.Examples(\n",
+    "            examples=[\n",
+    "                [\n",
+    "                    \"\"\"Generate employee records with:\n",
+    "- employee_id (format: EMP001, EMP002, etc.)\n",
+    "- name (full name)\n",
+    "- department (Engineering, Sales, Marketing, HR, Finance)\n",
+    "- salary (between $40,000-$150,000)\n",
+    "- hire_date (between 2020-2024)\n",
+    "- performance_rating (1-5)\"\"\",\n",
+    "                    10\n",
+    "                ],\n",
+    "                [\n",
+    "                    \"\"\"Generate e-commerce product data with:\n",
+    "- product_id (format: PRD-XXXX)\n",
+    "- product_name (creative product names)\n",
+    "- category (Electronics, Clothing, Home, Books, Sports)\n",
+    "- price (between $5-$500)\n",
+    "- stock_quantity (between 0-1000)\n",
+    "- rating (1.0-5.0)\n",
+    "- num_reviews (0-500)\"\"\",\n",
+    "                    15\n",
+    "                ],\n",
+    "                [\n",
+    "                    \"\"\"Generate student records with:\n",
+    "- student_id (format: STU2024XXX)\n",
+    "- name (full name)\n",
+    "- major (Computer Science, Biology, Business, Arts, Engineering)\n",
+    "- gpa (2.0-4.0)\n",
+    "- year (Freshman, Sophomore, Junior, Senior)\n",
+    "- credits_completed (0-120)\"\"\",\n",
+    "                    20\n",
+    "                ]\n",
+    "            ],\n",
+    "            inputs=[schema_input, num_records]\n",
+    "        )\n",
+    "        \n",
+    "        def generate_wrapper(api_key, schema, num_rec, example):\n",
+    "            # Use environment API key if available, otherwise use input\n",
+    "            final_api_key = ANTHROPIC_API_KEY or api_key\n",
+    "            \n",
+    "            if not final_api_key:\n",
+    "                return None, \"❌ Please provide your Anthropic API key (either in .env file or input field)\", None\n",
+    "            if not schema:\n",
+    "                return None, \"❌ Please describe your data schema\", None\n",
+    "            \n",
+    "            # For larger datasets, use batch generation\n",
+    "            if num_rec > 50:\n",
+    "                return generate_batch_data(final_api_key, schema, num_rec, example)\n",
+    "            else:\n",
+    "                return generate_synthetic_data(final_api_key, schema, num_rec, example)\n",
+    "        \n",
+    "        generate_btn.click(\n",
+    "            fn=generate_wrapper,\n",
+    "            inputs=[api_key_input, schema_input, num_records, example_input],\n",
+    "            outputs=[dataframe_output, status_output, csv_output]\n",
+    "        )\n",
+    "        \n",
+    "        gr.Markdown(\"\"\"\n",
+    "        ---\n",
+    "        ### 💡 Tips:\n",
+    "        - Be specific about data types, ranges, and formats\n",
+    "        - Provide examples for better results\n",
+    "        - For large datasets (>50 records), generation happens in batches\n",
+    "        - Claude 3 Haiku is fast and cost-effective for this task\n",
+    "        \n",
+    "        ### 🔑 API Key Setup:\n",
+    "        Create a `.env` file in the same directory with:\n",
+    "        ```\n",
+    "        ANTHROPIC_API_KEY=your_api_key_here\n",
+    "        ```\n",
+    "        \n",
+    "        ### ⚠️ Troubleshooting:\n",
+    "        If you see a \"proxies\" error, update httpx:\n",
+    "        ```\n",
+    "        pip install --upgrade httpx\n",
+    "        ```\n",
+    "        \"\"\")\n",
+    "    \n",
+    "    return demo\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "id": "cef71337-b446-46b2-b84b-d23b7dd4f13e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "* Running on local URL:  http://127.0.0.1:7867\n",
+      "\n",
+      "To create a public link, set `share=True` in `launch()`.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div><iframe src=\"http://127.0.0.1:7867/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": []
+     },
+     "execution_count": 40,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "demo = create_interface()\n",
+    "demo.launch()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ec34fee8-eeb1-4015-95fe-62276927d25a",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/week3/community-contributions/Synthetic
+++ b/week3/community-contributions/Synthetic
@@ -0,0 +1,5 @@
+gradio>=4.0.0
+anthropic>=0.25.0
+pandas>=1.5.0
+python-dotenv>=1.0.0
+httpx==0.27.2
--- a/week3/community-contributions/Week3_Exercise_Synthetic_Dataset_Generator.ipynb
+++ b/week3/community-contributions/Week3_Exercise_Synthetic_Dataset_Generator.ipynb
--- a/week4/community-contributions/day5_stock_analysis_recommender.ipynb
+++ b/week4/community-contributions/day5_stock_analysis_recommender.ipynb
--- a/week5/community-contributions/day4_RAG_website_summarizer.ipynb
+++ b/week5/community-contributions/day4_RAG_website_summarizer.ipynb
@@ -0,0 +1,202 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6afa6324",
+   "metadata": {},
+   "source": [
+    "Website Summarizer using Langchain RecursiveUrlLoader and OpenAI GPT-4o."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cd0aa282",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -qU langchain-community beautifulsoup4 lxml"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "ff0ba859",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# imports\n",
+    "\n",
+    "import os\n",
+    "import glob\n",
+    "from dotenv import load_dotenv\n",
+    "import gradio as gr\n",
+    "\n",
+    "# imports for langchain\n",
+    "\n",
+    "from langchain.text_splitter import CharacterTextSplitter\n",
+    "from langchain.schema import Document\n",
+    "from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n",
+    "from langchain_chroma import Chroma\n",
+    "\n",
+    "from langchain.memory import ConversationBufferMemory\n",
+    "from langchain.chains import ConversationalRetrievalChain\n",
+    "\n",
+    "from langchain_community.document_loaders import RecursiveUrlLoader\n",
+    "import re\n",
+    "\n",
+    "from bs4 import BeautifulSoup\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "e2be45ee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MODEL = \"gpt-4o\"\n",
+    "db_name = \"vector_db\"\n",
+    "\n",
+    "\n",
+    "load_dotenv(override=True)\n",
+    "os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "2cd21d56",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def bs4_extractor(html: str) -> str:\n",
+    "    soup = BeautifulSoup(html, \"lxml\")\n",
+    "    return re.sub(r\"\\n\\n+\", \"\\n\\n\", soup.text).strip()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "c07925ce",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def prepareLLM(website_url):\n",
+    "    loader = RecursiveUrlLoader(website_url, extractor=bs4_extractor)\n",
+    "    docs = loader.load()\n",
+    "    print(f\"Loaded {len(docs)} documents\")\n",
+    "    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n",
+    "    chunks = text_splitter.split_documents(docs)\n",
+    "    print(f\"Loaded {len(chunks)} chunks\")\n",
+    "\n",
+    "    embeddings = OpenAIEmbeddings()\n",
+    "\n",
+    "    # Delete if already exists\n",
+    "\n",
+    "    if os.path.exists(db_name):\n",
+    "        Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()\n",
+    "\n",
+    "    # Create vectorstore\n",
+    "\n",
+    "    vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)\n",
+    "    print(f\"Vectorstore created with {vectorstore._collection.count()} documents\")\n",
+    "\n",
+    "    # create a new Chat with OpenAI\n",
+    "    llm = ChatOpenAI(temperature=0.7, model_name=MODEL)\n",
+    "\n",
+    "    # set up the conversation memory for the chat\n",
+    "    memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)\n",
+    "\n",
+    "    # the retriever is an abstraction over the VectorStore that will be used during RAG\n",
+    "    retriever = vectorstore.as_retriever()\n",
+    "\n",
+    "    # putting it together: set up the conversation chain with the GPT 4o-mini LLM, the vector store and memory\n",
+    "    conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)\n",
+    "\n",
+    "    return conversation_chain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "8cc26a70",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "website_global= None\n",
+    "conversational_chain_global = None"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "809e7afa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def chat(website,question):\n",
+    "    global website_global\n",
+    "    global conversational_chain_global\n",
+    "    if website_global != website:\n",
+    "        conversation_chain = prepareLLM(website)\n",
+    "        website_global = website\n",
+    "        conversational_chain_global = conversation_chain\n",
+    "    result = conversational_chain_global.invoke({\"question\":question})\n",
+    "    return result['answer']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "e1e9c0e9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with gr.Blocks() as ui:\n",
+    "    website = gr.Textbox(label=\"Website URL (Only required for the first submit)\")\n",
+    "    question = gr.Textbox(label=\"Your Question\")\n",
+    "    submit = gr.Button(\"Submit\")\n",
+    "    answer = gr.Textbox(label=\"Response\")\n",
+    "    submit.click(fn=chat, inputs=[website,question], outputs=[answer])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "80ef8c02",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ui.launch()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fef26a4b",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}