Merge branch 'main' of github.com:ed-donner/llm_engineering

This commit is contained in:
Edward Donner
2025-10-12 10:46:51 -04:00
6 changed files with 6240 additions and 0 deletions

View File

@@ -0,0 +1,251 @@
# 🤖 Synthetic Dataset Generator
## AI-Powered Synthetic Data Generation with Claude 3 Haiku
## 📥 Installation
### 1⃣ Clone the Repository
```bash
git clone https://github.com/yourusername/synthetic-dataset-generator.git
cd synthetic-dataset-generator
```
### 2⃣ Create Virtual Environment (Recommended)
```bash
# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python3 -m venv venv
source venv/bin/activate
```
### 3⃣ Install Dependencies
```bash
pip install -r requirements.txt
```
**Requirements file (`requirements.txt`):**
```txt
gradio>=4.0.0
anthropic>=0.25.0
pandas>=1.5.0
python-dotenv>=1.0.0
httpx==0.27.2
```
### 4⃣ Set Up API Key
Create a `.env` file in the project root:
```bash
# .env
ANTHROPIC_API_KEY=your_api_key_here
```
> **Note**: Never commit your `.env` file to version control. Add it to `.gitignore`.
---
## 🚀 Usage
### Running the Application
```bash
python app.ipynb
```
The Gradio interface will launch at `http://localhost:7860`
### Basic Workflow
1. **Enter API Key** (if not in `.env`)
2. **Describe Your Schema** in plain English
3. **Set Number of Records** (1-200)
4. **Add Example Format** (optional, but recommended)
5. **Click Generate** 🎉
6. **Download CSV** when ready
---
## 📝 Example Schemas
### 👥 Customer Data
```
Generate customer data with:
- customer_id (format: CUST-XXXX)
- name (full name)
- email (valid email address)
- age (between 18-80)
- city (US cities)
- purchase_amount (between $10-$1000)
- join_date (dates in 2023-2024)
- subscription_type (Free, Basic, Premium)
```
### 👨‍💼 Employee Records
```
Generate employee records with:
- employee_id (format: EMP001, EMP002, etc.)
- name (full name)
- department (Engineering, Sales, Marketing, HR, Finance)
- salary (between $40,000-$150,000)
- hire_date (between 2020-2024)
- performance_rating (1-5)
- is_remote (true/false)
```
### 🛒 E-commerce Products
```
Generate e-commerce product data with:
- product_id (format: PRD-XXXX)
- product_name (creative product names)
- category (Electronics, Clothing, Home, Books, Sports)
- price (between $5-$500)
- stock_quantity (between 0-1000)
- rating (1.0-5.0)
- num_reviews (0-500)
- in_stock (true/false)
```
---
## 🎯 Advanced Usage
### Batch Generation
For datasets larger than 50 records, the tool automatically:
- Splits generation into batches of 50
- Combines results into a single dataset
- Prevents API timeout issues
### Custom Formats
Provide example JSON to guide the output format:
```json
{
"id": "USR-001",
"name": "Jane Smith",
"email": "jane.smith@example.com",
"created_at": "2024-01-15T10:30:00Z"
}
```
---
## 🔧 Troubleshooting
### ❌ Error: `proxies` keyword argument
**Solution**: Downgrade httpx to compatible version
```bash
pip install "httpx==0.27.2"
```
Then restart your Python kernel/terminal.
### ❌ API Key Not Found
**Solutions**:
1. Check `.env` file exists in project root
2. Verify `ANTHROPIC_API_KEY` is spelled correctly
3. Ensure no extra spaces in the `.env` file
4. Restart the application after creating `.env`
### ❌ JSON Parsing Error
**Solutions**:
1. Make your schema description more specific
2. Add an example format
3. Reduce the number of records per batch
4. Check your API key has sufficient credits
### ❌ Rate Limit Errors
**Solutions**:
1. Reduce batch size in code (change `batch_size=50` to `batch_size=20`)
2. Add delays between batches
3. Upgrade your Anthropic API plan
---
## 📊 Output Format
### DataFrame Preview
View generated data directly in the browser with scrollable table.
### CSV Download
- Automatic CSV generation
- Proper encoding (UTF-8)
- No index column
- Ready for Excel, Pandas, or any data tool
---
## 🧑‍💻 Skill Level
**Beginner Friendly**
- No ML/AI expertise required
- Basic Python knowledge helpful
- Simple natural language interface
- Pre-configured examples included
---
## 💡 Tips for Best Results
1. **Be Specific**: Include data types, ranges, and formats
2. **Use Examples**: Provide sample JSON for complex schemas
3. **Start Small**: Test with 5-10 records before scaling up
4. **Iterate**: Refine your schema based on initial results
5. **Validate**: Check the first few records before using the entire dataset
---
## 🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch
3. Commit your changes
4. Push to the branch
5. Open a Pull Request
---
## 🙏 Acknowledgments
- **Anthropic** for the Claude API
- **Gradio** for the UI framework
- **Pandas** for data manipulation
---
## 📞 Support
- 📧 Email: udayslathia16@gmail.com
---
## 🔗 Related Projects
- [Claude API Documentation](https://docs.anthropic.com/)
- [Gradio Documentation](https://gradio.app/docs/)
- [Pandas Documentation](https://pandas.pydata.org/)
---
<div align="center">
**Made with ❤️ using Claude 3 Haiku**
⭐ Star this repo if you find it useful!
</div>

View File

@@ -0,0 +1,502 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1faf5626-864e-4287-af11-535f9a3f59ae",
"metadata": {},
"source": [
"# 🤖 Synthetic Dataset Generator\n",
"## AI-Powered Synthetic Data Generation with Claude 3 Haiku\n",
"Generate custom synthetic datasets by simply describing your data schema. This tool uses Claude 3 Haiku to create realistic, diverse datasets for testing, development, and prototyping.\n",
"\n",
"![Alt text](https://img.shields.io/badge/python-3.10+-blue.svg) ![Alt text](https://img.shields.io/badge/License-MIT-yellow.svg) \n",
"\n",
"## ✨ Features\n",
"\n",
"- 🎯 Schema-Based Generation - Describe your data structure in plain English\n",
"- 🚀 Fast & Efficient - Powered by Claude 3 Haiku for cost-effective generation\n",
"- 📊 Batch Processing - Automatically handles large datasets (200+ records)\n",
"- 💾 Export Ready - Download as CSV for immediate use\n",
"- 🎨 User-Friendly UI - Built with Gradio for easy interaction\n",
"- 🔒 Secure - API key management via .env files\n",
"- 📝 Built-in Examples - Pre-configured schemas for common use cases\n",
"\n",
"## 🌍 Use Cases\n",
"\n",
"+ 🧪 Testing & Development - Generate test data for applications\n",
"+ 📈 Data Science - Create training datasets for ML models\n",
"+ 🎓 Education - Generate sample datasets for learning\n",
"+ 🏢 Prototyping - Quick data mockups for demos\n",
"+ 🔬 Research - Synthetic data for experiments\n",
"\n",
"## 🧠 Model\n",
"\n",
"- AI Model: Anthropic's claude-3-haiku-20240307\n",
"-Task: Structured data generation based on natural language schemas\n",
"- Output Format: JSON arrays converted to Pandas DataFrames and CSV\n",
"\n",
"## 🛠️ Requirements\n",
"### ⚙️ Hardware\n",
"\n",
"- ✅ CPU is sufficient — No GPU required\n",
"- 💾 Minimal RAM (2GB+)\n",
"\n",
"### 📦 Software\n",
"\n",
"Python 3.8 or higher\n",
"Anthropic API Key \n",
"\n",
"### Take the help of (`README.md`) for errors"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "7ece01a4-0676-4176-86b9-91b0be3a9786",
"metadata": {},
"outputs": [],
"source": [
"import gradio as gr\n",
"import json\n",
"import pandas as pd\n",
"from typing import List, Dict\n",
"import os\n",
"from dotenv import load_dotenv\n",
"import tempfile"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "01665d8a-c483-48c7-92e1-0d92ca4c9731",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load environment variables from .env file\n",
"load_dotenv()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "3cf53df7-175a-46b0-8508-a8ae34afb65b",
"metadata": {},
"outputs": [],
"source": [
"# Get API key from environment\n",
"ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY')"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "53a0686e-26c7-49c0-b048-a113be756c7c",
"metadata": {},
"outputs": [],
"source": [
"# Import anthropic after other imports to avoid conflicts\n",
"try:\n",
" from anthropic import Anthropic, APIError\n",
"except ImportError:\n",
" import anthropic\n",
" Anthropic = anthropic.Anthropic\n",
" APIError = anthropic.APIError\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "5f9cb807-ad4c-45b1-bedf-d342a14ebe4a",
"metadata": {},
"outputs": [],
"source": [
"# Initialize Anthropic client\n",
"def create_client(api_key: str):\n",
" \"\"\"Create Anthropic client with proper initialization\"\"\"\n",
" try:\n",
" # Try normal initialization\n",
" return Anthropic(api_key=api_key)\n",
" except TypeError as e:\n",
" if 'proxies' in str(e):\n",
" # Workaround for httpx version mismatch\n",
" import httpx\n",
" # Create a basic httpx client without proxies\n",
" http_client = httpx.Client()\n",
" return Anthropic(api_key=api_key, http_client=http_client)\n",
" else:\n",
" raise e\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "dea61271-a138-4f9b-979e-77a998a6950c",
"metadata": {},
"outputs": [],
"source": [
"def generate_synthetic_data(\n",
" api_key: str,\n",
" schema_description: str,\n",
" num_records: int,\n",
" example_format: str = \"\"\n",
") -> tuple:\n",
" \"\"\"\n",
" Generate synthetic dataset using Claude 3 Haiku\n",
" \n",
" Args:\n",
" api_key: Anthropic API key\n",
" schema_description: Description of the data schema\n",
" num_records: Number of records to generate\n",
" example_format: Optional example of desired format\n",
" \n",
" Returns:\n",
" tuple: (DataFrame, status message, csv_file_path)\n",
" \"\"\"\n",
" try:\n",
" # Create client\n",
" client = create_client(api_key)\n",
" \n",
" # Construct the prompt\n",
" example_section = f\"\\n\\nExample format:\\n{example_format}\" if example_format else \"\"\n",
" \n",
" prompt = f\"\"\"Generate {num_records} synthetic data records based on the following schema:\n",
"\n",
"{schema_description}{example_section}\n",
"\n",
"Requirements:\n",
"1. Return ONLY a valid JSON array of objects\n",
"2. Each object should be one record matching the schema\n",
"3. Make the data realistic and diverse\n",
"4. Ensure data types are appropriate (strings, numbers, booleans, dates, etc.)\n",
"5. Do not include any explanation, only the JSON array\n",
"\n",
"Generate exactly {num_records} records.\"\"\"\n",
"\n",
" # Call Claude API with explicit parameters\n",
" message = client.messages.create(\n",
" model=\"claude-3-haiku-20240307\",\n",
" max_tokens=4096,\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": prompt}\n",
" ]\n",
" )\n",
" \n",
" # Extract the response\n",
" response_text = message.content[0].text\n",
" \n",
" # Try to parse JSON from the response\n",
" # Sometimes Claude might wrap it in markdown code blocks\n",
" if \"```json\" in response_text:\n",
" json_str = response_text.split(\"```json\")[1].split(\"```\")[0].strip()\n",
" elif \"```\" in response_text:\n",
" json_str = response_text.split(\"```\")[1].split(\"```\")[0].strip()\n",
" else:\n",
" json_str = response_text.strip()\n",
" \n",
" # Parse JSON\n",
" data = json.loads(json_str)\n",
" \n",
" # Convert to DataFrame\n",
" df = pd.DataFrame(data)\n",
" \n",
" # Save to temporary CSV file with proper path\n",
" fd, temp_path = tempfile.mkstemp(suffix='.csv', prefix='synthetic_data_')\n",
" os.close(fd) # Close the file descriptor\n",
" \n",
" # Write CSV to the temp file\n",
" df.to_csv(temp_path, index=False)\n",
" \n",
" status = f\"✅ Successfully generated {len(df)} records!\"\n",
" return df, status, temp_path\n",
" \n",
" except json.JSONDecodeError as e:\n",
" return None, f\"❌ Error parsing JSON: {str(e)}\\n\\nResponse received:\\n{response_text[:500] if 'response_text' in locals() else 'N/A'}...\", None\n",
" except APIError as e:\n",
" return None, f\"❌ API Error: {str(e)}\", None\n",
" except Exception as e:\n",
" return None, f\"❌ Error: {type(e).__name__}: {str(e)}\", None"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "aa95c2aa-ac99-4919-94bd-981cb7bd42b7",
"metadata": {},
"outputs": [],
"source": [
"def generate_batch_data(\n",
" api_key: str,\n",
" schema_description: str,\n",
" total_records: int,\n",
" example_format: str = \"\",\n",
" batch_size: int = 50\n",
") -> tuple:\n",
" \"\"\"\n",
" Generate larger datasets in batches\n",
" \"\"\"\n",
" all_data = []\n",
" batches = (total_records + batch_size - 1) // batch_size\n",
" \n",
" for i in range(batches):\n",
" records_in_batch = min(batch_size, total_records - len(all_data))\n",
" df_batch, status, csv_path = generate_synthetic_data(\n",
" api_key, schema_description, records_in_batch, example_format\n",
" )\n",
" \n",
" if df_batch is not None:\n",
" all_data.extend(df_batch.to_dict('records'))\n",
" else:\n",
" return None, f\"❌ Error in batch {i+1}: {status}\", None\n",
" \n",
" final_df = pd.DataFrame(all_data)\n",
" \n",
" # Save final CSV with proper temp file handling\n",
" fd, temp_path = tempfile.mkstemp(suffix='.csv', prefix='synthetic_data_batch_')\n",
" os.close(fd)\n",
" \n",
" final_df.to_csv(temp_path, index=False)\n",
" \n",
" status = f\"✅ Successfully generated {len(final_df)} records in {batches} batches!\"\n",
" return final_df, status, temp_path\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "b73aff00-c0c0-43d4-96a9-43b0cd84de2b",
"metadata": {},
"outputs": [],
"source": [
"# Create Gradio Interface\n",
"def create_interface():\n",
" with gr.Blocks(title=\"Synthetic Dataset Generator\", theme=gr.themes.Soft()) as demo:\n",
" gr.Markdown(\"\"\"\n",
" # 🤖 Synthetic Dataset Generator\n",
" ### Powered by Claude 3 Haiku\n",
" \n",
" Create custom synthetic datasets by describing your schema. Claude will generate realistic data matching your specifications.\n",
" \"\"\")\n",
" \n",
" with gr.Row():\n",
" with gr.Column(scale=1):\n",
" # Show API key input only if not found in environment\n",
" if not ANTHROPIC_API_KEY:\n",
" api_key_input = gr.Textbox(\n",
" label=\"Anthropic API Key\",\n",
" type=\"password\",\n",
" placeholder=\"sk-ant-...\",\n",
" info=\"API key not found in .env file\"\n",
" )\n",
" else:\n",
" api_key_input = gr.Textbox(\n",
" label=\"Anthropic API Key\",\n",
" type=\"password\",\n",
" value=ANTHROPIC_API_KEY,\n",
" placeholder=\"Loaded from .env\",\n",
" info=\"✅ API key loaded from environment\",\n",
" interactive=False\n",
" )\n",
" \n",
" schema_input = gr.Textbox(\n",
" label=\"Data Schema Description\",\n",
" placeholder=\"\"\"Example: Generate customer data with:\n",
"- name (full name)\n",
"- email (valid email address)\n",
"- age (between 18-80)\n",
"- city (US cities)\n",
"- purchase_amount (between $10-$1000)\n",
"- join_date (dates in 2023-2024)\"\"\",\n",
" lines=10\n",
" )\n",
" \n",
" example_input = gr.Textbox(\n",
" label=\"Example Format (Optional)\",\n",
" placeholder=\"\"\"{\"name\": \"John Doe\", \"email\": \"john@example.com\", \"age\": 35, \"city\": \"New York\", \"purchase_amount\": 299.99, \"join_date\": \"2023-05-15\"}\"\"\",\n",
" lines=4\n",
" )\n",
" \n",
" num_records = gr.Slider(\n",
" minimum=1,\n",
" maximum=200,\n",
" value=10,\n",
" step=1,\n",
" label=\"Number of Records\"\n",
" )\n",
" \n",
" generate_btn = gr.Button(\"🚀 Generate Dataset\", variant=\"primary\")\n",
" \n",
" with gr.Column(scale=2):\n",
" status_output = gr.Textbox(label=\"Status\", lines=2)\n",
" dataframe_output = gr.Dataframe(\n",
" label=\"Generated Dataset\",\n",
" wrap=True\n",
" )\n",
" csv_output = gr.File(label=\"Download CSV\", type=\"filepath\")\n",
" \n",
" # Examples\n",
" gr.Markdown(\"### 📝 Example Schemas\")\n",
" gr.Examples(\n",
" examples=[\n",
" [\n",
" \"\"\"Generate employee records with:\n",
"- employee_id (format: EMP001, EMP002, etc.)\n",
"- name (full name)\n",
"- department (Engineering, Sales, Marketing, HR, Finance)\n",
"- salary (between $40,000-$150,000)\n",
"- hire_date (between 2020-2024)\n",
"- performance_rating (1-5)\"\"\",\n",
" 10\n",
" ],\n",
" [\n",
" \"\"\"Generate e-commerce product data with:\n",
"- product_id (format: PRD-XXXX)\n",
"- product_name (creative product names)\n",
"- category (Electronics, Clothing, Home, Books, Sports)\n",
"- price (between $5-$500)\n",
"- stock_quantity (between 0-1000)\n",
"- rating (1.0-5.0)\n",
"- num_reviews (0-500)\"\"\",\n",
" 15\n",
" ],\n",
" [\n",
" \"\"\"Generate student records with:\n",
"- student_id (format: STU2024XXX)\n",
"- name (full name)\n",
"- major (Computer Science, Biology, Business, Arts, Engineering)\n",
"- gpa (2.0-4.0)\n",
"- year (Freshman, Sophomore, Junior, Senior)\n",
"- credits_completed (0-120)\"\"\",\n",
" 20\n",
" ]\n",
" ],\n",
" inputs=[schema_input, num_records]\n",
" )\n",
" \n",
" def generate_wrapper(api_key, schema, num_rec, example):\n",
" # Use environment API key if available, otherwise use input\n",
" final_api_key = ANTHROPIC_API_KEY or api_key\n",
" \n",
" if not final_api_key:\n",
" return None, \"❌ Please provide your Anthropic API key (either in .env file or input field)\", None\n",
" if not schema:\n",
" return None, \"❌ Please describe your data schema\", None\n",
" \n",
" # For larger datasets, use batch generation\n",
" if num_rec > 50:\n",
" return generate_batch_data(final_api_key, schema, num_rec, example)\n",
" else:\n",
" return generate_synthetic_data(final_api_key, schema, num_rec, example)\n",
" \n",
" generate_btn.click(\n",
" fn=generate_wrapper,\n",
" inputs=[api_key_input, schema_input, num_records, example_input],\n",
" outputs=[dataframe_output, status_output, csv_output]\n",
" )\n",
" \n",
" gr.Markdown(\"\"\"\n",
" ---\n",
" ### 💡 Tips:\n",
" - Be specific about data types, ranges, and formats\n",
" - Provide examples for better results\n",
" - For large datasets (>50 records), generation happens in batches\n",
" - Claude 3 Haiku is fast and cost-effective for this task\n",
" \n",
" ### 🔑 API Key Setup:\n",
" Create a `.env` file in the same directory with:\n",
" ```\n",
" ANTHROPIC_API_KEY=your_api_key_here\n",
" ```\n",
" \n",
" ### ⚠️ Troubleshooting:\n",
" If you see a \"proxies\" error, update httpx:\n",
" ```\n",
" pip install --upgrade httpx\n",
" ```\n",
" \"\"\")\n",
" \n",
" return demo\n"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "cef71337-b446-46b2-b84b-d23b7dd4f13e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* Running on local URL: http://127.0.0.1:7867\n",
"\n",
"To create a public link, set `share=True` in `launch()`.\n"
]
},
{
"data": {
"text/html": [
"<div><iframe src=\"http://127.0.0.1:7867/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"demo = create_interface()\n",
"demo.launch()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec34fee8-eeb1-4015-95fe-62276927d25a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,5 @@
gradio>=4.0.0
anthropic>=0.25.0
pandas>=1.5.0
python-dotenv>=1.0.0
httpx==0.27.2

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,202 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "6afa6324",
"metadata": {},
"source": [
"Website Summarizer using Langchain RecursiveUrlLoader and OpenAI GPT-4o."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd0aa282",
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain-community beautifulsoup4 lxml"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ff0ba859",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import glob\n",
"from dotenv import load_dotenv\n",
"import gradio as gr\n",
"\n",
"# imports for langchain\n",
"\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.schema import Document\n",
"from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n",
"from langchain_chroma import Chroma\n",
"\n",
"from langchain.memory import ConversationBufferMemory\n",
"from langchain.chains import ConversationalRetrievalChain\n",
"\n",
"from langchain_community.document_loaders import RecursiveUrlLoader\n",
"import re\n",
"\n",
"from bs4 import BeautifulSoup\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "e2be45ee",
"metadata": {},
"outputs": [],
"source": [
"MODEL = \"gpt-4o\"\n",
"db_name = \"vector_db\"\n",
"\n",
"\n",
"load_dotenv(override=True)\n",
"os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2cd21d56",
"metadata": {},
"outputs": [],
"source": [
"def bs4_extractor(html: str) -> str:\n",
" soup = BeautifulSoup(html, \"lxml\")\n",
" return re.sub(r\"\\n\\n+\", \"\\n\\n\", soup.text).strip()\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c07925ce",
"metadata": {},
"outputs": [],
"source": [
"def prepareLLM(website_url):\n",
" loader = RecursiveUrlLoader(website_url, extractor=bs4_extractor)\n",
" docs = loader.load()\n",
" print(f\"Loaded {len(docs)} documents\")\n",
" text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n",
" chunks = text_splitter.split_documents(docs)\n",
" print(f\"Loaded {len(chunks)} chunks\")\n",
"\n",
" embeddings = OpenAIEmbeddings()\n",
"\n",
" # Delete if already exists\n",
"\n",
" if os.path.exists(db_name):\n",
" Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()\n",
"\n",
" # Create vectorstore\n",
"\n",
" vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)\n",
" print(f\"Vectorstore created with {vectorstore._collection.count()} documents\")\n",
"\n",
" # create a new Chat with OpenAI\n",
" llm = ChatOpenAI(temperature=0.7, model_name=MODEL)\n",
"\n",
" # set up the conversation memory for the chat\n",
" memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)\n",
"\n",
" # the retriever is an abstraction over the VectorStore that will be used during RAG\n",
" retriever = vectorstore.as_retriever()\n",
"\n",
" # putting it together: set up the conversation chain with the GPT 4o-mini LLM, the vector store and memory\n",
" conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)\n",
"\n",
" return conversation_chain"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8cc26a70",
"metadata": {},
"outputs": [],
"source": [
"website_global= None\n",
"conversational_chain_global = None"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "809e7afa",
"metadata": {},
"outputs": [],
"source": [
"def chat(website,question):\n",
" global website_global\n",
" global conversational_chain_global\n",
" if website_global != website:\n",
" conversation_chain = prepareLLM(website)\n",
" website_global = website\n",
" conversational_chain_global = conversation_chain\n",
" result = conversational_chain_global.invoke({\"question\":question})\n",
" return result['answer']"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "e1e9c0e9",
"metadata": {},
"outputs": [],
"source": [
"with gr.Blocks() as ui:\n",
" website = gr.Textbox(label=\"Website URL (Only required for the first submit)\")\n",
" question = gr.Textbox(label=\"Your Question\")\n",
" submit = gr.Button(\"Submit\")\n",
" answer = gr.Textbox(label=\"Response\")\n",
" submit.click(fn=chat, inputs=[website,question], outputs=[answer])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80ef8c02",
"metadata": {},
"outputs": [],
"source": [
"ui.launch()"
]
},
{
"cell_type": "markdown",
"id": "fef26a4b",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}