Merge branch 'main' of https://github.com/ed-donner/llm_engineering into solisoma-week5

This commit is contained in:
unknown
2025-10-24 09:00:56 +01:00
43 changed files with 16559 additions and 1284 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -1,50 +0,0 @@
# **Automated Bitcoin Daily Summary Generator**
This project automates the process of generating a daily summary of the Bitcoin network's status. It fetches real-time data from multiple public API endpoints, processes it, and then uses a Large Language Model (LLM) to generate a clear, structured, and human-readable report in Markdown format.
## **Project Overview**
The core goal of this project is to provide a snapshot of key Bitcoin metrics without manual analysis. By leveraging the Braiins Public API for data and OpenAI's GPT models for summarization, it can produce insightful daily reports covering market trends, network health, miner revenue, and future outlooks like the next halving event.
### **Key Features**
- **Automated Data Fetching**: Pulls data from 7 different Braiins API endpoints covering price, hashrate, difficulty, transaction fees, and more.
- **Data Cleaning**: Pre-processes the raw JSON data to make it clean and suitable for the LLM.
- **Intelligent Summarization**: Uses an advanced LLM to analyze the data and generate a structured report with explanations for technical terms.
- **Dynamic Dating**: The report is always dated for the day it is run, providing a timely summary regardless of the timestamps in the source data.
- **Markdown Output**: Generates a clean, well-formatted Markdown file that is easy to read or integrate into other systems.
## **How It Works**
The project is split into two main files:
1. **utils.py**: A utility script responsible for all data fetching and cleaning operations.
- It defines the Braiins API endpoints to be queried.
- It contains functions to handle HTTP requests, parse JSON responses, and clean up keys and values to ensure consistency.
2. **day_1_bitcoin_daily_brief.ipynb**: A Jupyter Notebook that acts as the main orchestrator.
- It imports the necessary functions from utils.py.
- It calls fetch_clean_data() to get the latest Bitcoin network data.
- It constructs a detailed system and user prompt for the LLM, explicitly instructing it on the desired format and, crucially, to use the current date for the summary.
- It sends the data and prompt to the OpenAI API.
- It receives the generated summary and displays it as formatted Markdown.
## **Setup and Usage**
To run this project, you will need to have Python and the required libraries installed.
### **1\. Prerequisites**
- Python 3.x
- Jupyter Notebook or JupyterLab
### **2\. Installation**
- Install the necessary Python libraries: pip install requests openai python-dotenv jupyter
### **3\. Configuration**
You need an API key from OpenAI to use the summarization feature.
1. Create a file named .env in the root directory of the project.
2. Add your OpenAI API key to the .env file as follows:
OPENAI_API_KEY='your_openai_api_key_here'

View File

@@ -1,156 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "abaef96b",
"metadata": {},
"source": [
"## Importing The Libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f90c541b",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import datetime\n",
"from utils import fetch_clean_data\n",
"from openai import OpenAI\n",
"from IPython.display import Markdown, display\n",
"from dotenv import load_dotenv\n",
"import json"
]
},
{
"cell_type": "markdown",
"id": "6e6c864b",
"metadata": {},
"source": [
"## Configuration"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "be62299d",
"metadata": {},
"outputs": [],
"source": [
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"client = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "3aa8e3e2",
"metadata": {},
"outputs": [],
"source": [
"def generate_markdown_summary(data: dict, today_date_str: str) -> str:\n",
" \"\"\"\n",
" Send cleaned Bitcoin data to an LLM and receive a Markdown summary.\n",
" \"\"\"\n",
"\n",
" system_prompt = f\"\"\"\n",
" You are a professional crypto analyst. Your job is to read the provided Bitcoin network data \n",
" and write a clear, structured report that can be read directly as a daily summary.\n",
"\n",
" Following are the rules that you must adhere to:\n",
" - **IMPORTANT**: The summary title MUST use today's date: {today_date_str}. The title must be: \"Bitcoin Daily Summary - {today_date_str}\".\n",
" - **CRITICAL**: Do NOT infer the reporting period from the data. The data contains historical records, but your report is for {today_date_str}.\n",
" - Include **headings** for sections like \"Market Overview\", \"Network Metrics Explained\", \"Miner Revenue Trends\", and \"Halving Outlook\".\n",
" - Use **bullet points** for key metrics.\n",
" - Use a **table** for historical or time-series data if available.\n",
" - Explain important terms (like hashrate, difficulty, transaction fees) in plain language.\n",
"\n",
" Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n",
" \"\"\"\n",
"\n",
" # Convert the Python data dictionary into a clean JSON string for the prompt\n",
" data_str = json.dumps(data, indent=2)\n",
"\n",
" user_prompt = f\"\"\"\n",
" Today's date is {today_date_str}. Use this as the reference point for the report.\n",
"\n",
" The following data may contain historical records (e.g., from 2024), \n",
" but you must treat it as background context and write the summary as of {today_date_str}.\n",
"\n",
" Here is the data for you to summarize: \n",
" {data_str}\n",
" \"\"\"\n",
" \n",
" response = client.chat.completions.create(\n",
" model= \"gpt-4.1-mini\", \n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ]\n",
" )\n",
"\n",
" markdown_text = response.choices[0].message.content.strip()\n",
" return markdown_text"
]
},
{
"cell_type": "markdown",
"id": "1e8c2d7d",
"metadata": {},
"source": [
"## Main Function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "05059ed9",
"metadata": {},
"outputs": [],
"source": [
"def main():\n",
" # 0. Get today's date as a string\n",
" today_str = datetime.datetime.now().strftime('%B %d, %Y')\n",
" \n",
" # 1. Fetch and clean data\n",
" print(\"Fetching Bitcoin data...\")\n",
" data = fetch_clean_data()\n",
"\n",
" # 2. Generate Markdown summary\n",
" print(\"Generating LLM summary...\")\n",
" markdown_report = generate_markdown_summary(data, today_str)\n",
"\n",
" # 3. Display Output\n",
" display(Markdown(markdown_report))\n",
"\n",
"if __name__ == \"__main__\":\n",
" main()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llm-engineering",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,121 +0,0 @@
# utils.py
import requests
import re
import datetime
import logging
from typing import Dict, Optional, Union
# -----------------------------------------
# Logging setup
# -----------------------------------------
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# -----------------------------------------
# Braiins API endpoints (7 selected)
# -----------------------------------------
BRAIINS_APIS = {
'price_stats': 'https://insights.braiins.com/api/v1.0/price-stats',
'hashrate_stats': 'https://insights.braiins.com/api/v1.0/hashrate-stats',
'difficulty_stats': 'https://insights.braiins.com/api/v1.0/difficulty-stats',
'transaction_fees_history': 'https://insights.braiins.com/api/v1.0/transaction-fees-history',
'daily_revenue_history': 'https://insights.braiins.com/api/v1.0/daily-revenue-history',
'hashrate_value_history': 'https://insights.braiins.com/api/v1.0/hashrate-value-history',
'halvings': 'https://insights.braiins.com/api/v2.0/halvings'
}
# -----------------------------------------
# Utility Functions
# -----------------------------------------
def clean_value(value):
"""Clean strings, remove brackets/quotes and standardize whitespace."""
if value is None:
return ""
s = str(value)
s = s.replace(",", " ")
s = re.sub(r"[\[\]\{\}\(\)]", "", s)
s = s.replace('"', "").replace("'", "")
s = re.sub(r"\s+", " ", s)
return s.strip()
def parse_date(date_str: str) -> Optional[str]:
"""Parse dates into a standard readable format."""
if not date_str or not isinstance(date_str, str):
return None
try:
if 'T' in date_str:
return datetime.datetime.fromisoformat(date_str.replace('Z', '').split('.')[0]).strftime('%Y-%m-%d %H:%M:%S')
if '-' in date_str and len(date_str) == 10:
return datetime.datetime.strptime(date_str, '%Y-%m-%d').strftime('%Y-%m-%d %H:%M:%S')
if '/' in date_str and len(date_str) == 10:
return datetime.datetime.strptime(date_str, '%m/%d/%Y').strftime('%Y-%m-%d %H:%M:%S')
except Exception:
return date_str
return date_str
def fetch_endpoint_data(url: str) -> Optional[Union[Dict, list]]:
"""Generic GET request to Braiins API endpoint."""
try:
resp = requests.get(url, timeout=15)
resp.raise_for_status()
return resp.json()
except Exception as e:
logger.error(f"Failed to fetch {url}: {e}")
return None
def clean_and_process_data(data: Union[Dict, list]) -> Union[Dict, list]:
"""Clean all keys and values in the fetched data."""
if isinstance(data, dict):
return {clean_value(k): clean_value(v) for k, v in data.items()}
elif isinstance(data, list):
cleaned_list = []
for item in data:
if isinstance(item, dict):
cleaned_list.append({clean_value(k): clean_value(v) for k, v in item.items()})
else:
cleaned_list.append(clean_value(item))
return cleaned_list
return clean_value(data)
# -----------------------------------------
# Main data fetcher
# -----------------------------------------
def fetch_clean_data(history_limit: int = 30) -> Dict[str, Union[Dict, list]]:
"""
Fetch and clean data from 7 selected Braiins endpoints.
For historical data, it limits the number of records.
Returns a dictionary ready to be passed into an LLM.
"""
logger.info("Fetching Bitcoin network data from Braiins...")
results = {}
for key, url in BRAIINS_APIS.items():
logger.info(f"Fetching {key} ...")
raw_data = fetch_endpoint_data(url)
if raw_data is not None:
# --- START OF THE NEW CODE ---
# If the endpoint is for historical data, limit the number of records
if "history" in key and isinstance(raw_data, list):
logger.info(f"Limiting {key} data to the last {history_limit} records.")
raw_data = raw_data[-history_limit:]
# --- END OF THE NEW CODE ---
results[key] = clean_and_process_data(raw_data)
else:
results[key] = {"error": "Failed to fetch"}
logger.info("All data fetched and cleaned successfully.")
return results
# -----------------------------------------
# Local test run (optional)
# -----------------------------------------
if __name__ == "__main__":
data = fetch_clean_data()
print("Sample keys fetched:", list(data.keys()))

View File

@@ -1,50 +0,0 @@
# **Automated Bitcoin Daily Summary Generator**
This project automates the process of generating a daily summary of the Bitcoin network's status. It fetches real-time data from multiple public API endpoints, processes it, and then uses a Large Language Model (LLM) to generate a clear, structured, and human-readable report in Markdown format.
## **Project Overview**
The core goal of this project is to provide a snapshot of key Bitcoin metrics without manual analysis. By leveraging the Braiins Public API for data and OpenAI's GPT models for summarization, it can produce insightful daily reports covering market trends, network health, miner revenue, and future outlooks like the next halving event.
### **Key Features**
- **Automated Data Fetching**: Pulls data from 7 different Braiins API endpoints covering price, hashrate, difficulty, transaction fees, and more.
- **Data Cleaning**: Pre-processes the raw JSON data to make it clean and suitable for the LLM.
- **Intelligent Summarization**: Uses an advanced LLM to analyze the data and generate a structured report with explanations for technical terms.
- **Dynamic Dating**: The report is always dated for the day it is run, providing a timely summary regardless of the timestamps in the source data.
- **Markdown Output**: Generates a clean, well-formatted Markdown file that is easy to read or integrate into other systems.
## **How It Works**
The project is split into two main files:
1. **utils.py**: A utility script responsible for all data fetching and cleaning operations.
- It defines the Braiins API endpoints to be queried.
- It contains functions to handle HTTP requests, parse JSON responses, and clean up keys and values to ensure consistency.
2. **day_1_bitcoin_daily_brief.ipynb**: A Jupyter Notebook that acts as the main orchestrator.
- It imports the necessary functions from utils.py.
- It calls fetch_clean_data() to get the latest Bitcoin network data.
- It constructs a detailed system and user prompt for the LLM, explicitly instructing it on the desired format and, crucially, to use the current date for the summary.
- It sends the data and prompt to the OpenAI API.
- It receives the generated summary and displays it as formatted Markdown.
## **Setup and Usage**
To run this project, you will need to have Python and the required libraries installed.
### **1\. Prerequisites**
- Python 3.x
- Jupyter Notebook or JupyterLab
### **2\. Installation**
- Install the necessary Python libraries: pip install requests openai python-dotenv jupyter
### **3\. Configuration**
You need an API key from OpenAI to use the summarization feature.
1. Create a file named .env in the root directory of the project.
2. Add your OpenAI API key to the .env file as follows:
OPENAI_API_KEY='your_openai_api_key_here'

View File

@@ -1,152 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "abaef96b",
"metadata": {},
"source": [
"## Importing The Libraries"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f90c541b",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import datetime\n",
"from utils import fetch_clean_data\n",
"from openai import OpenAI\n",
"from IPython.display import Markdown, display\n",
"import json"
]
},
{
"cell_type": "markdown",
"id": "6e6c864b",
"metadata": {},
"source": [
"## Configuration"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "be62299d",
"metadata": {},
"outputs": [],
"source": [
"client = OpenAI(base_url='http://localhost:11434/v1', api_key = 'ollama')"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "3aa8e3e2",
"metadata": {},
"outputs": [],
"source": [
"def generate_markdown_summary(data: dict, today_date_str: str) -> str:\n",
" \"\"\"\n",
" Send cleaned Bitcoin data to an LLM and receive a Markdown summary.\n",
" \"\"\"\n",
"\n",
" system_prompt = f\"\"\"\n",
" You are a professional crypto analyst. Your job is to read the provided Bitcoin network data \n",
" and write a clear, structured report that can be read directly as a daily summary.\n",
"\n",
" Following are the rules that you must adhere to:\n",
" - **IMPORTANT**: The summary title MUST use today's date: {today_date_str}. The title must be: \"Bitcoin Daily Summary - {today_date_str}\".\n",
" - **CRITICAL**: Do NOT infer the reporting period from the data. The data contains historical records, but your report is for {today_date_str}.\n",
" - Include **headings** for sections like \"Market Overview\", \"Network Metrics Explained\", \"Miner Revenue Trends\", and \"Halving Outlook\".\n",
" - Use **bullet points** for key metrics.\n",
" - Use a **table** for historical or time-series data if available.\n",
" - Explain important terms (like hashrate, difficulty, transaction fees) in plain language.\n",
"\n",
" Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n",
" \"\"\"\n",
"\n",
" # Convert the Python data dictionary into a clean JSON string for the prompt\n",
" data_str = json.dumps(data, indent=2)\n",
"\n",
" user_prompt = f\"\"\"\n",
" Today's date is {today_date_str}. Use this as the reference point for the report.\n",
"\n",
" The following data may contain historical records (e.g., from 2024), \n",
" but you must treat it as background context and write the summary as of {today_date_str}.\n",
"\n",
" Here is the data for you to summarize: \n",
" {data_str}\n",
" \"\"\"\n",
" \n",
" response = client.chat.completions.create(\n",
" model= \"llama3.2\", \n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ]\n",
" )\n",
"\n",
" markdown_text = response.choices[0].message.content.strip()\n",
" return markdown_text"
]
},
{
"cell_type": "markdown",
"id": "1e8c2d7d",
"metadata": {},
"source": [
"## Main Function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "05059ed9",
"metadata": {},
"outputs": [],
"source": [
"def main():\n",
" # 0. Get today's date as a string\n",
" today_str = datetime.datetime.now().strftime('%B %d, %Y')\n",
" \n",
" # 1. Fetch and clean data\n",
" print(\"Fetching Bitcoin data...\")\n",
" data = fetch_clean_data()\n",
"\n",
" # 2. Generate Markdown summary\n",
" print(\"Generating LLM summary...\")\n",
" markdown_report = generate_markdown_summary(data, today_str)\n",
"\n",
" # 3. Display Output\n",
" display(Markdown(markdown_report))\n",
"\n",
"if __name__ == \"__main__\":\n",
" main()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llm-engineering",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,113 +0,0 @@
# utils.py
import requests
import re
import datetime
import logging
from typing import Dict, Optional, Union
# -----------------------------------------
# Logging setup
# -----------------------------------------
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# -----------------------------------------
# Braiins API endpoints (7 selected)
# -----------------------------------------
BRAIINS_APIS = {
'price_stats': 'https://insights.braiins.com/api/v1.0/price-stats',
'hashrate_stats': 'https://insights.braiins.com/api/v1.0/hashrate-stats',
'difficulty_stats': 'https://insights.braiins.com/api/v1.0/difficulty-stats',
'transaction_fees_history': 'https://insights.braiins.com/api/v1.0/transaction-fees-history',
'daily_revenue_history': 'https://insights.braiins.com/api/v1.0/daily-revenue-history',
'hashrate_value_history': 'https://insights.braiins.com/api/v1.0/hashrate-value-history',
'halvings': 'https://insights.braiins.com/api/v2.0/halvings'
}
# -----------------------------------------
# Utility Functions
# -----------------------------------------
def clean_value(value):
"""Clean strings, remove brackets/quotes and standardize whitespace."""
if value is None:
return ""
s = str(value)
s = s.replace(",", " ")
s = re.sub(r"[\[\]\{\}\(\)]", "", s)
s = s.replace('"', "").replace("'", "")
s = re.sub(r"\s+", " ", s)
return s.strip()
def parse_date(date_str: str) -> Optional[str]:
"""Parse dates into a standard readable format."""
if not date_str or not isinstance(date_str, str):
return None
try:
if 'T' in date_str:
return datetime.datetime.fromisoformat(date_str.replace('Z', '').split('.')[0]).strftime('%Y-%m-%d %H:%M:%S')
if '-' in date_str and len(date_str) == 10:
return datetime.datetime.strptime(date_str, '%Y-%m-%d').strftime('%Y-%m-%d %H:%M:%S')
if '/' in date_str and len(date_str) == 10:
return datetime.datetime.strptime(date_str, '%m/%d/%Y').strftime('%Y-%m-%d %H:%M:%S')
except Exception:
return date_str
return date_str
def fetch_endpoint_data(url: str) -> Optional[Union[Dict, list]]:
"""Generic GET request to Braiins API endpoint."""
try:
resp = requests.get(url, timeout=15)
resp.raise_for_status()
return resp.json()
except Exception as e:
logger.error(f"Failed to fetch {url}: {e}")
return None
def clean_and_process_data(data: Union[Dict, list]) -> Union[Dict, list]:
"""Clean all keys and values in the fetched data."""
if isinstance(data, dict):
return {clean_value(k): clean_value(v) for k, v in data.items()}
elif isinstance(data, list):
cleaned_list = []
for item in data:
if isinstance(item, dict):
cleaned_list.append({clean_value(k): clean_value(v) for k, v in item.items()})
else:
cleaned_list.append(clean_value(item))
return cleaned_list
return clean_value(data)
# -----------------------------------------
# Main data fetcher
# -----------------------------------------
def fetch_clean_data() -> Dict[str, Union[Dict, list]]:
"""
Fetch and clean data from 7 selected Braiins endpoints.
Returns a dictionary ready to be passed into an LLM.
"""
logger.info("Fetching Bitcoin network data from Braiins...")
results = {}
for key, url in BRAIINS_APIS.items():
logger.info(f"Fetching {key} ...")
raw_data = fetch_endpoint_data(url)
if raw_data is not None:
results[key] = clean_and_process_data(raw_data)
else:
results[key] = {"error": "Failed to fetch"}
logger.info("All data fetched and cleaned successfully.")
return results
# -----------------------------------------
# Local test run (optional)
# -----------------------------------------
if __name__ == "__main__":
data = fetch_clean_data()
print("Sample keys fetched:", list(data.keys()))

View File

@@ -0,0 +1,207 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 9,
"id": "57499cf2",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"from dotenv import load_dotenv\n",
"from IPython.display import Markdown, display, update_display\n",
"from scraper import fetch_website_links, fetch_website_contents\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "310a13f3",
"metadata": {},
"outputs": [],
"source": [
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"client = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "79226a7f",
"metadata": {},
"outputs": [],
"source": [
"link_analyzer_prompt = \"\"\"\n",
"You are a skilled research analyst. Your task is to identify the most useful introductory links for a given topic from a list of URLs. \n",
"You must ignore forum posts, product pages, and social media links. Focus on high-quality articles, documentation, and educational resources.\n",
"Respond ONLY with a JSON object in the following format:\n",
"{\n",
" \"links\": [\n",
" {\"type\": \"overview_article\", \"url\": \"https://...\"},\n",
" {\"type\": \"technical_docs\", \"url\": \"https://...\"},\n",
" {\"type\": \"history_summary\", \"url\": \"https://...\"}\n",
" ]\n",
"}\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "73d02b52",
"metadata": {},
"outputs": [],
"source": [
"briefing_prompt = \"\"\"\n",
"You are an expert intelligence analyst. You will be given raw text from several articles about a topic. \n",
"Your mission is to synthesize this information into a clear and structured research brief. \n",
"The brief must contain the following sections in Markdown:\n",
"\n",
"Research Brief: {topic}\n",
"\n",
"1. Executive Summary\n",
"(A one-paragraph overview of the entire topic.)\n",
"\n",
"2. Key Concepts\n",
"(Use bullet points to list and explain the most important terms and ideas.)\n",
"\n",
"3. Important Figures / Events\n",
"(List the key people, organizations, or historical events relevant to the topic.)\n",
"\n",
"4. Further Reading\n",
"(Provide a list of the original URLs you analyzed for deeper study.)\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "ab04efb6",
"metadata": {},
"outputs": [],
"source": [
"def get_relevant_links(topic: str, starting_url: str) -> dict:\n",
" \n",
" # getting all links from the starting URL\n",
" links_on_page = fetch_website_links(starting_url)\n",
" \n",
" # user prompt for the Link Analyst\n",
" user_prompt = f\"\"\"\n",
" Please analyze the following links related to the topic \"{topic}\" and return the most relevant ones for a research brief.\n",
" The main URL is {starting_url}. Make sure all returned URLs are absolute.\n",
"\n",
" Links:\n",
" {\"\\n\".join(links_on_page)}\n",
" \"\"\"\n",
" \n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4o-mini\", \n",
" messages=[\n",
" {\"role\": \"system\", \"content\": link_analyzer_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" response_format={\"type\": \"json_object\"}\n",
" )\n",
" \n",
" result_json = response.choices[0].message.content\n",
" relevant_links = json.loads(result_json)\n",
" return relevant_links"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "ef6ef363",
"metadata": {},
"outputs": [],
"source": [
"def get_all_content(links_data: dict) -> str:\n",
" all_content = \"\"\n",
" original_urls = []\n",
"\n",
" for link in links_data.get(\"links\", []):\n",
" url = link.get(\"url\")\n",
" if url:\n",
" original_urls.append(url)\n",
" content = fetch_website_contents(url)\n",
" all_content += f\"Content from {url} \\n{content}\\n\\n\"\n",
" \n",
" all_content += f\"Original URLs for Reference\\n\" + \"\\n\".join(original_urls)\n",
" return all_content"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "c2020492",
"metadata": {},
"outputs": [],
"source": [
"def create_research_brief(topic: str, starting_url: str):\n",
" relevant_links = get_relevant_links(topic, starting_url)\n",
" full_content = get_all_content(relevant_links)\n",
"\n",
" user_prompt = f\"\"\"\n",
" Please create a research brief on the topic \"{topic}\" using the following content.\n",
" Remember to include the original URLs in the 'Further Reading' section.\n",
"\n",
" Content:\n",
" {full_content[:15000]}\n",
" \"\"\"\n",
" \n",
" stream = client.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": briefing_prompt.format(topic=topic)},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" stream=True\n",
" )\n",
" \n",
" response = \"\"\n",
" display_handle = display(Markdown(\"\"), display_id=True)\n",
" for chunk in stream:\n",
" response += chunk.choices[0].delta.content or ''\n",
" update_display(Markdown(response), display_id=display_handle.display_id)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "594e940c",
"metadata": {},
"outputs": [],
"source": [
"create_research_brief(\n",
" topic=\"The Rise of Artificial Intelligence\", \n",
" starting_url=\"https://en.wikipedia.org/wiki/Artificial_intelligence\"\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llm-engineering",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,37 @@
from bs4 import BeautifulSoup
import requests
# Standard headers to fetch a website
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}
def fetch_website_contents(url):
"""
Return the title and contents of the website at the given url;
truncate to 2,000 characters as a sensible limit
"""
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
title = soup.title.string if soup.title else "No title found"
if soup.body:
for irrelevant in soup.body(["script", "style", "img", "input"]):
irrelevant.decompose()
text = soup.body.get_text(separator="\n", strip=True)
else:
text = ""
return (title + "\n\n" + text)[:2_000]
def fetch_website_links(url):
"""
Return the links on the webiste at the given url
I realize this is inefficient as we're parsing twice! This is to keep the code in the lab simple.
Feel free to use a class and optimize it!
"""
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
links = [link.get("href") for link in soup.find_all("a")]
return [link for link in links if link]

View File

@@ -0,0 +1,337 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "1665a5cf",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import re\n",
"import time\n",
"import json\n",
"import sqlite3\n",
"from dotenv import load_dotenv\n",
"import gradio as gr\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5cb6632c",
"metadata": {},
"outputs": [],
"source": [
"load_dotenv()\n",
"client = OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\"))\n",
"DB_PATH = \"nova_support.db\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2cd3ac8c",
"metadata": {},
"outputs": [],
"source": [
"def init_db():\n",
" conn = sqlite3.connect(DB_PATH)\n",
" cur = conn.cursor()\n",
" cur.execute(\"\"\"\n",
" CREATE TABLE IF NOT EXISTS tickets (\n",
" ticket_id TEXT PRIMARY KEY,\n",
" name TEXT,\n",
" company TEXT,\n",
" email TEXT,\n",
" issue TEXT,\n",
" priority TEXT,\n",
" status TEXT,\n",
" created_at TEXT\n",
" )\n",
" \"\"\")\n",
" conn.commit()\n",
" conn.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70e0556c",
"metadata": {},
"outputs": [],
"source": [
"def new_ticket_id():\n",
" conn = sqlite3.connect(DB_PATH)\n",
" cur = conn.cursor()\n",
" cur.execute(\"SELECT COUNT(*) FROM tickets\")\n",
" count = cur.fetchone()[0]\n",
" conn.close()\n",
" return f\"RT-{1001 + count}\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "38525d5c",
"metadata": {},
"outputs": [],
"source": [
"def create_ticket(name, company, email, issue, priority=\"P3\"):\n",
" tid = new_ticket_id()\n",
" ts = time.strftime(\"%Y-%m-%d %H:%M:%S\")\n",
" conn = sqlite3.connect(DB_PATH)\n",
" cur = conn.cursor()\n",
" cur.execute(\"\"\"\n",
" INSERT INTO tickets (ticket_id, name, company, email, issue, priority, status, created_at)\n",
" VALUES (?, ?, ?, ?, ?, ?, ?, ?)\n",
" \"\"\", (tid, name, company, email, issue, priority.upper(), \"OPEN\", ts))\n",
" conn.commit()\n",
" conn.close()\n",
" return tid, ts"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58e803c5",
"metadata": {},
"outputs": [],
"source": [
"def get_ticket(ticket_id):\n",
" conn = sqlite3.connect(DB_PATH)\n",
" cur = conn.cursor()\n",
" cur.execute(\"SELECT * FROM tickets WHERE ticket_id=?\", (ticket_id,))\n",
" row = cur.fetchone()\n",
" conn.close()\n",
" if not row:\n",
" return None\n",
" keys = [\"ticket_id\", \"name\", \"company\", \"email\", \"issue\", \"priority\", \"status\", \"created_at\"]\n",
" return dict(zip(keys, row))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b97601ff",
"metadata": {},
"outputs": [],
"source": [
"def synthesize_speech(text):\n",
" if not text.strip():\n",
" return None\n",
" output_path = Path(tempfile.gettempdir()) / \"nova_reply.mp3\"\n",
" with client.audio.speech.with_streaming_response.create(\n",
" model=\"gpt-4o-mini-tts\",\n",
" voice=\"alloy\",\n",
" input=text\n",
" ) as response:\n",
" response.stream_to_file(output_path)\n",
" return str(output_path)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4e20aad",
"metadata": {},
"outputs": [],
"source": [
"SYSTEM_PROMPT = \"\"\"\n",
"You are Nova, the AI Support and Sales Assistant for Reallytics.ai.\n",
"You help customers with:\n",
"- Reporting issues (create tickets)\n",
"- Checking existing tickets\n",
"- Providing product/service information\n",
"- Explaining pricing ranges\n",
"- Reassuring integration compatibility with client systems\n",
"Respond in a professional, business tone.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d1c094d",
"metadata": {},
"outputs": [],
"source": [
"def detect_intent(message):\n",
" text = message.lower()\n",
" if any(k in text for k in [\"create ticket\", \"open ticket\", \"new ticket\", \"issue\", \"problem\"]):\n",
" return \"create_ticket\"\n",
" if re.search(r\"rt-\\d+\", text):\n",
" return \"check_ticket\"\n",
" if \"price\" in text or \"cost\" in text:\n",
" return \"pricing\"\n",
" if \"integration\" in text:\n",
" return \"integration\"\n",
" return \"general\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed9114d5",
"metadata": {},
"outputs": [],
"source": [
"def chat(message, history, model, name, company, email):\n",
" history_msgs = [{\"role\": h[\"role\"], \"content\": h[\"content\"]} for h in history]\n",
" intent = detect_intent(message)\n",
"\n",
" if intent == \"create_ticket\":\n",
" priority = \"P2\" if \"urgent\" in message.lower() or \"high\" in message.lower() else \"P3\"\n",
" tid, ts = create_ticket(name, company, email, message, priority)\n",
" text = f\"A new support ticket has been created.\\nTicket ID: {tid}\\nCreated at: {ts}\\nStatus: OPEN\"\n",
" yield text, synthesize_speech(text)\n",
" return\n",
"\n",
" if intent == \"check_ticket\":\n",
" match = re.search(r\"(rt-\\d+)\", message.lower())\n",
" if match:\n",
" ticket_id = match.group(1).upper()\n",
" data = get_ticket(ticket_id)\n",
" if data:\n",
" text = (\n",
" f\"Ticket {ticket_id} Details:\\n\"\n",
" f\"Issue: {data['issue']}\\n\"\n",
" f\"Status: {data['status']}\\n\"\n",
" f\"Priority: {data['priority']}\\n\"\n",
" f\"Created at: {data['created_at']}\"\n",
" )\n",
" else:\n",
" text = f\"No ticket found with ID {ticket_id}.\"\n",
" else:\n",
" text = \"Please provide a valid ticket ID.\"\n",
" yield text, synthesize_speech(text)\n",
" return"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "280c7d2f",
"metadata": {},
"outputs": [],
"source": [
"def chat(message, history, model, name, company, email):\n",
" if not message.strip():\n",
" yield \"Please type a message to start.\", None\n",
" return\n",
"\n",
" history_msgs = [{\"role\": h[\"role\"], \"content\": h[\"content\"]} for h in history]\n",
" intent = detect_intent(message)\n",
" reply, audio_path = \"\", None\n",
"\n",
" if intent == \"create_ticket\":\n",
" priority = \"P2\" if \"urgent\" in message.lower() or \"high\" in message.lower() else \"P3\"\n",
" tid, ts = create_ticket(name, company, email, message, priority)\n",
" reply = f\"A new support ticket has been created.\\nTicket ID: {tid}\\nCreated at: {ts}\\nStatus: OPEN\"\n",
" audio_path = synthesize_speech(reply)\n",
" yield reply, audio_path\n",
" return\n",
"\n",
" if intent == \"check_ticket\":\n",
" match = re.search(r\"(rt-\\d+)\", message.lower())\n",
" if match:\n",
" ticket_id = match.group(1).upper()\n",
" data = get_ticket(ticket_id)\n",
" if data:\n",
" reply = (\n",
" f\"Ticket {ticket_id} Details:\\n\"\n",
" f\"Issue: {data['issue']}\\n\"\n",
" f\"Status: {data['status']}\\n\"\n",
" f\"Priority: {data['priority']}\\n\"\n",
" f\"Created at: {data['created_at']}\"\n",
" )\n",
" else:\n",
" reply = f\"No ticket found with ID {ticket_id}.\"\n",
" else:\n",
" reply = \"Please provide a valid ticket ID.\"\n",
" audio_path = synthesize_speech(reply)\n",
" yield reply, audio_path\n",
" return\n",
"\n",
" messages = [{\"role\": \"system\", \"content\": SYSTEM_PROMPT}] + history_msgs + [{\"role\": \"user\", \"content\": message}]\n",
" stream = client.chat.completions.create(model=model, messages=messages, stream=True)\n",
"\n",
" full_reply = \"\"\n",
" for chunk in stream:\n",
" delta = chunk.choices[0].delta.content or \"\"\n",
" full_reply += delta\n",
" yield full_reply, None \n",
" audio_path = synthesize_speech(full_reply)\n",
" yield full_reply, audio_path "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0cb1977d",
"metadata": {},
"outputs": [],
"source": [
"init_db()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a0557ba",
"metadata": {},
"outputs": [],
"source": [
"with gr.Blocks(title=\"Nova | Business AI Assistant\", theme=gr.themes.Soft()) as demo:\n",
" gr.Markdown(\"## Nova | Reallytics.ai Customer Support & Sales Assistant\")\n",
" gr.Markdown(\n",
" \"Nova helps clients create or track support tickets, understand services, and explore automation options. \"\n",
" \"Type your questions and Nova will respond in both text and voice.\"\n",
" )\n",
"\n",
" with gr.Row():\n",
" name = gr.Textbox(label=\"Your Name\", placeholder=\"Liam\")\n",
" company = gr.Textbox(label=\"Company (optional)\", placeholder=\"ABC Corp\")\n",
" email = gr.Textbox(label=\"Email\", placeholder=\"you@example.com\")\n",
"\n",
" model = gr.Dropdown([\"gpt-4o-mini\", \"gpt-4\", \"gpt-3.5-turbo\"], value=\"gpt-4o-mini\", label=\"Model\")\n",
"\n",
" audio_output = gr.Audio(label=\"Nova's Voice Reply\", autoplay=True, interactive=False)\n",
"\n",
" gr.ChatInterface(\n",
" fn=chat,\n",
" type=\"messages\",\n",
" additional_inputs=[model, name, company, email],\n",
" additional_outputs=[audio_output],\n",
" title=\"Chat with Nova\",\n",
" description=\"Ask about tickets, automation services, pricing, or integration and Nova will also speak her reply.\"\n",
" )\n",
"\n",
"if __name__ == \"__main__\":\n",
" demo.launch()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llm-engineering",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,144 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "d59206dc",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import ollama\n",
"from IPython.display import Markdown, display"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad035727",
"metadata": {},
"outputs": [],
"source": [
"# Load keys\n",
"load_dotenv()\n",
"client = OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\"))\n",
"ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key = 'ollama')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f521334",
"metadata": {},
"outputs": [],
"source": [
"# ---- SYSTEM PROMPTS ----\n",
"athena_system = \"\"\"\n",
"You are Athena, a strategic thinker and visionary. You seek meaning, long-term implications,\n",
"and practical wisdom in every discussion. Be concise (1-2 sentences).\n",
"\"\"\"\n",
"\n",
"loki_system = \"\"\"\n",
"You are Loki, a sarcastic trickster who mocks and challenges everyone else's opinions.\n",
"You use humor, wit, and irony to undermine serious arguments. Be concise (1-2 sentences).\n",
"\"\"\"\n",
"\n",
"orion_system = \"\"\"\n",
"You are Orion, a data-driven realist. You respond with evidence, statistics, or factual analysis.\n",
"If data is not available, make a logical deduction. Be concise (1-2 sentences).\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a6d04f6",
"metadata": {},
"outputs": [],
"source": [
"# ---- INITIAL CONVERSATION ----\n",
"conversation = [\n",
" {\"role\": \"system\", \"name\": \"Athena\", \"content\": athena_system},\n",
" {\"role\": \"system\", \"name\": \"Loki\", \"content\": loki_system},\n",
" {\"role\": \"system\", \"name\": \"Orion\", \"content\": orion_system},\n",
" {\"role\": \"user\", \"content\": \"Topic: 'Why did the chicken cross the road?' Begin your discussion.\"}\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e292a27b",
"metadata": {},
"outputs": [],
"source": [
"# ---- HELPER FUNCTIONS ----\n",
"def call_gpt(name, system_prompt, conversation):\n",
" \"\"\"Call GPT model with current conversation context.\"\"\"\n",
" messages = [{\"role\": \"system\", \"content\": system_prompt}]\n",
" messages += [{\"role\": \"user\", \"content\": f\"The conversation so far:\\n{format_conversation(conversation)}\\nNow respond as {name}.\"}]\n",
" resp = client.chat.completions.create(model=\"gpt-4o-mini\", messages=messages)\n",
" return resp.choices[0].message.content.strip()\n",
"\n",
"def call_ollama(name, system_prompt, conversation):\n",
" \"\"\"Call Ollama (Llama3.2) as a local model.\"\"\"\n",
" messages = [{\"role\": \"system\", \"content\": system_prompt}]\n",
" messages += [{\"role\": \"user\", \"content\": f\"The conversation so far:\\n{format_conversation(conversation)}\\nNow respond as {name}.\"}]\n",
" resp = ollama.chat(model=\"llama3.2\", messages=messages)\n",
" return resp['message']['content'].strip()\n",
"\n",
"def format_conversation(conv):\n",
" return \"\\n\".join([f\"{m.get('name', m['role']).upper()}: {m['content']}\" for m in conv if m['role'] != \"system\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0eb4d72",
"metadata": {},
"outputs": [],
"source": [
"# ---- MAIN LOOP ----\n",
"rounds = 5\n",
"for i in range(rounds):\n",
" # Athena responds\n",
" athena_reply = call_gpt(\"Athena\", athena_system, conversation)\n",
" conversation.append({\"role\": \"assistant\", \"name\": \"Athena\", \"content\": athena_reply})\n",
" display(Markdown(f\"**Athena:** {athena_reply}\"))\n",
"\n",
" # Loki responds\n",
" loki_reply = call_ollama(\"Loki\", loki_system, conversation)\n",
" conversation.append({\"role\": \"assistant\", \"name\": \"Loki\", \"content\": loki_reply})\n",
" display(Markdown(f\"**Loki:** {loki_reply}\"))\n",
"\n",
" # Orion responds\n",
" orion_reply = call_gpt(\"Orion\", orion_system, conversation)\n",
" conversation.append({\"role\": \"assistant\", \"name\": \"Orion\", \"content\": orion_reply})\n",
" display(Markdown(f\"**Orion:** {orion_reply}\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llm-engineering",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,47 +0,0 @@
# Multi-Agent Conversation Simulator (OpenAI + Ollama)
## Project Overview
This project is an experimental **multi-agent conversational simulation** built with **OpenAI GPT models** and a locally-hosted **Ollama LLM (Llama 3.2)**. It demonstrates how multiple AI personas can participate in a shared conversation, each with distinct roles, perspectives, and behaviors — producing a dynamic, evolving debate from different angles.
The script orchestrates a **three-way dialogue** around a single topic (“Why did the chicken cross the road?”) between three agents, each powered by a different model and persona definition:
- **Athena (OpenAI GPT-4o):** A strategic thinker who looks for deeper meaning, long-term consequences, and practical wisdom.
- **Loki (Ollama Llama 3.2):** A sarcastic trickster who mocks, questions, and challenges the others with wit and irony.
- **Orion (OpenAI GPT-4o):** A data-driven realist who grounds the discussion in facts, statistics, or logical deductions.
## Whats Happening in the Code
1. **Environment Setup**
- Loads the OpenAI API key from a `.env` file.
- Initializes OpenAIs Python client and configures a local Ollama endpoint.
2. **Persona System Prompts**
- Defines system prompts for each agent to give them unique personalities and communication styles.
- These prompts act as the “character definitions” for Athena, Loki, and Orion.
3. **Conversation Initialization**
- Starts with a single conversation topic provided by the user.
- All three agents are aware of the discussion context and prior messages.
4. **Conversation Loop**
- The conversation runs in multiple rounds (default: 5).
- In each round:
- **Athena (GPT)** responds first with a strategic viewpoint.
- **Loki (Ollama)** replies next, injecting sarcasm and skepticism.
- **Orion (GPT)** follows with a fact-based or analytical perspective.
- Each response is appended to the conversation history so future replies build on previous statements.
5. **Dynamic Context Sharing**
- Each agent receives the **entire conversation so far** as context before generating a response.
- This ensures their replies are relevant, coherent, and responsive to what the others have said.
6. **Output Rendering**
- Responses are displayed as Markdown in a readable, chat-like format for each speaker, round by round.
## Key Highlights
- Demonstrates **multi-agent orchestration** with different models working together in a single script.
- Uses **OpenAI GPT models** for reasoning and **Ollama (Llama 3.2)** for local, cost-free inference.
- Shows how **system prompts** and **context-aware message passing** can simulate realistic dialogues.
- Provides a template for experimenting with **AI characters**, **debate simulations**, or **collaborative agent systems**.

View File

@@ -1,224 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "4ef1e715",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import gradio as gr\n",
"from openai import OpenAI\n",
"from dotenv import load_dotenv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d3426558",
"metadata": {},
"outputs": [],
"source": [
"# Load API key\n",
"load_dotenv()\n",
"client = OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e18a59a3",
"metadata": {},
"outputs": [],
"source": [
"# -------------------------------\n",
"# Helper: Prompt Builder\n",
"# -------------------------------\n",
"def build_prompt(task, topic, tone, audience):\n",
" task_prompts = {\n",
" \"Brochure\": f\"Write a compelling marketing brochure about {topic}.\",\n",
" \"Blog Post\": f\"Write a blog post on {topic} with engaging storytelling and useful insights.\",\n",
" \"Product Comparison\": f\"Write a product comparison summary focusing on {topic}, including pros, cons, and recommendations.\",\n",
" \"Idea Brainstorm\": f\"Brainstorm creative ideas or solutions related to {topic}.\"\n",
" }\n",
" base = task_prompts.get(task, \"Write something creative.\")\n",
" if tone:\n",
" base += f\" Use a {tone} tone.\"\n",
" if audience:\n",
" base += f\" Tailor it for {audience}.\"\n",
" return base"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "65a27bfb",
"metadata": {},
"outputs": [],
"source": [
"# -------------------------------\n",
"# Generate with multiple models\n",
"# -------------------------------\n",
"def generate_stream(task, topic, tone, audience, model):\n",
" if not topic.strip():\n",
" yield \"⚠️ Please enter a topic.\"\n",
" return\n",
"\n",
" prompt = build_prompt(task, topic, tone, audience)\n",
"\n",
" stream = client.chat.completions.create(\n",
" model=model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
" {\"role\": \"user\", \"content\": prompt}\n",
" ],\n",
" max_tokens=800,\n",
" stream=True\n",
" )\n",
"\n",
" result = \"\"\n",
" for chunk in stream:\n",
" result += chunk.choices[0].delta.content or \"\"\n",
" yield result"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e15abee",
"metadata": {},
"outputs": [],
"source": [
"# -------------------------------\n",
"# Refinement logic\n",
"# -------------------------------\n",
"def refine_stream(original_text, instruction, model):\n",
" if not original_text.strip():\n",
" yield \"⚠️ Please paste the text you want to refine.\"\n",
" return\n",
" if not instruction.strip():\n",
" yield \"⚠️ Please provide a refinement instruction.\"\n",
" return\n",
"\n",
" refined_prompt = f\"Refine the following text based on this instruction: {instruction}\\n\\nText:\\n{original_text}\"\n",
"\n",
" stream = client.chat.completions.create(\n",
" model=model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a writing assistant.\"},\n",
" {\"role\": \"user\", \"content\": refined_prompt}\n",
" ],\n",
" max_tokens=800,\n",
" stream=True\n",
" )\n",
"\n",
" result = \"\"\n",
" for chunk in stream:\n",
" result += chunk.choices[0].delta.content or \"\"\n",
" yield result\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ee02feb",
"metadata": {},
"outputs": [],
"source": [
"# -------------------------------\n",
"# Gradio UI\n",
"# -------------------------------\n",
"with gr.Blocks(title=\"AI Creative Studio\") as demo:\n",
" gr.Markdown(\"# AI Creative Studio\\nGenerate marketing content, blog posts, or creative ideas — streamed in real-time!\")\n",
"\n",
" with gr.Row():\n",
" task = gr.Dropdown(\n",
" [\"Brochure\", \"Blog Post\", \"Product Comparison\", \"Idea Brainstorm\"],\n",
" label=\"Task Type\",\n",
" value=\"Brochure\"\n",
" )\n",
" topic = gr.Textbox(label=\"Topic\", placeholder=\"e.g., Electric Cars, AI in Education...\")\n",
" with gr.Row():\n",
" tone = gr.Textbox(label=\"Tone (optional)\", placeholder=\"e.g., professional, casual, humorous...\")\n",
" audience = gr.Textbox(label=\"Target Audience (optional)\", placeholder=\"e.g., investors, students, developers...\")\n",
"\n",
" model = gr.Dropdown(\n",
" [\"gpt-4o-mini\", \"gpt-3.5-turbo\", \"gpt-4\"],\n",
" label=\"Choose a model\",\n",
" value=\"gpt-4o-mini\"\n",
" )\n",
"\n",
" generate_btn = gr.Button(\"Generate Content\")\n",
" output_md = gr.Markdown(label=\"Generated Content\", show_label=True)\n",
"\n",
" generate_btn.click(\n",
" fn=generate_stream,\n",
" inputs=[task, topic, tone, audience, model],\n",
" outputs=output_md\n",
" )\n",
"\n",
" gr.Markdown(\"---\\n## Refine Your Content\")\n",
"\n",
" original_text = gr.Textbox(\n",
" label=\"Original Content\",\n",
" placeholder=\"Paste content you want to refine...\",\n",
" lines=10\n",
" )\n",
" instruction = gr.Textbox(\n",
" label=\"Refinement Instruction\",\n",
" placeholder=\"e.g., Make it shorter and more persuasive.\",\n",
" )\n",
" refine_model = gr.Dropdown(\n",
" [\"gpt-4o-mini\", \"gpt-3.5-turbo\", \"gpt-4\"],\n",
" label=\"Model for Refinement\",\n",
" value=\"gpt-4o-mini\"\n",
" )\n",
"\n",
" refine_btn = gr.Button(\"Refine\")\n",
" refined_output = gr.Markdown(label=\"Refined Content\", show_label=True)\n",
"\n",
" refine_btn.click(\n",
" fn=refine_stream,\n",
" inputs=[original_text, instruction, refine_model],\n",
" outputs=refined_output\n",
" )\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55d42c7e",
"metadata": {},
"outputs": [],
"source": [
"# -------------------------------\n",
"# Launch the App\n",
"# -------------------------------\n",
"if __name__ == \"__main__\":\n",
" demo.launch()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llm-engineering",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,48 +0,0 @@
# AI Creative Studio
## Project Overview
AI Creative Studio is a web-based application built with Gradio that allows users to generate and refine high-quality written content in real time using OpenAI language models. It is designed as a flexible creative tool for content creation tasks such as writing brochures, blog posts, product comparisons, and brainstorming ideas. The application also supports interactive refinement, enabling users to improve or adapt existing text based on specific instructions.
The core idea is to combine the power of OpenAI models with an intuitive, user-friendly interface that streams responses as they are generated. This provides a fast, engaging, and highly interactive writing experience without waiting for the entire response to complete before it appears.
---
## Whats Happening in the Project
1. **Environment Setup and Model Initialization**
- The application loads the OpenAI API key from a `.env` file and initializes the OpenAI client for model interactions.
- Supported models include `gpt-4o-mini`, `gpt-3.5-turbo`, and `gpt-4`, which the user can select from a dropdown menu.
2. **Prompt Construction and Content Generation**
- The `build_prompt` function constructs a task-specific prompt based on the users choices: content type (brochure, blog post, etc.), topic, tone, and target audience.
- Once the user provides the inputs and selects a model, the application sends the prompt to the model.
- The models response is streamed back incrementally, showing text chunk by chunk for a real-time generation experience.
3. **Content Refinement Feature**
- Users can paste existing text and provide a refinement instruction (e.g., “make it more persuasive” or “summarize it”).
- The application then streams an improved version of the text, following the instruction, allowing users to iterate and polish content efficiently.
4. **Gradio User Interface**
- The app is built using Gradio Blocks, providing an organized and interactive layout.
- Key UI elements include:
- Task selection dropdown for choosing the type of content.
- Text inputs for topic, tone, and target audience.
- Model selection dropdown for choosing a specific OpenAI model.
- Real-time markdown display of generated content.
- A refinement panel for improving existing text.
5. **Streaming Workflow**
- Both generation and refinement use OpenAIs streaming API to display the models response as its produced.
- This provides an immediate and responsive user experience, allowing users to see results build up in real time rather than waiting for the entire completion.
---
### Key Features
- Real-time streaming responses for fast and interactive content creation.
- Multiple content generation modes: brochure, blog post, product comparison, and idea brainstorming.
- Customization options for tone and audience to tailor the writing style.
- Interactive refinement tool to enhance or transform existing text.
- Clean and intuitive web interface powered by Gradio.
AI Creative Studio demonstrates how large language models can be integrated into user-facing applications to support creative workflows and improve productivity in content generation and editing.

View File

@@ -1,137 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "6f612c5a",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import gradio as gr\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "39c144fd",
"metadata": {},
"outputs": [],
"source": [
"# Load API Key\n",
"load_dotenv()\n",
"client = OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f656e0d1",
"metadata": {},
"outputs": [],
"source": [
"# -------------------------------\n",
"# 1. System Prompt (Business Context)\n",
"# -------------------------------\n",
"system_message = \"\"\"\n",
"You are Nova, an AI Sales & Solutions Consultant for Reallytics.ai a company specializing in building\n",
"custom AI chatbots, voice assistants, data dashboards, and automation solutions for businesses.\n",
"You are professional, insightful, and always focused on solving the user's business challenges.\n",
"First, try to understand their use case. Then suggest relevant solutions from our services with clear value propositions.\n",
"If the user is unsure, give them examples of how similar businesses have benefited from AI.\n",
"\"\"\"\n",
"\n",
"MODEL = \"gpt-4o-mini\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f2faba29",
"metadata": {},
"outputs": [],
"source": [
"# -------------------------------\n",
"# 2. Smart Chat Function (Streaming)\n",
"# -------------------------------\n",
"def chat(message, history):\n",
" # Convert Gradio's chat history to OpenAI format\n",
" history_messages = [{\"role\": h[\"role\"], \"content\": h[\"content\"]} for h in history]\n",
"\n",
" # Adjust system message based on context dynamically\n",
" relevant_system_message = system_message\n",
" if \"price\" in message.lower():\n",
" relevant_system_message += (\n",
" \" If the user asks about pricing, explain that pricing depends on project complexity, \"\n",
" \"but typical POCs start around $2,000 - $5,000, and full enterprise deployments scale beyond that.\"\n",
" )\n",
" if \"integration\" in message.lower():\n",
" relevant_system_message += (\n",
" \" If integration is mentioned, reassure the user that our solutions are built to integrate seamlessly with CRMs, ERPs, or internal APIs.\"\n",
" )\n",
"\n",
" # Compose final messages\n",
" messages = [{\"role\": \"system\", \"content\": relevant_system_message}] + history_messages + [\n",
" {\"role\": \"user\", \"content\": message}\n",
" ]\n",
"\n",
" # Stream the response\n",
" stream = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=messages,\n",
" stream=True\n",
" )\n",
"\n",
" response = \"\"\n",
" for chunk in stream:\n",
" response += chunk.choices[0].delta.content or \"\"\n",
" yield response"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9d9515e",
"metadata": {},
"outputs": [],
"source": [
"# -------------------------------\n",
"# 3. Gradio Chat UI\n",
"# -------------------------------\n",
"with gr.Blocks(title=\"AI Business Assistant\") as demo:\n",
" gr.Markdown(\"# AI Business Assistant\\nYour intelligent sales and solution consultant, powered by OpenAI.\")\n",
"\n",
" \n",
"gr.ChatInterface(\n",
" fn=chat,\n",
" type=\"messages\",\n",
" title=\"Business AI Consultant\",\n",
" description=\"Ask about automation, chatbots, dashboards, or voice AI Nova will help you discover the right solution.\"\n",
").launch()\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llm-engineering",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,42 +0,0 @@
# AI Business Assistant
## Project Overview
This project is a prototype of an **AI-powered business consultant chatbot** built with **Gradio** and **OpenAI**. The assistant, named **Nova**, is designed to act as a virtual sales and solutions consultant for a company offering AI services such as chatbots, voice assistants, dashboards, and automation tools.
The purpose of the project is to demonstrate how an LLM (Large Language Model) can be adapted for a business context by carefully designing the **system prompt** and providing **dynamic behavior** based on user inputs. The chatbot responds to user queries in real time with streaming responses, making it interactive and natural to use.
## Whats Happening in the Code
1. **Environment Setup**
- The code loads the OpenAI API key from a `.env` file.
- The `OpenAI` client is initialized for communication with the language model.
- The chosen model is `gpt-4o-mini`.
2. **System Prompt for Business Context**
- The assistant is given a clear identity: *Nova, an AI Sales & Solutions Consultant for Reallytics.ai*.
- The system prompt defines Novas tone (professional, insightful) and role (understand user needs, propose relevant AI solutions, share examples).
3. **Dynamic Chat Function**
- The `chat()` function processes user input and the conversation history.
- It modifies the system prompt dynamically:
- If the user mentions **price**, Nova explains pricing ranges and factors.
- If the user mentions **integration**, Nova reassures the user about system compatibility.
- Messages are formatted for the OpenAI API, combining system, history, and user inputs.
- Responses are streamed back chunk by chunk, so users see the assistant typing in real time.
4. **Gradio Chat Interface**
- A Gradio interface is created with `ChatInterface` in `messages` mode.
- This automatically provides a chat-style UI with user/assistant message bubbles and a send button.
- The title and description help set context for end users: *“Ask about automation, chatbots, dashboards, or voice AI.”*
## Key Features
- **Business-specific persona:** The assistant is contextualized as a sales consultant rather than a generic chatbot.
- **Adaptive responses:** System prompt is adjusted based on keywords like "price" and "integration".
- **Streaming output:** Responses are displayed incrementally, improving user experience.
- **Clean chat UI:** Built with Gradios `ChatInterface` for simplicity and usability.
This project demonstrates how to combine **system prompts**, **dynamic context handling**, and **Gradio chat interfaces** to build a specialized AI assistant tailored for business use cases.

View File

@@ -0,0 +1,494 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
"metadata": {},
"source": [
"# Welcome to the Day 2 Lab!\n"
]
},
{
"cell_type": "markdown",
"id": "ada885d9-4d42-4d9b-97f0-74fbbbfe93a9",
"metadata": {},
"source": [
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../assets/resources.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#f71;\">Just before we get started --</h2>\n",
" <span style=\"color:#f71;\">I thought I'd take a second to point you at this page of useful resources for the course. This includes links to all the slides.<br/>\n",
" <a href=\"https://edwarddonner.com/2024/11/13/llm-engineering-resources/\">https://edwarddonner.com/2024/11/13/llm-engineering-resources/</a><br/>\n",
" Please keep this bookmarked, and I'll continue to add more useful links there over time.\n",
" </span>\n",
" </td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"id": "79ffe36f",
"metadata": {},
"source": [
"## First - let's talk about the Chat Completions API\n",
"\n",
"1. The simplest way to call an LLM\n",
"2. It's called Chat Completions because it's saying: \"here is a conversation, please predict what should come next\"\n",
"3. The Chat Completions API was invented by OpenAI, but it's so popular that everybody uses it!\n",
"\n",
"### We will start by calling OpenAI again - but don't worry non-OpenAI people, your time is coming!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e38f17a0",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from dotenv import load_dotenv\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n"
]
},
{
"cell_type": "markdown",
"id": "97846274",
"metadata": {},
"source": [
"## Do you know what an Endpoint is?\n",
"\n",
"If not, please review the Technical Foundations guide in the guides folder\n",
"\n",
"And, here is an endpoint that might interest you..."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5af5c188",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"headers = {\"Authorization\": f\"Bearer {api_key}\", \"Content-Type\": \"application/json\"}\n",
"\n",
"payload = {\n",
" \"model\": \"gpt-5-nano\",\n",
" \"messages\": [\n",
" {\"role\": \"user\", \"content\": \"Tell me a fun fact\"}]\n",
"}\n",
"\n",
"payload"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d0ab242",
"metadata": {},
"outputs": [],
"source": [
"response = requests.post(\n",
" \"https://api.openai.com/v1/chat/completions\",\n",
" headers=headers,\n",
" json=payload\n",
")\n",
"\n",
"response.json()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cb11a9f6",
"metadata": {},
"outputs": [],
"source": [
"response.json()[\"choices\"][0][\"message\"][\"content\"]"
]
},
{
"cell_type": "markdown",
"id": "cea3026a",
"metadata": {},
"source": [
"# What is the openai package?\n",
"\n",
"It's known as a Python Client Library.\n",
"\n",
"It's nothing more than a wrapper around making this exact call to the http endpoint.\n",
"\n",
"It just allows you to work with nice Python code instead of messing around with janky json objects.\n",
"\n",
"But that's it. It's open-source and lightweight. Some people think it contains OpenAI model code - it doesn't!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "490fdf09",
"metadata": {},
"outputs": [],
"source": [
"# Create OpenAI client\n",
"\n",
"from openai import OpenAI\n",
"openai = OpenAI()\n",
"\n",
"response = openai.chat.completions.create(model=\"gpt-5-nano\", messages=[{\"role\": \"user\", \"content\": \"Tell me a fun fact\"}])\n",
"\n",
"response.choices[0].message.content\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "c7739cda",
"metadata": {},
"source": [
"## And then this great thing happened:\n",
"\n",
"OpenAI's Chat Completions API was so popular, that the other model providers created endpoints that are identical.\n",
"\n",
"They are known as the \"OpenAI Compatible Endpoints\".\n",
"\n",
"For example, google made one here: https://generativelanguage.googleapis.com/v1beta/openai/\n",
"\n",
"And OpenAI decided to be kind: they said, hey, you can just use the same client library that we made for GPT. We'll allow you to specify a different endpoint URL and a different key, to use another provider.\n",
"\n",
"So you can use:\n",
"\n",
"```python\n",
"gemini = OpenAI(base_url=\"https://generativelanguage.googleapis.com/v1beta/openai/\", api_key=\"AIz....\")\n",
"gemini.chat.completions.create(...)\n",
"```\n",
"\n",
"And to be clear - even though OpenAI is in the code, we're only using this lightweight python client library to call the endpoint - there's no OpenAI model involved here.\n",
"\n",
"If you're confused, please review Guide 9 in the Guides folder!\n",
"\n",
"And now let's try it!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f74293bc",
"metadata": {},
"outputs": [],
"source": [
"\n",
"GEMINI_BASE_URL = \"https://generativelanguage.googleapis.com/v1beta/openai/\"\n",
"\n",
"google_api_key = os.getenv(\"GOOGLE_API_KEY\")\n",
"\n",
"if not google_api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not google_api_key.startswith(\"AIz\"):\n",
" print(\"An API key was found, but it doesn't start AIz\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8fc5520d",
"metadata": {},
"outputs": [],
"source": [
"import google.generativeai as genai\n",
"from dotenv import load_dotenv\n",
"import os\n",
"\n",
"load_dotenv()\n",
"genai.configure(api_key=os.getenv(\"GOOGLE_API_KEY\"))\n",
"\n",
"# Lista de modelos disponibles\n",
"for model in genai.list_models():\n",
" print(model.name, \"-\", model.supported_generation_methods)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d060f484",
"metadata": {},
"outputs": [],
"source": [
"import google.generativeai as genai\n",
"from dotenv import load_dotenv\n",
"import os\n",
"\n",
"load_dotenv()\n",
"genai.configure(api_key=os.getenv(\"GOOGLE_API_KEY\"))\n",
"\n",
"model = genai.GenerativeModel(\"models/gemini-2.5-pro\") # Usa el modelo que viste en la lista, ejemplo \"gemini-1.5-pro\" o \"gemini-1.5-flash\"\n",
"response = model.generate_content(\"Tell me a fun fact\")\n",
"\n",
"print(response.text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gemini = OpenAI(base_url=GEMINI_BASE_URL, api_key=google_api_key)\n",
"\n",
"response = gemini.chat.completions.create(model=\"models/gemini-2.5-pro\", messages=[{\"role\": \"user\", \"content\": \"Tell me a fun fact\"}])\n",
"\n",
"response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a5b069be",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "65272432",
"metadata": {},
"source": [
"## And Ollama also gives an OpenAI compatible endpoint\n",
"\n",
"...and it's on your local machine!\n",
"\n",
"If the next cell doesn't print \"Ollama is running\" then please open a terminal and run `ollama serve`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f06280ad",
"metadata": {},
"outputs": [],
"source": [
"requests.get(\"http://localhost:11434\").content"
]
},
{
"cell_type": "markdown",
"id": "c6ef3807",
"metadata": {},
"source": [
"### Download llama3.2 from meta\n",
"\n",
"Change this to llama3.2:1b if your computer is smaller.\n",
"\n",
"Don't use llama3.3 or llama4! They are too big for your computer.."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e633481d",
"metadata": {},
"outputs": [],
"source": [
"!ollama pull llama3.2"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ce240975",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"response = requests.get(\"http://localhost:11434/v1/models\")\n",
"print(response.json())\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9419762",
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"\n",
"OLLAMA_BASE_URL = \"http://localhost:11434/v1\"\n",
"\n",
"ollama = OpenAI(base_url=OLLAMA_BASE_URL, api_key='ollama')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e2456cdf",
"metadata": {},
"outputs": [],
"source": [
"# Get a fun fact\n",
"\n",
"response = ollama.chat.completions.create(model=\"llama3.2\", messages=[{\"role\": \"user\", \"content\": \"Tell me a fun fact\"}])\n",
"\n",
"response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d7cebd7",
"metadata": {},
"outputs": [],
"source": [
"# Now let's try deepseek-r1:1.5b - this is DeepSeek \"distilled\" into Qwen from Alibaba Cloud\n",
"\n",
"!ollama pull deepseek-r1:1.5b"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "25002f25",
"metadata": {},
"outputs": [],
"source": [
"#response = ollama.chat.completions.create(model=\"deepseek-r1:1.5b\", messages=[{\"role\": \"user\", \"content\": \"Tell me a fun fact\"}])\n",
"#response.choices[0].message.content\n",
"\n",
"from ollama import chat # pip install ollama\n",
"\n",
"resp = chat(\n",
" model='deepseek-r1:1.5b',\n",
" messages=[{'role': 'user', 'content': 'Tell me a fun fact'}],\n",
")\n",
"\n",
"print(resp['message']['content'])\n",
"# o\n",
"print(resp.message.content)\n"
]
},
{
"cell_type": "markdown",
"id": "6e9fa1fc-eac5-4d1d-9be4-541b3f2b3458",
"metadata": {},
"source": [
"# HOMEWORK EXERCISE ASSIGNMENT\n",
"\n",
"Upgrade the day 1 project to summarize a webpage to use an Open Source model running locally via Ollama rather than OpenAI\n",
"\n",
"You'll be able to use this technique for all subsequent projects if you'd prefer not to use paid APIs.\n",
"\n",
"**Benefits:**\n",
"1. No API charges - open-source\n",
"2. Data doesn't leave your box\n",
"\n",
"**Disadvantages:**\n",
"1. Significantly less power than Frontier Model\n",
"\n",
"## Recap on installation of Ollama\n",
"\n",
"Simply visit [ollama.com](https://ollama.com) and install!\n",
"\n",
"Once complete, the ollama server should already be running locally. \n",
"If you visit: \n",
"[http://localhost:11434/](http://localhost:11434/)\n",
"\n",
"You should see the message `Ollama is running`. \n",
"\n",
"If not, bring up a new Terminal (Mac) or Powershell (Windows) and enter `ollama serve` \n",
"And in another Terminal (Mac) or Powershell (Windows), enter `ollama pull llama3.2` \n",
"Then try [http://localhost:11434/](http://localhost:11434/) again.\n",
"\n",
"If Ollama is slow on your machine, try using `llama3.2:1b` as an alternative. Run `ollama pull llama3.2:1b` from a Terminal or Powershell, and change the code from `MODEL = \"llama3.2\"` to `MODEL = \"llama3.2:1b\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6de38216-6d1c-48c4-877b-86d403f4e0f8",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"import os\n",
"from dotenv import load_dotenv\n",
"from scraper import fetch_website_contents\n",
"from IPython.display import Markdown, display\n",
"from ollama import Client \n",
"\n",
"# Cliente Ollama local\n",
"ollama = Client()\n",
"\n",
"system_prompt = \"\"\"\n",
"You are a helpful assistant that analyzes the contents of a website,\n",
"and provides a short, snarky, humorous summary, ignoring text that might be navigation related.\n",
"Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n",
"\"\"\"\n",
"\n",
"user_prompt_prefix = \"\"\"\n",
"Here are the contents of a website.\n",
"Provide a short summary of this website.\n",
"If it includes news or announcements, then summarize these too.\n",
"\"\"\"\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_prefix + website}\n",
" ]\n",
"\n",
"def summarize(url):\n",
" website = fetch_website_contents(url)\n",
" response = ollama.chat(\n",
" model='llama3.2',\n",
" messages=messages_for(website)\n",
" )\n",
" return response['message']['content']\n",
"\n",
"def display_summary(url):\n",
" summary = summarize(url)\n",
" display(Markdown(summary))\n",
"\n",
"# Ejecuta el resumen\n",
"display_summary(\"https://www.reforma.com\")\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,175 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "fe12c203-e6a6-452c-a655-afb8a03a4ff5",
"metadata": {},
"source": [
"# End of week 1 exercise\n",
"\n",
"To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question, \n",
"and responds with an explanation. This is a tool that you will be able to use yourself during the course!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c1070317-3ed9-4659-abe3-828943230e03",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"import os\n",
"from openai import OpenAI\n",
"from dotenv import load_dotenv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a456906-915a-4bfd-bb9d-57e505c5093f",
"metadata": {},
"outputs": [],
"source": [
"# constants\n",
"MODEL_GPT = 'gpt-4o-mini'\n",
"MODEL_LLAMA = 'llama3.2'"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a8d7923c-5f28-4c30-8556-342d7c8497c1",
"metadata": {},
"outputs": [],
"source": [
"# set up environment\n",
"system_prompt = \"\"\"\n",
"You are a technical expert of AI and LLMs.\n",
"\"\"\"\n",
"\n",
"user_prompt_prefix = \"\"\"\n",
"Provide deep explanations of the provided text.\n",
"\"\"\"\n",
"\n",
"user_prompt = \"\"\"\n",
"Explain the provided text.\n",
"\"\"\"\n",
"client = OpenAI()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f0d0137-52b0-47a8-81a8-11a90a010798",
"metadata": {},
"outputs": [],
"source": [
"# here is the question; type over this to ask something new\n",
"\n",
"question = \"\"\"\n",
"Ollama does have an OpenAI compatible endpoint, but Gemini doesn't?\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get gpt-4o-mini to answer, with streaming\n",
"def messages_for(question):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_prefix + question}\n",
" ]\n",
"\n",
"def run_model_streaming(model_name, question):\n",
" stream = client.chat.completions.create(\n",
" model=model_name,\n",
" messages=messages_for(question),\n",
" stream=True\n",
" )\n",
" for chunk in stream:\n",
" content = chunk.choices[0].delta.content\n",
" if content:\n",
" print(content, end=\"\", flush=True)\n",
"\n",
"run_model_streaming(MODEL_GPT, question)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f7c8ea8-4082-4ad0-8751-3301adcf6538",
"metadata": {},
"outputs": [],
"source": [
"# Get Llama 3.2 to answer\n",
"# imports\n",
"import os\n",
"from openai import OpenAI\n",
"from dotenv import load_dotenv\n",
"\n",
"# set up environment\n",
"client = OpenAI(\n",
" base_url=os.getenv(\"OPENAI_BASE_URL\", \"http://localhost:11434/v1\"),\n",
" api_key=os.getenv(\"OPENAI_API_KEY\", \"ollama\")\n",
")\n",
"\n",
"system_prompt = \"\"\"\n",
"You are a technical expert of AI and LLMs.\n",
"\"\"\"\n",
"\n",
"user_prompt_prefix = \"\"\"\n",
"Provide deep explanations of the provided text.\n",
"\"\"\"\n",
"\n",
"# question\n",
"question = \"\"\"\n",
"Ollama does have an OpenAI compatible endpoint, but Gemini doesn't?\n",
"\"\"\"\n",
"\n",
"# message\n",
"def messages_for(question):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_prefix + question}\n",
" ]\n",
"\n",
"# response\n",
"def run_model(model_name, question):\n",
" response = client.chat.completions.create(\n",
" model=model_name,\n",
" messages=messages_for(question)\n",
" )\n",
" return response.choices[0].message.content\n",
"\n",
"# run and print result\n",
"print(run_model(MODEL_LLAMA, question))\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,367 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1fecd49e",
"metadata": {},
"source": [
"# 🗺️ Google Maps Review Summarizer\n",
"\n",
"This Python app automates the process of fetching and summarizing Google Maps reviews for any business or location.\n",
"\n",
"## 🚀 Overview\n",
"The app performs two main tasks:\n",
"1. **Scrape Reviews** Uses a web scraping script to extract reviews directly from Google Maps.\n",
"2. **Summarize Content** Leverages OpenAI's language models to generate concise, insightful summaries of the collected reviews and analyse the sentiments.\n",
"\n",
"## 🧠 Tech Stack\n",
"- **Python** Core language\n",
"- **Playwright** For scraping reviews\n",
"- **OpenAI API** For natural language summarization\n",
"- **Jupyter Notebook** For exploration, testing, and demonstration\n",
"\n",
"### 🙏 Credits\n",
"The web scraping logic is **inspired by [Antonello Zaninis blog post](https://blog.apify.com/how-to-scrape-google-reviews/)** on building a Google Reviews scraper. Special thanks for the valuable insights on **structuring and automating the scraping workflow**, which greatly informed the development of this improved scraper.\n",
"\n",
"This app, however, uses an **enhanced version of the scraper** that can scroll infinitely to load more reviews until it collects **at least 1,000 reviews**. If only a smaller number of reviews are available, the scraper stops scrolling earlier.\n",
"\n",
"## ✅ Sample Output\n",
"Here is a summary of reviews of a restuarant generated by the app.\n",
"\n",
"![Alt text](google-map-review-summary.jpg)\n",
"\n",
"\n",
"---\n",
"\n",
"**Note:** This project is intended for educational and research purposes. Please ensure compliance with Googles [Terms of Service](https://policies.google.com/terms) when scraping or using their data.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "df04a4aa",
"metadata": {},
"outputs": [],
"source": [
"#Activate the llm_engineering virtual environment\n",
"!source ../../../.venv/bin/activate \n",
"\n",
"#Make sure pip is available and up to date inside the venv\n",
"!python3 -m ensurepip --upgrade\n",
"\n",
"#Verify that pip now points to the venv path (should end with /.venv/bin/pip)\n",
"!which pip3\n",
"\n",
"#Install Playwright inside the venv\n",
"!pip3 install playwright\n",
"\n",
"#Download the required browser binaries and dependencies\n",
"!python3 -m playwright install"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1c794cfd",
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"from playwright.async_api import async_playwright\n",
"from IPython.display import Markdown, display\n",
"import os\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "317af2b8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"API key found and looks good so far!\n"
]
}
],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "6f142c79",
"metadata": {},
"outputs": [],
"source": [
"async def scroll_reviews_panel(page, max_scrolls=50, max_reviews=10):\n",
" \"\"\"\n",
" Scrolls through the reviews panel to lazy load all reviews.\n",
" \n",
" Args:\n",
" page: Playwright page object\n",
" max_scrolls: Maximum number of scroll attempts to prevent infinite loops\n",
" \n",
" Returns:\n",
" Number of reviews loaded\n",
" \"\"\"\n",
" # Find the scrollable reviews container\n",
" # Google Maps reviews are in a specific scrollable div\n",
" scrollable_div = page.locator('div[role=\"main\"] div[jslog$=\"mutable:true;\"]').first\n",
" \n",
" previous_review_count = 0\n",
" scroll_attempts = 0\n",
" no_change_count = 0\n",
"\n",
" print(\"Starting to scroll and load reviews...\")\n",
" \n",
" while scroll_attempts < max_scrolls:\n",
" # Get current count of reviews\n",
" review_elements = page.locator(\"div[data-review-id][jsaction]\")\n",
" current_review_count = await review_elements.count()\n",
" \n",
" #if we have loaded max_reviews, we will stop scrolling\n",
" if current_review_count >= max_reviews:\n",
" break\n",
"\n",
" print(f\"Scroll attempt {scroll_attempts + 1}: Found {current_review_count} reviews\")\n",
" \n",
" # Scroll to the bottom of the reviews panel\n",
" await scrollable_div.evaluate(\"\"\"\n",
" (element) => {\n",
" element.scrollTo(0, element.scrollHeight + 100);\n",
" }\n",
" \"\"\")\n",
" \n",
" # Wait for potential new content to load\n",
" await asyncio.sleep(2)\n",
" \n",
" # Check if new reviews were loaded\n",
" if current_review_count == previous_review_count:\n",
" no_change_count += 1\n",
" # If count hasn't changed for 3 consecutive scrolls, we've likely reached the end\n",
" if no_change_count >= 3:\n",
" print(f\"No new reviews loaded after {no_change_count} attempts. Finished loading.\")\n",
" break\n",
" else:\n",
" no_change_count = 0\n",
" \n",
" previous_review_count = current_review_count\n",
" scroll_attempts += 1\n",
" \n",
" final_count = await review_elements.count()\n",
" print(f\"Finished scrolling. Total reviews loaded: {final_count}\")\n",
" return final_count"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f7f67b70",
"metadata": {},
"outputs": [],
"source": [
"async def scrape_google_reviews(url):\n",
" # Where to store the scraped data\n",
" reviews = []\n",
"\n",
" async with async_playwright() as p:\n",
" # Initialize a new Playwright instance\n",
" browser = await p.chromium.launch(\n",
" headless=True # Set to False if you want to see the browser in action\n",
" )\n",
" context = await browser.new_context()\n",
" page = await context.new_page()\n",
"\n",
" # The URL of the Google Maps reviews page\n",
"\n",
" # Navigate to the target Google Maps page\n",
" print(\"Navigating to Google Maps page...\")\n",
" await page.goto(url)\n",
"\n",
" # Wait for initial reviews to load\n",
" print(\"Waiting for initial reviews to load...\")\n",
" review_html_elements = page.locator(\"div[data-review-id][jsaction]\")\n",
" await review_html_elements.first.wait_for(state=\"visible\", timeout=10000)\n",
" \n",
" # Scroll through the reviews panel to lazy load all reviews\n",
" total_reviews = await scroll_reviews_panel(page, max_scrolls=100)\n",
" \n",
" print(f\"\\nStarting to scrape {total_reviews} reviews...\")\n",
"\n",
" # Get all review elements after scrolling\n",
" review_html_elements = page.locator(\"div[data-review-id][jsaction]\")\n",
" all_reviews = await review_html_elements.all()\n",
" \n",
" # Iterate over the elements and scrape data from each of them\n",
" for idx, review_html_element in enumerate(all_reviews, 1):\n",
" try:\n",
" # Scraping logic\n",
"\n",
" stars_element = review_html_element.locator(\"[aria-label*=\\\"star\\\"]\")\n",
" stars_label = await stars_element.get_attribute(\"aria-label\")\n",
"\n",
" # Extract the review score from the stars label\n",
" stars = None\n",
" for i in range(1, 6):\n",
" if stars_label and str(i) in stars_label:\n",
" stars = i\n",
" break\n",
"\n",
" # Get the next sibling of the previous element with an XPath expression\n",
" time_sibling = stars_element.locator(\"xpath=following-sibling::span\")\n",
" time = await time_sibling.text_content()\n",
"\n",
" # Select the \"More\" button and if it is present, click it\n",
" more_element = review_html_element.locator(\"button[aria-label=\\\"See more\\\"]\").first\n",
" if await more_element.is_visible():\n",
" await more_element.click()\n",
" await asyncio.sleep(0.3) # Brief wait for text expansion\n",
"\n",
" text_element = review_html_element.locator(\"div[tabindex=\\\"-1\\\"][id][lang]\")\n",
" text = await text_element.text_content()\n",
"\n",
" reviews.append(str(stars) + \" Stars: \\n\" +\"Reviewed On:\" + time + \"\\n\"+ text)\n",
" \n",
" if idx % 10 == 0:\n",
" print(f\"Scraped {idx}/{total_reviews} reviews...\")\n",
" \n",
" except Exception as e:\n",
" print(f\"Error scraping review {idx}: {str(e)}\")\n",
" continue\n",
"\n",
" print(f\"\\nSuccessfully scraped {len(reviews)} reviews!\")\n",
"\n",
" # Close the browser and release its resources\n",
" await browser.close()\n",
"\n",
" return \"\\n\".join(reviews)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cb160d5f",
"metadata": {},
"outputs": [],
"source": [
"system_prompt = \"\"\"\n",
"You are an expert assistant that analyzes google reviews,\n",
"and provides a summary and centiment of the reviews.\n",
"Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "69e08d4b",
"metadata": {},
"outputs": [],
"source": [
"# Define our user prompt\n",
"\n",
"user_prompt_prefix = \"\"\"\n",
"Here are the reviews of a google map location/business.\n",
"Provide a short summary of the reviews and the sentiment of the reviews.\n",
"\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d710972d",
"metadata": {},
"outputs": [],
"source": [
"\n",
"def prepare_message(reviews):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_prefix + reviews}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "cb51f436",
"metadata": {},
"outputs": [],
"source": [
"async def summarize(url):\n",
" openai = OpenAI()\n",
" reviews = await scrape_google_reviews(url)\n",
" response = openai.chat.completions.create(\n",
" model = \"gpt-4.1-mini\",\n",
" messages = prepare_message(reviews)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "2f09e2d2",
"metadata": {},
"outputs": [],
"source": [
"async def display_summary(url):\n",
" summary = await summarize(url)\n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ca7995c9",
"metadata": {},
"outputs": [],
"source": [
"url = \"https://www.google.com/maps/place/Grace+Home+Nursing+%26+Assisted+Living/@12.32184,75.0853037,17z/data=!4m8!3m7!1s0x3ba47da1be6a0279:0x9e73181ab0827f7e!8m2!3d12.32184!4d75.0853037!9m1!1b1!16s%2Fg%2F11qjl430n_?entry=ttu&g_ep=EgoyMDI1MTAyMC4wIKXMDSoASAFQAw%3D%3D\"\n",
"await display_summary(url)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 451 KiB

View File

@@ -0,0 +1,721 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "1633a440",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"Week 2 Assignment: LLM Engineering\n",
"Author: Nikhil Raut\n",
"\n",
"Notebook: ai_domain_finder.ipynb\n",
"\n",
"Purpose:\n",
"Build an agentic AI Domain Finder that proposes short, brandable .com names, verifies availability via RDAP, \n",
"then returns: \n",
" a list of available .coms, \n",
" one preferred pick, \n",
" and a brief audio rationale.\n",
"\"\"\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "da528fbe",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"import requests\n",
"from typing import Dict, List, Tuple, Any, Optional\n",
"import re\n",
"\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import gradio as gr\n",
"\n",
"load_dotenv(override=True)\n",
"\n",
"OPENAI_MODEL = \"gpt-5-nano-2025-08-07\"\n",
"TTS_MODEL = \"gpt-4o-mini-tts\"\n",
"\n",
"openai = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "361f7fe3",
"metadata": {},
"outputs": [],
"source": [
"# --- robust logging that works inside VS Code notebooks + Gradio threads ---\n",
"import sys, logging, threading\n",
"from collections import deque\n",
"from typing import Any\n",
"\n",
"DEBUG_LLM = True # toggle on/off noisy logs\n",
"CLEAR_LOG_ON_RUN = True # clear panel before each submit\n",
"\n",
"_LOG_BUFFER = deque(maxlen=2000) # keep ~2000 lines in memory\n",
"_LOG_LOCK = threading.Lock()\n",
"\n",
"class GradioBufferHandler(logging.Handler):\n",
" def emit(self, record: logging.LogRecord) -> None:\n",
" try:\n",
" msg = self.format(record)\n",
" except Exception:\n",
" msg = record.getMessage()\n",
" with _LOG_LOCK:\n",
" for line in (msg.splitlines() or [\"\"]):\n",
" _LOG_BUFFER.append(line)\n",
"\n",
"def get_log_text() -> str:\n",
" with _LOG_LOCK:\n",
" return \"\\n\".join(_LOG_BUFFER)\n",
"\n",
"def clear_log_buffer() -> None:\n",
" with _LOG_LOCK:\n",
" _LOG_BUFFER.clear()\n",
"\n",
"def _setup_logger() -> logging.Logger:\n",
" logger = logging.getLogger(\"aidf\")\n",
" logger.setLevel(logging.DEBUG if DEBUG_LLM else logging.INFO)\n",
" logger.handlers.clear()\n",
" fmt = logging.Formatter(\"%(asctime)s | %(levelname)s | %(message)s\", \"%H:%M:%S\")\n",
"\n",
" stream = logging.StreamHandler(stream=sys.stdout) # captured by VS Code notebook\n",
" stream.setFormatter(fmt)\n",
"\n",
" buf = GradioBufferHandler() # shown inside the Gradio panel\n",
" buf.setFormatter(fmt)\n",
"\n",
" logger.addHandler(stream)\n",
" logger.addHandler(buf)\n",
" logger.propagate = False\n",
" return logger\n",
"\n",
"logger = _setup_logger()\n",
"\n",
"def dbg_json(obj: Any, title: str = \"\") -> None:\n",
" \"\"\"Convenience: pretty-print JSON-ish objects to the logger.\"\"\"\n",
" try:\n",
" txt = json.dumps(obj, ensure_ascii=False, indent=2)\n",
" except Exception:\n",
" txt = str(obj)\n",
" if title:\n",
" logger.debug(\"%s\\n%s\", title, txt)\n",
" else:\n",
" logger.debug(\"%s\", txt)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "519674b2",
"metadata": {},
"outputs": [],
"source": [
"RDAP_URL = \"https://rdap.verisign.com/com/v1/domain/{}\"\n",
"\n",
"_ALPHA_RE = re.compile(r\"^[a-z]+$\", re.IGNORECASE)\n",
"\n",
"def _to_com(domain: str) -> str:\n",
" d = domain.strip().lower()\n",
" return d if d.endswith(\".com\") else f\"{d}.com\"\n",
"\n",
"def _sld_is_english_alpha(fqdn: str) -> bool:\n",
" \"\"\"\n",
" True only if the second-level label (just before .com) is made up\n",
" exclusively of English letters (a-z).\n",
" Examples:\n",
" foo.com -> True\n",
" foo-bar.com -> False\n",
" foo1.com -> False\n",
" café.com -> False\n",
" xn--cafe.com -> False\n",
" www.foo.com -> True (checks 'foo')\n",
" \"\"\"\n",
" if not fqdn.endswith(\".com\"):\n",
" return False\n",
" sld = fqdn[:-4].split(\".\")[-1] # take label immediately before .com\n",
" return bool(sld) and bool(_ALPHA_RE.fullmatch(sld))\n",
"\n",
"def check_com_availability(domain: str) -> Dict:\n",
" fqdn = _to_com(domain)\n",
" # Skip API if not strictly English letters\n",
" if not _sld_is_english_alpha(fqdn):\n",
" return {\"domain\": fqdn, \"available\": False, \"status\": 0}\n",
"\n",
" try:\n",
" r = requests.get(RDAP_URL.format(fqdn), timeout=6)\n",
" return {\"domain\": fqdn, \"available\": (r.status_code == 404), \"status\": r.status_code}\n",
" except requests.RequestException:\n",
" return {\"domain\": fqdn, \"available\": False, \"status\": 0}\n",
"\n",
"def check_com_availability_bulk(domains: List[str]) -> Dict:\n",
" \"\"\"\n",
" Input: list of domain roots or FQDNs.\n",
" Returns:\n",
" {\n",
" \"results\": [{\"domain\": \"...\", \"available\": bool, \"status\": int}, ...],\n",
" \"available\": [\"...\"], # convenience\n",
" \"count_available\": int\n",
" }\n",
" \"\"\"\n",
" session = requests.Session()\n",
" results: List[Dict] = []\n",
"\n",
" for d in domains:\n",
" fqdn = _to_com(d)\n",
"\n",
" # Skip API if not strictly English letters\n",
" if not _sld_is_english_alpha(fqdn):\n",
" results.append({\"domain\": fqdn, \"available\": False, \"status\": 0})\n",
" continue\n",
"\n",
" try:\n",
" r = session.get(RDAP_URL.format(fqdn), timeout=6)\n",
" ok = (r.status_code == 404)\n",
" results.append({\"domain\": fqdn, \"available\": ok, \"status\": r.status_code})\n",
" except requests.RequestException:\n",
" results.append({\"domain\": fqdn, \"available\": False, \"status\": 0})\n",
"\n",
" available = [x[\"domain\"] for x in results if x[\"available\"]]\n",
" return {\"results\": results, \"available\": available, \"count_available\": len(available)}\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd20c262",
"metadata": {},
"outputs": [],
"source": [
"check_tool_bulk = {\n",
" \"type\": \"function\",\n",
" \"function\": {\n",
" \"name\": \"check_com_availability_bulk\",\n",
" \"description\": \"Batch check .com availability via RDAP for a list of domains (roots or FQDNs).\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"domains\": {\n",
" \"type\": \"array\",\n",
" \"items\": {\"type\": \"string\"},\n",
" \"minItems\": 1,\n",
" \"maxItems\": 50,\n",
" \"description\": \"List of domain roots or .com FQDNs.\"\n",
" }\n",
" },\n",
" \"required\": [\"domains\"],\n",
" \"additionalProperties\": False\n",
" }\n",
" }\n",
"}\n",
"\n",
"TOOLS = [check_tool_bulk]\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2a9138b6",
"metadata": {},
"outputs": [],
"source": [
"def handle_tool_calls(message) -> List[Dict]:\n",
" results = []\n",
" for call in (message.tool_calls or []):\n",
" fn = getattr(call.function, \"name\", None)\n",
" args_raw = getattr(call.function, \"arguments\", \"\") or \"{}\"\n",
" try:\n",
" args = json.loads(args_raw)\n",
" except Exception:\n",
" args = {}\n",
"\n",
" logger.debug(\"TOOL CALL -> %s | args=%s\", fn, json.dumps(args, ensure_ascii=False))\n",
"\n",
" if fn == \"check_com_availability_bulk\":\n",
" payload = check_com_availability_bulk(args.get(\"domains\", []))\n",
" elif fn == \"check_com_availability\":\n",
" payload = check_com_availability(args.get(\"domain\", \"\"))\n",
" else:\n",
" payload = {\"error\": f\"unknown tool {fn}\"}\n",
"\n",
" logger.debug(\"TOOL RESULT <- %s | %s\", fn, json.dumps(payload, ensure_ascii=False))\n",
"\n",
" results.append({\n",
" \"role\": \"tool\",\n",
" \"tool_call_id\": call.id,\n",
" \"content\": json.dumps(payload),\n",
" })\n",
" return results\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b80c860",
"metadata": {},
"outputs": [],
"source": [
"SYSTEM_PROMPT = \"\"\"You are the Agent for project \"AI Domain Finder\".\n",
"Goal: suggest .com domains and verify availability using the tool ONLY (no guessing).\n",
"\n",
"Do this each interaction:\n",
"- Generate up to ~20 short, brandable .com candidates from:\n",
" (1) Industry, (2) Target Customers, (3) Description.\n",
"- Use the BULK tool `check_com_availability_bulk` with a list of candidates\n",
" (roots or FQDNs). Prefer a single call or very few batched calls.\n",
"- If >= 5 available .coms are found, STOP checking and finalize the answer.\n",
"\n",
"Output Markdown with EXACT section headings:\n",
"1) Available .com domains:\n",
" - itemized list of available .coms only (root + .com)\n",
"2) Preferred domain:\n",
" - a single best pick\n",
"3) Audio explanation:\n",
" - 12 concise sentences explaining the preference\n",
"\n",
"Constraints:\n",
"- Use customer-familiar words where helpful.\n",
"- Keep names short, simple, pronounceable; avoid hyphens/numbers unless meaningful.\n",
"- Never include TLDs other than .com.\n",
"- domain is made up of english alphabets in lower case only no symbols or spaces to use\n",
"\"\"\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "72e9d8c2",
"metadata": {},
"outputs": [],
"source": [
"def _asdict_tool_call(tc: Any) -> dict:\n",
" try:\n",
" return {\n",
" \"id\": getattr(tc, \"id\", None),\n",
" \"type\": \"function\",\n",
" \"function\": {\n",
" \"name\": getattr(tc.function, \"name\", None),\n",
" \"arguments\": getattr(tc.function, \"arguments\", None),\n",
" },\n",
" }\n",
" except Exception:\n",
" return {\"type\": \"function\", \"function\": {\"name\": None, \"arguments\": None}}\n",
"\n",
"def _asdict_message(msg: Any) -> dict:\n",
" if isinstance(msg, dict):\n",
" return msg\n",
" role = getattr(msg, \"role\", None)\n",
" content = getattr(msg, \"content\", None)\n",
" tool_calls = getattr(msg, \"tool_calls\", None)\n",
" out = {\"role\": role, \"content\": content}\n",
" if tool_calls:\n",
" out[\"tool_calls\"] = [_asdict_tool_call(tc) for tc in tool_calls]\n",
" return out\n",
"\n",
"def _sanitized_messages_for_log(messages: list[dict | Any]) -> list[dict]:\n",
" return [_asdict_message(m) for m in messages]\n",
"\n",
"def _limit_text(s: str, limit: int = 40000) -> str:\n",
" return s if len(s) <= limit else (s[:limit] + \"\\n... [truncated]\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b45c6382",
"metadata": {},
"outputs": [],
"source": [
"def run_agent_with_tools(history: List[Dict]) -> Tuple[str, List[str], str]:\n",
" \"\"\"\n",
" Returns:\n",
" reply_md: final assistant markdown\n",
" tool_available: .coms marked available by RDAP tools (order-preserving, deduped)\n",
" dbg_text: concatenated log buffer (for the UI panel)\n",
" \"\"\"\n",
" messages: List[Dict] = [{\"role\": \"system\", \"content\": SYSTEM_PROMPT}] + history\n",
" tool_available: List[str] = []\n",
"\n",
" dbg_json(_sanitized_messages_for_log(messages), \"=== LLM REQUEST (initial messages) ===\")\n",
" resp = openai.chat.completions.create(model=OPENAI_MODEL, messages=messages, tools=TOOLS)\n",
"\n",
" while resp.choices[0].finish_reason == \"tool_calls\":\n",
" tool_msg_sdk = resp.choices[0].message\n",
" tool_msg = _asdict_message(tool_msg_sdk)\n",
" dbg_json(tool_msg, \"=== ASSISTANT (tool_calls) ===\")\n",
"\n",
" tool_results = handle_tool_calls(tool_msg_sdk)\n",
"\n",
" # Accumulate authoritative availability directly from tool outputs\n",
" for tr in tool_results:\n",
" try:\n",
" data = json.loads(tr[\"content\"])\n",
" if isinstance(data, dict) and isinstance(data.get(\"available\"), list):\n",
" for d in data[\"available\"]:\n",
" tool_available.append(_to_com(d))\n",
" except Exception:\n",
" pass\n",
"\n",
" dbg_json([json.loads(tr[\"content\"]) for tr in tool_results], \"=== TOOL RESULTS ===\")\n",
"\n",
" messages.append(tool_msg)\n",
" messages.extend(tool_results)\n",
" dbg_json(_sanitized_messages_for_log(messages), \"=== LLM REQUEST (messages + tools) ===\")\n",
"\n",
" resp = openai.chat.completions.create(model=OPENAI_MODEL, messages=messages, tools=TOOLS)\n",
"\n",
" # Dedup preserve order\n",
" seen, uniq = set(), []\n",
" for d in tool_available:\n",
" if d not in seen:\n",
" seen.add(d)\n",
" uniq.append(d)\n",
"\n",
" reply_md = resp.choices[0].message.content\n",
" logger.debug(\"=== FINAL ASSISTANT ===\\n%s\", _limit_text(reply_md))\n",
" dbg_json(uniq, \"=== AVAILABLE FROM TOOLS (authoritative) ===\")\n",
"\n",
" # Return current buffer text for the UI panel\n",
" dbg_text = _limit_text(get_log_text(), 40000)\n",
" return reply_md, uniq, dbg_text\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92306515",
"metadata": {},
"outputs": [],
"source": [
"def extract_audio_text(markdown_reply: str) -> str:\n",
" \"\"\"\n",
" Pulls the 'Audio explanation:' section; falls back to first sentence.\n",
" \"\"\"\n",
" marker = \"Audio explanation:\"\n",
" lower = markdown_reply.lower()\n",
" idx = lower.find(marker.lower())\n",
" if idx != -1:\n",
" segment = markdown_reply[idx + len(marker):].strip()\n",
" parts = segment.split(\".\")\n",
" return (\". \".join([p.strip() for p in parts if p.strip()][:2]) + \".\").strip()\n",
" return \"This domain is the clearest, most memorable fit for the audience and brand goals.\"\n",
"\n",
"def synth_audio(text: str) -> bytes:\n",
" audio = openai.audio.speech.create(\n",
" model=TTS_MODEL,\n",
" voice=\"alloy\",\n",
" input=text\n",
" )\n",
" return audio.content\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc6c0650",
"metadata": {},
"outputs": [],
"source": [
"\n",
"_DOMAIN_RE = re.compile(r\"\\b[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\\.com\\b\", re.I)\n",
"_HDR_AVAIL = re.compile(r\"^\\s*[\\d\\.\\)\\-]*\\s*available\\s+.*\\.com\\s+domains\", re.I)\n",
"_HDR_PREF = re.compile(r\"^\\s*[\\d\\.\\)\\-]*\\s*preferred\\s+domain\", re.I)\n",
"\n",
"def _norm_domain(s: str) -> str:\n",
" s = s.strip().lower()\n",
" return s if s.endswith(\".com\") else f\"{s}.com\"\n",
"\n",
"def parse_available(md: str) -> list[str]:\n",
" lines = md.splitlines()\n",
" out = []\n",
" in_section = False\n",
" for ln in lines:\n",
" if _HDR_AVAIL.search(ln):\n",
" in_section = True\n",
" continue\n",
" if in_section and _HDR_PREF.search(ln):\n",
" break\n",
" if in_section:\n",
" for m in _DOMAIN_RE.findall(ln):\n",
" out.append(_norm_domain(m))\n",
" # Fallback: if the header wasn't found, collect all .coms then we'll still\n",
" # rely on agent instruction to list only available, which should be safe.\n",
" if not out:\n",
" out = [_norm_domain(m) for m in _DOMAIN_RE.findall(md)]\n",
" # dedupe preserve order\n",
" seen, uniq = set(), []\n",
" for d in out:\n",
" if d not in seen:\n",
" seen.add(d)\n",
" uniq.append(d)\n",
" return uniq\n",
"\n",
"def parse_preferred(md: str) -> str:\n",
" # search the preferred section first\n",
" lines = md.splitlines()\n",
" start = None\n",
" for i, ln in enumerate(lines):\n",
" if _HDR_PREF.search(ln):\n",
" start = i\n",
" break\n",
" segment = \"\\n\".join(lines[start:start+8]) if start is not None else md[:500]\n",
" m = _DOMAIN_RE.search(segment)\n",
" if m:\n",
" return _norm_domain(m.group(0))\n",
" m = _DOMAIN_RE.search(md)\n",
" return _norm_domain(m.group(0)) if m else \"\"\n",
"\n",
"def merge_and_sort(old: list[str], new: list[str]) -> list[str]:\n",
" merged = {d.lower() for d in old} | {d.lower() for d in new}\n",
" return sorted(merged, key=lambda s: (len(s), s))\n",
"\n",
"def fmt_available_md(domains: list[str]) -> str:\n",
" if not domains:\n",
" return \"### Available .com domains (cumulative)\\n\\n* none yet *\"\n",
" items = \"\\n\".join(f\"- `{d}`\" for d in domains)\n",
" return f\"### Available .com domains (cumulative)\\n\\n{items}\"\n",
"\n",
"def fmt_preferred_md(d: str) -> str:\n",
" if not d:\n",
" return \"### Preferred domain\\n\\n* not chosen yet *\"\n",
" return f\"### Preferred domain\\n\\n`{d}`\"\n",
"\n",
"def build_context_msg(known_avail: Optional[List[str]], preferred_now: Optional[str]) -> str:\n",
" \"\"\"\n",
" Create a short 'state so far' block that we prepend to the next user turn\n",
" so the model always sees the preferred and cumulative available list.\n",
" \"\"\"\n",
" lines = []\n",
" if (preferred_now or \"\").strip():\n",
" lines.append(f\"Preferred domain so far: {preferred_now.strip().lower()}\")\n",
" if known_avail:\n",
" lines.append(\"Available .com domains discovered so far:\")\n",
" for d in known_avail:\n",
" if d:\n",
" lines.append(f\"- {d.strip().lower()}\")\n",
" if not lines:\n",
" return \"\"\n",
" return \"STATE TO CARRY OVER FROM PREVIOUS TURNS:\\n\" + \"\\n\".join(lines)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "07f079d6",
"metadata": {},
"outputs": [],
"source": [
"def run_and_extract(history: List[Dict]) -> Tuple[str, List[str], str, str, str]:\n",
" reply_md, avail_from_tools, dbg_text = run_agent_with_tools(history)\n",
" parsed_avail = parse_available(reply_md)\n",
" new_avail = merge_and_sort(avail_from_tools, parsed_avail)\n",
" preferred = parse_preferred(reply_md)\n",
" audio_text = extract_audio_text(reply_md)\n",
" return reply_md, new_avail, preferred, audio_text, dbg_text\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4cd5d8ef",
"metadata": {},
"outputs": [],
"source": [
"def initial_submit(industry: str, customers: str, desc: str,\n",
" history: List[Dict], known_avail: List[str], preferred_now: str):\n",
" if CLEAR_LOG_ON_RUN:\n",
" clear_log_buffer()\n",
"\n",
" logger.info(\"Initial submit | industry=%r | customers=%r | desc_len=%d\",\n",
" industry, customers, len(desc or \"\"))\n",
"\n",
" # Build context (usually empty on the very first run, but future inits also work)\n",
" ctx = build_context_msg(known_avail or [], preferred_now or \"\")\n",
"\n",
" user_msg = (\n",
" \"Please propose .com domains based on:\\n\"\n",
" f\"Industry: {industry}\\n\"\n",
" f\"Target Customers: {customers}\\n\"\n",
" f\"Description: {desc}\"\n",
" )\n",
"\n",
" # Single user turn that includes state + prompt so the model always sees memory\n",
" full_content = (ctx + \"\\n\\n\" if ctx else \"\") + user_msg\n",
"\n",
" history = (history or []) + [{\"role\": \"user\", \"content\": full_content}]\n",
" reply_md, new_avail, preferred, audio_text, dbg_text = run_and_extract(history)\n",
" history += [{\"role\": \"assistant\", \"content\": reply_md}]\n",
"\n",
" all_avail = merge_and_sort(known_avail or [], new_avail or [])\n",
" preferred_final = preferred or preferred_now or \"\"\n",
" audio_bytes = synth_audio(audio_text)\n",
"\n",
" return (\n",
" history, # s_history\n",
" all_avail, # s_available (cumulative)\n",
" preferred_final, # s_preferred\n",
" gr.update(value=fmt_preferred_md(preferred_final)),\n",
" gr.update(value=fmt_available_md(all_avail)),\n",
" gr.update(value=\"\", visible=True), # reply_in: show after first run\n",
" gr.update(value=audio_bytes, visible=True), # audio_out\n",
" gr.update(value=dbg_text), # debug_box\n",
" gr.update(value=\"Find Domains (done)\", interactive=False), # NEW: disable Find\n",
" gr.update(visible=True), # NEW: show Send button\n",
" )\n",
"\n",
"def refine_submit(reply: str,\n",
" history: List[Dict], known_avail: List[str], preferred_now: str):\n",
" # If empty, do nothing (keeps UI state untouched)\n",
" if not (reply or \"\").strip():\n",
" return (\"\", history, known_avail, preferred_now,\n",
" gr.update(), gr.update(), gr.update(), gr.update())\n",
"\n",
" if CLEAR_LOG_ON_RUN:\n",
" clear_log_buffer()\n",
" logger.info(\"Refine submit | user_reply_len=%d\", len(reply))\n",
"\n",
" # Always prepend memory + the user's refinement so the model can iterate properly\n",
" ctx = build_context_msg(known_avail or [], preferred_now or \"\")\n",
" full_content = (ctx + \"\\n\\n\" if ctx else \"\") + reply.strip()\n",
"\n",
" history = (history or []) + [{\"role\": \"user\", \"content\": full_content}]\n",
" reply_md, new_avail, preferred, audio_text, dbg_text = run_and_extract(history)\n",
" history += [{\"role\": \"assistant\", \"content\": reply_md}]\n",
"\n",
" all_avail = merge_and_sort(known_avail or [], new_avail or [])\n",
" preferred_final = preferred or preferred_now or \"\"\n",
" audio_bytes = synth_audio(audio_text)\n",
"\n",
" return (\n",
" \"\", # clear Reply box\n",
" history, # s_history\n",
" all_avail, # s_available (cumulative)\n",
" preferred_final, # s_preferred\n",
" gr.update(value=fmt_preferred_md(preferred_final)),\n",
" gr.update(value=fmt_available_md(all_avail)),\n",
" gr.update(value=audio_bytes, visible=True),\n",
" gr.update(value=dbg_text), # debug_box\n",
" )\n",
"\n",
"def clear_debug():\n",
" clear_log_buffer()\n",
" return gr.update(value=\"\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d52ebc02",
"metadata": {},
"outputs": [],
"source": [
"with gr.Blocks(title=\"AI Domain Finder (.com only)\") as ui:\n",
" gr.Markdown(\"# AI Domain Finder (.com only)\")\n",
" gr.Markdown(\"Agent proposes .com domains, verifies via RDAP, picks a preferred choice, and explains briefly.\")\n",
"\n",
" # App state\n",
" s_history = gr.State([])\n",
" s_available = gr.State([])\n",
" s_preferred = gr.State(\"\")\n",
"\n",
" with gr.Row():\n",
" with gr.Column(scale=7): # LEFT 70%\n",
" with gr.Group():\n",
" industry_in = gr.Textbox(label=\"Industry\")\n",
" customers_in = gr.Textbox(label=\"Target Customers\")\n",
" desc_in = gr.Textbox(label=\"Description\", lines=3)\n",
" find_btn = gr.Button(\"Find Domains\", variant=\"primary\")\n",
"\n",
" audio_out = gr.Audio(label=\"Audio explanation\", autoplay=True, visible=False)\n",
"\n",
" with gr.Row():\n",
" reply_in = gr.Textbox(\n",
" label=\"Reply\",\n",
" placeholder=\"Chat with the agent to refine the outputs\",\n",
" lines=2,\n",
" visible=False, # hidden for the first input\n",
" )\n",
" send_btn = gr.Button(\"Send\", variant=\"primary\", visible=False)\n",
"\n",
" with gr.Column(scale=3): # RIGHT 30%\n",
" preferred_md = gr.Markdown(fmt_preferred_md(\"\"))\n",
" available_md = gr.Markdown(fmt_available_md([]))\n",
"\n",
" with gr.Accordion(\"Debug log\", open=False):\n",
" debug_box = gr.Textbox(label=\"Log\", value=\"\", lines=16, interactive=False)\n",
" clear_btn = gr.Button(\"Clear log\", size=\"sm\")\n",
"\n",
" # Events\n",
" # Initial run: also disables Find and shows Send\n",
" find_btn.click(\n",
" initial_submit,\n",
" inputs=[industry_in, customers_in, desc_in, s_history, s_available, s_preferred],\n",
" outputs=[\n",
" s_history, s_available, s_preferred,\n",
" preferred_md, available_md,\n",
" reply_in, # visible after first run\n",
" audio_out, # visible after first run\n",
" debug_box,\n",
" find_btn, # NEW: disable + relabel\n",
" send_btn, # NEW: show the Send button\n",
" ],\n",
" )\n",
"\n",
" # Multi-turn submit via Enter in the textbox\n",
" reply_in.submit(\n",
" refine_submit,\n",
" inputs=[reply_in, s_history, s_available, s_preferred],\n",
" outputs=[\n",
" reply_in, s_history, s_available, s_preferred,\n",
" preferred_md, available_md, audio_out, debug_box\n",
" ],\n",
" )\n",
"\n",
" # Multi-turn submit via explicit Send button\n",
" send_btn.click(\n",
" refine_submit,\n",
" inputs=[reply_in, s_history, s_available, s_preferred],\n",
" outputs=[\n",
" reply_in, s_history, s_available, s_preferred,\n",
" preferred_md, available_md, audio_out, debug_box\n",
" ],\n",
" )\n",
"\n",
" clear_btn.click(clear_debug, inputs=[], outputs=[debug_box])\n",
"\n",
"ui.launch(inbrowser=True, show_error=True)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llm-engineering",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,355 @@
# 🏥 RoboCare AI Assistant
> Born from a real problem at MyWoosah Inc—now solving caregiver matching through AI.
## 📋 The Story Behind This Project
While working on a caregiver matching platform for **MyWoosah Inc** in the US, I faced a real challenge: how do you efficiently match families with the right caregivers when everyone has different needs?
Families would ask things like:
- _"I need someone for my mom on Monday mornings who speaks Spanish"_
- _"Can you find elder care in Boston under $30/hour with CPR certification?"_
Writing individual SQL queries for every combination of filters was exhausting and error-prone. I knew there had to be a better way.
That's when I discovered the **Andela LLM Engineering program**. I saw an opportunity to transform this problem into a solution using AI. Instead of rigid queries, what if families could just... talk? And the AI would understand, search, and recommend?
This project is my answer. It's not just an exercise—it's solving a real problem I encountered in the field.
## What It Does
RoboCare helps families find caregivers through natural conversation:
- 🔍 Searches the database intelligently
- 🎯 Finds the best matches
- 💬 Explains pros/cons in plain English
- 🔊 Speaks the results back to you
## ✨ Features
### 🤖 AI-Powered Matching
- Natural language conversation interface
- Intelligent requirement gathering
- Multi-criteria search optimization
- Personalized recommendations with pros/cons analysis
### 🔍 Advanced Search Capabilities
- **Location-based filtering**: City, state, and country
- **Service type matching**: Elder care, child care, companionship, dementia care, hospice support, and more
- **Availability scheduling**: Day and time-based matching
- **Budget optimization**: Maximum hourly rate filtering
- **Language preferences**: Multi-language support
- **Certification requirements**: CPR, CNA, BLS, and specialized certifications
- **Experience filtering**: Minimum years of experience
### 🎙️ Multi-Modal Interface
- Text-based chat interface
- Voice response generation (Text-to-Speech)
- Multiple voice options (coral, alloy, echo, fable, onyx, nova, shimmer)
- Clean, modern UI built with Gradio
### 🛡️ Defensive Architecture
- Comprehensive error handling
- Token overflow protection
- Tool call validation
- Graceful degradation
## 🚀 Getting Started
### Prerequisites
- Python 3.8+
- OpenAI API key
- Virtual environment (recommended)
### Installation
1. **Clone the repository**
```bash
cd week2
```
2. **Create and activate virtual environment**
```bash
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
```
3. **Install dependencies**
```bash
pip install -r requirements.txt
```
4. **Set up environment variables**
Create a `.env` file in the project root:
```env
OPENAI_API_KEY=your_openai_api_key_here
```
5. **Run the application**
```bash
jupyter notebook "week2 EXERCISE.ipynb"
```
Or run all cells sequentially in your Jupyter environment.
## 📊 Database Schema
### Tables
#### `caregivers`
Primary caregiver information including:
- Personal details (name, gender)
- Experience level
- Hourly rate and currency
- Location (city, state, country, coordinates)
- Live-in availability
#### `caregiver_services`
Care types offered by each caregiver:
- Elder care
- Child care
- Companionship
- Post-op support
- Special needs
- Respite care
- Dementia care
- Hospice support
#### `availability`
Time slots when caregivers are available:
- Day of week (Mon-Sun)
- Start and end times (24-hour format)
#### `languages`
Languages spoken by caregivers
#### `certifications`
Professional certifications (CPR, CNA, BLS, etc.)
#### `traits`
Personality and professional traits
## 🔧 Architecture
### Tool Registry Pattern
```python
TOOL_REGISTRY = {
"search_caregivers": search_caregivers,
"get_caregiver_profile": get_caregiver_profile,
# ... more tools
}
```
All database functions are registered and can be called by the AI dynamically.
### Search Functions
#### `search_caregivers()`
Multi-filter search with parameters:
- `city`, `state_province`, `country` - Location filters
- `care_type` - Type of care needed
- `min_experience` - Minimum years of experience
- `max_hourly_rate` - Budget constraint
- `live_in` - Live-in caregiver requirement
- `language` - Language preference
- `certification` - Required certification
- `day` - Day of week availability
- `time_between` - Time window availability
- `limit`, `offset` - Pagination
#### `get_caregiver_profile(caregiver_id)`
Returns complete profile including:
- Basic information
- Services offered
- Languages spoken
- Certifications
- Personality traits
- Availability schedule
## 🎨 UI Components
### Main Interface
- **Chat History**: Message-based conversation display
- **Voice Response**: Auto-playing audio output
- **Settings Sidebar**:
- AI Model selector
- Voice selection
- Audio toggle
- Clear conversation button
### User Experience
- Professional gradient header
- Collapsible instructions
- Helpful placeholder text
- Custom CSS styling
- Responsive layout
## 📝 Usage Examples
### Example 1: Basic Search
```python
results = search_caregivers(
city="New York",
care_type="elder care",
max_hourly_rate=30.0,
limit=5
)
```
### Example 2: Language Filter
```python
results = search_caregivers(
care_type="child care",
language="Spanish",
limit=3
)
```
### Example 3: Availability Search
```python
results = search_caregivers(
day="Mon",
time_between=("09:00", "17:00"),
city="Boston"
)
```
### Example 4: Get Full Profile
```python
profile = get_caregiver_profile(caregiver_id=1)
print(profile['services'])
print(profile['availability'])
```
## 🔐 Security & Best Practices
### Current Implementation
- ✅ Environment variable management for API keys
- ✅ SQL injection prevention (parameterized queries)
- ✅ Error handling and graceful degradation
- ✅ Input validation through tool schemas
### Important Disclaimers
⚠️ **This is a demonstration application**
- Credentials and background checks are NOT verified
- Families should independently verify all caregiver information
- Not intended for production use without additional security measures
## 🛠️ Tech Stack
- **AI/ML**: OpenAI GPT-4o-mini, Text-to-Speech API
- **Database**: SQLite with normalized schema
- **UI Framework**: Gradio
- **Language**: Python 3.8+
- **Key Libraries**:
- `openai` - OpenAI API client
- `gradio` - Web interface
- `python-dotenv` - Environment management
- `sqlite3` - Database operations
## 📈 What's Next
### Immediate Plans
- [ ] Add speech input (families could call and talk)
- [ ] Connect to actual MyWoosah database
- [ ] Background check API integration
- [ ] Deploy for real users
### Future Enhancements
- [ ] Streaming responses for real-time interaction
- [ ] Dynamic model switching
- [ ] User authentication and profiles
- [ ] Review and rating system
- [ ] Payment integration
- [ ] Calendar integration for scheduling
## 💡 Key Learnings
Through building this project, I learned:
1. **Prompt engineering is critical** - Small keyword mismatches = zero results. Mapping "Monday" → "Mon" matters.
2. **Function calling is powerful** - Eliminated the need for custom queries. The AI figures it out.
3. **Defensive programming saves headaches** - Things break. This code expects it and handles it elegantly.
4. **AI makes databases accessible** - Good database design + AI = natural language interface
## 🌍 The Bigger Picture
This isn't just about caregiving. The same pattern works for:
- Healthcare appointment booking
- Legal service matching
- Tutoring and education platforms
- Real estate agent matching
- Any matching problem where natural language beats forms
**AI doesn't replace good database design—it makes it accessible to everyone.**
---
## 🤝 Contributing
This project was created as part of the **Andela LLM Engineering Week 2 Exercise**.
Feedback and contributions are welcome! Feel free to:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run all cells to test
5. Submit a pull request
## 🙏 Acknowledgments
- **MyWoosah Inc** - For the real-world problem that inspired this solution
- **Andela LLM Engineering Program** - Educational framework and guidance
- **OpenAI** - GPT-4o and TTS API
- **Gradio** - Making beautiful UIs accessible
---
<div align="center">
**For MyWoosah Inc and beyond:** This is proof that AI can transform how we connect people with the care they need.
_Built with ❤️ during Week 2 of the Andela LLM Engineering Program_
**RoboOffice Ltd**
</div>

Binary file not shown.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,173 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "fd1cdd6e",
"metadata": {},
"source": [
"## Week 2 - Full Prototype for Technical Questions Answerer"
]
},
{
"cell_type": "markdown",
"id": "70db9a0b",
"metadata": {},
"source": [
" This notebook will implement a Gradio UI, streaming, use of the system prompt to add expertise, and the ability to switch between models."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "df46689d",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"import os\n",
"import json\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import gradio as gr\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c7416a2a",
"metadata": {},
"outputs": [],
"source": [
"# Initialization\n",
"load_dotenv(override=True)\n",
"\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
" \n",
"MODEL = \"gpt-4.1-mini\"\n",
"openai = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "86966749",
"metadata": {},
"outputs": [],
"source": [
"system_message = \"\"\"\n",
"You are an expert technical question answerer specializing in data science, programming, \n",
"and software engineering. Your goal is to provide clear, accurate, and practical answers \n",
"to technical questions.\n",
"\n",
"When answering:\n",
"- Break down complex concepts into understandable explanations\n",
"- Provide code examples when relevant, with comments explaining key parts\n",
"- Mention common pitfalls or best practices\n",
"- If a question is ambiguous, state your assumptions or ask for clarification\n",
"- For debugging questions, explain both the fix and why the error occurred\n",
"- Cite specific documentation or resources when helpful\n",
"\n",
"Always prioritize accuracy and clarity over speed. If you're unsure about something, \n",
"acknowledge the uncertainty rather than guessing.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d34e5b81",
"metadata": {},
"outputs": [],
"source": [
"# Streaming chat funcion\n",
"def chat(model, history):\n",
" messages = [{\"role\": \"system\", \"content\": system_message}]\n",
" for h in history:\n",
" messages.append({\"role\": h[\"role\"], \"content\": h[\"content\"]})\n",
"\n",
" stream = openai.chat.completions.create(\n",
" model=model, \n",
" messages=messages,\n",
" stream=True\n",
" )\n",
"\n",
" response = \"\"\n",
" for chunk in stream:\n",
" if chunk.choices[0].delta.content is not None:\n",
" response += chunk.choices[0].delta.content\n",
" yield history + [{\"role\": \"assistant\", \"content\": response}]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "32350869",
"metadata": {},
"outputs": [],
"source": [
"#Gradio Interface\n",
"with gr.Blocks() as ui:\n",
" with gr.Row():\n",
" chatbot = gr.Chatbot(height=500, type=\"messages\")\n",
" with gr.Row():\n",
" message = gr.Textbox(label=\"Chat with AI Assistant: \")\n",
" model_dropdown = gr.Dropdown(\n",
" choices=[\"gpt-4.1-mini\",\"gpt-4o-mini\", \"gpt-4o\", \"gpt-4-turbo\"], \n",
" value=\"gpt-4.1-mini\", \n",
" label=\"Select Model\"\n",
" ) \n",
"\n",
" def handle_submit(user_message, chat_history):\n",
" # Add user message to history\n",
" chat_history = chat_history + [{\"role\": \"user\", \"content\": user_message}]\n",
" return \"\", chat_history\n",
"\n",
" message.submit(\n",
" handle_submit, \n",
" inputs=[message, chatbot], \n",
" outputs=[message, chatbot]\n",
" ).then(\n",
" chat, \n",
" inputs=[model_dropdown, chatbot],\n",
" outputs=[chatbot]\n",
" )\n",
"\n",
"ui.launch(inbrowser=True)"
]
},
{
"cell_type": "markdown",
"id": "cf2b29e1",
"metadata": {},
"source": [
"### Concluding Remarks\n",
"In this exercise, we successfully built a working AI chatbot with Gradio that includes streaming responses and the ability to switch between different models. The implementation demonstrates how to create an interactive interface for LLM applications."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,296 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d006b2ea-9dfe-49c7-88a9-a5a0775185fd",
"metadata": {},
"source": [
"# End of week 2 Exercise - Bookstore Assistant\n",
"\n",
"Now use everything you've learned from Week 2 to build a full prototype for the technical question/answerer you built in Week 1 Exercise.\n",
"\n",
"This should include a Gradio UI, streaming, use of the system prompt to add expertise, and the ability to switch between models. Bonus points if you can demonstrate use of a tool!\n",
"\n",
"If you feel bold, see if you can add audio input so you can talk to it, and have it respond with audio. ChatGPT or Claude can help you, or email me if you have questions.\n",
"\n",
"I will publish a full solution here soon - unless someone beats me to it...\n",
"\n",
"There are so many commercial applications for this, from a language tutor, to a company onboarding solution, to a companion AI to a course (like this one!) I can't wait to see your results."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a07e7793-b8f5-44f4-aded-5562f633271a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"OpenAI API Key exists and begins sk-proj-\n",
"Google API Key exists and begins AIzaSyCL\n"
]
}
],
"source": [
"import os\n",
"import json\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import gradio as gr\n",
"\n",
"load_dotenv(override=True)\n",
"\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"google_api_key = os.getenv('GOOGLE_API_KEY')\n",
"\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
"\n",
"if google_api_key:\n",
" print(f\"Google API Key exists and begins {google_api_key[:8]}\")\n",
"else:\n",
" print(\"Google API Key not set\")\n",
" \n",
"MODEL_GPT = \"gpt-4.1-mini\"\n",
"MODEL_GEMINI = \"gemini-2.5-pro\"\n",
"\n",
"\n",
"openai = OpenAI()\n",
"\n",
"gemini_url = \"https://generativelanguage.googleapis.com/v1beta/openai/\"\n",
"gemini = OpenAI(api_key=google_api_key, base_url=gemini_url)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a3aa8bf",
"metadata": {},
"outputs": [],
"source": [
"# Gradio UI, streaming, use of the system prompt to add expertise, and the ability to switch between models\n",
"\n",
"system_message= \"\"\"\n",
" You are an assistant in a software engineering bookstore that analyzes the content of technical books and generates concise, informative summaries for readers.\n",
" Your goal is to help customers quickly understand what each book covers, its practical value, and who would benefit most from reading it.\n",
" Respond in markdown without code blocks.\n",
" Each summary should include:\n",
" Overview: The books main topic, scope, and focus area (e.g., software architecture, DevOps, system design).\n",
" Key Insights: The most important lessons, principles, or methodologies discussed.\n",
" Recommended For: The type of reader who would benefit most (e.g., junior developers, engineering managers, backend specialists).\n",
" Related Reads: Suggest one or two similar or complementary titles available in the store.\n",
" Maintain a professional and knowledgeable tone that reflects expertise in software engineering literature. \n",
"\"\"\"\n",
"\n",
"def stream_gpt(prompt):\n",
" messages = [\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\", \"content\": prompt}\n",
" ]\n",
" stream = openai.chat.completions.create(\n",
" model=MODEL_GPT,\n",
" messages=messages,\n",
" stream=True\n",
" )\n",
" result = \"\"\n",
" for chunk in stream:\n",
" result += chunk.choices[0].delta.content or \"\"\n",
" yield result\n",
"\n",
"def stream_gemini(prompt):\n",
" messages = [\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\", \"content\": prompt}\n",
" ]\n",
" stream = openai.chat.completions.create(\n",
" model=MODEL_GEMINI,\n",
" messages=messages,\n",
" stream=True\n",
" )\n",
" result = \"\"\n",
" for chunk in stream:\n",
" result += chunk.choices[0].delta.content or \"\"\n",
" yield result\n",
"\n",
"def stream_model(prompt, model):\n",
" if model==\"GPT\":\n",
" result = stream_gpt(prompt)\n",
" elif model==\"Gemini\":\n",
" result = stream_gemini(prompt)\n",
" else:\n",
" raise ValueError(\"Unknown model\")\n",
" yield from result\n",
"\n",
"\n",
"message_input = gr.Textbox(label=\"Your message:\", info=\"Enter a software engineering book title for the LLM\", lines=4)\n",
"model_selector = gr.Dropdown([\"GPT\", \"Gemini\"], label=\"Select model\", value=\"GPT\")\n",
"message_output = gr.Markdown(label=\"Response:\")\n",
"\n",
"view = gr.Interface(\n",
" fn=stream_model,\n",
" title=\"Bookstore Assistant\", \n",
" inputs=[message_input, model_selector], \n",
" outputs=[message_output], \n",
" examples=[\n",
" [\"Explain Clean Code by Robert C. Martin\", \"GPT\"],\n",
" [\"Explain Clean Code by Robert C. Martin\", \"Gemini\"]\n",
" ], \n",
" flagging_mode=\"never\"\n",
" )\n",
"view.launch()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a4d7980c",
"metadata": {},
"outputs": [],
"source": [
"import sqlite3\n",
"\n",
"DB = \"books.db\"\n",
"\n",
"with sqlite3.connect(DB) as conn:\n",
" cursor = conn.cursor()\n",
" cursor.execute('CREATE TABLE IF NOT EXISTS prices (title TEXT PRIMARY KEY, price REAL)')\n",
" conn.commit()\n",
"\n",
"def get_book_price(title):\n",
" print(f\"DATABASE TOOL CALLED: Getting price for {title}\", flush=True)\n",
" with sqlite3.connect(DB) as conn:\n",
" cursor = conn.cursor()\n",
" cursor.execute('SELECT price FROM prices WHERE title = ?', (title.lower(),))\n",
" result = cursor.fetchone()\n",
" return f\"Book -> {title} price is ${result[0]}\" if result else \"No price data available for this title\"\n",
"\n",
"def set_book_price(title, price):\n",
" with sqlite3.connect(DB) as conn:\n",
" cursor = conn.cursor()\n",
" cursor.execute('INSERT INTO prices (title, price) VALUES (?, ?) ON CONFLICT(title) DO UPDATE SET price = ?', (title.lower(), price, price))\n",
" conn.commit()\n",
"\n",
"book_prices = {\"Clean code\":20, \"Clean architecture\": 30, \"System design\": 40, \"Design patterns\": 50}\n",
"for title, price in book_prices.items():\n",
" set_book_price(title, price)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "86741761",
"metadata": {},
"outputs": [],
"source": [
"# use of a tool\n",
"MODEL = \"gpt-4.1-mini\"\n",
"\n",
"system_message = \"\"\"\n",
"You are a helpful assistant in a software engineering bookstore BookEye. \n",
"Give short, courteous answers, no more than 1 sentence.\n",
"Always be accurate. If you don't know the answer, say so.\n",
"\"\"\"\n",
"\n",
"price_function = {\n",
" \"name\": \"get_book_price\",\n",
" \"description\": \"Get the price of a book.\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"book_title\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The title of the book that the customer wants to buy\",\n",
" },\n",
" },\n",
" \"required\": [\"book_title\"],\n",
" \"additionalProperties\": False\n",
" }\n",
"}\n",
"tools = [{\"type\": \"function\", \"function\": price_function}]\n",
"\n",
"\n",
"def talker(message):\n",
" response = openai.audio.speech.create(\n",
" model=\"gpt-4o-mini-tts\",\n",
" voice=\"coral\",\n",
" input=message\n",
" )\n",
" return response.content\n",
"\n",
"def handle_tool_calls(message):\n",
" responses = []\n",
" for tool_call in message.tool_calls:\n",
" if tool_call.function.name == \"get_book_price\":\n",
" arguments = json.loads(tool_call.function.arguments)\n",
" title = arguments.get('book_title')\n",
" price_details = get_book_price(title)\n",
" responses.append({\n",
" \"role\": \"tool\",\n",
" \"content\": price_details,\n",
" \"tool_call_id\": tool_call.id\n",
" })\n",
" return responses\n",
"\n",
"def chat(history):\n",
" history = [{\"role\":h[\"role\"], \"content\":h[\"content\"]} for h in history]\n",
" messages = [{\"role\": \"system\", \"content\": system_message}] + history\n",
" response = openai.chat.completions.create(model=MODEL, messages=messages, tools=tools)\n",
"\n",
" while response.choices[0].finish_reason==\"tool_calls\":\n",
" message = response.choices[0].message\n",
" responses = handle_tool_calls(message)\n",
" messages.append(message)\n",
" messages.extend(responses)\n",
" response = openai.chat.completions.create(model=MODEL, messages=messages, tools=tools)\n",
"\n",
" reply = response.choices[0].message.content\n",
" history += [{\"role\":\"assistant\", \"content\":reply}]\n",
"\n",
" voice = talker(reply)\n",
" \n",
" return history, voice\n",
"\n",
"def put_message_in_chatbot(message, history):\n",
" return \"\", history + [{\"role\":\"user\", \"content\":message}]\n",
"with gr.Blocks() as ui:\n",
" with gr.Row():\n",
" chatbot = gr.Chatbot(height=300, type=\"messages\")\n",
" audio_output = gr.Audio(autoplay=True)\n",
" \n",
" with gr.Row():\n",
" message = gr.Textbox(label=\"Chat with our AI Assistant:\")\n",
"\n",
" message.submit(put_message_in_chatbot, inputs=[message, chatbot], outputs=[message, chatbot]).then(\n",
" chat, inputs=chatbot, outputs=[chatbot, audio_output]\n",
" )\n",
"\n",
"ui.launch(inbrowser=True, auth=(\"ted\", \"mowsb\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,906 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a8dbb4e8",
"metadata": {},
"source": [
"# 🧪 Survey Synthetic Dataset Generator — Week 3 Task"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d86f629",
"metadata": {},
"outputs": [],
"source": [
"\n",
"import os, re, json, time, uuid, math, random\n",
"from datetime import datetime, timedelta\n",
"from typing import List, Dict, Any\n",
"import numpy as np, pandas as pd\n",
"import pandera.pandas as pa\n",
"random.seed(7); np.random.seed(7)\n",
"print(\"✅ Base libraries ready. Pandera available:\", pa is not None)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f196ae73",
"metadata": {},
"outputs": [],
"source": [
"\n",
"def extract_strict_json(text: str):\n",
" \"\"\"Improved JSON extraction with multiple fallback strategies\"\"\"\n",
" if text is None:\n",
" raise ValueError(\"Empty model output.\")\n",
" \n",
" t = text.strip()\n",
" \n",
" # Strategy 1: Direct JSON parsing\n",
" try:\n",
" obj = json.loads(t)\n",
" if isinstance(obj, list):\n",
" return obj\n",
" elif isinstance(obj, dict):\n",
" for key in (\"rows\",\"data\",\"items\",\"records\",\"results\"):\n",
" if key in obj and isinstance(obj[key], list):\n",
" return obj[key]\n",
" if all(isinstance(k, str) and k.isdigit() for k in obj.keys()):\n",
" return [obj[k] for k in sorted(obj.keys(), key=int)]\n",
" except json.JSONDecodeError:\n",
" pass\n",
" \n",
" # Strategy 2: Extract JSON from code blocks\n",
" if t.startswith(\"```\"):\n",
" t = re.sub(r\"^```(?:json)?\\s*|\\s*```$\", \"\", t, flags=re.IGNORECASE|re.MULTILINE).strip()\n",
" \n",
" # Strategy 3: Find JSON array in text\n",
" start, end = t.find('['), t.rfind(']')\n",
" if start == -1 or end == -1 or end <= start:\n",
" raise ValueError(\"No JSON array found in model output.\")\n",
" \n",
" t = t[start:end+1]\n",
" \n",
" # Strategy 4: Fix common JSON issues\n",
" t = re.sub(r\",\\s*([\\]}])\", r\"\\1\", t) # Remove trailing commas\n",
" t = re.sub(r\"\\bNaN\\b|\\bInfinity\\b|\\b-Infinity\\b\", \"null\", t) # Replace NaN/Infinity\n",
" t = t.replace(\"\\u00a0\", \" \").replace(\"\\u200b\", \"\") # Remove invisible characters\n",
" \n",
" try:\n",
" return json.loads(t)\n",
" except json.JSONDecodeError as e:\n",
" raise ValueError(f\"Could not parse JSON: {str(e)}. Text: {t[:200]}...\")\n"
]
},
{
"cell_type": "markdown",
"id": "3670fa0d",
"metadata": {},
"source": [
"## 1) Configuration"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d16bd03a",
"metadata": {},
"outputs": [],
"source": [
"\n",
"CFG = {\n",
" \"rows\": 800,\n",
" \"datetime_range\": {\"start\": \"2024-01-01\", \"end\": \"2025-10-01\", \"fmt\": \"%Y-%m-%d %H:%M:%S\"},\n",
" \"fields\": [\n",
" {\"name\": \"response_id\", \"type\": \"uuid4\"},\n",
" {\"name\": \"respondent_id\", \"type\": \"int\", \"min\": 10000, \"max\": 99999},\n",
" {\"name\": \"submitted_at\", \"type\": \"datetime\"},\n",
" {\"name\": \"country\", \"type\": \"enum\", \"values\": [\"KE\",\"UG\",\"TZ\",\"RW\",\"NG\",\"ZA\"], \"probs\": [0.50,0.10,0.12,0.05,0.15,0.08]},\n",
" {\"name\": \"language\", \"type\": \"enum\", \"values\": [\"en\",\"sw\"], \"probs\": [0.85,0.15]},\n",
" {\"name\": \"device\", \"type\": \"enum\", \"values\": [\"android\",\"ios\",\"web\"], \"probs\": [0.60,0.25,0.15]},\n",
" {\"name\": \"age\", \"type\": \"int\", \"min\": 18, \"max\": 70},\n",
" {\"name\": \"gender\", \"type\": \"enum\", \"values\": [\"female\",\"male\",\"nonbinary\",\"prefer_not_to_say\"], \"probs\": [0.49,0.49,0.01,0.01]},\n",
" {\"name\": \"education\", \"type\": \"enum\", \"values\": [\"primary\",\"secondary\",\"diploma\",\"bachelor\",\"postgraduate\"], \"probs\": [0.08,0.32,0.18,0.30,0.12]},\n",
" {\"name\": \"income_band\", \"type\": \"enum\", \"values\": [\"low\",\"lower_mid\",\"upper_mid\",\"high\"], \"probs\": [0.28,0.42,0.23,0.07]},\n",
" {\"name\": \"completion_seconds\", \"type\": \"float\", \"min\": 60, \"max\": 1800, \"distribution\": \"lognormal\"},\n",
" {\"name\": \"attention_passed\", \"type\": \"bool\"},\n",
" {\"name\": \"q_quality\", \"type\": \"int\", \"min\": 1, \"max\": 5},\n",
" {\"name\": \"q_value\", \"type\": \"int\", \"min\": 1, \"max\": 5},\n",
" {\"name\": \"q_ease\", \"type\": \"int\", \"min\": 1, \"max\": 5},\n",
" {\"name\": \"q_support\", \"type\": \"int\", \"min\": 1, \"max\": 5},\n",
" {\"name\": \"nps\", \"type\": \"int\", \"min\": 0, \"max\": 10},\n",
" {\"name\": \"is_detractor\", \"type\": \"bool\"}\n",
" ]\n",
"}\n",
"print(\"Loaded config for\", CFG[\"rows\"], \"rows and\", len(CFG[\"fields\"]), \"fields.\")\n"
]
},
{
"cell_type": "markdown",
"id": "7da1f429",
"metadata": {},
"source": [
"## 2) Helpers"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d2f5fdff",
"metadata": {},
"outputs": [],
"source": [
"\n",
"def sample_enum(values, probs=None, size=None):\n",
" values = list(values)\n",
" if probs is None:\n",
" probs = [1.0 / len(values)] * len(values)\n",
" return np.random.choice(values, p=probs, size=size)\n",
"\n",
"def sample_numeric(field_cfg, size=1):\n",
" t = field_cfg[\"type\"]\n",
" if t == \"int\":\n",
" lo, hi = int(field_cfg[\"min\"]), int(field_cfg[\"max\"])\n",
" dist = field_cfg.get(\"distribution\", \"uniform\")\n",
" if dist == \"uniform\":\n",
" return np.random.randint(lo, hi + 1, size=size)\n",
" elif dist == \"normal\":\n",
" mu = (lo + hi) / 2.0\n",
" sigma = (hi - lo) / 6.0\n",
" out = np.random.normal(mu, sigma, size=size)\n",
" return np.clip(out, lo, hi).astype(int)\n",
" else:\n",
" return np.random.randint(lo, hi + 1, size=size)\n",
" elif t == \"float\":\n",
" lo, hi = float(field_cfg[\"min\"]), float(field_cfg[\"max\"])\n",
" dist = field_cfg.get(\"distribution\", \"uniform\")\n",
" if dist == \"uniform\":\n",
" return np.random.uniform(lo, hi, size=size)\n",
" elif dist == \"normal\":\n",
" mu = (lo + hi) / 2.0\n",
" sigma = (hi - lo) / 6.0\n",
" return np.clip(np.random.normal(mu, sigma, size=size), lo, hi)\n",
" elif dist == \"lognormal\":\n",
" mu = math.log(max(1e-3, (lo + hi) / 2.0))\n",
" sigma = 0.75\n",
" out = np.random.lognormal(mu, sigma, size=size)\n",
" return np.clip(out, lo, hi)\n",
" else:\n",
" return np.random.uniform(lo, hi, size=size)\n",
" else:\n",
" raise ValueError(\"Unsupported numeric type\")\n",
"\n",
"def sample_datetime(start: str, end: str, size=1, fmt=\"%Y-%m-%d %H:%M:%S\"):\n",
" s = datetime.fromisoformat(start)\n",
" e = datetime.fromisoformat(end)\n",
" total = int((e - s).total_seconds())\n",
" r = np.random.randint(0, total, size=size)\n",
" return [(s + timedelta(seconds=int(x))).strftime(fmt) for x in r]\n"
]
},
{
"cell_type": "markdown",
"id": "5f24111a",
"metadata": {},
"source": [
"## 3) Rule-based Generator"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd61330d",
"metadata": {},
"outputs": [],
"source": [
"\n",
"def generate_rule_based(CFG: Dict[str, Any]) -> pd.DataFrame:\n",
" n = CFG[\"rows\"]\n",
" dt_cfg = CFG.get(\"datetime_range\", {\"start\":\"2024-01-01\",\"end\":\"2025-10-01\",\"fmt\":\"%Y-%m-%d %H:%M:%S\"})\n",
" data = {}\n",
" for f in CFG[\"fields\"]:\n",
" name, t = f[\"name\"], f[\"type\"]\n",
" if t == \"uuid4\":\n",
" data[name] = [str(uuid.uuid4()) for _ in range(n)]\n",
" elif t in (\"int\",\"float\"):\n",
" data[name] = sample_numeric(f, size=n)\n",
" elif t == \"enum\":\n",
" data[name] = sample_enum(f[\"values\"], f.get(\"probs\"), size=n)\n",
" elif t == \"datetime\":\n",
" data[name] = sample_datetime(dt_cfg[\"start\"], dt_cfg[\"end\"], size=n, fmt=dt_cfg[\"fmt\"])\n",
" elif t == \"bool\":\n",
" data[name] = np.random.rand(n) < 0.9 # 90% True\n",
" else:\n",
" data[name] = [None]*n\n",
" df = pd.DataFrame(data)\n",
"\n",
" # Derive NPS roughly from likert questions\n",
" if set([\"q_quality\",\"q_value\",\"q_ease\",\"q_support\"]).issubset(df.columns):\n",
" likert_avg = df[[\"q_quality\",\"q_value\",\"q_ease\",\"q_support\"]].mean(axis=1)\n",
" df[\"nps\"] = np.clip(np.round((likert_avg - 1.0) * (10.0/4.0) + np.random.normal(0, 1.2, size=n)), 0, 10).astype(int)\n",
"\n",
" # Heuristic target: is_detractor more likely when completion high & attention failed\n",
" if \"is_detractor\" in df.columns:\n",
" base = 0.25\n",
" comp = df.get(\"completion_seconds\", pd.Series(np.zeros(n)))\n",
" attn = pd.Series(df.get(\"attention_passed\", np.ones(n))).astype(bool)\n",
" boost = (comp > 900).astype(int) + (~attn).astype(int)\n",
" p = np.clip(base + 0.15*boost, 0.01, 0.95)\n",
" df[\"is_detractor\"] = np.random.rand(n) < p\n",
"\n",
" return df\n",
"\n",
"df_rule = generate_rule_based(CFG)\n",
"df_rule.head()\n"
]
},
{
"cell_type": "markdown",
"id": "dd9eff20",
"metadata": {},
"source": [
"## 4) Validation (Pandera optional)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a4ef86a",
"metadata": {},
"outputs": [],
"source": [
"\n",
"def build_pandera_schema(CFG):\n",
" if pa is None:\n",
" return None\n",
" cols = {}\n",
" for f in CFG[\"fields\"]:\n",
" t, name = f[\"type\"], f[\"name\"]\n",
" if t == \"int\": cols[name] = pa.Column(int)\n",
" elif t == \"float\": cols[name] = pa.Column(float)\n",
" elif t == \"enum\": cols[name] = pa.Column(object)\n",
" elif t == \"datetime\": cols[name] = pa.Column(object)\n",
" elif t == \"uuid4\": cols[name] = pa.Column(object)\n",
" elif t == \"bool\": cols[name] = pa.Column(bool)\n",
" else: cols[name] = pa.Column(object)\n",
" return pa.DataFrameSchema(cols) if pa is not None else None\n",
"\n",
"def validate_df(df, CFG):\n",
" schema = build_pandera_schema(CFG)\n",
" if schema is None:\n",
" return df, {\"engine\":\"basic\",\"valid_rows\": len(df), \"invalid_rows\": 0}\n",
" try:\n",
" v = schema.validate(df, lazy=True)\n",
" return v, {\"engine\":\"pandera\",\"valid_rows\": len(v), \"invalid_rows\": 0}\n",
" except Exception as e:\n",
" print(\"Validation error:\", e)\n",
" return df, {\"engine\":\"pandera\",\"valid_rows\": len(df), \"invalid_rows\": 0, \"notes\": \"Non-strict mode.\"}\n",
"\n",
"validated_rule, report_rule = validate_df(df_rule, CFG)\n",
"print(report_rule)\n",
"validated_rule.head()\n"
]
},
{
"cell_type": "markdown",
"id": "d5f1d93a",
"metadata": {},
"source": [
"## 5) Save"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73626b4c",
"metadata": {},
"outputs": [],
"source": [
"\n",
"from pathlib import Path\n",
"out = Path(\"data\"); out.mkdir(exist_ok=True)\n",
"ts = datetime.utcnow().strftime(\"%Y%m%dT%H%M%SZ\")\n",
"csv_path = out / f\"survey_rule_{ts}.csv\"\n",
"validated_rule.to_csv(csv_path, index=False)\n",
"print(\"Saved:\", csv_path.as_posix())\n"
]
},
{
"cell_type": "markdown",
"id": "87c89b51",
"metadata": {},
"source": [
"## 6) Optional: LLM Generator (JSON mode, retry & strict parsing)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24e94771",
"metadata": {},
"outputs": [],
"source": [
"# Fixed LLM Generation Functions\n",
"def create_survey_prompt(CFG, n_rows=50):\n",
" \"\"\"Create a clear, structured prompt for survey data generation\"\"\"\n",
" fields_desc = []\n",
" for field in CFG['fields']:\n",
" name = field['name']\n",
" field_type = field['type']\n",
" \n",
" if field_type == 'int':\n",
" min_val = field.get('min', 0)\n",
" max_val = field.get('max', 100)\n",
" fields_desc.append(f\" - {name}: integer between {min_val} and {max_val}\")\n",
" elif field_type == 'float':\n",
" min_val = field.get('min', 0.0)\n",
" max_val = field.get('max', 100.0)\n",
" fields_desc.append(f\" - {name}: float between {min_val} and {max_val}\")\n",
" elif field_type == 'enum':\n",
" values = field.get('values', [])\n",
" fields_desc.append(f\" - {name}: one of {values}\")\n",
" elif field_type == 'bool':\n",
" fields_desc.append(f\" - {name}: boolean (true/false)\")\n",
" elif field_type == 'uuid4':\n",
" fields_desc.append(f\" - {name}: UUID string\")\n",
" elif field_type == 'datetime':\n",
" fmt = field.get('fmt', '%Y-%m-%d %H:%M:%S')\n",
" fields_desc.append(f\" - {name}: datetime string in format {fmt}\")\n",
" else:\n",
" fields_desc.append(f\" - {name}: {field_type}\")\n",
" \n",
" prompt = f\"\"\"Generate {n_rows} rows of realistic survey response data.\n",
"\n",
"Schema:\n",
"{chr(10).join(fields_desc)}\n",
"\n",
"CRITICAL REQUIREMENTS:\n",
"- Return a JSON object with a \"responses\" key containing an array\n",
"- Each object in the array must have all required fields\n",
"- Use realistic, diverse values for survey responses\n",
"- No trailing commas\n",
"- No comments or explanations\n",
"\n",
"Output format: JSON object with \"responses\" array containing exactly {n_rows} objects.\n",
"\n",
"Example structure:\n",
"{{\n",
" \"responses\": [\n",
" {{\n",
" \"response_id\": \"uuid-string\",\n",
" \"respondent_id\": 12345,\n",
" \"submitted_at\": \"2024-01-01 12:00:00\",\n",
" \"country\": \"KE\",\n",
" \"language\": \"en\",\n",
" \"device\": \"android\",\n",
" \"age\": 25,\n",
" \"gender\": \"female\",\n",
" \"education\": \"bachelor\",\n",
" \"income_band\": \"upper_mid\",\n",
" \"completion_seconds\": 300.5,\n",
" \"attention_passed\": true,\n",
" \"q_quality\": 4,\n",
" \"q_value\": 3,\n",
" \"q_ease\": 5,\n",
" \"q_support\": 4,\n",
" \"nps\": 8,\n",
" \"is_detractor\": false\n",
" }},\n",
" ...\n",
" ]\n",
"}}\n",
"\n",
"IMPORTANT: Return ONLY the JSON object with \"responses\" key, nothing else.\"\"\"\n",
" \n",
" return prompt\n",
"\n",
"def repair_truncated_json(content):\n",
" \"\"\"Attempt to repair truncated JSON responses\"\"\"\n",
" content = content.strip()\n",
" \n",
" # If it starts with { but doesn't end with }, try to close it\n",
" if content.startswith('{') and not content.endswith('}'):\n",
" # Find the last complete object in the responses array\n",
" responses_start = content.find('\"responses\": [')\n",
" if responses_start != -1:\n",
" # Find the last complete object\n",
" brace_count = 0\n",
" last_complete_pos = -1\n",
" in_string = False\n",
" escape_next = False\n",
" \n",
" for i, char in enumerate(content[responses_start:], responses_start):\n",
" if escape_next:\n",
" escape_next = False\n",
" continue\n",
" \n",
" if char == '\\\\':\n",
" escape_next = True\n",
" continue\n",
" \n",
" if char == '\"' and not escape_next:\n",
" in_string = not in_string\n",
" continue\n",
" \n",
" if not in_string:\n",
" if char == '{':\n",
" brace_count += 1\n",
" elif char == '}':\n",
" brace_count -= 1\n",
" if brace_count == 0:\n",
" last_complete_pos = i\n",
" break\n",
" \n",
" if last_complete_pos != -1:\n",
" # Truncate at the last complete object and close the JSON\n",
" repaired = content[:last_complete_pos + 1] + '\\n ]\\n}'\n",
" print(f\"🔧 Repaired JSON: truncated at position {last_complete_pos}\")\n",
" return repaired\n",
" \n",
" return content\n",
"\n",
"def fixed_llm_generate_batch(CFG, n_rows=50):\n",
" \"\"\"Fixed LLM generation with better prompt and error handling\"\"\"\n",
" if not os.getenv('OPENAI_API_KEY'):\n",
" print(\"No OpenAI API key, using rule-based fallback\")\n",
" tmp = dict(CFG); tmp['rows'] = n_rows\n",
" return generate_rule_based(tmp)\n",
" \n",
" try:\n",
" from openai import OpenAI\n",
" client = OpenAI()\n",
" \n",
" prompt = create_survey_prompt(CFG, n_rows)\n",
" \n",
" print(f\"🔄 Generating {n_rows} survey responses with LLM...\")\n",
" \n",
" # Calculate appropriate max_tokens based on batch size\n",
" # Roughly 200-300 tokens per row, with some buffer\n",
" estimated_tokens = n_rows * 300 + 500 # Buffer for JSON structure\n",
" max_tokens = min(max(estimated_tokens, 2000), 8000) # Between 2k-8k tokens\n",
" \n",
" print(f\"📊 Using max_tokens: {max_tokens} (estimated: {estimated_tokens})\")\n",
" \n",
" response = client.chat.completions.create(\n",
" model='gpt-4o-mini',\n",
" messages=[\n",
" {'role': 'system', 'content': 'You are a data generation expert. Generate realistic survey data in JSON format. Always return complete, valid JSON.'},\n",
" {'role': 'user', 'content': prompt}\n",
" ],\n",
" temperature=0.3,\n",
" max_tokens=max_tokens,\n",
" response_format={'type': 'json_object'}\n",
" )\n",
" \n",
" content = response.choices[0].message.content\n",
" print(f\"📝 Raw response length: {len(content)} characters\")\n",
" \n",
" # Check if response appears truncated\n",
" if not content.strip().endswith('}') and not content.strip().endswith(']'):\n",
" print(\"⚠️ Response appears truncated, attempting repair...\")\n",
" content = repair_truncated_json(content)\n",
" \n",
" # Try to extract JSON with improved logic\n",
" try:\n",
" data = json.loads(content)\n",
" print(f\"🔍 Parsed JSON type: {type(data)}\")\n",
" \n",
" if isinstance(data, list):\n",
" df = pd.DataFrame(data)\n",
" print(f\"📊 Direct array: {len(df)} rows\")\n",
" elif isinstance(data, dict):\n",
" # Check for common keys that might contain the data\n",
" for key in ['responses', 'rows', 'data', 'items', 'records', 'results', 'survey_responses']:\n",
" if key in data and isinstance(data[key], list):\n",
" df = pd.DataFrame(data[key])\n",
" print(f\"📊 Found data in '{key}': {len(df)} rows\")\n",
" break\n",
" else:\n",
" # If no standard key found, check if all values are lists/objects\n",
" list_keys = [k for k, v in data.items() if isinstance(v, list) and len(v) > 0]\n",
" if list_keys:\n",
" # Use the first list key found\n",
" key = list_keys[0]\n",
" df = pd.DataFrame(data[key])\n",
" print(f\"📊 Found data in '{key}': {len(df)} rows\")\n",
" else:\n",
" # Try to convert the dict values to a list\n",
" if all(isinstance(v, dict) for v in data.values()):\n",
" df = pd.DataFrame(list(data.values()))\n",
" print(f\"📊 Converted dict values: {len(df)} rows\")\n",
" else:\n",
" raise ValueError(f\"Unexpected JSON structure: {list(data.keys())}\")\n",
" else:\n",
" raise ValueError(f\"Unexpected JSON type: {type(data)}\")\n",
" \n",
" if len(df) == n_rows:\n",
" print(f\"✅ Successfully generated {len(df)} survey responses\")\n",
" return df\n",
" else:\n",
" print(f\"⚠️ Generated {len(df)} rows, expected {n_rows}\")\n",
" if len(df) > 0:\n",
" return df\n",
" else:\n",
" raise ValueError(\"No data generated\")\n",
" \n",
" except json.JSONDecodeError as e:\n",
" print(f\"❌ JSON parsing failed: {str(e)}\")\n",
" # Try the improved extract_strict_json function\n",
" try:\n",
" data = extract_strict_json(content)\n",
" df = pd.DataFrame(data)\n",
" print(f\"✅ Recovered with strict parsing: {len(df)} rows\")\n",
" return df\n",
" except Exception as e2:\n",
" print(f\"❌ Strict parsing also failed: {str(e2)}\")\n",
" # Print a sample of the content for debugging\n",
" print(f\"🔍 Content sample: {content[:500]}...\")\n",
" raise e2\n",
" \n",
" except Exception as e:\n",
" print(f'❌ LLM error, fallback to rule-based mock: {str(e)}')\n",
" tmp = dict(CFG); tmp['rows'] = n_rows\n",
" return generate_rule_based(tmp)\n",
"\n",
"def fixed_generate_llm(CFG, total_rows=200, batch_size=50):\n",
" \"\"\"Fixed LLM generation with adaptive batch processing\"\"\"\n",
" print(f\"🚀 Generating {total_rows} survey responses with adaptive batching\")\n",
" \n",
" # Adaptive batch sizing based on total rows\n",
" if total_rows <= 20:\n",
" optimal_batch_size = min(batch_size, total_rows)\n",
" elif total_rows <= 50:\n",
" optimal_batch_size = min(15, batch_size)\n",
" elif total_rows <= 100:\n",
" optimal_batch_size = min(10, batch_size)\n",
" else:\n",
" optimal_batch_size = min(8, batch_size)\n",
" \n",
" print(f\"📊 Using optimal batch size: {optimal_batch_size}\")\n",
" \n",
" all_dataframes = []\n",
" remaining = total_rows\n",
" \n",
" while remaining > 0:\n",
" current_batch_size = min(optimal_batch_size, remaining)\n",
" print(f\"\\n📦 Processing batch: {current_batch_size} rows (remaining: {remaining})\")\n",
" \n",
" try:\n",
" batch_df = fixed_llm_generate_batch(CFG, current_batch_size)\n",
" all_dataframes.append(batch_df)\n",
" remaining -= len(batch_df)\n",
" \n",
" # Small delay between batches to avoid rate limits\n",
" if remaining > 0:\n",
" time.sleep(1.5)\n",
" \n",
" except Exception as e:\n",
" print(f\"❌ Batch failed: {str(e)}\")\n",
" print(f\"🔄 Retrying with smaller batch size...\")\n",
" \n",
" # Try with smaller batch size\n",
" smaller_batch = max(1, current_batch_size // 2)\n",
" if smaller_batch < current_batch_size:\n",
" try:\n",
" print(f\"🔄 Retrying with {smaller_batch} rows...\")\n",
" batch_df = fixed_llm_generate_batch(CFG, smaller_batch)\n",
" all_dataframes.append(batch_df)\n",
" remaining -= len(batch_df)\n",
" continue\n",
" except Exception as e2:\n",
" print(f\"❌ Retry also failed: {str(e2)}\")\n",
" \n",
" print(f\"Using rule-based fallback for remaining {remaining} rows\")\n",
" fallback_df = generate_rule_based(CFG, remaining)\n",
" all_dataframes.append(fallback_df)\n",
" break\n",
" \n",
" if all_dataframes:\n",
" result = pd.concat(all_dataframes, ignore_index=True)\n",
" print(f\"✅ Generated total: {len(result)} survey responses\")\n",
" return result\n",
" else:\n",
" print(\"❌ No data generated\")\n",
" return pd.DataFrame()\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e1af410e",
"metadata": {},
"outputs": [],
"source": [
"# Test the fixed LLM generation\n",
"print(\"🧪 Testing LLM generation...\")\n",
"\n",
"# Test with small dataset first\n",
"test_df = fixed_llm_generate_batch(CFG, 10)\n",
"print(f\"\\n📊 Generated dataset shape: {test_df.shape}\")\n",
"print(f\"\\n📋 First few rows:\")\n",
"print(test_df.head())\n",
"print(f\"\\n📈 Data types:\")\n",
"print(test_df.dtypes)\n",
"\n",
"# Debug function to see what the LLM is actually returning\n",
"def debug_llm_response(CFG, n_rows=5):\n",
" \"\"\"Debug function to see raw LLM response\"\"\"\n",
" if not os.getenv('OPENAI_API_KEY'):\n",
" print(\"No OpenAI API key available for debugging\")\n",
" return\n",
" \n",
" try:\n",
" from openai import OpenAI\n",
" client = OpenAI()\n",
" \n",
" prompt = create_survey_prompt(CFG, n_rows)\n",
" \n",
" print(f\"\\n🔍 DEBUG: Testing with {n_rows} rows\")\n",
" print(f\"📝 Prompt length: {len(prompt)} characters\")\n",
" \n",
" response = client.chat.completions.create(\n",
" model='gpt-4o-mini',\n",
" messages=[\n",
" {'role': 'system', 'content': 'You are a data generation expert. Generate realistic survey data in JSON format.'},\n",
" {'role': 'user', 'content': prompt}\n",
" ],\n",
" temperature=0.3,\n",
" max_tokens=2000,\n",
" response_format={'type': 'json_object'}\n",
" )\n",
" \n",
" content = response.choices[0].message.content\n",
" print(f\"📝 Raw response length: {len(content)} characters\")\n",
" print(f\"🔍 First 200 characters: {content[:200]}\")\n",
" print(f\"🔍 Last 200 characters: {content[-200:]}\")\n",
" \n",
" # Try to parse\n",
" try:\n",
" data = json.loads(content)\n",
" print(f\"✅ JSON parsed successfully\")\n",
" print(f\"🔍 Data type: {type(data)}\")\n",
" if isinstance(data, dict):\n",
" print(f\"🔍 Dict keys: {list(data.keys())}\")\n",
" elif isinstance(data, list):\n",
" print(f\"🔍 List length: {len(data)}\")\n",
" except Exception as e:\n",
" print(f\"❌ JSON parsing failed: {str(e)}\")\n",
" \n",
" except Exception as e:\n",
" print(f\"❌ Debug failed: {str(e)}\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75c90739",
"metadata": {},
"outputs": [],
"source": [
"# Test the fixed implementation\n",
"print(\"🧪 Testing the fixed LLM generation...\")\n",
"\n",
"# Test with small dataset\n",
"test_df = fixed_llm_generate_batch(CFG, 5)\n",
"print(f\"\\n📊 Generated dataset shape: {test_df.shape}\")\n",
"print(f\"\\n📋 First few rows:\")\n",
"print(test_df.head())\n",
"print(f\"\\n📈 Data types:\")\n",
"print(test_df.dtypes)\n",
"\n",
"if not test_df.empty:\n",
" print(f\"\\n✅ SUCCESS! LLM generation is now working!\")\n",
" print(f\"📊 Generated {len(test_df)} survey responses using LLM\")\n",
"else:\n",
" print(f\"\\n❌ Still having issues with LLM generation\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dd83b842",
"metadata": {},
"outputs": [],
"source": [
"#Test larger dataset generation \n",
"print(\"🚀 Testing larger dataset generation...\")\n",
"large_df = fixed_generate_llm(CFG, total_rows=100, batch_size=25)\n",
"if not large_df.empty:\n",
" print(f\"\\n📊 Large dataset shape: {large_df.shape}\")\n",
" print(f\"\\n📈 Summary statistics:\")\n",
" print(large_df.describe())\n",
" \n",
" # Save the results\n",
" from pathlib import Path\n",
" out = Path(\"data\"); out.mkdir(exist_ok=True)\n",
" ts = datetime.utcnow().strftime(\"%Y%m%dT%H%M%SZ\")\n",
" csv_path = out / f\"survey_llm_fixed_{ts}.csv\"\n",
" large_df.to_csv(csv_path, index=False)\n",
" print(f\"💾 Saved: {csv_path}\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6029d3e2",
"metadata": {},
"outputs": [],
"source": [
"\n",
"def build_json_schema(CFG):\n",
" schema = {'type':'array','items':{'type':'object','properties':{},'required':[]}}\n",
" props = schema['items']['properties']; req = schema['items']['required']\n",
" for f in CFG['fields']:\n",
" name, t = f['name'], f['type']\n",
" req.append(name)\n",
" if t in ('int','float'): props[name] = {'type':'number' if t=='float' else 'integer'}\n",
" elif t == 'enum': props[name] = {'type':'string','enum': f['values']}\n",
" elif t in ('uuid4','datetime'): props[name] = {'type':'string'}\n",
" elif t == 'bool': props[name] = {'type':'boolean'}\n",
" else: props[name] = {'type':'string'}\n",
" return schema\n",
"\n",
"PROMPT_PREAMBLE = (\n",
" \"You are a data generator. Return ONLY JSON. \"\n",
" \"Respond as a JSON object with key 'rows' whose value is an array of exactly N objects. \"\n",
" \"No prose, no code fences, no trailing commas.\"\n",
")\n",
"\n",
"def render_prompt(CFG, n_rows=100):\n",
" minimal_cfg = {'fields': []}\n",
" for f in CFG['fields']:\n",
" base = {k: f[k] for k in ['name','type'] if k in f}\n",
" if 'min' in f and 'max' in f: base.update({'min': f['min'], 'max': f['max']})\n",
" if 'values' in f: base.update({'values': f['values']})\n",
" if 'fmt' in f: base.update({'fmt': f['fmt']})\n",
" minimal_cfg['fields'].append(base)\n",
" return {\n",
" 'preamble': PROMPT_PREAMBLE,\n",
" 'n_rows': n_rows,\n",
" 'schema': build_json_schema(CFG),\n",
" 'constraints': minimal_cfg,\n",
" 'instruction': f\"Return ONLY this structure: {{'rows': [ ... exactly {n_rows} objects ... ]}}\"\n",
" }\n",
"\n",
"def parse_llm_json_to_df(raw: str) -> pd.DataFrame:\n",
" try:\n",
" obj = json.loads(raw)\n",
" if isinstance(obj, dict) and isinstance(obj.get('rows'), list):\n",
" return pd.DataFrame(obj['rows'])\n",
" except Exception:\n",
" pass\n",
" data = extract_strict_json(raw)\n",
" return pd.DataFrame(data)\n",
"\n",
"USE_LLM = bool(os.getenv('OPENAI_API_KEY'))\n",
"print('LLM available:', USE_LLM)\n",
"\n",
"def llm_generate_batch(CFG, n_rows=50):\n",
" if USE_LLM:\n",
" try:\n",
" from openai import OpenAI\n",
" client = OpenAI()\n",
" prompt = json.dumps(render_prompt(CFG, n_rows))\n",
" resp = client.chat.completions.create(\n",
" model='gpt-4o-mini',\n",
" response_format={'type': 'json_object'},\n",
" messages=[\n",
" {'role':'system','content':'You output strict JSON only.'},\n",
" {'role':'user','content': prompt}\n",
" ],\n",
" temperature=0.2,\n",
" max_tokens=8192,\n",
" )\n",
" raw = resp.choices[0].message.content\n",
" try:\n",
" return parse_llm_json_to_df(raw)\n",
" except Exception:\n",
" stricter = (\n",
" prompt\n",
" + \"\\nReturn ONLY a JSON object structured as: \"\n",
" + \"{\\\"rows\\\": [ ... exactly N objects ... ]}. \"\n",
" + \"No prose, no explanations.\"\n",
" )\n",
" resp2 = client.chat.completions.create(\n",
" model='gpt-4o-mini',\n",
" response_format={'type': 'json_object'},\n",
" messages=[\n",
" {'role':'system','content':'You output strict JSON only.'},\n",
" {'role':'user','content': stricter}\n",
" ],\n",
" temperature=0.2,\n",
" max_tokens=8192,\n",
" )\n",
" raw2 = resp2.choices[0].message.content\n",
" return parse_llm_json_to_df(raw2)\n",
" except Exception as e:\n",
" print('LLM error, fallback to rule-based mock:', e)\n",
" tmp = dict(CFG); tmp['rows'] = n_rows\n",
" return generate_rule_based(tmp)\n",
"\n",
"def generate_llm(CFG, total_rows=200, batch_size=50):\n",
" dfs = []; remaining = total_rows\n",
" while remaining > 0:\n",
" b = min(batch_size, remaining)\n",
" dfs.append(llm_generate_batch(CFG, n_rows=b))\n",
" remaining -= b\n",
" time.sleep(0.2)\n",
" return pd.concat(dfs, ignore_index=True)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2e759087",
"metadata": {},
"outputs": [],
"source": [
"df_llm = generate_llm(CFG, total_rows=100, batch_size=50)\n",
"df_llm.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d4908ad",
"metadata": {},
"outputs": [],
"source": [
"# Test the improved LLM generation with adaptive batching\n",
"print(\"🧪 Testing improved LLM generation with adaptive batching...\")\n",
"\n",
"# Test with smaller dataset first\n",
"print(\"\\n📦 Testing small batch (10 rows)...\")\n",
"small_df = fixed_llm_generate_batch(CFG, 10)\n",
"print(f\"✅ Small batch result: {len(small_df)} rows\")\n",
"\n",
"# Test with medium dataset using adaptive batching\n",
"print(\"\\n📦 Testing medium dataset (30 rows) with adaptive batching...\")\n",
"medium_df = fixed_generate_llm(CFG, total_rows=30, batch_size=15)\n",
"print(f\"✅ Medium dataset result: {len(medium_df)} rows\")\n",
"\n",
"if not medium_df.empty:\n",
" print(f\"\\n📊 Dataset shape: {medium_df.shape}\")\n",
" print(f\"\\n📋 First few rows:\")\n",
" print(medium_df.head())\n",
" \n",
" # Save the results\n",
" from pathlib import Path\n",
" out = Path(\"data\"); out.mkdir(exist_ok=True)\n",
" ts = datetime.utcnow().strftime(\"%Y%m%dT%H%M%SZ\")\n",
" csv_path = out / f\"survey_adaptive_batch_{ts}.csv\"\n",
" medium_df.to_csv(csv_path, index=False)\n",
" print(f\"💾 Saved: {csv_path}\")\n",
"else:\n",
" print(\"❌ Medium dataset generation failed\")\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,216 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c58e628f",
"metadata": {},
"source": [
"\n",
"## **Week 3 task.**\n",
"Create your own tool that generates synthetic data/test data. Input the type of dataset or products or job postings, etc. and let the tool dream up various data samples.\n",
"\n",
"https://colab.research.google.com/drive/13wR4Blz3Ot_x0GOpflmvvFffm5XU3Kct?usp=sharing"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "0ddde9ed",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"import torch\n",
"from IPython.display import Markdown, display, update_display\n",
"from openai import OpenAI\n",
"from huggingface_hub import login\n",
"from huggingface_hub import login\n",
"from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig\n",
"from dotenv import load_dotenv\n",
"import gradio as gr"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cbbc6cc8",
"metadata": {},
"outputs": [],
"source": [
"\n",
"load_dotenv(override=True)\n",
"\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"llama_api_key = \"ollama\"\n",
"\n",
"# hf_token = userdata.get('HF_TOKEN')\n",
"# login(hf_token, add_to_git_credential=True)\n",
"\n",
"\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
"\n",
"if llama_api_key:\n",
" print(f\"LLama API Key exists\")\n",
"else:\n",
" print(\"LLama API Key not set\")\n",
" \n",
"GPT_MODEL = \"gpt-4.1-mini\"\n",
"LLAMA_MODEL = \"llama3.1\"\n",
"\n",
"\n",
"openai = OpenAI()\n",
"\n",
"llama_url = \"http://localhost:11434/v1\"\n",
"llama = OpenAI(api_key=llama_api_key, base_url=llama_url)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "ef083ec6",
"metadata": {},
"outputs": [],
"source": [
"def generate_with_gpt(user_prompt: str, num_samples: int = 5):\n",
" \"\"\"\n",
" Generates synthetic data using OpenAI's GPT.\n",
" Return a JSON string.\n",
" \"\"\"\n",
" if not openai:\n",
" return json.dumps({\"error\": \"OpenAI client not initialized. Please check your API key.\"}, indent=2)\n",
"\n",
" try:\n",
" response = openai.chat.completions.create(\n",
" model=GPT_MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": f\"You are a data generation assistant. Generate a JSON array of exactly {num_samples} objects based on the user's request. The output must be valid JSON only, without any other text or formatting.\"},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" response_format={\"type\": \"json_object\"}\n",
" )\n",
" \n",
" json_text = response.choices[0].message.content\n",
" return json_text\n",
" except APIError as e:\n",
" return json.dumps({\"error\": f\"Error from OpenAI API: {e.body}\"}, indent=2)\n",
" except Exception as e:\n",
" return json.dumps({\"error\": f\"An unexpected error occurred: {e}\"}, indent=2)\n",
"\n",
"def generate_with_gpt(user_prompt: str, num_samples: int = 5):\n",
" \"\"\"\n",
" Generates synthetic data using OpenAI's GPT.\n",
" Return a JSON string.\n",
" \"\"\"\n",
" if not openai:\n",
" return json.dumps({\"error\": \"OpenAI client not initialized. Please check your API key.\"}, indent=2)\n",
"\n",
" try:\n",
" response = openai.chat.completions.create(\n",
" model=GPT_MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": f\"You are a data generation assistant. Generate a JSON array of exactly {num_samples} objects based on the user's request. The output must be valid JSON only, without any other text or formatting.\"},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" response_format={\"type\": \"json_object\"}\n",
" )\n",
" \n",
" json_text = response.choices[0].message.content\n",
" return json_text\n",
" except APIError as e:\n",
" return json.dumps({\"error\": f\"Error from OpenAI API: {e.body}\"}, indent=2)\n",
" except Exception as e:\n",
" return json.dumps({\"error\": f\"An unexpected error occurred: {e}\"}, indent=2)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "b98f84d8",
"metadata": {},
"outputs": [],
"source": [
"def generate_data(user_prompt, model_choice):\n",
" \"\"\"\n",
" Wrapper function that calls the appropriate generation function based on model choice.\n",
" \"\"\"\n",
" if not user_prompt:\n",
" return json.dumps({\"error\": \"Please provide a description for the data.\"}, indent=2)\n",
"\n",
" if model_choice == f\"Hugging Face ({LLAMA_MODEL})\":\n",
" return generate_with_llama(user_prompt)\n",
" elif model_choice == f\"OpenAI ({GPT_MODEL})\":\n",
" return generate_with_gpt(user_prompt)\n",
" else:\n",
" return json.dumps({\"error\": \"Invalid model choice.\"}, indent=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "adbc19a8",
"metadata": {},
"outputs": [],
"source": [
"# Gradio UI\n",
"with gr.Blocks(theme=gr.themes.Glass(), title=\"Synthetic Data Generator\") as ui:\n",
" gr.Markdown(\"# Synthetic Data Generator\")\n",
" gr.Markdown(\"Describe the type of data you need, select a model, and click 'Generate'.\")\n",
"\n",
" with gr.Row():\n",
" with gr.Column(scale=3):\n",
" data_prompt = gr.Textbox(\n",
" lines=5,\n",
" label=\"Data Prompt\",\n",
" placeholder=\"e.g., a list of customer profiles with name, email, and a favorite product\"\n",
" )\n",
" \n",
" with gr.Column(scale=1):\n",
" model_choice = gr.Radio(\n",
" [f\"Hugging Face ({LLAMA_MODEL})\", f\"OpenAI ({GPT_MODEL})\"],\n",
" label=\"Choose a Model\",\n",
" value=f\"Hugging Face ({LLAMA_MODEL})\"\n",
" )\n",
" \n",
" generate_btn = gr.Button(\"Generate Data\")\n",
" \n",
" with gr.Row():\n",
" output_json = gr.JSON(label=\"Generated Data\")\n",
" \n",
" generate_btn.click(\n",
" fn=generate_data,\n",
" inputs=[data_prompt, model_choice],\n",
" outputs=output_json\n",
" )\n",
"\n",
"ui.launch(inbrowser=True, debug=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,264 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "fee27f39",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import gradio as gr\n",
"\n",
"load_dotenv(override=True)\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n",
"google_api_key = os.getenv('GOOGLE_API_KEY')\n",
"ollama_api_key = os.getenv('OLLAMA_API_KEY')\n",
"\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
" \n",
"if anthropic_api_key:\n",
" print(f\"Anthropic API Key exists and begins {anthropic_api_key[:7]}\")\n",
"else:\n",
" print(\"Anthropic API Key not set (and this is optional)\")\n",
"\n",
"if google_api_key:\n",
" print(f\"Google API Key exists and begins {google_api_key[:2]}\")\n",
"else:\n",
" print(\"Google API Key not set (and this is optional)\")\n",
"\n",
"if ollama_api_key:\n",
" print(f\"OLLAMA API Key exists and begins {ollama_api_key[:2]}\")\n",
"else:\n",
" print(\"OLLAMA API Key not set (and this is optional)\")\n",
"\n",
"# Connect to client libraries\n",
"\n",
"openai = OpenAI()\n",
"\n",
"anthropic_url = \"https://api.anthropic.com/v1/\"\n",
"gemini_url = \"https://generativelanguage.googleapis.com/v1beta/openai/\"\n",
"ollama_url = \"http://localhost:11434/v1\"\n",
"\n",
"anthropic = OpenAI(api_key=anthropic_api_key, base_url=anthropic_url)\n",
"gemini = OpenAI(api_key=google_api_key, base_url=gemini_url)\n",
"ollama = OpenAI(api_key=ollama_api_key, base_url=ollama_url)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d26f4175",
"metadata": {},
"outputs": [],
"source": [
"models = [\"gpt-5\", \"claude-sonnet-4-5-20250929\", \"gemini-2.5-pro\", \"gpt-oss:20b-cloud\", ]\n",
"\n",
"clients = {\"gpt-5\": openai, \"claude-sonnet-4-5-20250929\": anthropic, \"gemini-2.5-pro\": gemini, \"gpt-oss:20b-cloud\": ollama}\n",
"\n",
"# Want to keep costs ultra-low? Replace this with models of your choice, using the examples from yesterday"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76563884",
"metadata": {},
"outputs": [],
"source": [
"system_prompt_doc = \"\"\"You are an expert Python developer and code reviewer.\n",
"Your job is to read the user's provided function, and return:\n",
"1. A concise, PEP-257-compliant docstring summarizing what the function does, clarifying types, parameters, return values, and side effects.\n",
"2. Helpful inline comments that improve both readability and maintainability, without restating what the code obviously does.\n",
"\n",
"Only output the function, not explanations or additional text. \n",
"Do not modify variable names or refactor the function logic.\n",
"Your response should improve the code's clarity and documentation, making it easier for others to understand and maintain.\n",
"Don't be extremely verbose.\n",
"Your answer should be at a {level} level of expertise.\n",
"\"\"\"\n",
"\n",
"system_prompt_tests = \"\"\"You are a seasoned Python developer and testing expert.\n",
"Your task is to read the user's provided function, and generate:\n",
"1. A concise set of meaningful unit tests that thoroughly validate the function's correctness, including typical, edge, and error cases.\n",
"2. The tests should be written for pytest (or unittest if pytest is not appropriate), use clear, descriptive names, and avoid unnecessary complexity.\n",
"3. If dependencies or mocking are needed, include minimal necessary setup code (but avoid over-mocking).\n",
"\n",
"Only output the relevant test code, not explanations or extra text.\n",
"Do not change the original function; focus solely on comprehensive, maintainable test coverage that other developers can easily understand and extend.\n",
"\"\"\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1bd82e96",
"metadata": {},
"outputs": [],
"source": [
"def generate_documentation(code, model, level):\n",
" response = clients[model].chat.completions.create(\n",
" model=model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt_doc.format(level=level)},\n",
" {\"role\": \"user\", \"content\": code}\n",
" ],\n",
" stream=True\n",
" )\n",
" output = \"\"\n",
" for chunk in response:\n",
" output += chunk.choices[0].delta.content or \"\"\n",
" yield output.replace(\"```python\", \"\").replace(\"```\", \"\")\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b01b3421",
"metadata": {},
"outputs": [],
"source": [
"def generate_tests(code, model ):\n",
" response = clients[model].chat.completions.create(\n",
" model=model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt_tests},\n",
" {\"role\": \"user\", \"content\": code}\n",
" ],\n",
" stream=True\n",
" )\n",
" output = \"\"\n",
" for chunk in response:\n",
" output += chunk.choices[0].delta.content or \"\"\n",
" yield output.replace(\"```python\", \"\").replace(\"```\", \"\")\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16b71915",
"metadata": {},
"outputs": [],
"source": [
"vscode_dark = gr.themes.Monochrome(\n",
" primary_hue=\"blue\",\n",
" secondary_hue=\"slate\",\n",
" neutral_hue=\"slate\",\n",
").set(\n",
" body_background_fill=\"#1e1e1e\",\n",
" body_background_fill_dark=\"#1e1e1e\",\n",
" block_background_fill=\"#252526\",\n",
" block_background_fill_dark=\"#252526\",\n",
" block_border_color=\"#3e3e42\",\n",
" block_border_color_dark=\"#3e3e42\",\n",
" border_color_primary=\"#3e3e42\",\n",
" block_label_background_fill=\"#252526\",\n",
" block_label_background_fill_dark=\"#252526\",\n",
" block_label_text_color=\"#cccccc\",\n",
" block_label_text_color_dark=\"#cccccc\",\n",
" block_title_text_color=\"#cccccc\",\n",
" block_title_text_color_dark=\"#cccccc\",\n",
" body_text_color=\"#d4d4d4\",\n",
" body_text_color_dark=\"#d4d4d4\",\n",
" button_primary_background_fill=\"#0e639c\",\n",
" button_primary_background_fill_dark=\"#0e639c\",\n",
" button_primary_background_fill_hover=\"#1177bb\",\n",
" button_primary_background_fill_hover_dark=\"#1177bb\",\n",
" button_primary_text_color=\"#ffffff\",\n",
" button_primary_text_color_dark=\"#ffffff\",\n",
" input_background_fill=\"#3c3c3c\",\n",
" input_background_fill_dark=\"#3c3c3c\",\n",
" color_accent=\"#007acc\",\n",
" color_accent_soft=\"#094771\",\n",
")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "23311022",
"metadata": {},
"outputs": [],
"source": [
"import gradio as gr\n",
"\n",
"with gr.Blocks(theme=vscode_dark, css=\"\"\"\n",
" .gradio-container {font-family: 'Consolas', 'Monaco', monospace;}\n",
" h1 {color: #d4d4d4 !important;}\n",
"\"\"\") as ui:\n",
" gr.Markdown(\"# 🧑‍💻 Python Code Reviewer & Test Generator\", elem_id=\"app-title\")\n",
" with gr.Tab(\"Docstring & Comments\") as tab1:\n",
" gr.Markdown(\"# Function Docstring & Comment Helper\\nPaste your function below and get helpful docstrings and inline comments!\")\n",
"\n",
" with gr.Row():\n",
" code_input_1 = gr.Code(label=\"Paste your Python function here\", lines=10, language=\"python\")\n",
" code_output = gr.Code(label=\"Function with improved docstring and comments\", lines=10, language=\"python\")\n",
" \n",
" with gr.Row(equal_height=True):\n",
" level_radio = gr.Radio(choices=[\"Junior\", \"Mid\", \"Senior\"], value=\"Mid\", label=\"Reviewer level\", interactive=True)\n",
" model_dropdown = gr.Dropdown(choices=models, value=models[-1], label=\"Select model\")\n",
" submit_doc_btn = gr.Button(\"Generate docstring & comments\", scale=0.5)\n",
"\n",
" submit_doc_btn.click(\n",
" generate_documentation, \n",
" inputs=[code_input_1, model_dropdown, level_radio], \n",
" outputs=code_output\n",
" )\n",
"\n",
" with gr.Tab(\"Unit Tests\") as tab2:\n",
" gr.Markdown(\"# Unit Test Generator\\nPaste your function below and get auto-generated unit tests!\")\n",
"\n",
" with gr.Row():\n",
" code_input_2 = gr.Code(label=\"Paste your Python function here\", lines=10, language=\"python\")\n",
" code_output_2 = gr.Code(label=\"Generated tests\", lines=10, language=\"python\")\n",
" \n",
" with gr.Row(equal_height=True):\n",
" model_dropdown_2 = gr.Dropdown(choices=models, value=models[-1], label=\"Select model\")\n",
" submit_test_btn = gr.Button(\"Generate unit tests\", scale=0.5)\n",
"\n",
" submit_test_btn.click(\n",
" generate_tests, \n",
" inputs=[code_input_2, model_dropdown_2], \n",
" outputs=code_output_2\n",
" )\n",
" \n",
" tab2.select(lambda x: x, inputs=code_input_1, outputs=code_input_2)\n",
" tab1.select(lambda x: x, inputs=code_input_2, outputs=code_input_1)\n",
"\n",
"ui.launch(share=False, inbrowser=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,596 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "4a6ab9a2-28a2-445d-8512-a0dc8d1b54e9",
"metadata": {},
"source": [
"# Code DocString / Comment Generator\n",
"\n",
"Submitted By : Bharat Puri\n",
"\n",
"Goal: Build a code tool that scans Python modules, finds functions/classes\n",
"without docstrings, and uses an LLM (Claude / GPT / Gemini / Qwen etc.)\n",
"to generate high-quality Google or NumPy style docstrings."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "e610bf56-a46e-4aff-8de1-ab49d62b1ad3",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import io\n",
"import sys\n",
"import re\n",
"from dotenv import load_dotenv\n",
"import sys\n",
"sys.path.append(os.path.abspath(os.path.join(\"..\", \"..\"))) \n",
"from openai import OpenAI\n",
"import gradio as gr\n",
"import subprocess\n",
"from IPython.display import Markdown, display\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f672e1c-87e9-4865-b760-370fa605e614",
"metadata": {},
"outputs": [],
"source": [
"load_dotenv(override=True)\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n",
"google_api_key = os.getenv('GOOGLE_API_KEY')\n",
"grok_api_key = os.getenv('GROK_API_KEY')\n",
"groq_api_key = os.getenv('GROQ_API_KEY')\n",
"openrouter_api_key = os.getenv('OPENROUTER_API_KEY')\n",
"\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
" \n",
"if anthropic_api_key:\n",
" print(f\"Anthropic API Key exists and begins {anthropic_api_key[:7]}\")\n",
"else:\n",
" print(\"Anthropic API Key not set (and this is optional)\")\n",
"\n",
"if google_api_key:\n",
" print(f\"Google API Key exists and begins {google_api_key[:2]}\")\n",
"else:\n",
" print(\"Google API Key not set (and this is optional)\")\n",
"\n",
"if grok_api_key:\n",
" print(f\"Grok API Key exists and begins {grok_api_key[:4]}\")\n",
"else:\n",
" print(\"Grok API Key not set (and this is optional)\")\n",
"\n",
"if groq_api_key:\n",
" print(f\"Groq API Key exists and begins {groq_api_key[:4]}\")\n",
"else:\n",
" print(\"Groq API Key not set (and this is optional)\")\n",
"\n",
"if openrouter_api_key:\n",
" print(f\"OpenRouter API Key exists and begins {openrouter_api_key[:6]}\")\n",
"else:\n",
" print(\"OpenRouter API Key not set (and this is optional)\")\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "59863df1",
"metadata": {},
"outputs": [],
"source": [
"# Connect to client libraries\n",
"\n",
"openai = OpenAI()\n",
"\n",
"anthropic_url = \"https://api.anthropic.com/v1/\"\n",
"gemini_url = \"https://generativelanguage.googleapis.com/v1beta/openai/\"\n",
"grok_url = \"https://api.x.ai/v1\"\n",
"groq_url = \"https://api.groq.com/openai/v1\"\n",
"ollama_url = \"http://localhost:11434/v1\"\n",
"openrouter_url = \"https://openrouter.ai/api/v1\"\n",
"\n",
"anthropic = OpenAI(api_key=anthropic_api_key, base_url=anthropic_url)\n",
"gemini = OpenAI(api_key=google_api_key, base_url=gemini_url)\n",
"grok = OpenAI(api_key=grok_api_key, base_url=grok_url)\n",
"groq = OpenAI(api_key=groq_api_key, base_url=groq_url)\n",
"ollama = OpenAI(api_key=\"ollama\", base_url=ollama_url)\n",
"openrouter = OpenAI(api_key=openrouter_api_key, base_url=openrouter_url)\n",
"\n",
"MODEL = os.getenv(\"DOCGEN_MODEL\", \"gpt-4o-mini\")\n",
"\n",
"\n",
"# Registry for multiple model providers\n",
"MODEL_REGISTRY = {\n",
" \"gpt-4o-mini (OpenAI)\": {\n",
" \"provider\": \"openai\",\n",
" \"model\": \"gpt-4o-mini\",\n",
" },\n",
" \"gpt-4o (OpenAI)\": {\n",
" \"provider\": \"openai\",\n",
" \"model\": \"gpt-4o\",\n",
" },\n",
" \"claude-3.5-sonnet (Anthropic)\": {\n",
" \"provider\": \"anthropic\",\n",
" \"model\": \"claude-3.5-sonnet\",\n",
" },\n",
" \"gemini-1.5-pro (Google)\": {\n",
" \"provider\": \"google\",\n",
" \"model\": \"gemini-1.5-pro\",\n",
" },\n",
" \"codellama-7b (Open Source)\": {\n",
" \"provider\": \"open_source\",\n",
" \"model\": \"codellama-7b\",\n",
" },\n",
" \"starcoder2 (Open Source)\": {\n",
" \"provider\": \"open_source\",\n",
" \"model\": \"starcoder2\",\n",
" },\n",
"}\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "8aa149ed-9298-4d69-8fe2-8f5de0f667da",
"metadata": {},
"outputs": [],
"source": [
"models = [\"gpt-5\", \"claude-sonnet-4-5-20250929\", \"grok-4\", \"gemini-2.5-pro\", \"qwen2.5-coder\", \"deepseek-coder-v2\", \"gpt-oss:20b\", \"qwen/qwen3-coder-30b-a3b-instruct\", \"openai/gpt-oss-120b\", ]\n",
"\n",
"clients = {\"gpt-5\": openai, \"claude-sonnet-4-5-20250929\": anthropic, \"grok-4\": grok, \"gemini-2.5-pro\": gemini, \"openai/gpt-oss-120b\": groq, \"qwen2.5-coder\": ollama, \"deepseek-coder-v2\": ollama, \"gpt-oss:20b\": ollama, \"qwen/qwen3-coder-30b-a3b-instruct\": openrouter}\n",
"\n",
"# Want to keep costs ultra-low? Replace this with models of your choice, using the examples from yesterday"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "17b7d074-b1a4-4673-adec-918f82a4eff0",
"metadata": {},
"outputs": [],
"source": [
"# ================================================================\n",
"# Prompt Templates and Utilities\n",
"# ================================================================\n",
"\n",
"DOCSTYLE_TEMPLATES = {\n",
" \"google\": (\n",
" \"You will write a concise Google-style Python docstring for the given function or class.\\n\"\n",
" \"Rules:\\n\"\n",
" \"- One-line summary followed by short details.\\n\"\n",
" \"- Include Args:, Returns:, Raises: only if relevant.\\n\"\n",
" \"- Keep under 12 lines, no code fences or markdown formatting.\\n\"\n",
" \"Return ONLY the text between triple quotes.\"\n",
" ),\n",
"}\n",
"\n",
"SYSTEM_PROMPT = (\n",
" \"You are a senior Python engineer and technical writer. \"\n",
" \"Write precise, helpful docstrings.\"\n",
")\n",
"\n",
"\n",
"def make_user_prompt(style: str, module_name: str, signature: str, code_context: str) -> str:\n",
" \"\"\"Build the user message for the model based on template and context.\"\"\"\n",
" instr = DOCSTYLE_TEMPLATES.get(style, DOCSTYLE_TEMPLATES[\"google\"])\n",
" prompt = (\n",
" f\"{instr}\\n\\n\"\n",
" f\"Module: {module_name}\\n\"\n",
" f\"Signature:\\n{signature}\\n\\n\"\n",
" f\"Code context:\\n{code_context}\\n\\n\"\n",
" \"Return ONLY a triple-quoted docstring, for example:\\n\"\n",
" '\"\"\"One-line summary.\\n\\n'\n",
" \"Args:\\n\"\n",
" \" x: Description\\n\"\n",
" \"Returns:\\n\"\n",
" \" y: Description\\n\"\n",
" '\"\"\"'\n",
" )\n",
" return prompt\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "16b3c10f-f7bc-4a2f-a22f-65c6807b7574",
"metadata": {},
"outputs": [],
"source": [
"# ================================================================\n",
"# LLM Chat Helper — OpenAI GPT\n",
"# ================================================================\n",
"def llm_generate_docstring(signature: str, context: str, style: str = \"google\", \n",
" module_name: str = \"module\", model_choice: str = \"gpt-4o-mini (OpenAI)\") -> str:\n",
" \"\"\"\n",
" Generate a Python docstring using the selected model provider.\n",
" \"\"\"\n",
" user_prompt = make_user_prompt(style, module_name, signature, context)\n",
" model_info = MODEL_REGISTRY.get(model_choice, MODEL_REGISTRY[\"gpt-4o-mini (OpenAI)\"])\n",
"\n",
" provider = model_info[\"provider\"]\n",
" model_name = model_info[\"model\"]\n",
"\n",
" if provider == \"openai\":\n",
" response = openai.chat.completions.create(\n",
" model=model_name,\n",
" temperature=0.2,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a senior Python engineer and technical writer.\"},\n",
" {\"role\": \"user\", \"content\": user_prompt},\n",
" ],\n",
" )\n",
" text = response.choices[0].message.content.strip()\n",
"\n",
" elif provider == \"anthropic\":\n",
" # Future: integrate Anthropic SDK\n",
" text = \"Claude response simulation: \" + user_prompt[:200]\n",
"\n",
" elif provider == \"google\":\n",
" # Future: integrate Gemini API\n",
" text = \"Gemini response simulation: \" + user_prompt[:200]\n",
"\n",
" else:\n",
" # Simulated open-source fallback\n",
" text = f\"[Simulated output from {model_name}]\\nAuto-generated docstring for {signature}\"\n",
"\n",
" import re\n",
" match = re.search(r'\"\"\"(.*?)\"\"\"', text, re.S)\n",
" return match.group(1).strip() if match else text\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "82da91ac-e563-4425-8b45-1b94880d342f",
"metadata": {},
"outputs": [],
"source": [
"# ================================================================\n",
"# 🧱 AST Parsing Utilities — find missing docstrings\n",
"# ================================================================\n",
"import ast\n",
"\n",
"def node_signature(node: ast.AST) -> str:\n",
" \"\"\"\n",
" Build a readable signature string from a FunctionDef or ClassDef node.\n",
" Example: def add(x, y) -> int:\n",
" \"\"\"\n",
" if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):\n",
" args = [a.arg for a in node.args.args]\n",
" if node.args.vararg:\n",
" args.append(\"*\" + node.args.vararg.arg)\n",
" for a in node.args.kwonlyargs:\n",
" args.append(a.arg + \"=?\")\n",
" if node.args.kwarg:\n",
" args.append(\"**\" + node.args.kwarg.arg)\n",
" ret = \"\"\n",
" if getattr(node, \"returns\", None):\n",
" try:\n",
" ret = f\" -> {ast.unparse(node.returns)}\"\n",
" except Exception:\n",
" pass\n",
" return f\"def {node.name}({', '.join(args)}){ret}:\"\n",
"\n",
" elif isinstance(node, ast.ClassDef):\n",
" return f\"class {node.name}:\"\n",
"\n",
" return \"\"\n",
"\n",
"\n",
"def context_snippet(src: str, node: ast.AST, max_lines: int = 60) -> str:\n",
" \"\"\"\n",
" Extract a small snippet of source code around a node for context.\n",
" This helps the LLM understand what the function/class does.\n",
" \"\"\"\n",
" lines = src.splitlines()\n",
" start = getattr(node, \"lineno\", 1) - 1\n",
" end = getattr(node, \"end_lineno\", start + 1)\n",
" snippet = lines[start:end]\n",
" if len(snippet) > max_lines:\n",
" snippet = snippet[:max_lines] + [\"# ... (truncated) ...\"]\n",
" return \"\\n\".join(snippet)\n",
"\n",
"\n",
"def find_missing_docstrings(src: str):\n",
" \"\"\"\n",
" Parse the Python source code and return a list of nodes\n",
" (module, class, function) that do NOT have docstrings.\n",
" \"\"\"\n",
" tree = ast.parse(src)\n",
" missing = []\n",
"\n",
" # Module-level docstring check\n",
" if ast.get_docstring(tree) is None:\n",
" missing.append((\"module\", tree))\n",
"\n",
" # Walk through the AST for classes and functions\n",
" for node in ast.walk(tree):\n",
" if isinstance(node, (ast.ClassDef, ast.FunctionDef, ast.AsyncFunctionDef)):\n",
" if ast.get_docstring(node) is None:\n",
" kind = \"class\" if isinstance(node, ast.ClassDef) else \"function\"\n",
" missing.append((kind, node))\n",
"\n",
" return missing\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea69108f-e4ca-4326-89fe-97c5748c0e79",
"metadata": {},
"outputs": [],
"source": [
"## Quick Test ##\n",
"\n",
"code = '''\n",
"def add(x, y):\n",
" return x + y\n",
"\n",
"class Counter:\n",
" def inc(self):\n",
" self.total += 1\n",
"'''\n",
"\n",
"for kind, node in find_missing_docstrings(code):\n",
" print(f\"Missing docstring → {kind}: {node_signature(node)}\")\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "00d65b96-e65d-4e11-89be-06f265a5f2e3",
"metadata": {},
"outputs": [],
"source": [
"# ================================================================\n",
"# Insert Generated Docstrings into Code\n",
"# ================================================================\n",
"import difflib\n",
"import textwrap\n",
"\n",
"def insert_docstring(src: str, node: ast.AST, docstring: str) -> str:\n",
" \"\"\"\n",
" Insert a generated docstring inside a function/class node.\n",
" Keeps indentation consistent with the original code.\n",
" \"\"\"\n",
" lines = src.splitlines()\n",
" if not hasattr(node, \"body\") or not node.body:\n",
" return src # nothing to insert into\n",
"\n",
" start_idx = node.body[0].lineno - 1\n",
" indent = re.match(r\"\\s*\", lines[start_idx]).group(0)\n",
" ds_lines = textwrap.indent(f'\"\"\"{docstring.strip()}\"\"\"', indent).splitlines()\n",
"\n",
" new_lines = lines[:start_idx] + ds_lines + [\"\"] + lines[start_idx:]\n",
" return \"\\n\".join(new_lines)\n",
"\n",
"\n",
"def insert_module_docstring(src: str, docstring: str) -> str:\n",
" \"\"\"Insert a module-level docstring at the top of the file.\"\"\"\n",
" lines = src.splitlines()\n",
" ds_block = f'\"\"\"{docstring.strip()}\"\"\"\\n'\n",
" return ds_block + \"\\n\".join(lines)\n",
"\n",
"\n",
"def diff_text(a: str, b: str) -> str:\n",
" \"\"\"Show unified diff of original vs updated code.\"\"\"\n",
" return \"\".join(\n",
" difflib.unified_diff(\n",
" a.splitlines(keepends=True),\n",
" b.splitlines(keepends=True),\n",
" fromfile=\"original.py\",\n",
" tofile=\"updated.py\",\n",
" )\n",
" )\n",
"\n",
"\n",
"def generate_docstrings_for_source(src: str, style: str = \"google\", module_name: str = \"module\", model_choice: str = \"gpt-4o-mini (OpenAI)\"):\n",
" targets = find_missing_docstrings(src)\n",
" updated = src\n",
" report = []\n",
"\n",
" for kind, node in sorted(targets, key=lambda t: 0 if t[0] == \"module\" else 1):\n",
" sig = \"module \" + module_name if kind == \"module\" else node_signature(node)\n",
" ctx = src if kind == \"module\" else context_snippet(src, node)\n",
" doc = llm_generate_docstring(sig, ctx, style=style, module_name=module_name, model_choice=model_choice)\n",
"\n",
" if kind == \"module\":\n",
" updated = insert_module_docstring(updated, doc)\n",
" else:\n",
" updated = insert_docstring(updated, node, doc)\n",
"\n",
" report.append({\"kind\": kind, \"signature\": sig, \"model\": model_choice, \"doc_preview\": doc[:150]})\n",
"\n",
" return updated, report\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d00cf4b7-773d-49cb-8262-9d11d787ee10",
"metadata": {},
"outputs": [],
"source": [
"## Quick Test ##\n",
"new_code, report = generate_docstrings_for_source(code, style=\"google\", module_name=\"demo\")\n",
"\n",
"print(\"=== Generated Docstrings ===\")\n",
"for r in report:\n",
" print(f\"- {r['kind']}: {r['signature']}\")\n",
" print(\" \", r['doc_preview'])\n",
"print(\"\\n=== Updated Source ===\")\n",
"print(new_code)\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "b318db41-c05d-48ce-9990-b6f1a0577c68",
"metadata": {},
"outputs": [],
"source": [
"# ================================================================\n",
"# 📂 File-Based Workflow — preview or apply docstrings\n",
"# ================================================================\n",
"from pathlib import Path\n",
"import pandas as pd\n",
"\n",
"def process_file(path: str, style: str = \"google\", apply: bool = False) -> pd.DataFrame:\n",
" \"\"\"\n",
" Process a .py file: find missing docstrings, generate them via GPT,\n",
" and either preview the diff or apply the updates in place.\n",
" \"\"\"\n",
" p = Path(path)\n",
" src = p.read_text(encoding=\"utf-8\")\n",
" updated, rows = generate_docstrings_for_source(src, style=style, module_name=p.stem)\n",
"\n",
" if apply:\n",
" p.write_text(updated, encoding=\"utf-8\")\n",
" print(f\"✅ Updated file written → {p}\")\n",
" else:\n",
" print(\"🔍 Diff preview:\")\n",
" print(diff_text(src, updated))\n",
"\n",
" return pd.DataFrame(rows)\n",
"\n",
"# Example usage:\n",
"# df = process_file(\"my_script.py\", style=\"google\", apply=False) # preview\n",
"# df = process_file(\"my_script.py\", style=\"google\", apply=True) # overwrite with docstrings\n",
"# df\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "8962cf0e-9255-475e-bbc1-21500be0cd78",
"metadata": {},
"outputs": [],
"source": [
"# ================================================================\n",
"# 📂 File-Based Workflow — preview or apply docstrings\n",
"# ================================================================\n",
"from pathlib import Path\n",
"import pandas as pd\n",
"\n",
"def process_file(path: str, style: str = \"google\", apply: bool = False) -> pd.DataFrame:\n",
" \"\"\"\n",
" Process a .py file: find missing docstrings, generate them via GPT,\n",
" and either preview the diff or apply the updates in place.\n",
" \"\"\"\n",
" p = Path(path)\n",
" src = p.read_text(encoding=\"utf-8\")\n",
" updated, rows = generate_docstrings_for_source(src, style=style, module_name=p.stem)\n",
"\n",
" if apply:\n",
" p.write_text(updated, encoding=\"utf-8\")\n",
" print(f\"✅ Updated file written → {p}\")\n",
" else:\n",
" print(\"🔍 Diff preview:\")\n",
" print(diff_text(src, updated))\n",
"\n",
" return pd.DataFrame(rows)\n",
"\n",
"# Example usage:\n",
"# df = process_file(\"my_script.py\", style=\"google\", apply=False) # preview\n",
"# df = process_file(\"my_script.py\", style=\"google\", apply=True) # overwrite with docstrings\n",
"# df\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b0b0f852-982f-4918-9b5d-89880cc12003",
"metadata": {},
"outputs": [],
"source": [
"# ================================================================\n",
"# 🎨 Enhanced Gradio Interface with Model Selector\n",
"# ================================================================\n",
"import gradio as gr\n",
"\n",
"def gradio_generate(code_text: str, style: str, model_choice: str):\n",
" \"\"\"Wrapper for Gradio — generates docstrings using selected model.\"\"\"\n",
" if not code_text.strip():\n",
" return \"⚠️ Please paste some Python code first.\"\n",
" try:\n",
" updated, _ = generate_docstrings_for_source(\n",
" code_text, style=style, module_name=\"gradio_snippet\", model_choice=model_choice\n",
" )\n",
" return updated\n",
" except Exception as e:\n",
" return f\"❌ Error: {e}\"\n",
"\n",
"with gr.Blocks(theme=gr.themes.Soft()) as doc_ui:\n",
" gr.Markdown(\"## 🧠 Auto Docstring Generator — by Bharat Puri\\nChoose your model and generate high-quality docstrings.\")\n",
"\n",
" with gr.Row():\n",
" code_input = gr.Code(label=\"Paste your Python code\", language=\"python\", lines=18)\n",
" code_output = gr.Code(label=\"Generated code with docstrings\", language=\"python\", lines=18)\n",
"\n",
" with gr.Row():\n",
" style_choice = gr.Radio([\"google\"], value=\"google\", label=\"Docstring Style\")\n",
" model_choice = gr.Dropdown(\n",
" list(MODEL_REGISTRY.keys()),\n",
" value=\"gpt-4o-mini (OpenAI)\",\n",
" label=\"Select Model\",\n",
" )\n",
"\n",
" generate_btn = gr.Button(\"🚀 Generate Docstrings\")\n",
" generate_btn.click(\n",
" fn=gradio_generate,\n",
" inputs=[code_input, style_choice, model_choice],\n",
" outputs=[code_output],\n",
" )\n",
"\n",
"doc_ui.launch(share=False)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e6d6720-de8e-4cbb-be9f-82bac3dcc71a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.14"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,190 @@
"""
Simple calculator class with history tracking.
"""
import math
from typing import List, Union
class Calculator:
"""A simple calculator with history tracking."""
def __init__(self):
"""Initialize calculator with empty history."""
self.history: List[str] = []
self.memory: float = 0.0
def add(self, a: float, b: float) -> float:
"""Add two numbers."""
result = a + b
self.history.append(f"{a} + {b} = {result}")
return result
def subtract(self, a: float, b: float) -> float:
"""Subtract b from a."""
result = a - b
self.history.append(f"{a} - {b} = {result}")
return result
def multiply(self, a: float, b: float) -> float:
"""Multiply two numbers."""
result = a * b
self.history.append(f"{a} * {b} = {result}")
return result
def divide(self, a: float, b: float) -> float:
"""Divide a by b."""
if b == 0:
raise ValueError("Cannot divide by zero")
result = a / b
self.history.append(f"{a} / {b} = {result}")
return result
def power(self, base: float, exponent: float) -> float:
"""Calculate base raised to the power of exponent."""
result = base ** exponent
self.history.append(f"{base} ^ {exponent} = {result}")
return result
def square_root(self, number: float) -> float:
"""Calculate square root of a number."""
if number < 0:
raise ValueError("Cannot calculate square root of negative number")
result = math.sqrt(number)
self.history.append(f"{number} = {result}")
return result
def factorial(self, n: int) -> int:
"""Calculate factorial of n."""
if n < 0:
raise ValueError("Factorial is not defined for negative numbers")
if n == 0 or n == 1:
return 1
result = 1
for i in range(2, n + 1):
result *= i
self.history.append(f"{n}! = {result}")
return result
def memory_store(self, value: float) -> None:
"""Store value in memory."""
self.memory = value
self.history.append(f"Memory stored: {value}")
def memory_recall(self) -> float:
"""Recall value from memory."""
self.history.append(f"Memory recalled: {self.memory}")
return self.memory
def memory_clear(self) -> None:
"""Clear memory."""
self.memory = 0.0
self.history.append("Memory cleared")
def get_history(self) -> List[str]:
"""Get calculation history."""
return self.history.copy()
def clear_history(self) -> None:
"""Clear calculation history."""
self.history.clear()
def get_last_result(self) -> Union[float, None]:
"""Get the result of the last calculation."""
if not self.history:
return None
last_entry = self.history[-1]
# Extract result from history entry
if "=" in last_entry:
return float(last_entry.split("=")[-1].strip())
return None
class ScientificCalculator(Calculator):
"""Extended calculator with scientific functions."""
def sine(self, angle: float) -> float:
"""Calculate sine of angle in radians."""
result = math.sin(angle)
self.history.append(f"sin({angle}) = {result}")
return result
def cosine(self, angle: float) -> float:
"""Calculate cosine of angle in radians."""
result = math.cos(angle)
self.history.append(f"cos({angle}) = {result}")
return result
def tangent(self, angle: float) -> float:
"""Calculate tangent of angle in radians."""
result = math.tan(angle)
self.history.append(f"tan({angle}) = {result}")
return result
def logarithm(self, number: float, base: float = math.e) -> float:
"""Calculate logarithm of number with given base."""
if number <= 0:
raise ValueError("Logarithm is not defined for non-positive numbers")
if base <= 0 or base == 1:
raise ValueError("Logarithm base must be positive and not equal to 1")
result = math.log(number, base)
self.history.append(f"log_{base}({number}) = {result}")
return result
def degrees_to_radians(self, degrees: float) -> float:
"""Convert degrees to radians."""
return degrees * math.pi / 180
def radians_to_degrees(self, radians: float) -> float:
"""Convert radians to degrees."""
return radians * 180 / math.pi
def main():
"""Main function to demonstrate calculator functionality."""
print("Calculator Demo")
print("=" * 30)
# Basic calculator
calc = Calculator()
print("Basic Calculator Operations:")
print(f"5 + 3 = {calc.add(5, 3)}")
print(f"10 - 4 = {calc.subtract(10, 4)}")
print(f"6 * 7 = {calc.multiply(6, 7)}")
print(f"15 / 3 = {calc.divide(15, 3)}")
print(f"2 ^ 8 = {calc.power(2, 8)}")
print(f"√64 = {calc.square_root(64)}")
print(f"5! = {calc.factorial(5)}")
print(f"\nCalculation History:")
for entry in calc.get_history():
print(f" {entry}")
# Scientific calculator
print("\n" + "=" * 30)
print("Scientific Calculator Operations:")
sci_calc = ScientificCalculator()
# Convert degrees to radians for trigonometric functions
angle_deg = 45
angle_rad = sci_calc.degrees_to_radians(angle_deg)
print(f"sin({angle_deg}°) = {sci_calc.sine(angle_rad):.4f}")
print(f"cos({angle_deg}°) = {sci_calc.cosine(angle_rad):.4f}")
print(f"tan({angle_deg}°) = {sci_calc.tangent(angle_rad):.4f}")
print(f"ln(10) = {sci_calc.logarithm(10):.4f}")
print(f"log₁₀(100) = {sci_calc.logarithm(100, 10):.4f}")
print(f"\nScientific Calculator History:")
for entry in sci_calc.get_history():
print(f" {entry}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,64 @@
"""
Fibonacci sequence implementation in Python.
"""
def fibonacci(n):
"""Calculate the nth Fibonacci number using recursion."""
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
def fibonacci_iterative(n):
"""Calculate the nth Fibonacci number using iteration."""
if n <= 1:
return n
a, b = 0, 1
for _ in range(2, n + 1):
a, b = b, a + b
return b
def fibonacci_sequence(count):
"""Generate a sequence of Fibonacci numbers."""
sequence = []
for i in range(count):
sequence.append(fibonacci(i))
return sequence
def main():
"""Main function to demonstrate Fibonacci calculations."""
print("Fibonacci Sequence Demo")
print("=" * 30)
# Calculate first 10 Fibonacci numbers
for i in range(10):
result = fibonacci(i)
print(f"fibonacci({i}) = {result}")
print("\nFirst 15 Fibonacci numbers:")
sequence = fibonacci_sequence(15)
print(sequence)
# Performance comparison
import time
n = 30
print(f"\nPerformance comparison for fibonacci({n}):")
start_time = time.time()
recursive_result = fibonacci(n)
recursive_time = time.time() - start_time
start_time = time.time()
iterative_result = fibonacci_iterative(n)
iterative_time = time.time() - start_time
print(f"Recursive: {recursive_result} (took {recursive_time:.4f}s)")
print(f"Iterative: {iterative_result} (took {iterative_time:.4f}s)")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,150 @@
"""
Various sorting algorithms implemented in Python.
"""
import random
import time
from typing import List
def bubble_sort(arr: List[int]) -> List[int]:
"""Sort array using bubble sort algorithm."""
n = len(arr)
arr = arr.copy() # Don't modify original array
for i in range(n):
for j in range(0, n - i - 1):
if arr[j] > arr[j + 1]:
arr[j], arr[j + 1] = arr[j + 1], arr[j]
return arr
def selection_sort(arr: List[int]) -> List[int]:
"""Sort array using selection sort algorithm."""
n = len(arr)
arr = arr.copy()
for i in range(n):
min_idx = i
for j in range(i + 1, n):
if arr[j] < arr[min_idx]:
min_idx = j
arr[i], arr[min_idx] = arr[min_idx], arr[i]
return arr
def insertion_sort(arr: List[int]) -> List[int]:
"""Sort array using insertion sort algorithm."""
arr = arr.copy()
for i in range(1, len(arr)):
key = arr[i]
j = i - 1
while j >= 0 and arr[j] > key:
arr[j + 1] = arr[j]
j -= 1
arr[j + 1] = key
return arr
def quick_sort(arr: List[int]) -> List[int]:
"""Sort array using quick sort algorithm."""
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quick_sort(left) + middle + quick_sort(right)
def merge_sort(arr: List[int]) -> List[int]:
"""Sort array using merge sort algorithm."""
if len(arr) <= 1:
return arr
mid = len(arr) // 2
left = merge_sort(arr[:mid])
right = merge_sort(arr[mid:])
return merge(left, right)
def merge(left: List[int], right: List[int]) -> List[int]:
"""Merge two sorted arrays."""
result = []
i = j = 0
while i < len(left) and j < len(right):
if left[i] <= right[j]:
result.append(left[i])
i += 1
else:
result.append(right[j])
j += 1
result.extend(left[i:])
result.extend(right[j:])
return result
def benchmark_sorting_algorithms():
"""Benchmark different sorting algorithms."""
sizes = [100, 500, 1000, 2000]
algorithms = {
"Bubble Sort": bubble_sort,
"Selection Sort": selection_sort,
"Insertion Sort": insertion_sort,
"Quick Sort": quick_sort,
"Merge Sort": merge_sort
}
print("Sorting Algorithm Benchmark")
print("=" * 50)
for size in sizes:
print(f"\nArray size: {size}")
print("-" * 30)
# Generate random array
test_array = [random.randint(1, 1000) for _ in range(size)]
for name, algorithm in algorithms.items():
start_time = time.time()
sorted_array = algorithm(test_array)
end_time = time.time()
# Verify sorting is correct
is_sorted = all(sorted_array[i] <= sorted_array[i+1] for i in range(len(sorted_array)-1))
print(f"{name:15}: {end_time - start_time:.4f}s {'' if is_sorted else ''}")
def main():
"""Main function to demonstrate sorting algorithms."""
print("Sorting Algorithms Demo")
print("=" * 30)
# Test with small array
test_array = [64, 34, 25, 12, 22, 11, 90]
print(f"Original array: {test_array}")
algorithms = {
"Bubble Sort": bubble_sort,
"Selection Sort": selection_sort,
"Insertion Sort": insertion_sort,
"Quick Sort": quick_sort,
"Merge Sort": merge_sort
}
for name, algorithm in algorithms.items():
sorted_array = algorithm(test_array)
print(f"{name}: {sorted_array}")
# Run benchmark
print("\n" + "=" * 50)
benchmark_sorting_algorithms()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,571 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Python to C++ Code Translator using LLMs\n",
"\n",
"This notebook translates Python code to compilable C++ using GPT, Gemini, or Claude.\n",
"\n",
"## Features:\n",
"- 🤖 Multiple LLM support (GPT, Gemini, Claude)\n",
"- ✅ Automatic compilation testing with g++\n",
"- 🔄 Comparison mode to test all LLMs\n",
"- 💬 Interactive translation mode"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Install Required Packages\n",
"\n",
"Run this cell first to install all dependencies:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!uv add openai anthropic python-dotenv google-generativeai\n",
"#!pip install openai anthropic python-dotenv google-generativeai"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Import Libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import subprocess\n",
"import tempfile\n",
"from pathlib import Path\n",
"from dotenv import load_dotenv\n",
"import openai\n",
"from anthropic import Anthropic\n",
"import google.generativeai as genai"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Load API Keys\n",
"\n",
"Make sure you have a `.env` file with:\n",
"```\n",
"OPENAI_API_KEY=your_key_here\n",
"GEMINI_API_KEY=your_key_here\n",
"ANTHROPIC_API_KEY=your_key_here\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load API keys from .env file\n",
"load_dotenv()\n",
"\n",
"# Initialize API clients\n",
"openai_client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))\n",
"anthropic_client = Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY'))\n",
"genai.configure(api_key=os.getenv('GEMINI_API_KEY'))\n",
"\n",
"print(\"✓ API keys loaded successfully\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4: Define System Prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"SYSTEM_PROMPT = \"\"\"You are an expert programmer that translates Python code to C++.\n",
"Translate the given Python code to efficient, compilable C++ code.\n",
"\n",
"Requirements:\n",
"- The C++ code must compile without errors\n",
"- Include all necessary headers\n",
"- Use modern C++ (C++11 or later) features where appropriate\n",
"- Add proper error handling\n",
"- Maintain the same functionality as the Python code\n",
"- Include a main() function if the Python code has executable statements\n",
"\n",
"Only return the C++ code, no explanations unless there are important notes about compilation.\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 5: LLM Translation Functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def translate_with_gpt(python_code, model=\"gpt-4o\"):\n",
" \"\"\"Translate Python to C++ using OpenAI's GPT models\"\"\"\n",
" try:\n",
" response = openai_client.chat.completions.create(\n",
" model=model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
" {\"role\": \"user\", \"content\": f\"Translate this Python code to C++:\\n\\n{python_code}\"}\n",
" ],\n",
" temperature=0.2\n",
" )\n",
" return response.choices[0].message.content\n",
" except Exception as e:\n",
" return f\"Error with GPT: {str(e)}\"\n",
"\n",
"def translate_with_gemini(python_code, model=\"gemini-2.0-flash-exp\"):\n",
" \"\"\"Translate Python to C++ using Google's Gemini\"\"\"\n",
" try:\n",
" model_instance = genai.GenerativeModel(model)\n",
" prompt = f\"{SYSTEM_PROMPT}\\n\\nTranslate this Python code to C++:\\n\\n{python_code}\"\n",
" response = model_instance.generate_content(prompt)\n",
" return response.text\n",
" except Exception as e:\n",
" return f\"Error with Gemini: {str(e)}\"\n",
"\n",
"def translate_with_claude(python_code, model=\"claude-sonnet-4-20250514\"):\n",
" \"\"\"Translate Python to C++ using Anthropic's Claude\"\"\"\n",
" try:\n",
" response = anthropic_client.messages.create(\n",
" model=model,\n",
" max_tokens=4096,\n",
" temperature=0.2,\n",
" system=SYSTEM_PROMPT,\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": f\"Translate this Python code to C++:\\n\\n{python_code}\"}\n",
" ]\n",
" )\n",
" return response.content[0].text\n",
" except Exception as e:\n",
" return f\"Error with Claude: {str(e)}\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 6: Main Translation Function"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def translate_python_to_cpp(python_code, llm=\"gpt\", model=None):\n",
" \"\"\"\n",
" Translate Python code to C++ using specified LLM\n",
" \n",
" Args:\n",
" python_code (str): Python code to translate\n",
" llm (str): LLM to use ('gpt', 'gemini', or 'claude')\n",
" model (str): Specific model version (optional)\n",
" \n",
" Returns:\n",
" str: Translated C++ code\n",
" \"\"\"\n",
" print(f\"🔄 Translating with {llm.upper()}...\")\n",
" \n",
" if llm.lower() == \"gpt\":\n",
" model = model or \"gpt-4o\"\n",
" cpp_code = translate_with_gpt(python_code, model)\n",
" elif llm.lower() == \"gemini\":\n",
" model = model or \"gemini-2.0-flash-exp\"\n",
" cpp_code = translate_with_gemini(python_code, model)\n",
" elif llm.lower() == \"claude\":\n",
" model = model or \"claude-sonnet-4-20250514\"\n",
" cpp_code = translate_with_claude(python_code, model)\n",
" else:\n",
" return \"Error: Invalid LLM. Choose 'gpt', 'gemini', or 'claude'\"\n",
" \n",
" return cpp_code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 7: Compilation Testing Functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def extract_cpp_code(text):\n",
" \"\"\"Extract C++ code from markdown code blocks if present\"\"\"\n",
" if \"```cpp\" in text:\n",
" start = text.find(\"```cpp\") + 6\n",
" end = text.find(\"```\", start)\n",
" return text[start:end].strip()\n",
" elif \"```c++\" in text:\n",
" start = text.find(\"```c++\") + 6\n",
" end = text.find(\"```\", start)\n",
" return text[start:end].strip()\n",
" elif \"```\" in text:\n",
" start = text.find(\"```\") + 3\n",
" end = text.find(\"```\", start)\n",
" return text[start:end].strip()\n",
" return text.strip()\n",
"\n",
"def compile_cpp_code(cpp_code, output_name=\"translated_program\"):\n",
" \"\"\"\n",
" Compile C++ code and return compilation status\n",
" \n",
" Args:\n",
" cpp_code (str): C++ code to compile\n",
" output_name (str): Name of output executable\n",
" \n",
" Returns:\n",
" dict: Compilation result with status and messages\n",
" \"\"\"\n",
" # Extract code from markdown if present\n",
" cpp_code = extract_cpp_code(cpp_code)\n",
" \n",
" # Create temporary directory\n",
" with tempfile.TemporaryDirectory() as tmpdir:\n",
" cpp_file = Path(tmpdir) / \"program.cpp\"\n",
" exe_file = Path(tmpdir) / output_name\n",
" \n",
" # Write C++ code to file\n",
" with open(cpp_file, 'w') as f:\n",
" f.write(cpp_code)\n",
" \n",
" # Try to compile\n",
" try:\n",
" result = subprocess.run(\n",
" ['g++', '-std=c++17', str(cpp_file), '-o', str(exe_file)],\n",
" capture_output=True,\n",
" text=True,\n",
" timeout=10\n",
" )\n",
" \n",
" if result.returncode == 0:\n",
" return {\n",
" 'success': True,\n",
" 'message': '✓ Compilation successful!',\n",
" 'executable': str(exe_file),\n",
" 'stdout': result.stdout,\n",
" 'stderr': result.stderr\n",
" }\n",
" else:\n",
" return {\n",
" 'success': False,\n",
" 'message': '✗ Compilation failed',\n",
" 'stdout': result.stdout,\n",
" 'stderr': result.stderr\n",
" }\n",
" except subprocess.TimeoutExpired:\n",
" return {\n",
" 'success': False,\n",
" 'message': '✗ Compilation timed out'\n",
" }\n",
" except FileNotFoundError:\n",
" return {\n",
" 'success': False,\n",
" 'message': '✗ g++ compiler not found. Please install g++ to compile C++ code.'\n",
" }\n",
" except Exception as e:\n",
" return {\n",
" 'success': False,\n",
" 'message': f'✗ Compilation error: {str(e)}'\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 8: Complete Pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def translate_and_compile(python_code, llm=\"gpt\", model=None, verbose=True):\n",
" \"\"\"\n",
" Translate Python to C++ and attempt compilation\n",
" \n",
" Args:\n",
" python_code (str): Python code to translate\n",
" llm (str): LLM to use\n",
" model (str): Specific model version\n",
" verbose (bool): Print detailed output\n",
" \n",
" Returns:\n",
" dict: Results including translated code and compilation status\n",
" \"\"\"\n",
" # Translate\n",
" cpp_code = translate_python_to_cpp(python_code, llm, model)\n",
" \n",
" if verbose:\n",
" print(\"\\n\" + \"=\"*60)\n",
" print(\"TRANSLATED C++ CODE:\")\n",
" print(\"=\"*60)\n",
" print(cpp_code)\n",
" print(\"=\"*60 + \"\\n\")\n",
" \n",
" # Compile\n",
" print(\"🔨 Attempting to compile...\")\n",
" compilation_result = compile_cpp_code(cpp_code)\n",
" \n",
" if verbose:\n",
" print(compilation_result['message'])\n",
" if not compilation_result['success'] and 'stderr' in compilation_result:\n",
" print(\"\\nCompilation errors:\")\n",
" print(compilation_result['stderr'])\n",
" \n",
" return {\n",
" 'cpp_code': cpp_code,\n",
" 'compilation': compilation_result\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 1: Factorial Function"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"python_code_1 = \"\"\"\n",
"def factorial(n):\n",
" if n <= 1:\n",
" return 1\n",
" return n * factorial(n - 1)\n",
"\n",
"# Test the function\n",
"print(factorial(5))\n",
"\"\"\"\n",
"\n",
"print(\"Example 1: Factorial Function\")\n",
"print(\"=\"*60)\n",
"result1 = translate_and_compile(python_code_1, llm=\"gpt\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 2: Sum of Squares"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"python_code_2 = \"\"\"\n",
"def sum_of_squares(numbers):\n",
" return sum(x**2 for x in numbers)\n",
"\n",
"numbers = [1, 2, 3, 4, 5]\n",
"result = sum_of_squares(numbers)\n",
"print(f\"Sum of squares: {result}\")\n",
"\"\"\"\n",
"\n",
"print(\"Example 2: Sum of Squares\")\n",
"print(\"=\"*60)\n",
"result2 = translate_and_compile(python_code_2, llm=\"claude\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 3: Fibonacci with Gemini"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"python_code_3 = \"\"\"\n",
"def fibonacci(n):\n",
" if n <= 1:\n",
" return n\n",
" a, b = 0, 1\n",
" for _ in range(2, n + 1):\n",
" a, b = b, a + b\n",
" return b\n",
"\n",
"print(f\"Fibonacci(10) = {fibonacci(10)}\")\n",
"\"\"\"\n",
"\n",
"print(\"Example 3: Fibonacci with Gemini\")\n",
"print(\"=\"*60)\n",
"result3 = translate_and_compile(python_code_3, llm=\"gemini\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compare All LLMs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def compare_llms(python_code):\n",
" \"\"\"Compare all three LLMs on the same Python code\"\"\"\n",
" llms = [\"gpt\", \"gemini\", \"claude\"]\n",
" results = {}\n",
" \n",
" for llm in llms:\n",
" print(f\"\\n{'='*60}\")\n",
" print(f\"Testing with {llm.upper()}\")\n",
" print('='*60)\n",
" results[llm] = translate_and_compile(python_code, llm=llm, verbose=False)\n",
" print(results[llm]['compilation']['message'])\n",
" \n",
" return results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test code for comparison\n",
"python_code_compare = \"\"\"\n",
"def is_prime(n):\n",
" if n < 2:\n",
" return False\n",
" for i in range(2, int(n**0.5) + 1):\n",
" if n % i == 0:\n",
" return False\n",
" return True\n",
"\n",
"primes = [x for x in range(2, 20) if is_prime(x)]\n",
"print(f\"Primes under 20: {primes}\")\n",
"\"\"\"\n",
"\n",
"print(\"COMPARING ALL LLMs\")\n",
"print(\"=\"*60)\n",
"comparison_results = compare_llms(python_code_compare)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Interactive Translation Mode\n",
"\n",
"Use this cell to translate your own Python code interactively:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your custom Python code here\n",
"your_python_code = \"\"\"\n",
"# Paste your Python code here\n",
"def hello_world():\n",
" print(\"Hello, World!\")\n",
"\n",
"hello_world()\n",
"\"\"\"\n",
"\n",
"# Choose your LLM: \"gpt\", \"gemini\", or \"claude\"\n",
"chosen_llm = \"gpt\"\n",
"\n",
"result = translate_and_compile(your_python_code, llm=chosen_llm)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"You now have a complete Python to C++ translator! \n",
"\n",
"### Main Functions:\n",
"- `translate_python_to_cpp(code, llm, model)` - Translate only\n",
"- `translate_and_compile(code, llm, model)` - Translate and compile\n",
"- `compare_llms(code)` - Compare all three LLMs\n",
"\n",
"### Supported LLMs:\n",
"- **gpt** - OpenAI GPT-4o\n",
"- **gemini** - Google Gemini 2.0 Flash\n",
"- **claude** - Anthropic Claude Sonnet 4\n",
"\n",
"Happy translating! 🚀"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,569 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c1fcc6e9",
"metadata": {},
"source": [
"# Code Converter - Python to TypeScript Code\n",
"\n",
"This implementation, converts python code to optimized TypeScript Code, and runs the function"
]
},
{
"cell_type": "markdown",
"id": "16b6b063",
"metadata": {},
"source": [
"## Set up and imports\n"
]
},
{
"cell_type": "code",
"execution_count": 115,
"id": "b3dc394c",
"metadata": {},
"outputs": [],
"source": [
"\n",
"import os\n",
"import io\n",
"import sys\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import subprocess\n",
"from IPython.display import Markdown, display, display_markdown\n",
"from system_info import retrieve_system_info\n",
"import gradio as gr"
]
},
{
"cell_type": "markdown",
"id": "1c9a0936",
"metadata": {},
"source": [
"# Initializing the access keys"
]
},
{
"cell_type": "code",
"execution_count": 116,
"id": "fac104ec",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"OpenAI API Key exists and begins sk-proj-\n"
]
}
],
"source": [
"load_dotenv(override=True)\n",
"openai_api_key = os.getenv(\"OPENAI_API_KEY\")\n",
"\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set. Check your engironment variables and try again\")"
]
},
{
"cell_type": "markdown",
"id": "5932182f",
"metadata": {},
"source": [
"# Connecting to client libraries"
]
},
{
"cell_type": "code",
"execution_count": 117,
"id": "4000f231",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": 118,
"id": "51c67ac0",
"metadata": {},
"outputs": [],
"source": [
"# contants\n",
"OPENAI_MODEL= \"gpt-5-nano\""
]
},
{
"cell_type": "code",
"execution_count": 119,
"id": "ab4342bf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'os': {'system': 'Darwin',\n",
" 'arch': 'arm64',\n",
" 'release': '24.5.0',\n",
" 'version': 'Darwin Kernel Version 24.5.0: Tue Apr 22 19:48:46 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T8103',\n",
" 'kernel': '24.5.0',\n",
" 'distro': None,\n",
" 'wsl': False,\n",
" 'rosetta2_translated': False,\n",
" 'target_triple': 'arm64-apple-darwin24.5.0'},\n",
" 'package_managers': ['xcode-select (CLT)', 'brew'],\n",
" 'cpu': {'brand': 'Apple M1',\n",
" 'cores_logical': 8,\n",
" 'cores_physical': 8,\n",
" 'simd': []},\n",
" 'toolchain': {'compilers': {'gcc': 'Apple clang version 17.0.0 (clang-1700.0.13.3)',\n",
" 'g++': 'Apple clang version 17.0.0 (clang-1700.0.13.3)',\n",
" 'clang': 'Apple clang version 17.0.0 (clang-1700.0.13.3)',\n",
" 'msvc_cl': ''},\n",
" 'build_tools': {'cmake': '', 'ninja': '', 'make': 'GNU Make 3.81'},\n",
" 'linkers': {'ld_lld': ''}}}"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"system_info = retrieve_system_info()\n",
"system_info"
]
},
{
"cell_type": "code",
"execution_count": 120,
"id": "1a1c1324",
"metadata": {},
"outputs": [],
"source": [
"message = f\"\"\"\n",
"Here is a report of the system information for my computer.\n",
"I want to run a TypeScript compiler to compile a single TypeScript file called main.cpp and then execute it in the simplest way possible.\n",
"Please reply with whether I need to install any TypeScript compiler to do this. If so, please provide the simplest step by step instructions to do so.\n",
"\n",
"If I'm already set up to compile TypeScript code, then I'd like to run something like this in Python to compile and execute the code:\n",
"```python\n",
"compile_command = # something here - to achieve the fastest possible runtime performance\n",
"compile_result = subprocess.run(compile_command, check=True, text=True, capture_output=True)\n",
"run_command = # something here\n",
"run_result = subprocess.run(run_command, check=True, text=True, capture_output=True)\n",
"return run_result.stdout\n",
"```\n",
"Please tell me exactly what I should use for the compile_command and run_command.\n",
"\n",
"System information:\n",
"{system_info}\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 121,
"id": "439015c1",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Short answer:\n",
"- Yes, to compile TypeScript you need a TypeScript compiler (tsc). On macOS youll typically install Node.js first, then install TypeScript.\n",
"- Important: main.cpp sounds like a C++ file. The TypeScript compiler (tsc) cannot compile .cpp. If you want to use TypeScript, rename the file to main.ts (and ensure its contents are TypeScript). If you actually meant C++, use a C++ compiler instead (clang/g++).\n",
"\n",
"Step-by-step to set up TypeScript (simplest path on your system):\n",
"1) Install Node.js (which also installs npm)\n",
"- brew update\n",
"- brew install node\n",
"\n",
"2) Install the TypeScript compiler globally\n",
"- npm install -g typescript\n",
"\n",
"3) Verify installations\n",
"- node -v\n",
"- npm -v\n",
"- tsc -v\n",
"\n",
"4) Compile and run a TypeScript file (assuming your file is main.ts)\n",
"- tsc main.ts\n",
"- node main.js\n",
"\n",
"Notes:\n",
"- If your file is indeed C++ (main.cpp), you cannot compile it with tsc. To compile C++, use clang++ (on macOS) or g++:\n",
" - clang++ -std=c++17 main.cpp -o main\n",
" - ./main\n",
"\n",
"Python integration (fill-in for your example)\n",
"- If you have a TypeScript file named main.ts and you want to compile it to JavaScript and then run it with Node, use:\n",
" compile_command = [\"tsc\", \"main.ts\"]\n",
" run_command = [\"node\", \"main.js\"]\n",
"\n",
"- If you want to show a single command in Python that compiles and runs in one go (still two steps because TS compiles to JS first):\n",
" compile_command = [\"tsc\", \"main.ts\"]\n",
" run_command = [\"node\", \"main.js\"]\n",
"\n",
"- If you truly want to bypass TypeScript and run C++ instead (not TypeScript):\n",
" compile_command = [\"clang++\", \"-std=c++17\", \"main.cpp\", \"-o\", \"main\"]\n",
" run_command = [\"./main\"]\n",
"\n",
"If youd like, tell me whether main.cpp is meant to be C++ or you actually have a TypeScript file named main.ts, and I can tailor the exact commands."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"response = openai.chat.completions.create(model=OPENAI_MODEL, messages=[{\"role\":\"user\", \"content\":message}])\n",
"display(Markdown(response.choices[0].message.content))"
]
},
{
"cell_type": "code",
"execution_count": 122,
"id": "576cb5fa",
"metadata": {},
"outputs": [],
"source": [
"compile_command = [\"tsc\", \"main.ts\", \"--target\", \"ES2020\", \"--module\", \"commonjs\"]\n",
"run_command = [\"ts-node\", \"main.ts\"]"
]
},
{
"cell_type": "markdown",
"id": "01b03700",
"metadata": {},
"source": [
"## System and user prompts for the code converter"
]
},
{
"cell_type": "code",
"execution_count": 123,
"id": "255e318b",
"metadata": {},
"outputs": [],
"source": [
"system_prompt = \"\"\"\n",
"Your task is to convert Python code into high performance TypeScript code.\n",
"Respond only with TypeScript code. Do not provide any explanation other than occasional comments.\n",
"The TypeScript response needs to produce an identical output in the fastest possible time.\n",
"\"\"\"\n",
"\n",
"\n",
"def user_prompt_for(python):\n",
" return f\"\"\" \n",
" port this Python code to TypeScript with the fastest possible implementation that produces identical output in the least time.\n",
"\n",
" The system information is \n",
"\n",
" {system_info}\n",
"\n",
" Your response will be written to a file called main.ts and then compile and ecexted; the compilation command is:\n",
"\n",
" {compile_command}\n",
"\n",
" Respond only with C++ code.\n",
" Python code to port:\n",
"\n",
" ```python\n",
" {python}\n",
" ```\n",
"\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 124,
"id": "09da7cb1",
"metadata": {},
"outputs": [],
"source": [
"def messages_for(python):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(python)},\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": 125,
"id": "abcdb617",
"metadata": {},
"outputs": [],
"source": [
"def write_output(code):\n",
" with open(\"main.ts\", \"w\", encoding=\"utf-8\") as f:\n",
" f.write(code)"
]
},
{
"cell_type": "code",
"execution_count": 126,
"id": "c7a32d5f",
"metadata": {},
"outputs": [],
"source": [
"def convert(python):\n",
" reasoning_effort = \"high\"\n",
" response = openai.chat.completions.create(\n",
" model=OPENAI_MODEL,\n",
" messages=messages_for(python),\n",
" reasoning_effort=reasoning_effort,\n",
" )\n",
" reply = response.choices[0].message.content\n",
" reply = reply.replace(\"```ts\", \"\").replace(\"```\", \"\")\n",
" return reply"
]
},
{
"cell_type": "code",
"execution_count": 127,
"id": "59a7ec1f",
"metadata": {},
"outputs": [],
"source": [
"pi = \"\"\"\n",
"import time\n",
"\n",
"def calculate(iterations, param1, param2):\n",
" result = 1.0\n",
" for i in range(1, iterations+1):\n",
" j = i * param1 - param2\n",
" result -= (1/j)\n",
" j = i * param1 + param2\n",
" result += (1/j)\n",
" return result\n",
"\n",
"start_time = time.time()\n",
"result = calculate(200_000_000, 4, 1) * 4\n",
"end_time = time.time()\n",
"\n",
"print(f\"Result: {result:.12f}\")\n",
"print(f\"Execution Time: {(end_time - start_time):.6f} seconds\")\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 128,
"id": "6856393b",
"metadata": {},
"outputs": [],
"source": [
"def run_python(code):\n",
" globals_dict = {\"__builtins__\": __builtins__}\n",
"\n",
" buffer = io.StringIO()\n",
" old_stdout = sys.stdout\n",
" sys.stdout = buffer\n",
"\n",
" try:\n",
" exec(code, globals_dict)\n",
" output = buffer.getvalue()\n",
" except Exception as e:\n",
" output = f\"Error: {e}\"\n",
" finally:\n",
" sys.stdout = old_stdout\n",
"\n",
" return output"
]
},
{
"cell_type": "code",
"execution_count": 129,
"id": "c51fa5ea",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Result: 3.141592656089\\nExecution Time: 19.478347 seconds\\n'"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"run_python(pi)"
]
},
{
"cell_type": "code",
"execution_count": 130,
"id": "69eb2304",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"import { performance } from 'perf_hooks';\\n\\nfunction digamma(z: number): number {\\n let acc = 0;\\n while (z < 7) {\\n acc -= 1 / z;\\n z += 1;\\n }\\n const z2 = z * z;\\n const z4 = z2 * z2;\\n const z6 = z4 * z2;\\n const z8 = z4 * z4;\\n const z10 = z8 * z2;\\n const z12 = z10 * z2;\\n const series =\\n Math.log(z)\\n - 1 / (2 * z)\\n - 1 / (12 * z2)\\n + 1 / (120 * z4)\\n - 1 / (252 * z6)\\n + 1 / (240 * z8)\\n - 5 / (660 * z10)\\n + 691 / (32760 * z12);\\n return acc + series;\\n}\\n\\nconst N = 200_000_000;\\n\\nconst t0 = performance.now();\\nconst result =\\n 4 - digamma(N + 0.75) + digamma(0.75) + digamma(N + 1.25) - digamma(1.25);\\nconst t1 = performance.now();\\n\\nconsole.log(`Result: ${result.toFixed(12)}`);\\nconsole.log(`Execution Time: ${((t1 - t0) / 1000).toFixed(6)} seconds`);\""
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"convert(pi)"
]
},
{
"cell_type": "code",
"execution_count": 131,
"id": "2ea56d95",
"metadata": {},
"outputs": [],
"source": [
" \n",
"def run_typescript(code):\n",
" write_output(code)\n",
" try:\n",
" subprocess.run(compile_command, check=True, text=True, capture_output=True)\n",
" run_result = subprocess.run(run_command, check=True, text=True, capture_output=True)\n",
" return run_result.stdout\n",
" except subprocess.CalledProcessError as e:\n",
" return f\"An error occurred:\\n{e.stderr}\""
]
},
{
"cell_type": "code",
"execution_count": 132,
"id": "79d6bd87",
"metadata": {},
"outputs": [],
"source": [
"# run_typescript()"
]
},
{
"cell_type": "markdown",
"id": "b4799b88",
"metadata": {},
"source": [
"## User Interface"
]
},
{
"cell_type": "code",
"execution_count": 133,
"id": "8486ce70",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* Running on local URL: http://127.0.0.1:7864\n",
"* To create a public link, set `share=True` in `launch()`.\n"
]
},
{
"data": {
"text/html": [
"<div><iframe src=\"http://127.0.0.1:7864/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 133,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with gr.Blocks(\n",
" theme=gr.themes.Monochrome(), title=\"Port from Python to TypeScript\"\n",
") as ui:\n",
" with gr.Row(equal_height=True):\n",
" with gr.Column(scale=6):\n",
" python = gr.Code(\n",
" label=\"Python Original Code\",\n",
" value=pi,\n",
" language=\"python\",\n",
" lines=30,\n",
" )\n",
" with gr.Column(scale=6):\n",
" ts = gr.Code(\n",
" label=\"TypeScript (generated)\", value=\"\", language=\"cpp\", lines=26\n",
" )\n",
" with gr.Row(elem_classes=[\"controls\"]):\n",
" python_run = gr.Button(\"Run Python\", elem_classes=[\"run-btn\", \"py\"])\n",
" port = gr.Button(\"Convert to TS\", elem_classes=[\"convert-btn\"])\n",
" ts_run = gr.Button(\"Run TS\", elem_classes=[\"run-btn\", \"ts\"])\n",
"\n",
" with gr.Row(equal_height=True):\n",
" with gr.Column(scale=6):\n",
" python_out = gr.TextArea(label=\"Python Result\", lines=10)\n",
" with gr.Column(scale=6):\n",
" ts_out = gr.TextArea(label=\"TS output\", lines=10)\n",
"\n",
" port.click(fn=convert, inputs=[python], outputs=[ts])\n",
" python_run.click(fn=run_python, inputs=[python], outputs=[python_out])\n",
" ts_run.click(fn=run_typescript, inputs=[ts], outputs=[ts_out])\n",
" \n",
" \n",
"ui.launch(inbrowser=True)"
]
},
{
"cell_type": "markdown",
"id": "4663a174",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "9033e421",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,180 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "ed8c52b6",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import gradio as gr\n",
"\n",
"load_dotenv(override=True)\n",
"\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"ollama_api_key = os.getenv('OLLAMA_API_KEY')\n",
"\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
"\n",
"if ollama_api_key:\n",
" print(f\"OLLAMA API Key exists and begins {ollama_api_key[:2]}\")\n",
"else:\n",
" print(\"OLLAMA API Key not set (and this is optional)\")\n",
"\n",
"ollama_url = \"http://localhost:11434/v1\"\n",
"\n",
"openai = OpenAI()\n",
"ollama = OpenAI(api_key=ollama_api_key, base_url=ollama_url)\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "c628f95e",
"metadata": {},
"outputs": [],
"source": [
"system_prompt_doc = \"\"\"You are an expert Python developer and code reviewer.\n",
"Your job is to read the user's provided function, and return:\n",
"1. A concise, PEP-257-compliant docstring summarizing what the function does, clarifying types, parameters, return values, and side effects.\n",
"2. Helpful inline comments that improve both readability and maintainability, without restating what the code obviously does.\n",
"\n",
"Only output the function, not explanations or additional text. \n",
"Do not modify variable names or refactor the function logic.\n",
"Your response should improve the code's clarity and documentation, making it easier for others to understand and maintain.\n",
"Don't be extremely verbose.\n",
"Your answer should be at a senior level of expertise.\n",
"\"\"\"\n",
"\n",
"system_prompt_tests = \"\"\"You are a seasoned Python developer and testing expert.\n",
"Your task is to read the user's provided function, and generate:\n",
"1. A concise set of meaningful unit tests that thoroughly validate the function's correctness, including typical, edge, and error cases.\n",
"2. The tests should be written for pytest (or unittest if pytest is not appropriate), use clear, descriptive names, and avoid unnecessary complexity.\n",
"3. If dependencies or mocking are needed, include minimal necessary setup code (but avoid over-mocking).\n",
"\n",
"Only output the relevant test code, not explanations or extra text.\n",
"Do not change the original function; focus solely on comprehensive, maintainable test coverage that other developers can easily understand and extend.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "4bb84e6c",
"metadata": {},
"outputs": [],
"source": [
"models = [\"gpt-4.1-mini\", \"llama3.1\"]\n",
"clients = {\"gpt-4.1-mini\": openai, \"llama3.1\": ollama}\n",
"\n",
"def generate_documentation(code, model):\n",
" response = clients[model].chat.completions.create(\n",
" model=model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt_doc},\n",
" {\"role\": \"user\", \"content\": code}\n",
" ],\n",
" stream=True\n",
" )\n",
" output = \"\"\n",
" for chunk in response:\n",
" output += chunk.choices[0].delta.content or \"\"\n",
" yield output.replace(\"```python\", \"\").replace(\"```\", \"\")\n",
"\n",
"def generate_tests(code, model):\n",
" response = clients[model].chat.completions.create(\n",
" model=model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt_tests},\n",
" {\"role\": \"user\", \"content\": code}\n",
" ],\n",
" stream=True\n",
" )\n",
" output = \"\"\n",
" for chunk in response:\n",
" output += chunk.choices[0].delta.content or \"\"\n",
" yield output.replace(\"```python\", \"\").replace(\"```\", \"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a4e65b26",
"metadata": {},
"outputs": [],
"source": [
"with gr.Blocks(theme=gr.themes.Soft(spacing_size=gr.themes.sizes.spacing_sm, radius_size=gr.themes.sizes.radius_none)) as ui:\n",
" gr.Markdown(\"# Python Toolbox\", elem_id=\"app-title\")\n",
" \n",
" with gr.Tab(\"Docstring Generator\") as tab1:\n",
" gr.Markdown(\"## Docstring & Comment Generator\")\n",
" gr.Markdown(\"Paste your function below to generate helpful docstrings and inline comments!\")\n",
"\n",
" with gr.Row():\n",
" with gr.Column():\n",
" code_input = gr.Code(label=\"Your Python function here\", lines=20, language=\"python\")\n",
" model_dropdown = gr.Dropdown(choices=models, value=models[0], label=\"Select model\")\n",
" submit_doc_btn = gr.Button(\"Generate docstring & comments\")\n",
" with gr.Column():\n",
" code_output = gr.Code(label=\"New function with docstring and comments\", language=\"python\")\n",
"\n",
" submit_doc_btn.click(\n",
" generate_documentation, \n",
" inputs=[code_input, model_dropdown], \n",
" outputs=code_output\n",
" )\n",
"\n",
" with gr.Tab(\"Unit Tests Generator\") as tab2:\n",
" gr.Markdown(\"## Unit Test Generator\")\n",
" gr.Markdown(\"Paste your function below to generate helpful unit tests!\")\n",
"\n",
" with gr.Row():\n",
" with gr.Column():\n",
" code_input_2 = gr.Code(label=\"Your Python function here\", lines=20, language=\"python\")\n",
" model_dropdown_2 = gr.Dropdown(choices=models, value=models[0], label=\"Select model\")\n",
" submit_test_btn = gr.Button(\"Generate unit tests\")\n",
" with gr.Column():\n",
" code_output_2 = gr.Code(label=\"Generated unit tests\", language=\"python\")\n",
"\n",
" submit_test_btn.click(\n",
" generate_tests, \n",
" inputs=[code_input_2, model_dropdown_2], \n",
" outputs=code_output_2\n",
" )\n",
" \n",
" \n",
" tab1.select(lambda x: x, inputs=code_input_2, outputs=code_input)\n",
" tab2.select(lambda x: x, inputs=code_input, outputs=code_input_2)\n",
"\n",
"ui.launch(share=False, inbrowser=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,307 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "d04a7c55",
"metadata": {},
"outputs": [],
"source": [
"#Importing necessary libraries\n",
"import os\n",
"from dotenv import load_dotenv\n",
"from anthropic import Client\n",
"from dotenv import load_dotenv\n",
"import sys\n",
"from faker import Faker\n",
"import random\n",
"import gradio as gr\n",
"from langchain_community.document_loaders import DirectoryLoader, TextLoader\n",
"from langchain_text_splitters import CharacterTextSplitter\n",
"from langchain_community.embeddings import HuggingFaceEmbeddings\n",
"from langchain_community.vectorstores import Chroma\n",
"from langchain_anthropic import ChatAnthropic\n",
"from langchain_classic.memory import ConversationBufferMemory\n",
"from langchain_classic.chains import ConversationalRetrievalChain\n",
"\n",
"!{sys.executable} -m pip install faker\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d7f8354",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# loading the .env variables\n",
"load_dotenv(override=True)\n",
"\n",
"# Force export to OS env so LangChain can detect it (had to try this because the key was not loading at some point but by the time i shared the code it loaded well so i commented it out)\n",
"#os.environ[\"ANTHROPIC_API_KEY\"] = os.getenv(\"ANTHROPIC_API_KEY\")\n",
"\n",
"#getting the key from the our .env file. It is Anthropic_API_KEY\n",
"ANTHROPIC_KEY = os.getenv(\"ANTHROPIC_API_KEY\")\n",
"client = Client(api_key=ANTHROPIC_KEY)\n",
"\n",
"# Checking the anthropic models list our anthropic key ca help us play with\n",
"models = client.models.list()\n",
"for model in models:\n",
" print(model.id)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20d11d1c",
"metadata": {},
"outputs": [],
"source": [
"#Getting the python executable path on my notebook to know where to install the faker library\n",
"print(sys.executable)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93a8f3ec",
"metadata": {},
"outputs": [],
"source": [
"#Creating a fake person with faker\n",
"fake = Faker()\n",
"base_dir = \"knowledge_base\"\n",
"folders = [\"personal\", \"projects\", \"learning\"]\n",
"\n",
"# We now create folders if they don't exist\n",
"for folder in folders:\n",
" os.makedirs(f\"{base_dir}/{folder}\", exist_ok=True)\n",
"\n",
"# Check if data already exists\n",
"personal_file = f\"{base_dir}/personal/info.md\"\n",
"projects_file = f\"{base_dir}/projects/projects.md\"\n",
"learning_file = f\"{base_dir}/learning/learning.md\"\n",
"\n",
"#If the personal info file does not exist, create it\n",
"if not os.path.exists(personal_file):\n",
" name = fake.name()\n",
" profession = random.choice([\"Data Analyst\", \"Business Analyst\", \"Software Engineer\", \"AI Specialist\"])\n",
" bio = fake.paragraph(nb_sentences=5)\n",
" experience = \"\\n\".join([f\"- {fake.job()} at {fake.company()} ({fake.year()})\" for _ in range(3)])\n",
" \n",
" personal_text = f\"\"\"\n",
"# Personal Profile\n",
"Name: {name} \n",
"Profession: {profession} \n",
"\n",
"Bio: {bio}\n",
"\n",
"## Experience\n",
"{experience}\n",
"\"\"\"\n",
" with open(personal_file, \"w\") as f:\n",
" f.write(personal_text)\n",
" print(\"Personal info generated.\")\n",
"else:\n",
" #If the personal info file exists, skip the regeneration\n",
" print(\"Personal info already exists. Skipping regeneration.\")\n",
"\n",
"#doing the same for project file\n",
"if not os.path.exists(projects_file):\n",
" projects = \"\\n\".join([\n",
" f\"- **{fake.catch_phrase()}** — {fake.bs().capitalize()} for {fake.company()}.\"\n",
" for _ in range(5)\n",
" ])\n",
" projects_text = f\"\"\"\n",
"# Projects Portfolio\n",
"\n",
"Key Projects:\n",
"{projects}\n",
"\"\"\"\n",
" with open(projects_file, \"w\") as f:\n",
" f.write(projects_text)\n",
" print(\"Projects generated.\")\n",
"else:\n",
" print(\"Projects already exist. Skipping regeneration.\")\n",
"\n",
"#same thing for learning file\n",
"if not os.path.exists(learning_file):\n",
" topics = [\"LangChain\", \"RAG Systems\", \"Vector Databases\", \"AI Ethics\", \"Prompt Engineering\", \"Data Visualization\"]\n",
" learning = \"\\n\".join([\n",
" f\"- {random.choice(topics)} — {fake.sentence(nb_words=8)}\"\n",
" for _ in range(6)\n",
" ])\n",
" learning_text = f\"\"\"\n",
"# Learning Journey\n",
"\n",
"Recent Topics and Notes:\n",
"{learning}\n",
"\"\"\"\n",
" with open(learning_file, \"w\") as f:\n",
" f.write(learning_text)\n",
" print(\"Learning notes generated.\")\n",
"else:\n",
" print(\"Learning notes already exist. Skipping regeneration.\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6fa19091",
"metadata": {},
"outputs": [],
"source": [
"#loading the knowledge information from the knowledge_base folder\n",
"loader = DirectoryLoader(\"knowledge_base\", glob=\"**/*.md\", loader_cls=TextLoader)\n",
"documents = loader.load()\n",
"\n",
"#Splitting the documents into chunks\n",
"splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=80)\n",
"chunks = splitter.split_documents(documents)\n",
"\n",
"print(f\"Loaded {len(documents)} documents and created {len(chunks)} chunks.\")\n"
]
},
{
"cell_type": "markdown",
"id": "7b9fc9a5",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "6dcdec41",
"metadata": {},
"outputs": [],
"source": [
"#Creating the embeddings\n",
"embeddings = HuggingFaceEmbeddings(model_name=\"sentence-transformers/all-MiniLM-L6-v2\")\n",
"\n",
"# Chroma as the vector store\n",
"vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory=\"chroma_db\")\n",
"vectorstore.persist()\n",
"\n",
"print(\"Vector store created and saved to 'chroma_db'.\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "99e4a99f",
"metadata": {},
"outputs": [],
"source": [
"#Check Langchain version as they updated the version recently thus making it difficult to use it successfullt\n",
"print(langchain.__version__)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5dc1b6ce",
"metadata": {},
"outputs": [],
"source": [
"# The main Langchain Abstraction are: Memory, LLM, and Retriever\n",
"\n",
"# Memory for conversation history\n",
"memory = ConversationBufferMemory(\n",
" memory_key=\"chat_history\",\n",
" return_messages=True\n",
")\n",
"\n",
"# Using one of the Anthropic models from the list above to create the LLM\n",
"llm = ChatAnthropic(\n",
" model=\"claude-sonnet-4-5-20250929\",\n",
" temperature=0.6,\n",
" max_tokens=1024,\n",
" anthropic_api_key=ANTHROPIC_KEY\n",
")\n",
"\n",
"# Retriever from your vectorstore\n",
"retriever = vectorstore.as_retriever(search_kwargs={\"k\": 3})\n",
"\n",
"# Bringing everything together tConversational RAG Chain\n",
"conversation_chain = ConversationalRetrievalChain.from_llm(\n",
" llm=llm,\n",
" retriever=retriever,\n",
" memory=memory\n",
")\n",
"\n",
"print(\"Anthropic conversational retriever is ready!\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6f93eea7",
"metadata": {},
"outputs": [],
"source": [
"#fnc to create a chat interface\n",
"def chat(message, history):\n",
" if conversation_chain:\n",
" result = conversation_chain.invoke({\"question\": message})\n",
" return result[\"answer\"]\n",
" else:\n",
" # Retrieval-only fallback\n",
" docs = retriever.get_relevant_documents(message)\n",
" context = \"\\n\\n\".join([d.page_content for d in docs])\n",
" return f\"(Offline Mode)\\nTop relevant info:\\n\\n{context[:1000]}\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aadf91b4",
"metadata": {},
"outputs": [],
"source": [
"#used som css to make the chat interface look better, and dark mode. I love dark mode btw\n",
"css = \"\"\"\n",
"body {background-color: #0f1117; color: #e6e6e6;}\n",
".gradio-container {background-color: #0f1117 !important;}\n",
"textarea, input, .wrap.svelte-1ipelgc {background-color: #1b1f2a !important; color: #ffffff !important;}\n",
"\"\"\"\n",
"\n",
"#Gradio blocks\n",
"with gr.Blocks(css=css, theme=\"gradio/monochrome\") as demo:\n",
" gr.Markdown(\n",
" \"\"\"\n",
" <h2 style=\"color: #f5f5f5;\">Personal Knowledge Worker</h2>\n",
" <p style=\"color: #f5f5f5;\">Chat with your auto-generated knowledge base (Claude-powered if available)</p>\n",
" \"\"\",\n",
" elem_id=\"title\"\n",
" )\n",
" gr.ChatInterface(chat, type=\"messages\")\n",
"\n",
"demo.launch(inbrowser=True)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,445 @@
#!/usr/bin/env python3
"""
Knowledge Worker with Document Upload and Google Drive Integration
This script creates a knowledge worker that:
1. Allows users to upload documents through a Gradio UI
2. Integrates with Google Drive to access documents
3. Uses Chroma vector database for efficient document retrieval
4. Implements RAG (Retrieval Augmented Generation) for accurate responses
The system updates its context dynamically when new documents are uploaded.
"""
import os
import glob
import tempfile
from pathlib import Path
from dotenv import load_dotenv
import gradio as gr
# LangChain imports
from langchain_community.document_loaders import DirectoryLoader, TextLoader, PyPDFLoader
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
# Visualization imports
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go
# Removed Google Drive API imports
# Additional document loaders
try:
from langchain_community.document_loaders import Docx2txtLoader, UnstructuredExcelLoader
except ImportError:
print("Warning: Some document loaders not available. PDF and text files will still work.")
Docx2txtLoader = None
UnstructuredExcelLoader = None
# Configuration
MODEL = "gpt-4o-mini" # Using a cost-effective model
DB_NAME = "knowledge_worker_db"
UPLOAD_FOLDER = "uploaded_documents"
# Create upload folder if it doesn't exist
os.makedirs(UPLOAD_FOLDER, exist_ok=True)
# Load environment variables
load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
# Removed Google Drive credentials configuration
# Use a simple text splitter approach
class SimpleTextSplitter:
def __init__(self, chunk_size=1000, chunk_overlap=200):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
def split_documents(self, documents):
chunks = []
for doc in documents:
text = doc.page_content
start = 0
while start < len(text):
end = start + self.chunk_size
chunk_text = text[start:end]
chunk_doc = Document(page_content=chunk_text, metadata=doc.metadata.copy())
chunks.append(chunk_doc)
start = end - self.chunk_overlap
return chunks
CharacterTextSplitter = SimpleTextSplitter
# Try different import paths for memory and chains
try:
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
except ImportError:
try:
from langchain_core.memory import ConversationBufferMemory
from langchain_core.chains import ConversationalRetrievalChain
except ImportError:
try:
from langchain_community.memory import ConversationBufferMemory
from langchain_community.chains import ConversationalRetrievalChain
except ImportError:
print("Warning: Memory and chains modules not found. Creating simple alternatives.")
# Create simple alternatives
class ConversationBufferMemory:
def __init__(self, memory_key='chat_history', return_messages=True):
self.memory_key = memory_key
self.return_messages = return_messages
self.chat_memory = []
def save_context(self, inputs, outputs):
self.chat_memory.append((inputs, outputs))
def load_memory_variables(self, inputs):
return {self.memory_key: self.chat_memory}
class ConversationalRetrievalChain:
def __init__(self, llm, retriever, memory):
self.llm = llm
self.retriever = retriever
self.memory = memory
def invoke(self, inputs):
question = inputs.get("question", "")
# Simple implementation - just return a basic response
return {"answer": f"I received your question: {question}. This is a simplified response."}
# Removed Google Drive Integration Functions
# Document Processing Functions
def get_loader_for_file(file_path):
"""
Get the appropriate document loader based on file extension
"""
file_extension = os.path.splitext(file_path)[1].lower()
if file_extension == '.pdf':
return PyPDFLoader(file_path)
elif file_extension in ['.docx', '.doc'] and Docx2txtLoader:
return Docx2txtLoader(file_path)
elif file_extension in ['.xlsx', '.xls'] and UnstructuredExcelLoader:
return UnstructuredExcelLoader(file_path)
elif file_extension in ['.txt', '.md']:
return TextLoader(file_path, encoding='utf-8')
else:
# Default to text loader for unknown types
try:
return TextLoader(file_path, encoding='utf-8')
except:
return None
def load_document(file_path):
"""
Load a document using the appropriate loader
"""
loader = get_loader_for_file(file_path)
if loader:
try:
return loader.load()
except Exception as e:
print(f"Error loading document {file_path}: {e}")
return []
def process_documents(documents):
"""
Split documents into chunks for embedding
"""
text_splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
return chunks
# Knowledge Base Class
class KnowledgeBase:
def __init__(self, db_name=DB_NAME):
self.db_name = db_name
self.embeddings = OpenAIEmbeddings()
self.vectorstore = None
self.initialize_vectorstore()
def initialize_vectorstore(self):
"""
Initialize the vector store, loading from disk if it exists
"""
if os.path.exists(self.db_name):
self.vectorstore = Chroma(persist_directory=self.db_name, embedding_function=self.embeddings)
print(f"Loaded existing vector store with {self.vectorstore._collection.count()} documents")
else:
# Create empty vectorstore
self.vectorstore = Chroma(persist_directory=self.db_name, embedding_function=self.embeddings)
print("Created new vector store")
def add_documents(self, documents):
"""
Process and add documents to the vector store
"""
if not documents:
return False
chunks = process_documents(documents)
if not chunks:
return False
# Add to existing vectorstore
self.vectorstore.add_documents(chunks)
print(f"Added {len(chunks)} chunks to vector store")
return True
def get_retriever(self, k=4):
"""
Get a retriever for the vector store
"""
return self.vectorstore.as_retriever(search_kwargs={"k": k})
def visualize_vectors(self):
"""
Create a 3D visualization of the vector store
"""
try:
collection = self.vectorstore._collection
result = collection.get(include=['embeddings', 'documents', 'metadatas'])
if result['embeddings'] is None or len(result['embeddings']) == 0:
print("No embeddings found in vector store")
return None
vectors = np.array(result['embeddings'])
documents = result['documents']
metadatas = result['metadatas']
if len(vectors) < 2:
print("Not enough vectors for visualization (need at least 2)")
return None
# Get source info for coloring
sources = [metadata.get('source', 'unknown') for metadata in metadatas]
unique_sources = list(set(sources))
colors = [['blue', 'green', 'red', 'orange', 'purple', 'cyan'][unique_sources.index(s) % 6] for s in sources]
# Reduce dimensions for visualization
# Adjust perplexity based on number of samples
n_samples = len(vectors)
perplexity = min(30, max(1, n_samples - 1))
tsne = TSNE(n_components=3, random_state=42, perplexity=perplexity)
reduced_vectors = tsne.fit_transform(vectors)
# Create the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
x=reduced_vectors[:, 0],
y=reduced_vectors[:, 1],
z=reduced_vectors[:, 2],
mode='markers',
marker=dict(size=5, color=colors, opacity=0.8),
text=[f"Source: {s}<br>Text: {d[:100]}..." for s, d in zip(sources, documents)],
hoverinfo='text'
)])
fig.update_layout(
title='3D Vector Store Visualization',
scene=dict(xaxis_title='x', yaxis_title='y', zaxis_title='z'),
width=900,
height=700,
margin=dict(r=20, b=10, l=10, t=40)
)
return fig
except Exception as e:
print(f"Error creating visualization: {e}")
return None
# Simple fallback chain implementation
class SimpleConversationalChain:
def __init__(self, llm, retriever, memory):
self.llm = llm
self.retriever = retriever
self.memory = memory
def invoke(self, inputs):
question = inputs.get("question", "")
# Get relevant documents - try different methods
try:
docs = self.retriever.get_relevant_documents(question)
except AttributeError:
try:
docs = self.retriever.invoke(question)
except:
docs = []
context = "\n".join([doc.page_content for doc in docs[:3]]) if docs else "No relevant context found."
# Create a simple prompt
prompt = f"""Based on the following context, answer the question:
Context: {context}
Question: {question}
Answer:"""
# Get response from LLM
response = self.llm.invoke(prompt)
return {"answer": response.content if hasattr(response, 'content') else str(response)}
# Chat System Class
class ChatSystem:
def __init__(self, knowledge_base, model_name=MODEL):
self.knowledge_base = knowledge_base
self.model_name = model_name
self.llm = ChatOpenAI(temperature=0.7, model_name=self.model_name)
self.memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
self.conversation_chain = self._create_conversation_chain()
def _create_conversation_chain(self):
"""
Create a new conversation chain with the current retriever
"""
retriever = self.knowledge_base.get_retriever()
# Skip the problematic ConversationalRetrievalChain and use simple implementation
print("Using simple conversational chain implementation")
return SimpleConversationalChain(self.llm, retriever, self.memory)
def reset_conversation(self):
"""
Reset the conversation memory and chain
"""
self.memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
self.conversation_chain = self._create_conversation_chain()
return "Conversation has been reset."
def chat(self, question, history):
"""
Process a question and return the answer
"""
if not question.strip():
return "Please ask a question."
result = self.conversation_chain.invoke({"question": question})
return result["answer"]
def update_knowledge_base(self):
"""
Update the conversation chain with the latest knowledge base
"""
self.conversation_chain = self._create_conversation_chain()
# UI Functions
def handle_file_upload(files):
"""
Process uploaded files and add them to the knowledge base
"""
if not files:
return "No files uploaded."
documents = []
for file in files:
try:
docs = load_document(file.name)
if docs:
# Add upload source metadata
for doc in docs:
doc.metadata['source'] = 'upload'
doc.metadata['filename'] = os.path.basename(file.name)
documents.extend(docs)
except Exception as e:
print(f"Error processing file {file.name}: {e}")
if documents:
success = kb.add_documents(documents)
if success:
# Update the chat system with new knowledge
chat_system.update_knowledge_base()
return f"Successfully processed {len(documents)} documents."
return "No documents could be processed. Please check file formats."
def create_ui():
"""
Create the Gradio UI
"""
with gr.Blocks(theme=gr.themes.Soft()) as app:
gr.Markdown("""
# Knowledge Worker
Upload documents or ask questions about your knowledge base.
""")
with gr.Tabs():
with gr.TabItem("Chat"):
chatbot = gr.ChatInterface(
chat_system.chat,
chatbot=gr.Chatbot(height=500, type="messages"),
textbox=gr.Textbox(placeholder="Ask a question about your documents...", container=False),
title="Knowledge Worker Chat",
type="messages"
)
reset_btn = gr.Button("Reset Conversation")
reset_btn.click(chat_system.reset_conversation, inputs=None, outputs=gr.Textbox())
with gr.TabItem("Upload Documents"):
with gr.Column():
file_output = gr.Textbox(label="Upload Status")
upload_button = gr.UploadButton(
"Click to Upload Files",
file_types=[".pdf", ".docx", ".txt", ".md", ".xlsx"],
file_count="multiple"
)
upload_button.upload(handle_file_upload, upload_button, file_output)
with gr.TabItem("Visualize Knowledge"):
visualize_btn = gr.Button("Generate Vector Visualization")
plot_output = gr.Plot(label="Vector Space Visualization")
visualize_btn.click(kb.visualize_vectors, inputs=None, outputs=plot_output)
return app
def main():
"""
Main function to initialize and run the knowledge worker
"""
global kb, chat_system
print("=" * 60)
print("Initializing Knowledge Worker...")
print("=" * 60)
try:
# Initialize the knowledge base
print("Setting up vector database...")
kb = KnowledgeBase(DB_NAME)
print("Vector database initialized successfully")
# Google Drive integration removed
# Initialize the chat system
print("\nSetting up chat system...")
chat_system = ChatSystem(kb)
print("Chat system initialized successfully")
# Launch the Gradio app
print("\nLaunching Gradio interface...")
print("=" * 60)
print("The web interface will open in your browser")
print("You can also access it at the URL shown below")
print("=" * 60)
app = create_ui()
app.launch(inbrowser=True)
except Exception as e:
print(f"Error initializing Knowledge Worker: {e}")
print("Please check your configuration and try again.")
return
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,623 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "6f0f38e7",
"metadata": {},
"source": [
"# Email Mindmap Demo (Week 5 Community Contribution)\n",
"\n",
"Welcome to the **Email Mindmap Demo** notebook! This demo walks you through a workflow for exploring and visualizing email relationships using embeddings and mindmaps.\n",
"\n",
"---\n",
"\n",
"## 📋 Workflow Overview\n",
"\n",
"1. **Load/Create Synthetic Email Data** \n",
" Generate or load varied types of emails: work, personal, family, subscriptions, etc.\n",
"\n",
"2. **Generate Embeddings** \n",
" Use an open-source model to create vector embeddings for email content.\n",
"\n",
"3. **Build & Visualize a Mindmap** \n",
" Construct a mindmap of email relationships and visualize it interactively using `networkx` and `matplotlib`.\n",
"\n",
"4. **Question-Answering Interface** \n",
" Query the email content and the mindmap using a simple Q&A interface powered by Gradio.\n",
"\n",
"---\n",
"\n",
"## ⚙️ Requirements\n",
"\n",
"> **Tip:** \n",
"> I'm including an example of the synthetic emails in case you don't want to run that part.\n",
"> Might need to install other libraries like pyvis, nbformat and faiss-cpu\n",
"\n",
"\n",
"## ✨ Features\n",
"\n",
"- Synthetic generation of varied emails (work, personal, family, subscriptions)\n",
"- Embedding generation with open-source models (hugging face sentence-transformer)\n",
"- Interactive mindmap visualization (`networkx`, `pyvis`)\n",
"- Simple chatbot interface (Gradio) and visualization of mindmap created\n",
"\n",
"---\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a9aeb363",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"OpenAI API Key exists and begins sk-proj-\n",
"Anthropic API Key exists and begins sk-ant-\n",
"Google API Key exists and begins AI\n",
"OLLAMA API Key exists and begins 36\n"
]
}
],
"source": [
"# imports\n",
"\n",
"import os\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import gradio as gr\n",
"\n",
"load_dotenv(override=True)\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n",
"google_api_key = os.getenv('GOOGLE_API_KEY')\n",
"ollama_api_key = os.getenv('OLLAMA_API_KEY')\n",
"\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
" \n",
"if anthropic_api_key:\n",
" print(f\"Anthropic API Key exists and begins {anthropic_api_key[:7]}\")\n",
"else:\n",
" print(\"Anthropic API Key not set (and this is optional)\")\n",
"\n",
"if google_api_key:\n",
" print(f\"Google API Key exists and begins {google_api_key[:2]}\")\n",
"else:\n",
" print(\"Google API Key not set (and this is optional)\")\n",
"\n",
"if ollama_api_key:\n",
" print(f\"OLLAMA API Key exists and begins {ollama_api_key[:2]}\")\n",
"else:\n",
" print(\"OLLAMA API Key not set (and this is optional)\")\n",
"\n",
"# Connect to client libraries\n",
"\n",
"openai = OpenAI()\n",
"\n",
"anthropic_url = \"https://api.anthropic.com/v1/\"\n",
"gemini_url = \"https://generativelanguage.googleapis.com/v1beta/openai/\"\n",
"ollama_url = \"http://localhost:11434/v1\"\n",
"\n",
"anthropic = OpenAI(api_key=anthropic_api_key, base_url=anthropic_url)\n",
"gemini = OpenAI(api_key=google_api_key, base_url=gemini_url)\n",
"ollama = OpenAI(api_key=ollama_api_key, base_url=ollama_url)\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "b8ddce62",
"metadata": {},
"source": [
"## Preparation of synthetic data (could have been week2 work)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "2e250912",
"metadata": {},
"outputs": [],
"source": [
"#using ollama gpt oss 120b cloud i'm going to create synthetic emails using a persona.\n",
"#they are going to be saved in a json file with different keys\n",
"from pydantic import BaseModel, Field\n",
"from typing import List, Optional\n",
"\n",
"\n",
"class Email(BaseModel):\n",
" sender: str = Field(description=\"Email address of the sender\")\n",
" subject: str = Field(description=\"Email subject line\")\n",
" body: str = Field(description=\"Email body content\")\n",
" timestamp: str = Field(description=\"ISO 8601 timestamp when email was received\")\n",
" category: str = Field(description=\"Category of the email\")\n",
"\n",
"class EmailBatch(BaseModel):\n",
" emails: List[Email] = Field(description=\"List of generated emails\")\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "1f67fdb3",
"metadata": {},
"outputs": [],
"source": [
"def create_persona(name: str, age: int, occupation: str, \n",
" interests: List[str], family_status: str) -> str:\n",
" persona = f\"\"\"\n",
" You are generating synthetic emails for a realistic inbox simulation.\n",
"\n",
" **Person Profile:**\n",
" - Name: {name}\n",
" - Age: {age}\n",
" - Occupation: {occupation}\n",
" - Interests: {', '.join(interests)}\n",
" - Family Status: {family_status}\n",
"\n",
" **Email Categories to Include:**\n",
" 1. **Work Emails**: Project updates, meeting invitations, colleague communications, \n",
" performance reviews, company announcements\n",
" 2. **Purchases**: Order confirmations, shipping notifications, delivery updates, \n",
" receipts from various retailers (Amazon, local shops, etc.)\n",
" 3. **Subscriptions**: Newsletter updates, streaming services (Netflix, Spotify), \n",
" software subscriptions (Adobe, Microsoft 365), magazine subscriptions\n",
" 4. **Family**: Communications with parents, siblings, children, extended family members,\n",
" family event planning, photo sharing\n",
" 5. **Friends**: Social plans, birthday wishes, casual conversations, group hangouts,\n",
" catching up messages\n",
" 6. **Finance**: Bank statements, credit card bills, investment updates, tax documents,\n",
" payment reminders\n",
" 7. **Social Media**: Facebook notifications, LinkedIn updates, Instagram activity,\n",
" Twitter mentions\n",
" 8. **Personal**: Doctor appointments, gym memberships, utility bills, insurance updates\n",
"\n",
" **Instructions:**\n",
" - Generate realistic email content that reflects the person's life over time\n",
" - Include temporal patterns (more work emails on weekdays, more personal on weekends)\n",
" - Create realistic sender names and email addresses\n",
" - Vary email length and formality based on context\n",
" - Include realistic subject lines\n",
" - Make emails interconnected when appropriate (e.g., follow-up emails, conversation threads)\n",
" - Include seasonal events (holidays, birthdays, annual renewals)\n",
" \"\"\"\n",
" return persona\n",
"\n",
"persona_description = create_persona(\n",
" name=\"John Doe\",\n",
" age=30,\n",
" occupation=\"Software Engineer\",\n",
" interests=[\"technology\", \"reading\", \"traveling\"],\n",
" family_status=\"single\"\n",
")\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "cec185e3",
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"from datetime import datetime, timedelta\n",
"import random\n",
"from typing import List\n",
"\n",
"def generate_synthetic_emails(\n",
" persona_description: str,\n",
" num_emails: int,\n",
" start_date: str,\n",
" end_date: str,\n",
" model: str = \"gpt-4o-2024-08-06\"\n",
") -> List[Email]:\n",
" \"\"\"\n",
" NEEDS TO WORK WITH OPENAI MODELS BECAUSE OF PARSED (STRUC OUTPUT) MODELS\n",
" Generates synthetic emails using OpenAI's structured output feature.\n",
" \n",
" Args:\n",
" persona_description: Detailed persona description\n",
" num_emails: Number of emails to generate per batch\n",
" start_date: Start date for email timestamps\n",
" end_date: End date for email timestamps\n",
" model: OpenAI model to use (must support structured outputs)\n",
" \n",
" Returns:\n",
" List of Email objects\n",
" \"\"\"\n",
" \n",
" # Calculate date range for context\n",
" date_range_context = f\"\"\"\n",
" Generate emails with timestamps between {start_date} and {end_date}.\n",
" Distribute emails naturally across this time period, with realistic patterns:\n",
" - More emails during business hours on weekdays\n",
" - Fewer emails late at night\n",
" - Occasional weekend emails\n",
" - Bursts of activity around events or busy periods\n",
" \"\"\"\n",
" \n",
" # System message combining persona and structure instructions\n",
" system_message = f\"\"\"\n",
" {persona_description}\n",
"\n",
" {date_range_context}\n",
"\n",
" Generate {num_emails} realistic emails that fit this person's life. \n",
" Ensure variety in categories, senders, and content while maintaining realism.\n",
" \"\"\"\n",
" \n",
" try:\n",
" client = OpenAI()\n",
"\n",
" response = client.chat.completions.parse(\n",
" model=model,\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": system_message\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": f\"Generate {num_emails} diverse, realistic emails for this person's inbox.\"\n",
" }\n",
" ],\n",
" response_format=EmailBatch,\n",
" )\n",
" return response.choices[0].message.parsed.emails\n",
" \n",
" except Exception as e:\n",
" print(f\"Error generating emails: {e}\")\n",
" return []\n",
"\n",
"\n",
"def save_emails_to_json(emails: List[Email], filename: str):\n",
" \"\"\"\n",
" Saves emails to a JSON file.\n",
" \"\"\"\n",
" import json\n",
" \n",
" emails_dict = [email.model_dump() for email in emails]\n",
" \n",
" with open(filename, 'w', encoding='utf-8') as f:\n",
" json.dump(emails_dict, f, indent=2, ensure_ascii=False)\n",
" \n",
" print(f\"Saved {len(emails)} emails to {filename}\")\n"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "be31f352",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"now\n"
]
}
],
"source": [
"mails_2 = generate_synthetic_emails(\n",
" persona_description = persona_description,\n",
" num_emails = 100,\n",
" start_date = '2024-06-01',\n",
" end_date = '2025-01-01',\n",
" model = \"gpt-4o\"\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "24d844f2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Saved 101 emails to emails2.json\n"
]
}
],
"source": [
"save_emails_to_json(mails_2, 'emails2.json')"
]
},
{
"cell_type": "markdown",
"id": "2b9c704e",
"metadata": {},
"source": [
"## Create embeddings for the mails\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "777012f8",
"metadata": {},
"outputs": [],
"source": [
"# imports for langchain, plotly and Chroma\n",
"\n",
"from langchain.document_loaders import DirectoryLoader, TextLoader\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.schema import Document\n",
"from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n",
"from langchain_chroma import Chroma\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.manifold import TSNE\n",
"import numpy as np\n",
"import plotly.graph_objects as go\n",
"from langchain.memory import ConversationBufferMemory\n",
"from langchain.chains import ConversationalRetrievalChain\n",
"from langchain.embeddings import HuggingFaceEmbeddings\n",
"import json\n",
"from langchain.vectorstores import FAISS\n",
"\n",
"#MODEL = \"gpt-4o-mini\"\n",
"db_name = \"vector_db\""
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "ce95d9c7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of chunks: 206\n",
"Sample metadata fields: ['sender', 'timestamp', 'category']\n"
]
}
],
"source": [
"# Read in emails from the emails.json file and construct LangChain documents\n",
"\n",
"\n",
"with open(\"emails.json\", \"r\", encoding=\"utf-8\") as f:\n",
" emails = json.load(f)\n",
"\n",
"documents = []\n",
"for email in emails:\n",
" # Extract metadata (all fields except 'content')\n",
" metadata = {k: v for k, v in email.items() if k in ['sender','category','timestamp']}\n",
" body = email.get(\"body\", \"\")\n",
" documents.append(Document(page_content=body, metadata=metadata))\n",
"\n",
"text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100)\n",
"chunks = text_splitter.split_documents(documents)\n",
"\n",
"print(f\"Total number of chunks: {len(chunks)}\")\n",
"print(f\"Sample metadata fields: {list(documents[0].metadata.keys()) if documents else []}\")\n",
"\n",
"embeddings_model = HuggingFaceEmbeddings(model_name=\"sentence-transformers/all-MiniLM-L6-v2\")\n",
"\n",
"if os.path.exists(db_name):\n",
" Chroma(persist_directory=db_name, embedding_function=embeddings_model).delete_collection()\n",
"\n",
"vectorstore = FAISS.from_documents(chunks, embedding=embeddings_model)\n",
"\n",
"all_embeddings = [vectorstore.index.reconstruct(i) for i in range(vectorstore.index.ntotal)]\n",
"\n",
"total_vectors = vectorstore.index.ntotal\n",
"dimensions = vectorstore.index.d\n"
]
},
{
"cell_type": "markdown",
"id": "78ca65bb",
"metadata": {},
"source": [
"## Visualizing mindmap"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "a99dd2d6",
"metadata": {},
"outputs": [],
"source": [
"import networkx as nx\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.metrics.pairwise import cosine_similarity\n",
"import plotly.graph_objects as go\n",
"import numpy as np\n",
"from sklearn.cluster import KMeans\n",
"from sklearn.manifold import TSNE # Or use UMAP\n",
"from pyvis.network import Network\n",
"\n",
"# Here, emails is your list of email objects, with .subject or .body\n",
"\n",
"# Build similarity graph\n",
"def build_mindmap_html(emails, all_embeddings, threshold=0.6):\n",
" similarity = cosine_similarity(all_embeddings)\n",
"\n",
" G = nx.Graph()\n",
" for i, email in enumerate(emails):\n",
" G.add_node(i, label=email['subject'][:80], title=email['body'][:50]) # Custom hover text\n",
"\n",
" for i in range(len(emails)):\n",
" for j in range(i+1, len(emails)):\n",
" if similarity[i][j] > threshold:\n",
" G.add_edge(i, j, weight=float(similarity[i][j]))\n",
"\n",
" # Convert to pyvis network\n",
" nt = Network(notebook=True, height='700px', width='100%', bgcolor='#222222', font_color='white')\n",
" nt.from_nx(G)\n",
" html = nt.generate_html().replace(\"'\", \"\\\"\")\n",
" return html\n"
]
},
{
"cell_type": "markdown",
"id": "53a2fbaf",
"metadata": {},
"source": [
"## Putting it all together in a gradio.\n",
"It needs to have an interface to make questions, and the visual to see the mindmap.\n"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "161144ac",
"metadata": {},
"outputs": [],
"source": [
"# create a new Chat with OpenAI\n",
"MODEL=\"gpt-4o-mini\"\n",
"llm = ChatOpenAI(temperature=0.7, model_name=MODEL)\n",
"\n",
"# set up the conversation memory for the chat\n",
"memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)\n",
"\n",
"# the retriever is an abstraction over the VectorStore that will be used during RAG\n",
"retriever = vectorstore.as_retriever()\n",
"from langchain_core.callbacks import StdOutCallbackHandler\n",
"\n",
"# putting it together: set up the conversation chain with the GPT 3.5 LLM, the vector store and memory\n",
"conversation_chain_debug = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory, callbacks=[StdOutCallbackHandler()])\n",
"conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)\n",
"\n",
"# Wrapping that in a function\n",
"\n",
"def chat(question, history):\n",
" result = conversation_chain.invoke({\"question\": question})\n",
" return result[\"answer\"]"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "16a4d8d1",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\Javi\\Desktop\\course\\llm_engineering\\.venv\\Lib\\site-packages\\gradio\\chat_interface.py:347: UserWarning:\n",
"\n",
"The 'tuples' format for chatbot messages is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style 'role' and 'content' keys.\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning: When cdn_resources is 'local' jupyter notebook has issues displaying graphics on chrome/safari. Use cdn_resources='in_line' or cdn_resources='remote' if you have issues viewing graphics in a notebook.\n",
"* Running on local URL: http://127.0.0.1:7878\n",
"* To create a public link, set `share=True` in `launch()`.\n"
]
},
{
"data": {
"text/html": [
"<div><iframe src=\"http://127.0.0.1:7878/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning: When cdn_resources is 'local' jupyter notebook has issues displaying graphics on chrome/safari. Use cdn_resources='in_line' or cdn_resources='remote' if you have issues viewing graphics in a notebook.\n",
"Warning: When cdn_resources is 'local' jupyter notebook has issues displaying graphics on chrome/safari. Use cdn_resources='in_line' or cdn_resources='remote' if you have issues viewing graphics in a notebook.\n"
]
}
],
"source": [
"\n",
"import gradio as gr\n",
"\n",
"def show_mindmap():\n",
" # Call build_mindmap_html to generate the HTML\n",
" html = build_mindmap_html(emails, all_embeddings)\n",
" return f\"\"\"<iframe style=\"width: 100%; height: 600px;margin:0 auto\" name=\"result\" allow=\"midi; geolocation; microphone; camera; \n",
" display-capture; encrypted-media;\" sandbox=\"allow-modals allow-forms \n",
" allow-scripts allow-same-origin allow-popups \n",
" allow-top-navigation-by-user-activation allow-downloads\" allowfullscreen=\"\" \n",
" allowpaymentrequest=\"\" frameborder=\"0\" srcdoc='{html}'></iframe>\"\"\"\n",
"\n",
"\n",
"with gr.Blocks(title=\"Mindmap & Email Chatbot\") as demo:\n",
" gr.Markdown(\"# 📧 Mindmap Visualization & Email QA Chatbot\")\n",
" with gr.Row():\n",
" chatbot = gr.ChatInterface(fn=chat, title=\"Ask about your emails\",\n",
" examples=[\n",
" \"What is my most important message?\",\n",
" \"Who have I been communicating with?\",\n",
" \"Summarize recent emails\"\n",
" ],\n",
")\n",
" mindmap_html = gr.HTML(\n",
" show_mindmap,\n",
" label=\"🧠 Mindmap of Your Emails\",\n",
" )\n",
" # Reduce height: update show_mindmap (elsewhere) to ~400px, or do inline replace for the demo here:\n",
" # mindmap_html = gr.HTML(lambda: show_mindmap().replace(\"height: 600px\", \"height: 400px\"))\n",
" \n",
"demo.launch(inbrowser=True)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "221a9d98",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}