Merge branch 'main' of github.com:ed-donner/llm_engineering

This commit is contained in:
Edward Donner
2025-04-28 09:19:37 -04:00
44 changed files with 9449 additions and 49 deletions

View File

@@ -0,0 +1,173 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "aa629e55-8f41-41ab-b319-b55dd1cfc76b",
"metadata": {},
"source": [
"# Playwright Scraper Showcase (Async in Jupyter)\n",
"\n",
"This notebook demonstrates how to run async Playwright-based scraping code inside JupyterLab using `nest_asyncio`.\n",
"\n",
"**Note:** Requires `openai_scraper_playwright.py` to be in the same directory."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "97469777",
"metadata": {},
"outputs": [],
"source": [
"import nest_asyncio\n",
"import asyncio\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6254fa89",
"metadata": {},
"outputs": [],
"source": [
"from openai_scraper_playwright import EnhancedOpenAIScraper, analyze_content"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "33d2737b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"### 1. Overall Summary of the Website:\n",
"The website appears to be a hub for various applications of AI technology, particularly focusing on the capabilities of ChatGPT and other AI models developed by OpenAI. It offers a range of services from answering queries, assisting in planning trips, explaining technical topics, helping with language translation, and providing educational content. The site also features updates on new AI models, research publications, and business solutions integrating AI.\n",
"\n",
"### 2. Key Individuals or Entities:\n",
"- **OpenAI**: Mentioned as the organization behind the development of AI models and technologies such as ChatGPT, GPT-4.1, and image generation models. OpenAI seems to be focused on advancing and applying AI in various fields.\n",
"- **Lyndon Barrois & Sora**: Featured in a story, possibly highlighting individual experiences or contributions within the OpenAI ecosystem.\n",
"\n",
"### 3. Recent Announcements or Updates:\n",
"- **Introducing our latest image generation model in the API** (Product, Apr 23, 2025)\n",
"- **Thinking with images** (Release, Apr 16, 2025)\n",
"- **OpenAI announces nonprofit commission advisors** (Company, Apr 15, 2025)\n",
"- **Our updated Preparedness Framework** (Publication, Apr 15, 2025)\n",
"- **BrowseComp: a benchmark for browsing agents** (Publication, Apr 10, 2025)\n",
"- **OpenAI Pioneers Program** (Company, Apr 9, 2025)\n",
"\n",
"### 4. Main Topics or Themes:\n",
"- **AI Model Development and Application**: Discusses various AI models like ChatGPT, GPT-4.1, and image generation models.\n",
"- **Educational and Practical AI Uses**: Offers help in educational topics, practical tasks, and creative endeavors using AI.\n",
"- **Business Integration**: Focuses on integrating AI into business processes, automating tasks in finance, legal, and other sectors.\n",
"- **Research and Publications**: Shares updates on the latest research and publications related to AI technology.\n",
"\n",
"### 5. Any Noteworthy Features or Projects:\n",
"- **GPT-4.1 and Image Generation Models**: Introduction of new and advanced AI models for text and image processing.\n",
"- **OpenAI Pioneers Program**: A significant initiative likely aimed at fostering innovation and practical applications of AI technology.\n",
"- **BrowseComp and PaperBench**: Research projects or benchmarks designed to evaluate and improve AI capabilities in specific domains.\n"
]
}
],
"source": [
"result = asyncio.run(analyze_content())\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "d7450ccf",
"metadata": {},
"source": [
"✅ If you see structured analysis above, the async code ran successfully in Jupyter!"
]
},
{
"cell_type": "markdown",
"id": "9a46716c-6f77-4b2b-b423-cc9fe05014da",
"metadata": {},
"source": [
"# 🧪 Playwright Scraper Output (Formatted)\n",
"\n",
"---\n",
"\n",
"## 🧭 1. **Overall Summary of the Website**\n",
"\n",
"*The website appears to be focused on showcasing various applications and updates related to OpenAI's technology, specifically ChatGPT and other AI tools. It provides information on product releases, company updates, and educational content on how to use AI technologies in different scenarios such as planning trips, learning games, coding, and more.*\n",
"\n",
"---\n",
"\n",
"## 🧑‍💼 2. **Key Individuals or Entities**\n",
"\n",
"- **OpenAI** — Company behind the technologies and updates discussed on the website \n",
"- **Lyndon Barrois & Sora** — Featured in a story, possibly highlighting user experiences or contributions\n",
"\n",
"---\n",
"\n",
"## 📰 3. **Recent Announcements or Updates**\n",
"\n",
"- 📢 **Introducing GPT-4.1 in the API** — *(no date provided)*\n",
"- 🖼️ **Introducing 4o Image Generation** — *(no date provided)*\n",
"- 🐟 **Catching halibut with ChatGPT** — *(no date provided)*\n",
"- 🧠 **Thinking with images** — *Apr 16, 2025*\n",
"- 🧑‍⚖️ **Nonprofit commission advisors announced** — *Apr 15, 2025*\n",
"- ⚙️ **Updated Preparedness Framework** — *Apr 15, 2025*\n",
"- 🌐 **BrowseComp benchmark for browsing agents** — *Apr 10, 2025*\n",
"- 🚀 **OpenAI Pioneers Program launched** — *Apr 9, 2025*\n",
"- 📊 **PaperBench research benchmark published** — *Apr 2, 2025*\n",
"\n",
"---\n",
"\n",
"## 📚 4. **Main Topics or Themes**\n",
"\n",
"- 🤖 **AI Technology Applications** — Using AI for tasks like planning, learning, and troubleshooting \n",
"- 🧩 **Product and Feature Releases** — Updates on new capabilities \n",
"- 📘 **Educational Content** — Guides for using AI effectively \n",
"- 🧪 **Research and Development** — Publications and technical benchmarks\n",
"\n",
"---\n",
"\n",
"## ⭐ 5. **Noteworthy Features or Projects**\n",
"\n",
"- ✅ **GPT-4.1** — A new API-accessible version of the language model \n",
"- 🖼️ **4o Image Generation** — Feature focused on AI-generated images \n",
"- 🚀 **OpenAI Pioneers Program** — Initiative likely fostering innovation in AI \n",
"- 📊 **BrowseComp & PaperBench** — Benchmarks for evaluating AI agents\n",
"\n",
"---\n",
"\n",
"✅ *If you're reading this inside Jupyter and seeing clean structure — your async notebook setup is working beautifully.*\n"
]
},
{
"cell_type": "markdown",
"id": "95c38374-5daa-487c-8bd9-919bb4037ea3",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,69 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "3df9df94",
"metadata": {},
"source": [
"# 🧪 Playwright Scraper Output (Formatted)\n",
"\n",
"---\n",
"\n",
"## 🧭 1. **Overall Summary of the Website**\n",
"\n",
"*The website appears to be focused on showcasing various applications and updates related to OpenAI's technology, specifically ChatGPT and other AI tools. It provides information on product releases, company updates, and educational content on how to use AI technologies in different scenarios such as planning trips, learning games, coding, and more.*\n",
"\n",
"---\n",
"\n",
"## 🧑‍💼 2. **Key Individuals or Entities**\n",
"\n",
"- **OpenAI** — Company behind the technologies and updates discussed on the website \n",
"- **Lyndon Barrois & Sora** — Featured in a story, possibly highlighting user experiences or contributions\n",
"\n",
"---\n",
"\n",
"## 📰 3. **Recent Announcements or Updates**\n",
"\n",
"- 📢 **Introducing GPT-4.1 in the API** — *(no date provided)*\n",
"- 🖼️ **Introducing 4o Image Generation** — *(no date provided)*\n",
"- 🐟 **Catching halibut with ChatGPT** — *(no date provided)*\n",
"- 🧠 **Thinking with images** — *Apr 16, 2025*\n",
"- 🧑‍⚖️ **Nonprofit commission advisors announced** — *Apr 15, 2025*\n",
"- ⚙️ **Updated Preparedness Framework** — *Apr 15, 2025*\n",
"- 🌐 **BrowseComp benchmark for browsing agents** — *Apr 10, 2025*\n",
"- 🚀 **OpenAI Pioneers Program launched** — *Apr 9, 2025*\n",
"- 📊 **PaperBench research benchmark published** — *Apr 2, 2025*\n",
"\n",
"---\n",
"\n",
"## 📚 4. **Main Topics or Themes**\n",
"\n",
"- 🤖 **AI Technology Applications** — Using AI for tasks like planning, learning, and troubleshooting \n",
"- 🧩 **Product and Feature Releases** — Updates on new capabilities \n",
"- 📘 **Educational Content** — Guides for using AI effectively \n",
"- 🧪 **Research and Development** — Publications and technical benchmarks\n",
"\n",
"---\n",
"\n",
"## ⭐ 5. **Noteworthy Features or Projects**\n",
"\n",
"- ✅ **GPT-4.1** — A new API-accessible version of the language model \n",
"- 🖼️ **4o Image Generation** — Feature focused on AI-generated images \n",
"- 🚀 **OpenAI Pioneers Program** — Initiative likely fostering innovation in AI \n",
"- 📊 **BrowseComp & PaperBench** — Benchmarks for evaluating AI agents\n",
"\n",
"---\n",
"\n",
"✅ *If you're reading this inside Jupyter and seeing clean structure — your async notebook setup is working beautifully.*\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,67 @@
# 🧠 Community Contribution: Async Playwright-based OpenAI Scraper
This contribution presents a fully asynchronous, headless-browser-based scraper for [https://openai.com](https://openai.com) using **Playwright** — an alternative to Selenium.
Developed by: [lakovicb](https://github.com/lakovicb)
IDE used: WingIDE Pro (Jupyter compatibility via `nest_asyncio`)
---
## 📦 Features
- 🧭 Simulates human-like interactions (mouse movement, scrolling)
- 🧠 GPT-based analysis using OpenAI's API
- 🧪 Works inside **JupyterLab** using `nest_asyncio`
- 📊 Prometheus metrics for scraping observability
- ⚡ Smart content caching via `diskcache`
---
## 🚀 How to Run
### 1. Install dependencies
```bash
pip install -r requirements.txt
```
> Ensure [Playwright is installed & browsers are downloaded](https://playwright.dev/python/docs/intro)
```bash
playwright install
```
### 2. Set environment variables in `.env`
```env
OPENAI_API_KEY=your_openai_key
BROWSER_PATH=/usr/bin/chromium-browser
```
You can also define optional proxy/login params if needed.
---
## 📘 Notebooks Included
| Notebook | Description |
|----------|-------------|
| `Playwright_Solution_JupyterAsync.ipynb` | Executes async scraper directly inside Jupyter |
| `Playwright_Solution_Showcase_Formatted.ipynb` | Nicely formatted output for human reading |
---
## 🔁 Output Example
- GPT-generated summary
- Timeline of updates
- Entities and projects mentioned
- Structured topics & themes
*Can be extended with PDF export, LangChain pipeline, or vector store ingestion.*
---
## 🙏 Thanks
Huge thanks to Ed Donner for the amazing course and challenge inspiration!

View File

@@ -0,0 +1,141 @@
# openai_scraper_playwright.py
import asyncio
from playwright.async_api import async_playwright
from openai import OpenAI
import logging
import random
import time
import os
from prometheus_client import start_http_server, Counter, Histogram
from diskcache import Cache
from dotenv import load_dotenv
load_dotenv()
SCRAPE_ATTEMPTS = Counter('scrape_attempts', 'Total scraping attempts')
SCRAPE_DURATION = Histogram('scrape_duration', 'Scraping duration distribution')
cache = Cache('./scraper_cache')
class ScrapingError(Exception): pass
class ContentAnalysisError(Exception): pass
class EnhancedOpenAIScraper:
API_KEY = os.getenv("OPENAI_API_KEY")
BROWSER_EXECUTABLE = os.getenv("BROWSER_PATH", "/usr/bin/chromium-browser")
MAX_CONTENT_LENGTH = int(os.getenv("MAX_CONTENT_LENGTH", 30000))
def __init__(self, headless=True):
self.user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."
]
self.timeout = 45000
self.retry_count = int(os.getenv("RETRY_COUNT", 2))
self.headless = headless
self.proxy_servers = [x.strip() for x in os.getenv("PROXY_SERVERS", "").split(',') if x.strip()]
async def human_interaction(self, page):
for _ in range(random.randint(2, 5)):
x, y = random.randint(0, 1366), random.randint(0, 768)
await page.mouse.move(x, y, steps=random.randint(5, 20))
await page.wait_for_timeout(random.randint(50, 200))
if random.random() < 0.3:
await page.keyboard.press('Tab')
await page.keyboard.type(' ', delay=random.randint(50, 200))
await page.mouse.wheel(0, random.choice([300, 600, 900]))
await page.wait_for_timeout(random.randint(500, 2000))
async def load_page(self, page, url):
try:
await page.goto(url, wait_until="domcontentloaded", timeout=self.timeout)
selectors = ['main article', '#main-content', 'section:first-of-type', 'div[class*="content"]', 'body']
for selector in selectors:
if await page.query_selector(selector):
return True
await page.wait_for_timeout(5000)
return True
except Exception as e:
logging.error(f"Error loading page {url}: {e}")
return False
@SCRAPE_DURATION.time()
async def scrape_with_retry(self, url):
SCRAPE_ATTEMPTS.inc()
last_error = None
try:
async with async_playwright() as p:
args = {
"headless": self.headless,
"args": ["--disable-blink-features=AutomationControlled", "--no-sandbox"],
"executable_path": self.BROWSER_EXECUTABLE
}
browser = await p.chromium.launch(**args)
context = await browser.new_context(user_agent=random.choice(self.user_agents))
page = await context.new_page()
await page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', { get: () => false });
""")
for attempt in range(self.retry_count):
try:
if not await self.load_page(page, url):
raise ScrapingError("Failed to load page")
await self.human_interaction(page)
content = await page.evaluate("""() => document.body.innerText""")
if not content.strip():
raise ContentAnalysisError("No content extracted")
await browser.close()
return content[:self.MAX_CONTENT_LENGTH]
except Exception as e:
last_error = e
if attempt < self.retry_count - 1:
await asyncio.sleep(5)
else:
await browser.close()
raise
except Exception as e:
raise last_error or e
async def get_cached_content(self, url):
key = 'cache_' + url.replace('https://', '').replace('/', '_')
content = cache.get(key)
if content is None:
content = await self.scrape_with_retry(url)
cache.set(key, content, expire=int(os.getenv("CACHE_EXPIRY", 3600)))
return content
async def analyze_content(url="https://openai.com", headless=True):
scraper = EnhancedOpenAIScraper(headless=headless)
content = await scraper.get_cached_content(url)
client = OpenAI(api_key=EnhancedOpenAIScraper.API_KEY)
if not client.api_key:
raise ContentAnalysisError("OpenAI API key not configured")
prompt = f"""
Analyze this page:
{content}
"""
model = os.getenv("OPENAI_MODEL", "gpt-4-turbo")
temperature = float(os.getenv("MODEL_TEMPERATURE", 0.3))
max_tokens = int(os.getenv("MAX_TOKENS", 1500))
top_p = float(os.getenv("MODEL_TOP_P", 0.9))
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a content analyst."},
{"role": "user", "content": prompt}
],
temperature=temperature,
max_tokens=max_tokens,
top_p=top_p
)
if not response.choices:
raise ContentAnalysisError("Empty response from GPT")
return response.choices[0].message.content

View File

@@ -0,0 +1,6 @@
playwright>=1.43.0
openai>=1.14.2
prometheus-client>=0.19.0
diskcache>=5.6.1
python-dotenv>=1.0.1
nest_asyncio>=1.6.0

View File

@@ -0,0 +1,177 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0b15b939-593a-4ccc-89bd-0cee09fe2f12",
"metadata": {},
"source": [
"# Python Code Summarizer\n",
"\n",
"The Below code will summarize the python code and example it in details which can help codes better understand a forigen code."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8dcf353c-e4f2-4ce7-a3b5-71b29700a148",
"metadata": {},
"outputs": [],
"source": [
"# Imports\n",
"from IPython.display import Markdown, display\n",
"import os\n",
"import openai\n",
"from dotenv import load_dotenv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "111cf632-08e8-4246-a5bb-b56942789242",
"metadata": {},
"outputs": [],
"source": [
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4f5376f-5e6f-4d75-81bf-222e34bfe828",
"metadata": {},
"outputs": [],
"source": [
"def read_code(**kwargs):\n",
" \"\"\"\n",
" You can pass two types of key word arguments to this function.\n",
" code_path= Path to your complex python code.\n",
" code= Passing raw python code.\n",
" \"\"\"\n",
" code_path = kwargs.get('code_path',None)\n",
" code_raw = kwargs.get('code',None)\n",
" \n",
" if code_path:\n",
" with open(code_path, 'r') as code_file:\n",
" code = code_file.read()\n",
" return (True, code)\n",
"\n",
" if code_raw:\n",
" return (True, code_raw)\n",
"\n",
" return (False, None)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00743dac-0e70-45b7-879a-d7293a6f68a6",
"metadata": {},
"outputs": [],
"source": [
"# Model Prompt\n",
"system_prompt = (\n",
" \"You are a helpful assistant. The following input will be a Python code snippet. \"\n",
" \"Your task is to:\\n\\n\"\n",
" \"1. Summarize the overall purpose of the code.\\n\"\n",
" \"2. Explain the code line by line, describing what each line does and why it's written that way.\\n\"\n",
" \"3. Provide reasoning behind the code structure and logic to help novice Python developers understand the concepts better.\\n\\n\"\n",
" \"Use Markdown format in your response. Make the explanation beginner-friendly, using code blocks, bullet points, and headings where helpful.\"\n",
" ) \n",
"# In a plot twist worthy of sci-fi, this prompt was written by ChatGPT...\n",
"# to tell ChatGPT how to respond. Weve officially entered the Matrix. 🤖🌀"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed7d2447-32a9-4761-8b0a-b31814bee7e5",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Guess where I got this code from :)\n",
"code_line = \"\"\"yeild from set(book.get(\"author)) for book in books if book.get(\"author\"))\"\"\"\n",
"is_code, raw_code = read_code(code=code_line)\n",
"\n",
"if is_code:\n",
" user_prompt = raw_code\n",
"else:\n",
" print(\"Invalid Arguments\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d74a1a39-1c24-4d4b-bd49-0ca416377a93",
"metadata": {},
"outputs": [],
"source": [
"def messages_for():\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "df6c2726-d0fb-4ab6-b13b-d047e8807558",
"metadata": {},
"outputs": [],
"source": [
"def summarize():\n",
" \n",
" response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = messages_for()\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8425144c-595e-4ad6-9801-3e8778d285c4",
"metadata": {},
"outputs": [],
"source": [
"def display_summary():\n",
" summary = summarize()\n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "744bffdd-ec3c-4b27-b126-81bf3e8c8295",
"metadata": {},
"outputs": [],
"source": [
"display_summary()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -110,10 +110,24 @@
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

View File

@@ -0,0 +1,173 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"from dotenv import load_dotenv\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install youtube_transcript_api"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"from youtube_transcript_api import YouTubeTranscriptApi"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"\n",
"class YouTubeWebLink:\n",
" def __init__(self, url):\n",
" self.url = url\n",
" self.video_id = self.get_video_id(url)\n",
" self.set_openai_client()\n",
" self.set_system_prompt()\n",
"\n",
" def get_video_id(self, url):\n",
" \"\"\" extract youtube video id from url with regular expression \"\"\"\n",
" regex = r\"(?:v=|be/)([a-zA-Z0-9_-]{11})\"\n",
" match = re.search(regex, url)\n",
" if match:\n",
" return match.group(1)\n",
" else:\n",
" raise ValueError(\"Probably not a YouTube URL\")\n",
" \n",
" def set_openai_client(self):\n",
" self.openai = OpenAI()\n",
" \n",
" def set_system_prompt(self, system_prompt=None):\n",
" \"\"\" set system prompt from youtube video \"\"\"\n",
" self.system_prompt = \"\"\"\n",
" You are a skilled explainer and storyteller who specializes in summarizing YouTube video transcripts in a way that's both engaging and informative. \n",
" Your task is to:\n",
" - Capture key points and main ideas of the video\n",
" - Structure your summary with in clear sections\n",
" - Include important details, facts, and figures mentioned\n",
" - Never end your summary with a \"Conclusion\" section\n",
" - Keep the summary short and easy to understand\n",
" - Always format your response in markdown for better readability\n",
" \"\"\" if system_prompt is None else system_prompt\n",
"\n",
" def get_transcript(self):\n",
" \"\"\" get transcript from youtube video \"\"\"\n",
" try:\n",
" print('Fetching video transcript...')\n",
" transcript = YouTubeTranscriptApi.get_transcript(self.video_id)\n",
" return \" \".join([item['text'] for item in transcript])\n",
" except Exception as e:\n",
" print(f\"Error fetching transcript: {e}\")\n",
" return None\n",
" \n",
" def get_summary_from_transcript(self, transcript):\n",
" \"\"\" summarize text using openai \"\"\"\n",
" try:\n",
" print('Summarizing video...')\n",
" response = self.openai.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": self.system_prompt},\n",
" {\"role\": \"user\", \"content\": f\"Summarize the following YouTube video transcript:\\n\\n{transcript}\"}\n",
" ]\n",
" )\n",
" return response.choices[0].message.content\n",
" except Exception as e:\n",
" print(f\"Error summarizing text: {e}\")\n",
" return None\n",
"\n",
" def display_summary(self):\n",
" \"\"\" summarize youtube video \"\"\"\n",
" transcript = self.get_transcript()\n",
" summary = self.get_summary_from_transcript(transcript)\n",
" display(Markdown(summary))\n"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"# video link and share link of same youtube video\n",
"test_url_1 = \"https://www.youtube.com/watch?v=nYy-umCNKPQ&list=PLWHe-9GP9SMMdl6SLaovUQF2abiLGbMjs\"\n",
"test_url_2 = \"https://youtu.be/nYy-umCNKPQ?si=ILVrQlKT0W71G5pU\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test that we get same id\n",
"video1, video2 = YouTubeWebLink(test_url_1), YouTubeWebLink(test_url_2)\n",
"video1.video_id, video2.video_id"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"video1.display_summary()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llms",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -0,0 +1,293 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "2c80b652-eadd-4d48-a512-d5945c0365d3",
"metadata": {},
"source": [
"# Compare websites"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"\n",
"# If you get an error running this cell, then please head over to the troubleshooting notebook!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables \n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "019974d9-f3ad-4a8a-b5f9-0a3719aea2d3",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()\n",
"\n",
"# If this doesn't work, try Kernel menu >> Restart Kernel and Clear Outputs Of All Cells, then run the cells from the top of this notebook down.\n",
"# If it STILL doesn't work (horrors!) then please see the Troubleshooting notebook in this folder for full instructions"
]
},
{
"cell_type": "markdown",
"id": "2aa190e5-cb31-456a-96cc-db109919cd78",
"metadata": {},
"source": [
"## Website class"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5e793b2-6775-426a-a139-4848291d0463",
"metadata": {},
"outputs": [],
"source": [
"# A class to represent a Webpage\n",
"# If you're not familiar with Classes, check out the \"Intermediate Python\" notebook\n",
"\n",
"# Some websites need you to use proper headers when fetching them:\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
"\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given url using the BeautifulSoup library\n",
" \"\"\"\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
"metadata": {},
"outputs": [],
"source": [
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
"metadata": {},
"outputs": [],
"source": [
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt"
]
},
{
"cell_type": "markdown",
"id": "d06e8d78-ce4c-4b05-aa8e-17050c82bb47",
"metadata": {},
"source": [
"## Website messages function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
"metadata": {},
"outputs": [],
"source": [
"# See how this function creates exactly the format above\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]"
]
},
{
"cell_type": "markdown",
"id": "16f49d46-bf55-4c3e-928f-68fc0bf715b0",
"metadata": {},
"source": [
"## Website summary"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
"metadata": {},
"outputs": [],
"source": [
"# And now: call the OpenAI API. You will get very familiar with this!\n",
"\n",
"def summarize(url):\n",
" website = Website(url)\n",
" response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = messages_for(website)\n",
" )\n",
" return response.choices[0].message.content\n",
"\n",
"# A function to display this nicely in the Jupyter output, using markdown\n",
"\n",
"def display_summary(summary): \n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d926d59-450e-4609-92ba-2d6f244f1342",
"metadata": {},
"outputs": [],
"source": [
"w1 = \"https://cnn.com\"\n",
"summary1 = summarize(w1)\n",
"display_summary(summary1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "45d83403-a24c-44b5-84ac-961449b4008f",
"metadata": {},
"outputs": [],
"source": [
"w2 = \"https://www.foxnews.com\"\n",
"summary2 = summarize(w2)\n",
"display_summary(summary2)"
]
},
{
"cell_type": "markdown",
"id": "0a51b45c-f3a6-4b0b-acfe-52957c04fd94",
"metadata": {},
"source": [
"## Comparison between two websites"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4b30d5a5-bbe5-499c-9392-0896440f80c7",
"metadata": {},
"outputs": [],
"source": [
"system_prompt_compare = \"\"\"You are a weblsite analyst that compares the summaries of two websites\n",
"and provides a compare and contrast bewtween the two. \n",
"Respond in markdown.\"\"\"\n",
"\n",
"def user_prompt_for_compare(summary1, summary2):\n",
" user_prompt = f\"You are asked to compare this summary of a website {summary1}\\n\\n\"\n",
" user_prompt += f\"\\nWith the summary of this second website {summary2}\\n\\n\"\n",
" user_prompt += \"please provide a short comparison of the two websites. \\\n",
"List the similarities and differences in bullet point format.\\n\\n\" \n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5c9c955-840f-4c31-a1a7-b4872f77f3b4",
"metadata": {},
"outputs": [],
"source": [
"def messages_for_compare():\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt_compare},\n",
" {\"role\": \"user\", \"content\": user_prompt_for_compare(summary1, summary2)}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56307d77-f207-48f1-b59a-e97f6a2a37dd",
"metadata": {},
"outputs": [],
"source": [
"def compare(): \n",
" response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = messages_for_compare()\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ae3140bb-ddad-43e2-b697-6d05ae541544",
"metadata": {},
"outputs": [],
"source": [
"display_summary(compare())"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,448 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"\n",
"# If you get an error running this cell, then please head over to the troubleshooting notebook!"
]
},
{
"cell_type": "markdown",
"id": "92d0aa2b-8e2f-4c1b-8b81-646faf4cd8c5",
"metadata": {},
"source": [
"# And now the change for Ollama\n",
"\n",
"1. No environment variables are needed (no keys) so this part has been removed\n",
"\n",
"2. The OpenAI client library is being initialized to point to your local computer for Ollama\n",
"\n",
"3. You need to have installed Ollama on your computer, and run `ollama run llama3.2` in a Powershell or Terminal if you haven't already\n",
"\n",
"4. Anywhere in this lab that it used to have **gpt-4o-mini** it now has **lama3.2**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "019974d9-f3ad-4a8a-b5f9-0a3719aea2d3",
"metadata": {},
"outputs": [],
"source": [
"# Here it is - see the base_url\n",
"\n",
"openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')\n"
]
},
{
"cell_type": "markdown",
"id": "442fc84b-0815-4f40-99ab-d9a5da6bda91",
"metadata": {},
"source": [
"# Let's make a quick call to a Frontier model to get started, as a preview!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a58394bf-1e45-46af-9bfd-01e24da6f49a",
"metadata": {},
"outputs": [],
"source": [
"# To give you a preview -- calling OpenAI with these messages is this easy. Any problems, head over to the Troubleshooting notebook.\n",
"\n",
"message = \"Hello, Llama! This is my first ever message to you! Hi!\"\n",
"response = openai.chat.completions.create(model=\"llama3.2\", messages=[{\"role\":\"user\", \"content\":message}])\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "2aa190e5-cb31-456a-96cc-db109919cd78",
"metadata": {},
"source": [
"## OK onwards with our first project"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5e793b2-6775-426a-a139-4848291d0463",
"metadata": {},
"outputs": [],
"source": [
"# A class to represent a Webpage\n",
"# If you're not familiar with Classes, check out the \"Intermediate Python\" notebook\n",
"\n",
"# Some websites need you to use proper headers when fetching them:\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
"\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given url using the BeautifulSoup library\n",
" \"\"\"\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ef960cf-6dc2-4cda-afb3-b38be12f4c97",
"metadata": {},
"outputs": [],
"source": [
"# Let's try one out. Change the website and add print statements to follow along.\n",
"\n",
"ed = Website(\"https://sohanpatharla.vercel.app/about\")\n",
"print(ed.title)\n",
"print(\"Title is printed above\")\n",
"print(ed.text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
"metadata": {},
"outputs": [],
"source": [
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
"metadata": {},
"outputs": [],
"source": [
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt"
]
},
{
"cell_type": "markdown",
"id": "d06e8d78-ce4c-4b05-aa8e-17050c82bb47",
"metadata": {},
"source": [
"## And now let's build useful messages for GPT-4o-mini, using a function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
"metadata": {},
"outputs": [],
"source": [
"# See how this function creates exactly the format above\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]"
]
},
{
"cell_type": "markdown",
"id": "16f49d46-bf55-4c3e-928f-68fc0bf715b0",
"metadata": {},
"source": [
"## Time to bring it together - the API for OpenAI is very simple!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
"metadata": {},
"outputs": [],
"source": [
"# And now: call the OpenAI API. You will get very familiar with this!\n",
"\n",
"def summarize(url):\n",
" website = Website(url)\n",
" response = openai.chat.completions.create(\n",
" model = \"llama3.2\",\n",
" messages = messages_for(website)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d926d59-450e-4609-92ba-2d6f244f1342",
"metadata": {},
"outputs": [],
"source": [
"# A function to display this nicely in the Jupyter output, using markdown\n",
"\n",
"def display_summary(url):\n",
" summary = summarize(url)\n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3018853a-445f-41ff-9560-d925d1774b2f",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://sohanpatharla.vercel.app/about\")"
]
},
{
"cell_type": "markdown",
"id": "b3bcf6f4-adce-45e9-97ad-d9a5d7a3a624",
"metadata": {},
"source": [
"# Let's try more websites\n",
"\n",
"Note that this will only work on websites that can be scraped using this simplistic approach.\n",
"\n",
"Websites that are rendered with Javascript, like React apps, won't show up. See the community-contributions folder for a Selenium implementation that gets around this. You'll need to read up on installing Selenium (ask ChatGPT!)\n",
"\n",
"Also Websites protected with CloudFront (and similar) may give 403 errors - many thanks Andy J for pointing this out.\n",
"\n",
"But many websites will work just fine!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "45d83403-a24c-44b5-84ac-961449b4008f",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://openai.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75e9fd40-b354-4341-991e-863ef2e59db7",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://anthropic.com\")"
]
},
{
"cell_type": "markdown",
"id": "490381df-3d03-4aaa-8f29-c5c10ace0ab5",
"metadata": {},
"source": [
"## Email Subject Suggestion based on the letter body"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00743dac-0e70-45b7-879a-d7293a6f68a6",
"metadata": {},
"outputs": [],
"source": [
"# Step 1: Create your prompts\n",
"\n",
"system_prompt = \"\"\"You are an assistant that analyzes the contents of an email letter body \\\n",
"and provide a appropriate short subject line for that email,based on that email body. \\\n",
"\"\"\"\n",
"user_prompt = \"\"\"\n",
" \\nThe contents of an email body is as follows; \\\n",
"understand the content in that well and provide me a appropriate subject based on the text content in it. \\\n",
"Understand the sentiment of the email and choose the subject type to be formal or informal or anything.\\n\\n\n",
"\"\"\"\n",
"\n",
"# Step 2: Make the messages list\n",
"\n",
"messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" \n",
" {\"role\": \"user\", \"content\": user_prompt + \"\"\"\n",
"Hey John, just wanted to say thanks for your help with the move last weekend! Couldn't have done it without you.\n",
"\"\"\"},\n",
"\n",
" {\"role\": \"user\", \"content\": user_prompt + \"\"\"\n",
"Dear Hiring Manager, I am writing to express my interest in the Marketing Manager position listed on your companys website.\n",
"\"\"\"},\n",
"\n",
" {\"role\": \"user\", \"content\": user_prompt + \"\"\"\n",
"We are excited to invite you to our annual developer conference taking place in San Francisco this July. Register today to secure your spot!\n",
"\"\"\"},\n",
"\n",
" {\"role\": \"user\", \"content\": user_prompt + \"\"\"\n",
"Hello, I'm following up on the support ticket I submitted last week regarding the issue with logging into my account. I still havent received a resolution.\n",
"\"\"\"},\n",
"\n",
" {\"role\": \"user\", \"content\": user_prompt + \"\"\"\n",
"Congratulations! You've been selected as one of our winners in the Spring Giveaway Contest. Claim your prize by replying to this email.\n",
"\"\"\"},\n",
"\n",
" {\"role\": \"user\", \"content\": user_prompt + \"\"\"\n",
"Good morning team, just a reminder that our Q2 strategy meeting is scheduled for 10 AM tomorrow in Conference Room B.\n",
"\"\"\"},\n",
"\n",
" {\"role\": \"user\", \"content\": user_prompt + \"\"\"\n",
"Hi Mom, the flight was fine, and I got here safely. The weathers great and the Airbnb is cozy. Ill send pictures soon!\n",
"\"\"\"},\n",
"\n",
" {\"role\": \"user\", \"content\": user_prompt + \"\"\"\n",
"To whom it may concern, I am very dissatisfied with the quality of the product I received and would like a full refund.\n",
"\"\"\"}\n",
"]\n",
"\n",
"\n",
"# Step 3: Call OpenAI\n",
"\n",
"response =openai.chat.completions.create(model=\"llama3.2\",messages=messages)\n",
"\n",
"# Step 4: print the result\n",
"# response = openai.chat.completions.create(model=\"llama3.2\", messages=messages)\n",
"#print(response.choices[0].message.content)\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "36ed9f14-b349-40e9-a42c-b367e77f8bda",
"metadata": {},
"source": [
"## An extra exercise for those who enjoy web scraping\n",
"\n",
"You may notice that if you try `display_summary(\"https://openai.com\")` - it doesn't work! That's because OpenAI has a fancy website that uses Javascript. There are many ways around this that some of you might be familiar with. For example, Selenium is a hugely popular framework that runs a browser behind the scenes, renders the page, and allows you to query it. If you have experience with Selenium, Playwright or similar, then feel free to improve the Website class to use them. In the community-contributions folder, you'll find an example Selenium solution from a student (thank you!)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bf424661-6c39-4398-9983-9b02df7e9311",
"metadata": {},
"outputs": [],
"source": [
"!pip install selenium"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4484fcf-8b39-4c3f-9674-37970ed71988",
"metadata": {},
"outputs": [],
"source": [
"#Parse webpages which is designed using JavaScript heavely\n",
"# download the chorme driver from here as per your version of chrome - https://developer.chrome.com/docs/chromedriver/downloads\n",
"from selenium import webdriver\n",
"from selenium.webdriver.chrome.service import Service\n",
"from selenium.webdriver.common.by import By\n",
"from selenium.webdriver.chrome.options import Options\n",
"\n",
"PATH_TO_CHROME_DRIVER = r'C:\\Users\\sohan\\Downloads\\chromedriver-win64\\chromedriver-win64\\chromedriver.exe'\n",
"\n",
"class Website:\n",
" url: str\n",
" title: str\n",
" text: str\n",
"\n",
" def __init__(self, url):\n",
" self.url = url\n",
"\n",
" options = Options()\n",
"\n",
" options.add_argument(\"--no-sandbox\")\n",
" options.add_argument(\"--disable-dev-shm-usage\")\n",
"\n",
" service = Service(PATH_TO_CHROME_DRIVER)\n",
" driver = webdriver.Chrome(service=service, options=options)\n",
" driver.get(url)\n",
"\n",
" input(\"Please complete the verification in the browser and press Enter to continue...\")\n",
" page_source = driver.page_source\n",
" driver.quit()\n",
"\n",
" soup = BeautifulSoup(page_source, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56989f9b-8efb-4cfb-a355-1c50d36cc9b2",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"display_summary(\"https://openai.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59b15b6d-3743-44a0-9dd4-23c9e9da6e3e",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,431 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "35f59eb3",
"metadata": {},
"source": [
"# Pluggable Web Scraper and Summarizer with Interface-Based Design\n",
"\n",
"This system implements a **pluggable architecture** for web scraping and summarization, built on interface-based design using Pythons `Protocol` types. Each stage of the pipeline—content fetching, HTML parsing, and LLM-based summarization—is defined through explicit structural contracts rather than concrete implementations. Components like `RequestsFetcher`, `RobustSoupParser`, and `OllamaClient` fulfill these protocols and can be swapped independently, enabling flexibility, testing, and future extension without modifying core logic. Immutable data models (`@dataclass(frozen=True)`) enforce data integrity throughout the pipeline, while the design cleanly separates concerns across modules to support maintainability and modular growth."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "f42e6d21",
"metadata": {},
"outputs": [],
"source": [
"from dataclasses import dataclass\n",
"from typing import Protocol, Optional, List, Dict, Tuple\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"import logging\n",
"import chardet"
]
},
{
"cell_type": "markdown",
"id": "65c17368",
"metadata": {},
"source": [
"# Configuration"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "eb0904d7",
"metadata": {},
"outputs": [],
"source": [
"logging.basicConfig(level=logging.INFO)\n",
"logger = logging.getLogger(__name__)\n",
"\n",
"HEADERS = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36\",\n",
"}\n",
"DEFAULT_TIMEOUT = 10\n",
"UNWANTED_TAGS = [\"script\", \"style\", \"nav\", \"header\", \"footer\", \"img\", \"input\"]"
]
},
{
"cell_type": "markdown",
"id": "8110aa46",
"metadata": {},
"source": [
"# Data Models"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "cdb6c990",
"metadata": {},
"outputs": [],
"source": [
"@dataclass(frozen=True)\n",
"class RawResponse:\n",
" content: bytes\n",
" status_code: int\n",
" encoding: str\n",
" headers: Dict[str, str]\n",
" elapsed: float\n",
" final_url: str\n",
"\n",
"@dataclass(frozen=True)\n",
"class WebsiteContent:\n",
" url: str\n",
" title: str\n",
" text: str\n",
" status_code: int\n",
" response_time: float\n",
"\n",
"@dataclass(frozen=True)\n",
"class LLMResponse:\n",
" content: str\n",
" model: str\n",
" tokens_used: int"
]
},
{
"cell_type": "markdown",
"id": "87b2a97a",
"metadata": {},
"source": [
"# Protocols"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "3070eac2",
"metadata": {},
"outputs": [],
"source": [
"class ContentFetcher(Protocol):\n",
" def fetch(self, url: str) -> RawResponse: ...\n",
"\n",
"class ContentParser(Protocol):\n",
" def parse(self, response: RawResponse) -> WebsiteContent: ...\n",
"\n",
"class LLMClient(Protocol):\n",
" def generate(self, messages: List[Dict[str, str]], model: str) -> LLMResponse: ...\n"
]
},
{
"cell_type": "markdown",
"id": "553daa11",
"metadata": {},
"source": [
"# Implementations"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "1a42bed9",
"metadata": {},
"outputs": [],
"source": [
"class RequestsFetcher:\n",
" def __init__(self, \n",
" headers: Dict[str, str] = HEADERS,\n",
" timeout: int = DEFAULT_TIMEOUT,\n",
" max_redirects: int = 5):\n",
" self.headers = headers\n",
" self.timeout = timeout\n",
" self.max_redirects = max_redirects\n",
"\n",
" def fetch(self, url: str) -> RawResponse:\n",
" logger.info(f\"Fetching content from {url}\")\n",
" try:\n",
" response = requests.get(\n",
" url,\n",
" headers=self.headers,\n",
" timeout=self.timeout,\n",
" allow_redirects=True,\n",
" stream=False # Prevent partial content issues\n",
" )\n",
" response.raise_for_status()\n",
" \n",
" return RawResponse(\n",
" content=response.content,\n",
" status_code=response.status_code,\n",
" encoding=response.encoding,\n",
" headers=dict(response.headers),\n",
" elapsed=response.elapsed.total_seconds(),\n",
" final_url=response.url\n",
" )\n",
" except requests.exceptions.RequestException as e:\n",
" logger.error(f\"Failed to fetch {url}: {str(e)}\")\n",
" raise\n",
"\n",
"class RobustSoupParser:\n",
" def __init__(self, unwanted_tags: Tuple[str] = UNWANTED_TAGS):\n",
" self.unwanted_tags = unwanted_tags\n",
"\n",
" def parse(self, response: RawResponse) -> WebsiteContent:\n",
" logger.info(f\"Parsing content from {response.final_url}\")\n",
" \n",
" # Detect encoding if not provided\n",
" encoding = response.encoding or self._detect_encoding(response.content)\n",
" \n",
" try:\n",
" decoded_content = response.content.decode(encoding, errors='replace')\n",
" soup = BeautifulSoup(decoded_content, 'html.parser')\n",
" except Exception as e:\n",
" logger.error(f\"Failed to parse content: {str(e)}\")\n",
" raise\n",
"\n",
" return WebsiteContent(\n",
" url=response.final_url,\n",
" title=self._extract_title(soup),\n",
" text=self._clean_content(soup),\n",
" status_code=response.status_code,\n",
" response_time=response.elapsed\n",
" )\n",
"\n",
" def _detect_encoding(self, content: bytes) -> str:\n",
" result = chardet.detect(content)\n",
" return result['encoding'] or 'utf-8'\n",
"\n",
" def _extract_title(self, soup: BeautifulSoup) -> str:\n",
" title_tag = soup.find('title')\n",
" return title_tag.text.strip() if title_tag else \"Untitled\"\n",
"\n",
" def _clean_content(self, soup: BeautifulSoup) -> str:\n",
" # Remove unwanted tags\n",
" for tag in self.unwanted_tags:\n",
" for element in soup.find_all(tag):\n",
" element.decompose()\n",
"\n",
" # Extract text with semantic line breaks\n",
" text = '\\n\\n'.join([\n",
" element.get_text().strip()\n",
" for element in soup.find_all(['p', 'h1', 'h2', 'h3', 'article'])\n",
" if element.get_text().strip()\n",
" ])\n",
" \n",
" return text or \"No readable content found\"\n",
"\n",
"class OllamaClient:\n",
" def __init__(self, \n",
" base_url: str = 'http://localhost:11434/v1',\n",
" api_key: str = 'ollama',\n",
" max_retries: int = 3):\n",
" self.client = OpenAI(base_url=base_url, api_key=api_key)\n",
" self.max_retries = max_retries\n",
"\n",
" def generate(self, \n",
" messages: List[Dict[str, str]], \n",
" model: str = \"llama3.2\") -> LLMResponse:\n",
" logger.info(f\"Generating summary with {model}\")\n",
" \n",
" for attempt in range(self.max_retries):\n",
" try:\n",
" response = self.client.chat.completions.create(\n",
" model=model,\n",
" messages=messages\n",
" )\n",
" return LLMResponse(\n",
" content=response.choices[0].message.content,\n",
" model=model,\n",
" tokens_used=response.usage.total_tokens\n",
" )\n",
" except Exception as e:\n",
" if attempt == self.max_retries - 1:\n",
" logger.error(f\"Failed after {self.max_retries} attempts: {str(e)}\")\n",
" raise\n",
" logger.warning(f\"Retry {attempt + 1}/{self.max_retries}\")"
]
},
{
"cell_type": "markdown",
"id": "1805d4f8",
"metadata": {},
"source": [
"# Core Pipeline"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "a985806a",
"metadata": {},
"outputs": [],
"source": [
"class SummarizationPipeline:\n",
" SYSTEM_PROMPT = \"\"\"You are a professional web content analyst. Provide a structured markdown summary containing:\n",
"- Key points\n",
"- Notable statistics\n",
"- Important names/dates\n",
"- Actionable insights\n",
"Avoid navigation content and marketing fluff.\"\"\"\n",
"\n",
" def __init__(self,\n",
" fetcher: ContentFetcher,\n",
" parser: ContentParser,\n",
" llm_client: LLMClient):\n",
" self.fetcher = fetcher\n",
" self.parser = parser\n",
" self.llm_client = llm_client\n",
"\n",
" def summarize(self, url: str, model: str = \"llama3.2\") -> LLMResponse:\n",
" raw_response = self.fetcher.fetch(url)\n",
" website_content = self.parser.parse(raw_response)\n",
" messages = self._build_messages(website_content)\n",
" return self.llm_client.generate(messages, model)\n",
"\n",
" def _build_messages(self, content: WebsiteContent) -> List[Dict[str, str]]:\n",
" user_prompt = f\"\"\"**Website Analysis Request**\n",
"URL: {content.url}\n",
"Title: {content.title}\n",
"\n",
"Content:\n",
"{content.text[:8000]} # Truncate to stay within context window\n",
"\n",
"Please provide a comprehensive summary following the guidelines above.\"\"\"\n",
" \n",
" return [\n",
" {\"role\": \"system\", \"content\": self.SYSTEM_PROMPT},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ]"
]
},
{
"cell_type": "markdown",
"id": "41832e20",
"metadata": {},
"source": [
"# Factory & Presentation"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "656b8dd4",
"metadata": {},
"outputs": [],
"source": [
"def create_default_pipeline() -> SummarizationPipeline:\n",
" return SummarizationPipeline(\n",
" fetcher=RequestsFetcher(),\n",
" parser=RobustSoupParser(),\n",
" llm_client=OllamaClient()\n",
" )\n",
"\n",
"class JupyterPresenter:\n",
" @staticmethod\n",
" def display(response: LLMResponse) -> None:\n",
" display(Markdown(f\"\"\"\n",
"## Summary Results\n",
"**Model**: {response.model} \n",
"**Tokens Used**: {response.tokens_used} \n",
"**Summary**:\n",
"{response.content}\n",
" \"\"\"))\n",
" "
]
},
{
"cell_type": "markdown",
"id": "76339788",
"metadata": {},
"source": [
"# Execution"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "69304964",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:__main__:Fetching content from https://edwarddonner.com\n",
"INFO:__main__:Parsing content from https://edwarddonner.com/\n",
"INFO:__main__:Generating summary with llama3.2\n",
"INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
]
},
{
"data": {
"text/markdown": [
"\n",
"## Summary Results\n",
"**Model**: llama3.2 \n",
"**Tokens Used**: 630 \n",
"**Summary**:\n",
"**Website Analysis Summary**\n",
"==========================\n",
"\n",
"### Key Points\n",
"\n",
"* The website belongs to Edward Donner, a co-founder and CTO of Nebula.io, an AI startup applying LLMs for talent discovery.\n",
"* The website showcases Donner's interests in code writing, music production, and technology.\n",
"* It announces the launch of The Complete Agentic AI Engineering Course and provides resources on LLM workshop and mastering AI.\n",
"\n",
"### Notable Statistics\n",
"\n",
"* None mentioned, as there are no explicit statistics provided on the website.\n",
"\n",
"### Important Names/Dates\n",
"\n",
"* Edward Donner: Website owner and CTO of Nebula.io.\n",
"* 2021: Year in which AI startup untapt was acquired by an unknown party (no information about the acquirer is available).\n",
"\n",
"### Actionable Insights\n",
"\n",
"* The website appears to be a personal page showcasing Donner's expertise in AI, LLMs, and talent discovery. It may serve as a way for him to establish his professional brand and network with potential clients or collaborators.\n",
"* Offering resources and courses, such as \"The Complete Agentic AI Engineering Course\" and workshops, can help attract visitors and demonstrate the company's capabilities.\n",
"* Subscribing to the website might offer exclusive access to updates, insights on LLMs and talent discovery, and potentially lucrative career opportunities.\n",
" "
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pipeline = create_default_pipeline()\n",
"try:\n",
" response = pipeline.summarize(\"https://edwarddonner.com\")\n",
" JupyterPresenter.display(response)\n",
"except Exception as e:\n",
" logger.error(f\"Summarization failed: {str(e)}\")\n",
" display(Markdown(\"## Error\\nUnable to generate summary. Please check the URL and try again.\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,273 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a15135e6-3ba5-44ae-b14b-dc67674a5ca3",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"# Resarch Paper Summarizer by Name"
]
},
{
"cell_type": "markdown",
"id": "a50f02ea-0f04-4f68-ae66-d1369780065e",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"### Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea6e09ac-adee-4bb8-b3bd-4f6411495751",
"metadata": {},
"outputs": [],
"source": [
"## If dependencies do not exist please install them\n",
"# !pip install python-dotenv openai arxiv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e5301f2b-3037-4a85-b7cd-5e6bd700418a",
"metadata": {},
"outputs": [],
"source": [
"import arxiv\n",
"import os\n",
"from openai import OpenAI\n",
"from dotenv import load_dotenv\n",
"from IPython.display import Markdown, display"
]
},
{
"cell_type": "markdown",
"id": "ac45a1f4-0005-4e0a-be90-741182c1db9f",
"metadata": {},
"source": [
"### Load Open AI Key"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "381bef97-6bb7-4bdc-a71d-2ea65c8f6964",
"metadata": {},
"outputs": [],
"source": [
"load_dotenv()\n",
"api_key = os.getenv(\"OPENAI_API_KEY\")\n",
"\n",
"if not api_key:\n",
" print(\"❌ No OpenAI API key found in .env file.\")\n",
"else:\n",
" print(\"✅ API key loaded successfully.\")\n",
"\n",
"# ✅ Initialize OpenAI\n",
"openai = OpenAI(api_key=api_key)"
]
},
{
"cell_type": "markdown",
"id": "00817dbe-209e-418c-bb46-7b6b866fdff4",
"metadata": {},
"source": [
"### Main Class MLResearchFetcher"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7355ba4c-ef61-4934-bb79-4d80b4473e52",
"metadata": {},
"outputs": [],
"source": [
"class MLResearchFetcher:\n",
" def __init__(self, system_prompt, query=\"machine learning\", max_results=5):\n",
" self.query = query\n",
" self.max_results = max_results\n",
" self.system_prompt = system_prompt\n",
"\n",
" def fetch_papers(self):\n",
" search = arxiv.Search(\n",
" query=f'ti:\"{self.query}\"',\n",
" max_results=self.max_results,\n",
" sort_by=arxiv.SortCriterion.SubmittedDate,\n",
" sort_order=arxiv.SortOrder.Descending,\n",
" )\n",
" return list(search.results())\n",
"\n",
" def summarize_abstract(self, abstract, system_prompt):\n",
" try:\n",
" completion = openai.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": abstract}\n",
" ]\n",
" )\n",
" return completion.choices[0].message.content.strip()\n",
" except Exception as e:\n",
" return f\"❌ Error during summarization: {e}\"\n",
"\n",
" def display_results(self):\n",
" papers = self.fetch_papers()\n",
" for paper in papers:\n",
" display(Markdown(f\"### 📄 [{paper.title}]({paper.entry_id})\"))\n",
" display(Markdown(f\"**Authors:** {', '.join(author.name for author in paper.authors)}\"))\n",
" display(Markdown(f\"**Published:** {paper.published.date()}\"))\n",
" display(Markdown(f\"**Abstract:** {paper.summary.strip()}\"))\n",
" summary = self.summarize_abstract(paper.summary, self.system_prompt)\n",
" display(Markdown(f\"**🔍 Summary:** {summary}\"))\n",
" display(Markdown(\"---\"))"
]
},
{
"cell_type": "markdown",
"id": "304857ba-e832-42a3-8219-ec9202e41509",
"metadata": {},
"source": [
"### Helper Functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1be2a2da-135b-4aec-b200-dc364d319ac4",
"metadata": {},
"outputs": [],
"source": [
"system_prompt = \"You are an expert research paper summarizer and AI research assistant. \\\n",
"When provided with the URL or content of a research paper in the field of machine learning, artificial intelligence, or data science, perform the following: \\\n",
"1. **Extract and present** the following details in a clear, structured Markdown format: \\\n",
" - Title and Author(s) \\\n",
" - Year of Publication \\\n",
" - Objective or Aim of the Research (Why the study was conducted) \\\n",
" - Background or Introduction (What foundational knowledge or motivation led to this work) \\\n",
" - Type of Research (e.g., empirical study, theoretical analysis, experimental benchmark) \\\n",
" - Methods or Methodology (How the research was conducted: dataset, models, techniques used) \\\n",
" - Results and Key Findings (What was discovered or proven) \\\n",
" - Conclusion (Summary of insights, limitations, and proposed future work) \\\n",
"\\\n",
"2. **Evaluate** the impact and relevance of the paper: \\\n",
" - Assess the significance of the research to the broader ML/AI community \\\n",
" - Note any novelty, performance improvements, or theoretical breakthroughs \\\n",
" - Comment on the potential applications or industry relevance \\\n",
"\\\n",
"3. **Suggest new research directions**: \\\n",
" - Identify gaps, limitations, or unexplored ideas in the paper \\\n",
" - Propose at least one new research idea or follow-up paper that builds upon this work \\\n",
"\\\n",
"Respond in a clean, professional Markdown format suitable for researchers or students reviewing the literature.\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f8b68134-c265-4272-87c4-e16fc205e7c4",
"metadata": {},
"outputs": [],
"source": [
"def print_papers(papers):\n",
" for paper in papers:\n",
" title = paper.title\n",
" authors = \", \".join(author.name for author in paper.authors)\n",
" published = paper.published.strftime('%Y-%m-%d')\n",
" abstract = paper.summary.strip()\n",
" link = paper.entry_id\n",
" pdf_link = [l.href for l in paper.links if l.title == 'pdf']\n",
" categories = \", \".join(paper.categories)\n",
"\n",
" print(f\"\\n📄 Title: {title}\")\n",
" print(f\"👥 Authors: {authors}\")\n",
" print(f\"📅 Published: {published}\")\n",
" print(f\"🏷️ Categories: {categories}\")\n",
" print(f\"🔗 Link: {link}\")\n",
" if pdf_link:\n",
" print(f\"📄 PDF: {pdf_link[0]}\")\n",
" print(f\"\\n📝 Abstract:\\n{abstract}\")\n",
" print(\"-\" * 80)\n"
]
},
{
"cell_type": "markdown",
"id": "9e688bbd-d3dd-4f2b-a7c3-d6e550ec9667",
"metadata": {},
"source": [
"#### Get the papers given the name of the paper"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6dcf9639-d6b5-4194-b6a2-5260329fcbe7",
"metadata": {},
"outputs": [],
"source": [
"fetcher = MLResearchFetcher(system_prompt, query=\"QWEN2 TECHNICAL REPORT\", max_results=3)\n",
"papers = fetcher.fetch_papers()\n",
"print_papers(papers)"
]
},
{
"cell_type": "markdown",
"id": "a04e219b-389f-4e0a-9645-662d966d4055",
"metadata": {},
"source": [
"### Call the model and get the results"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "297e915b-078a-49c7-836f-3c4ddf8e17dc",
"metadata": {},
"outputs": [],
"source": [
"fetcher.display_results()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2344499c-3b39-4497-a0bf-1cff83117fdc",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,615 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
"metadata": {},
"source": [
"# Instant Gratification\n",
"\n",
"## Your first Frontier LLM Project!\n",
"\n",
"Let's build a useful LLM solution - in a matter of minutes.\n",
"\n",
"By the end of this course, you will have built an autonomous Agentic AI solution with 7 agents that collaborate to solve a business problem. All in good time! We will start with something smaller...\n",
"\n",
"Our goal is to code a new kind of Web Browser. Give it a URL, and it will respond with a summary. The Reader's Digest of the internet!!\n",
"\n",
"Before starting, you should have completed the setup for [PC](../SETUP-PC.md) or [Mac](../SETUP-mac.md) and you hopefully launched this jupyter lab from within the project root directory, with your environment activated.\n",
"\n",
"## If you're new to Jupyter Lab\n",
"\n",
"Welcome to the wonderful world of Data Science experimentation! Once you've used Jupyter Lab, you'll wonder how you ever lived without it. Simply click in each \"cell\" with code in it, such as the cell immediately below this text, and hit Shift+Return to execute that cell. As you wish, you can add a cell with the + button in the toolbar, and print values of variables, or try out variations. \n",
"\n",
"I've written a notebook called [Guide to Jupyter](Guide%20to%20Jupyter.ipynb) to help you get more familiar with Jupyter Labs, including adding Markdown comments, using `!` to run shell commands, and `tqdm` to show progress.\n",
"\n",
"## If you'd prefer to work in IDEs\n",
"\n",
"If you're more comfortable in IDEs like VSCode or Pycharm, they both work great with these lab notebooks too. \n",
"If you'd prefer to work in VSCode, [here](https://chatgpt.com/share/676f2e19-c228-8012-9911-6ca42f8ed766) are instructions from an AI friend on how to configure it for the course.\n",
"\n",
"## If you'd like to brush up your Python\n",
"\n",
"I've added a notebook called [Intermediate Python](Intermediate%20Python.ipynb) to get you up to speed. But you should give it a miss if you already have a good idea what this code does: \n",
"`yield from {book.get(\"author\") for book in books if book.get(\"author\")}`\n",
"\n",
"## I am here to help\n",
"\n",
"If you have any problems at all, please do reach out. \n",
"I'm available through the platform, or at ed@edwarddonner.com, or at https://www.linkedin.com/in/eddonner/ if you'd like to connect (and I love connecting!)\n",
"\n",
"## More troubleshooting\n",
"\n",
"Please see the [troubleshooting](troubleshooting.ipynb) notebook in this folder to diagnose and fix common problems. At the very end of it is a diagnostics script with some useful debug info.\n",
"\n",
"## If this is old hat!\n",
"\n",
"If you're already comfortable with today's material, please hang in there; you can move swiftly through the first few labs - we will get much more in depth as the weeks progress.\n",
"\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#900;\">Please read - important note</h2>\n",
" <span style=\"color:#900;\">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you do this with me, either at the same time, or (perhaps better) right afterwards. Add print statements to understand what's going on, and then come up with your own variations. If you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...</span>\n",
" </td>\n",
" </tr>\n",
"</table>\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../business.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#181;\">Business value of these exercises</h2>\n",
" <span style=\"color:#181;\">A final thought. While I've designed these notebooks to be educational, I've also tried to make them enjoyable. We'll do fun things like have LLMs tell jokes and argue with each other. But fundamentally, my goal is to teach skills you can apply in business. I'll explain business implications as we go, and it's worth keeping this in mind: as you build experience with models and techniques, think of ways you could put this into action at work today. Please do contact me if you'd like to discuss more or if you have ideas to bounce off me.</span>\n",
" </td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"\n",
"# If you get an error running this cell, then please head over to the troubleshooting notebook!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc8e7064-bca4-48b5-8598-dee42658cab3",
"metadata": {},
"outputs": [],
"source": [
"pip install -q -U google-generativeai"
]
},
{
"cell_type": "markdown",
"id": "6900b2a8-6384-4316-8aaa-5e519fca4254",
"metadata": {},
"source": [
"# Connecting to OpenAI\n",
"\n",
"The next cell is where we load in the environment variables in your `.env` file and connect to OpenAI.\n",
"\n",
"## Troubleshooting if you have problems:\n",
"\n",
"Head over to the [troubleshooting](troubleshooting.ipynb) notebook in this folder for step by step code to identify the root cause and fix it!\n",
"\n",
"If you make a change, try restarting the \"Kernel\" (the python process sitting behind this notebook) by Kernel menu >> Restart Kernel and Clear Outputs of All Cells. Then try this notebook again, starting at the top.\n",
"\n",
"Or, contact me! Message me or email ed@edwarddonner.com and we will get this to work.\n",
"\n",
"Any concerns about API costs? See my notes in the README - costs should be minimal, and you can control it at every point. You can also use Ollama as a free alternative, which we discuss during Day 2."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "019974d9-f3ad-4a8a-b5f9-0a3719aea2d3",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()\n",
"\n",
"# If this doesn't work, try Kernel menu >> Restart Kernel and Clear Outputs Of All Cells, then run the cells from the top of this notebook down.\n",
"# If it STILL doesn't work (horrors!) then please see the Troubleshooting notebook in this folder for full instructions"
]
},
{
"cell_type": "markdown",
"id": "442fc84b-0815-4f40-99ab-d9a5da6bda91",
"metadata": {},
"source": [
"# Let's make a quick call to a Frontier model to get started, as a preview!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81249b57-bf32-42a5-870d-411a58792dcc",
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"MODEL = \"llama3.2\"\n",
"openai = OpenAI(base_url=\"http://localhost:11434/v1\", api_key=\"ollama\")\n",
"\n",
"response = openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[{\"role\": \"user\", \"content\": \"What is 2 + 2?\"}]\n",
")\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a58394bf-1e45-46af-9bfd-01e24da6f49a",
"metadata": {},
"outputs": [],
"source": [
"# To give you a preview -- calling OpenAI with these messages is this easy. Any problems, head over to the Troubleshooting notebook.\n",
"\n",
"message = \"Hello, GPT! This is my first ever message to you! Hi!\"\n",
"response = openai.chat.completions.create(model=\"llama3.2\", messages=[{\"role\":\"user\", \"content\":message}])\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "2aa190e5-cb31-456a-96cc-db109919cd78",
"metadata": {},
"source": [
"## OK onwards with our first project"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5e793b2-6775-426a-a139-4848291d0463",
"metadata": {},
"outputs": [],
"source": [
"# A class to represent a Webpage\n",
"# If you're not familiar with Classes, check out the \"Intermediate Python\" notebook\n",
"\n",
"# Some websites need you to use proper headers when fetching them:\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
" \"\"\"\n",
" A utility class to represent a website that we have scraped\n",
"\n",
" \"\"\"\n",
" url:str\n",
" title:str\n",
" text:str\n",
"\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given url using the BeautifulSoup library\n",
" \"\"\"\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ef960cf-6dc2-4cda-afb3-b38be12f4c97",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Let's try one out. Change the website and add print statements to follow along.\n",
"\n",
"ed = Website(\"https://edwarddonner.com\")\n",
"print(ed.title)\n",
"# print(ed.text)"
]
},
{
"cell_type": "markdown",
"id": "6a478a0c-2c53-48ff-869c-4d08199931e1",
"metadata": {},
"source": [
"## Types of prompts\n",
"\n",
"You may know this already - but if not, you will get very familiar with it!\n",
"\n",
"Models like GPT4o have been trained to receive instructions in a particular way.\n",
"\n",
"They expect to receive:\n",
"\n",
"**A system prompt** that tells them what task they are performing and what tone they should use\n",
"\n",
"**A user prompt** -- the conversation starter that they should reply to"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
"metadata": {},
"outputs": [],
"source": [
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
"metadata": {},
"outputs": [],
"source": [
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "26448ec4-5c00-4204-baec-7df91d11ff2e",
"metadata": {},
"outputs": [],
"source": [
"print(user_prompt_for(ed))"
]
},
{
"cell_type": "markdown",
"id": "ea211b5f-28e1-4a86-8e52-c0b7677cadcc",
"metadata": {},
"source": [
"## Messages\n",
"\n",
"The API from OpenAI expects to receive messages in a particular structure.\n",
"Many of the other APIs share this structure:\n",
"\n",
"```\n",
"[\n",
" {\"role\": \"system\", \"content\": \"system message goes here\"},\n",
" {\"role\": \"user\", \"content\": \"user message goes here\"}\n",
"]\n",
"\n",
"To give you a preview, the next 2 cells make a rather simple call - we won't stretch the might GPT (yet!)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f25dcd35-0cd0-4235-9f64-ac37ed9eaaa5",
"metadata": {},
"outputs": [],
"source": [
"messages = [\n",
" {\"role\": \"system\", \"content\": \"You are a snarky assistant\"},\n",
" {\"role\": \"user\", \"content\": \"What is 2 + 2?\"}\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21ed95c5-7001-47de-a36d-1d6673b403ce",
"metadata": {},
"outputs": [],
"source": [
"# To give you a preview -- calling OpenAI with system and user messages:\n",
"\n",
"response = openai.chat.completions.create(model=\"llama3.2\", messages=messages)\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "d06e8d78-ce4c-4b05-aa8e-17050c82bb47",
"metadata": {},
"source": [
"## And now let's build useful messages for GPT-4o-mini, using a function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
"metadata": {},
"outputs": [],
"source": [
"# See how this function creates exactly the format above\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36478464-39ee-485c-9f3f-6a4e458dbc9c",
"metadata": {},
"outputs": [],
"source": [
"# Try this out, and then try for a few more websites\n",
"\n",
"messages_for(ed)"
]
},
{
"cell_type": "markdown",
"id": "16f49d46-bf55-4c3e-928f-68fc0bf715b0",
"metadata": {},
"source": [
"## Time to bring it together - the API for OpenAI is very simple!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
"metadata": {},
"outputs": [],
"source": [
"# And now: call the OpenAI API. You will get very familiar with this!\n",
"!ollama pull llama3.2\n",
"\n",
"from openai import OpenAI\n",
"MODEL = \"llama3.2\"\n",
"openai = OpenAI(base_url=\"http://localhost:11434/v1\", api_key=\"ollama\")\n",
"\n",
"\n",
"def summarize(url):\n",
" website = Website(url)\n",
" response = openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages = messages_for(website)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b65dd67-8ae7-4932-85ad-128bf8850148",
"metadata": {},
"outputs": [],
"source": [
"summarize(\"https://edwarddonner.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d926d59-450e-4609-92ba-2d6f244f1342",
"metadata": {},
"outputs": [],
"source": [
"# A function to display this nicely in the Jupyter output, using markdown\n",
"\n",
"def display_summary(url):\n",
" summary = summarize(url)\n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3018853a-445f-41ff-9560-d925d1774b2f",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://edwarddonner.com\")"
]
},
{
"cell_type": "markdown",
"id": "b3bcf6f4-adce-45e9-97ad-d9a5d7a3a624",
"metadata": {},
"source": [
"# Let's try more websites\n",
"\n",
"Note that this will only work on websites that can be scraped using this simplistic approach.\n",
"\n",
"Websites that are rendered with Javascript, like React apps, won't show up. See the community-contributions folder for a Selenium implementation that gets around this. You'll need to read up on installing Selenium (ask ChatGPT!)\n",
"\n",
"Also Websites protected with CloudFront (and similar) may give 403 errors - many thanks Andy J for pointing this out.\n",
"\n",
"But many websites will work just fine!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "45d83403-a24c-44b5-84ac-961449b4008f",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://cnn.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75e9fd40-b354-4341-991e-863ef2e59db7",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://anthropic.com\")"
]
},
{
"cell_type": "markdown",
"id": "c951be1a-7f1b-448f-af1f-845978e47e2c",
"metadata": {},
"source": [
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../business.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#181;\">Business applications</h2>\n",
" <span style=\"color:#181;\">In this exercise, you experienced calling the Cloud API of a Frontier Model (a leading model at the frontier of AI) for the first time. We will be using APIs like OpenAI at many stages in the course, in addition to building our own LLMs.\n",
"\n",
"More specifically, we've applied this to Summarization - a classic Gen AI use case to make a summary. This can be applied to any business vertical - summarizing the news, summarizing financial performance, summarizing a resume in a cover letter - the applications are limitless. Consider how you could apply Summarization in your business, and try prototyping a solution.</span>\n",
" </td>\n",
" </tr>\n",
"</table>\n",
"\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#900;\">Before you continue - now try yourself</h2>\n",
" <span style=\"color:#900;\">Use the cell below to make your own simple commercial example. Stick with the summarization use case for now. Here's an idea: write something that will take the contents of an email, and will suggest an appropriate short subject line for the email. That's the kind of feature that might be built into a commercial email tool.</span>\n",
" </td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00743dac-0e70-45b7-879a-d7293a6f68a6",
"metadata": {},
"outputs": [],
"source": [
"# Step 1: Create your prompts\n",
"\n",
"system_prompt = \"you are an assistant which analyzes the website content and understand it\"\n",
"user_prompt = \"\"\"\n",
" Summarize the website https://www.github.com. ignore the components like input,forms etc\n",
"\"\"\"\n",
"\n",
"# Step 2: Make the messages list\n",
"\n",
"messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ] # fill this in\n",
"\n",
"# Step 3: Call OpenAI\n",
"\n",
"response =openai.chat.completions.create(\n",
" model=\"llama3.2\",\n",
" messages = messages )\n",
"\n",
"# Step 4: print the result\n",
"summary = response.choices[0].message.content\n",
"display(Markdown(summary))"
]
},
{
"cell_type": "markdown",
"id": "36ed9f14-b349-40e9-a42c-b367e77f8bda",
"metadata": {},
"source": [
"## An extra exercise for those who enjoy web scraping\n",
"\n",
"You may notice that if you try `display_summary(\"https://openai.com\")` - it doesn't work! That's because OpenAI has a fancy website that uses Javascript. There are many ways around this that some of you might be familiar with. For example, Selenium is a hugely popular framework that runs a browser behind the scenes, renders the page, and allows you to query it. If you have experience with Selenium, Playwright or similar, then feel free to improve the Website class to use them. In the community-contributions folder, you'll find an example Selenium solution from a student (thank you!)"
]
},
{
"cell_type": "markdown",
"id": "eeab24dc-5f90-4570-b542-b0585aca3eb6",
"metadata": {},
"source": [
"# Sharing your code\n",
"\n",
"I'd love it if you share your code afterwards so I can share it with others! You'll notice that some students have already made changes (including a Selenium implementation) which you will find in the community-contributions folder. If you'd like add your changes to that folder, submit a Pull Request with your new versions in that folder and I'll merge your changes.\n",
"\n",
"If you're not an expert with git (and I am not!) then GPT has given some nice instructions on how to submit a Pull Request. It's a bit of an involved process, but once you've done it once it's pretty clear. As a pro-tip: it's best if you clear the outputs of your Jupyter notebooks (Edit >> Clean outputs of all cells, and then Save) for clean notebooks.\n",
"\n",
"Here are good instructions courtesy of an AI friend: \n",
"https://chatgpt.com/share/677a9cb5-c64c-8012-99e0-e06e88afd293"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,112 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "3ba06289-d17a-4ccd-85f5-2b79956d4e59",
"metadata": {},
"outputs": [],
"source": [
"!pip install selenium"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eabbbc62-1de1-4883-9b3e-9c90145ea6c5",
"metadata": {},
"outputs": [],
"source": [
"from selenium import webdriver\n",
"from selenium.webdriver.chrome.options import Options\n",
"from selenium.webdriver.chrome.service import Service\n",
"from bs4 import BeautifulSoup\n",
"import time\n",
"import os \n",
"\n",
"class Website:\n",
" def __init__(self, url, driver_path=None, wait_time=3):\n",
" self.url = url\n",
" self.wait_time = wait_time\n",
"\n",
" # Headless Chrome settings\n",
" options = Options()\n",
" # options.add_argument(\"--headless\") \n",
" # Headless mode runs the browser in the background (invisible).\n",
" # However, some websites (like openai.com) block headless browsers.\n",
" # So if this line is active, the page may not load correctly and you may not get the full content.\n",
" options.add_argument(\"--disable-gpu\")\n",
" options.add_argument(\"--no-sandbox\")\n",
" options.add_argument(\"--window-size=1920x1080\")\n",
"\n",
" # Driver path\n",
" if driver_path:\n",
" service = Service(executable_path=driver_path)\n",
" else:\n",
" service = Service() \n",
"\n",
" # Start browser\n",
" driver = webdriver.Chrome(service=service, options=options)\n",
" driver.get(url)\n",
"\n",
" # Wait for the loading page\n",
" time.sleep(self.wait_time)\n",
"\n",
" # Take page source\n",
" html = driver.page_source\n",
" driver.quit()\n",
"\n",
" # Analysis with BeautifulSoup \n",
" soup = BeautifulSoup(html, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
"\n",
" # Clean irrelevant tags\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
"\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "852c52e2-bd4d-4bb9-94ef-e498c33f1a89",
"metadata": {},
"outputs": [],
"source": [
"site = Website(\"https://openai.com\", driver_path=\"/Users/gizemmervedemir/Downloads/chromedriver-mac-arm64/chromedriver\")\n",
"print(\"Title:\", site.title)\n",
"print(\"\\nFirst 500 character:\\n\", site.text[:500])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7620c685-c35c-4d6b-aaf1-a3da98f19ca7",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,271 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"\n",
"# If you get an error running this cell, then please head over to the troubleshooting notebook!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "019974d9-f3ad-4a8a-b5f9-0a3719aea2d3",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()\n",
"\n",
"# If this doesn't work, try Kernel menu >> Restart Kernel and Clear Outputs Of All Cells, then run the cells from the top of this notebook down.\n",
"# If it STILL doesn't work (horrors!) then please see the Troubleshooting notebook in this folder for full instructions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
"metadata": {},
"outputs": [],
"source": [
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
"metadata": {},
"outputs": [],
"source": [
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
"metadata": {},
"outputs": [],
"source": [
"# See how this function creates exactly the format above\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]"
]
},
{
"cell_type": "markdown",
"id": "eeab24dc-5f90-4570-b542-b0585aca3eb6",
"metadata": {},
"source": [
"# Sharing your code\n",
"\n",
"I'd love it if you share your code afterwards so I can share it with others! You'll notice that some students have already made changes (including a Selenium implementation) which you will find in the community-contributions folder. If you'd like add your changes to that folder, submit a Pull Request with your new versions in that folder and I'll merge your changes.\n",
"\n",
"If you're not an expert with git (and I am not!) then GPT has given some nice instructions on how to submit a Pull Request. It's a bit of an involved process, but once you've done it once it's pretty clear. As a pro-tip: it's best if you clear the outputs of your Jupyter notebooks (Edit >> Clean outputs of all cells, and then Save) for clean notebooks\n",
"\n",
"Here are good instructions courtesy of an AI friend: \n",
"https://chatgpt.com/share/677a9cb5-c64c-8012-99e0-e06e88afd293"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "acbb92b2-b625-4a37-b03a-09dc8f06b222",
"metadata": {},
"outputs": [],
"source": [
"!pip install selenium"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d6448a12-6aa1-4dd1-aaf1-c8a3a3c3ecb0",
"metadata": {},
"outputs": [],
"source": [
"!pip install webdriver-manager"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4484fcf-8b39-4c3f-9674-37970ed71988",
"metadata": {},
"outputs": [],
"source": [
"# A class to represent a Webpage\n",
"# If you're not familiar with Classes, check out the \"Intermediate Python\" notebook\n",
"\n",
"# Some websites need you to use proper headers when fetching them:\n",
"# Import necessary modules\n",
"from selenium import webdriver\n",
"from selenium.webdriver.chrome.options import Options\n",
"from selenium.webdriver.chrome.service import Service\n",
"from webdriver_manager.chrome import ChromeDriverManager\n",
"from bs4 import BeautifulSoup\n",
"import time\n",
"\n",
"class ScrapeWebsite:\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given URL using Selenium + BeautifulSoup\n",
" Supports JavaScript-heavy and normal websites uniformly.\n",
" \"\"\"\n",
" self.url = url\n",
"\n",
" # Configure headless Chrome\n",
" options = Options()\n",
" options.add_argument('--headless')\n",
" options.add_argument('--no-sandbox')\n",
" options.add_argument('--disable-dev-shm-usage')\n",
"\n",
" # Use webdriver-manager to manage ChromeDriver\n",
" service = Service(ChromeDriverManager().install())\n",
"\n",
" # Initialize the Chrome WebDriver with the service and options\n",
" driver = webdriver.Chrome(service=service, options=options)\n",
"\n",
" # Start Selenium WebDriver\n",
" driver.get(url)\n",
"\n",
" # Wait for JS to load (adjust as needed)\n",
" time.sleep(3)\n",
"\n",
" # Fetch the page source after JS execution\n",
" page_source = driver.page_source\n",
" driver.quit()\n",
"\n",
" # Parse the HTML content with BeautifulSoup\n",
" soup = BeautifulSoup(page_source, 'html.parser')\n",
"\n",
" # Extract title\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
"\n",
" # Remove unnecessary elements\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
"\n",
" # Extract the main text\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f576f485-60c0-4539-bfb3-79d821ebefa4",
"metadata": {},
"outputs": [],
"source": [
"def summarize_js_website(url):\n",
" website = ScrapeWebsite(url)\n",
" response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = messages_for(website)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00ac3659-e4f0-4b64-8041-ba35bfa2c4c9",
"metadata": {},
"outputs": [],
"source": [
"summary = summarize_js_website(\"https://dheerajmaddi.netlify.app/\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d526136e-9960-4f09-aad0-32f8c11de0ac",
"metadata": {},
"outputs": [],
"source": [
"display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bcf1fd75-9964-4223-bcda-f2794bc9f7af",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,751 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
"metadata": {},
"source": [
"# YOUR FIRST LAB\n",
"### Please read this section. This is valuable to get you prepared, even if it's a long read -- it's important stuff.\n",
"\n",
"## Your first Frontier LLM Project\n",
"\n",
"Let's build a useful LLM solution - in a matter of minutes.\n",
"\n",
"By the end of this course, you will have built an autonomous Agentic AI solution with 7 agents that collaborate to solve a business problem. All in good time! We will start with something smaller...\n",
"\n",
"Our goal is to code a new kind of Web Browser. Give it a URL, and it will respond with a summary. The Reader's Digest of the internet!!\n",
"\n",
"Before starting, you should have completed the setup for [PC](../SETUP-PC.md) or [Mac](../SETUP-mac.md) and you hopefully launched this jupyter lab from within the project root directory, with your environment activated.\n",
"\n",
"## If you're new to Jupyter Lab\n",
"\n",
"Welcome to the wonderful world of Data Science experimentation! Once you've used Jupyter Lab, you'll wonder how you ever lived without it. Simply click in each \"cell\" with code in it, such as the cell immediately below this text, and hit Shift+Return to execute that cell. As you wish, you can add a cell with the + button in the toolbar, and print values of variables, or try out variations. \n",
"\n",
"I've written a notebook called [Guide to Jupyter](Guide%20to%20Jupyter.ipynb) to help you get more familiar with Jupyter Labs, including adding Markdown comments, using `!` to run shell commands, and `tqdm` to show progress.\n",
"\n",
"## If you're new to the Command Line\n",
"\n",
"Please see these excellent guides: [Command line on PC](https://chatgpt.com/share/67b0acea-ba38-8012-9c34-7a2541052665) and [Command line on Mac](https://chatgpt.com/canvas/shared/67b0b10c93a081918210723867525d2b). \n",
"\n",
"## If you'd prefer to work in IDEs\n",
"\n",
"If you're more comfortable in IDEs like VSCode or Pycharm, they both work great with these lab notebooks too. \n",
"If you'd prefer to work in VSCode, [here](https://chatgpt.com/share/676f2e19-c228-8012-9911-6ca42f8ed766) are instructions from an AI friend on how to configure it for the course.\n",
"\n",
"## If you'd like to brush up your Python\n",
"\n",
"I've added a notebook called [Intermediate Python](Intermediate%20Python.ipynb) to get you up to speed. But you should give it a miss if you already have a good idea what this code does: \n",
"`yield from {book.get(\"author\") for book in books if book.get(\"author\")}`\n",
"\n",
"## I am here to help\n",
"\n",
"If you have any problems at all, please do reach out. \n",
"I'm available through the platform, or at ed@edwarddonner.com, or at https://www.linkedin.com/in/eddonner/ if you'd like to connect (and I love connecting!) \n",
"And this is new to me, but I'm also trying out X/Twitter at [@edwarddonner](https://x.com/edwarddonner) - if you're on X, please show me how it's done 😂 \n",
"\n",
"## More troubleshooting\n",
"\n",
"Please see the [troubleshooting](troubleshooting.ipynb) notebook in this folder to diagnose and fix common problems. At the very end of it is a diagnostics script with some useful debug info.\n",
"\n",
"## If this is old hat!\n",
"\n",
"If you're already comfortable with today's material, please hang in there; you can move swiftly through the first few labs - we will get much more in depth as the weeks progress.\n",
"\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#900;\">Please read - important note</h2>\n",
" <span style=\"color:#900;\">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations. If you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...</span>\n",
" </td>\n",
" </tr>\n",
"</table>\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../resources.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#f71;\">Treat these labs as a resource</h2>\n",
" <span style=\"color:#f71;\">I push updates to the code regularly. When people ask questions or have problems, I incorporate it in the code, adding more examples or improved commentary. As a result, you'll notice that the code below isn't identical to the videos. Everything from the videos is here; but in addition, I've added more steps and better explanations, and occasionally added new models like DeepSeek. Consider this like an interactive book that accompanies the lectures.\n",
" </span>\n",
" </td>\n",
" </tr>\n",
"</table>\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../business.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#181;\">Business value of these exercises</h2>\n",
" <span style=\"color:#181;\">A final thought. While I've designed these notebooks to be educational, I've also tried to make them enjoyable. We'll do fun things like have LLMs tell jokes and argue with each other. But fundamentally, my goal is to teach skills you can apply in business. I'll explain business implications as we go, and it's worth keeping this in mind: as you build experience with models and techniques, think of ways you could put this into action at work today. Please do contact me if you'd like to discuss more or if you have ideas to bounce off me.</span>\n",
" </td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"\n",
"# If you get an error running this cell, then please head over to the troubleshooting notebook!"
]
},
{
"cell_type": "markdown",
"id": "6900b2a8-6384-4316-8aaa-5e519fca4254",
"metadata": {},
"source": [
"# Connecting to OpenAI (or Ollama)\n",
"\n",
"The next cell is where we load in the environment variables in your `.env` file and connect to OpenAI. \n",
"\n",
"If you'd like to use free Ollama instead, please see the README section \"Free Alternative to Paid APIs\", and if you're not sure how to do this, there's a full solution in the solutions folder (day1_with_ollama.ipynb).\n",
"\n",
"## Troubleshooting if you have problems:\n",
"\n",
"Head over to the [troubleshooting](troubleshooting.ipynb) notebook in this folder for step by step code to identify the root cause and fix it!\n",
"\n",
"If you make a change, try restarting the \"Kernel\" (the python process sitting behind this notebook) by Kernel menu >> Restart Kernel and Clear Outputs of All Cells. Then try this notebook again, starting at the top.\n",
"\n",
"Or, contact me! Message me or email ed@edwarddonner.com and we will get this to work.\n",
"\n",
"Any concerns about API costs? See my notes in the README - costs should be minimal, and you can control it at every point. You can also use Ollama as a free alternative, which we discuss during Day 2."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "019974d9-f3ad-4a8a-b5f9-0a3719aea2d3",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()\n",
"\n",
"# If this doesn't work, try Kernel menu >> Restart Kernel and Clear Outputs Of All Cells, then run the cells from the top of this notebook down.\n",
"# If it STILL doesn't work (horrors!) then please see the Troubleshooting notebook in this folder for full instructions"
]
},
{
"cell_type": "markdown",
"id": "442fc84b-0815-4f40-99ab-d9a5da6bda91",
"metadata": {},
"source": [
"# Let's make a quick call to a Frontier model to get started, as a preview!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a58394bf-1e45-46af-9bfd-01e24da6f49a",
"metadata": {},
"outputs": [],
"source": [
"# To give you a preview -- calling OpenAI with these messages is this easy. Any problems, head over to the Troubleshooting notebook.\n",
"\n",
"message = \"Hello, GPT! This is my first ever message to you! Hi!\"\n",
"response = openai.chat.completions.create(model=\"gpt-4o-mini\", messages=[{\"role\":\"user\", \"content\":message}])\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "2aa190e5-cb31-456a-96cc-db109919cd78",
"metadata": {},
"source": [
"## OK onwards with our first project"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5e793b2-6775-426a-a139-4848291d0463",
"metadata": {},
"outputs": [],
"source": [
"# A class to represent a Webpage\n",
"# If you're not familiar with Classes, check out the \"Intermediate Python\" notebook\n",
"\n",
"# Some websites need you to use proper headers when fetching them:\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
"\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given url using the BeautifulSoup library\n",
" \"\"\"\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ef960cf-6dc2-4cda-afb3-b38be12f4c97",
"metadata": {},
"outputs": [],
"source": [
"# Let's try one out. Change the website and add print statements to follow along.\n",
"\n",
"ed = Website(\"https://edwarddonner.com\")\n",
"print(ed.title)\n",
"print(ed.text)"
]
},
{
"cell_type": "markdown",
"id": "6a478a0c-2c53-48ff-869c-4d08199931e1",
"metadata": {},
"source": [
"## Types of prompts\n",
"\n",
"You may know this already - but if not, you will get very familiar with it!\n",
"\n",
"Models like GPT4o have been trained to receive instructions in a particular way.\n",
"\n",
"They expect to receive:\n",
"\n",
"**A system prompt** that tells them what task they are performing and what tone they should use\n",
"\n",
"**A user prompt** -- the conversation starter that they should reply to"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
"metadata": {},
"outputs": [],
"source": [
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
"metadata": {},
"outputs": [],
"source": [
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "26448ec4-5c00-4204-baec-7df91d11ff2e",
"metadata": {},
"outputs": [],
"source": [
"print(user_prompt_for(ed))"
]
},
{
"cell_type": "markdown",
"id": "ea211b5f-28e1-4a86-8e52-c0b7677cadcc",
"metadata": {},
"source": [
"## Messages\n",
"\n",
"The API from OpenAI expects to receive messages in a particular structure.\n",
"Many of the other APIs share this structure:\n",
"\n",
"```\n",
"[\n",
" {\"role\": \"system\", \"content\": \"system message goes here\"},\n",
" {\"role\": \"user\", \"content\": \"user message goes here\"}\n",
"]\n",
"\n",
"To give you a preview, the next 2 cells make a rather simple call - we won't stretch the mighty GPT (yet!)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f25dcd35-0cd0-4235-9f64-ac37ed9eaaa5",
"metadata": {},
"outputs": [],
"source": [
"messages = [\n",
" {\"role\": \"system\", \"content\": \"You are a snarky assistant\"},\n",
" {\"role\": \"user\", \"content\": \"What is 2 + 2?\"}\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21ed95c5-7001-47de-a36d-1d6673b403ce",
"metadata": {},
"outputs": [],
"source": [
"# To give you a preview -- calling OpenAI with system and user messages:\n",
"\n",
"response = openai.chat.completions.create(model=\"gpt-4o-mini\", messages=messages)\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "d06e8d78-ce4c-4b05-aa8e-17050c82bb47",
"metadata": {},
"source": [
"## And now let's build useful messages for GPT-4o-mini, using a function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
"metadata": {},
"outputs": [],
"source": [
"# See how this function creates exactly the format above\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36478464-39ee-485c-9f3f-6a4e458dbc9c",
"metadata": {},
"outputs": [],
"source": [
"# Try this out, and then try for a few more websites\n",
"\n",
"messages_for(ed)"
]
},
{
"cell_type": "markdown",
"id": "16f49d46-bf55-4c3e-928f-68fc0bf715b0",
"metadata": {},
"source": [
"## Time to bring it together - the API for OpenAI is very simple!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
"metadata": {},
"outputs": [],
"source": [
"# And now: call the OpenAI API. You will get very familiar with this!\n",
"\n",
"def summarize(url):\n",
" website = Website(url)\n",
" response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = messages_for(website)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "05e38d41-dfa4-4b20-9c96-c46ea75d9fb5",
"metadata": {},
"outputs": [],
"source": [
"summarize(\"https://edwarddonner.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d926d59-450e-4609-92ba-2d6f244f1342",
"metadata": {},
"outputs": [],
"source": [
"# A function to display this nicely in the Jupyter output, using markdown\n",
"\n",
"def display_summary(url):\n",
" summary = summarize(url)\n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3018853a-445f-41ff-9560-d925d1774b2f",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://edwarddonner.com\")"
]
},
{
"cell_type": "markdown",
"id": "b3bcf6f4-adce-45e9-97ad-d9a5d7a3a624",
"metadata": {},
"source": [
"# Let's try more websites\n",
"\n",
"Note that this will only work on websites that can be scraped using this simplistic approach.\n",
"\n",
"Websites that are rendered with Javascript, like React apps, won't show up. See the community-contributions folder for a Selenium implementation that gets around this. You'll need to read up on installing Selenium (ask ChatGPT!)\n",
"\n",
"Also Websites protected with CloudFront (and similar) may give 403 errors - many thanks Andy J for pointing this out.\n",
"\n",
"But many websites will work just fine!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "45d83403-a24c-44b5-84ac-961449b4008f",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://cnn.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75e9fd40-b354-4341-991e-863ef2e59db7",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://anthropic.com\")"
]
},
{
"cell_type": "markdown",
"id": "f84c01ba",
"metadata": {},
"source": [
"# Install Selenium using Conda\n",
"\n",
"## First we need to install selenium package using conda"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "14d1ca84",
"metadata": {},
"outputs": [],
"source": [
"%conda install -c conda-forge selenium -y"
]
},
{
"cell_type": "markdown",
"id": "a5f35b45",
"metadata": {},
"source": [
"## Change the website class to use selenium"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed2ebef8",
"metadata": {},
"outputs": [],
"source": [
"from selenium import webdriver\n",
"from selenium.webdriver.chrome.service import Service\n",
"from selenium.webdriver.common.by import By\n",
"from selenium.webdriver.chrome.options import Options\n",
"from bs4 import BeautifulSoup\n",
"\n",
"class Website:\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this WebsiteSelenium object from the given URL using Selenium and BeautifulSoup.\n",
" \"\"\"\n",
" self.url = url\n",
"\n",
" # Set up Selenium WebDriver with headless Chrome\n",
" chrome_options = Options()\n",
" chrome_options.add_argument(\"--no-sandbox\")\n",
" chrome_options.add_argument(\"--disable-dev-shm-usage\")\n",
" chrome_options.add_argument(\"--disable-blink-features=AutomationControlled\") # Prevent detection\n",
" chrome_options.add_argument(\"--disable-infobars\") # Disable \"Chrome is being controlled\" infobar\n",
" \n",
" # Remove the default \"user-agent\" string\n",
" # chrome_options.add_argument(\"user-agent=YOUR_CUSTOM_USER_AGENT\") # Use a user-agent string from a real browser\n",
"\n",
"\n",
" service = Service() # Use default ChromeDriver path\n",
" driver = webdriver.Chrome(service=service, options=chrome_options)\n",
"\n",
" try:\n",
" # Fetch the webpage\n",
" driver.get(url)\n",
"\n",
" # Get the page source\n",
" page_source = driver.page_source\n",
"\n",
" # Parse the page source with BeautifulSoup\n",
" soup = BeautifulSoup(page_source, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)\n",
" finally:\n",
" # Close the WebDriver\n",
" driver.quit()"
]
},
{
"cell_type": "markdown",
"id": "66eae3bd",
"metadata": {},
"source": [
"## Now let's try again"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f9ef6a1",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://anthropic.com\")"
]
},
{
"cell_type": "markdown",
"id": "c951be1a-7f1b-448f-af1f-845978e47e2c",
"metadata": {},
"source": [
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../business.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#181;\">Business applications</h2>\n",
" <span style=\"color:#181;\">In this exercise, you experienced calling the Cloud API of a Frontier Model (a leading model at the frontier of AI) for the first time. We will be using APIs like OpenAI at many stages in the course, in addition to building our own LLMs.\n",
"\n",
"More specifically, we've applied this to Summarization - a classic Gen AI use case to make a summary. This can be applied to any business vertical - summarizing the news, summarizing financial performance, summarizing a resume in a cover letter - the applications are limitless. Consider how you could apply Summarization in your business, and try prototyping a solution.</span>\n",
" </td>\n",
" </tr>\n",
"</table>\n",
"\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#900;\">Before you continue - now try yourself</h2>\n",
" <span style=\"color:#900;\">Use the cell below to make your own simple commercial example. Stick with the summarization use case for now. Here's an idea: write something that will take the contents of an email, and will suggest an appropriate short subject line for the email. That's the kind of feature that might be built into a commercial email tool.</span>\n",
" </td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"id": "0ab3a6bb",
"metadata": {},
"source": [
"# CV improver for a job\n",
"\n",
"We are going to use AI to help us improve our linkedin profile for a given linkedIn Job URL.\n",
"\n",
"It will take in our profile URL and a job URL and it will output several recommendations on how to modify our profile to better match what is required in the job."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "62d46f7e",
"metadata": {},
"outputs": [],
"source": [
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"\n",
"def user_prompt_for_job(candidate_profile_url, job_url):\n",
" candidate_profile = Website(candidate_profile_url)\n",
" user_prompt = f\"You are looking at a candidate profile titled {candidate_profile.title}\"\n",
" user_prompt += \"\\nThe contents of this candidate profile is as follows;\\n\"\n",
" user_prompt += candidate_profile.text\n",
"\n",
" job = Website(job_url)\n",
" user_prompt += \"\\nThis candidate wants to apply to following job: {job.title} \\n \"\n",
" user_prompt += \"\\nThe details of the jobs is as follows; \\\n",
" please provide the candidate at least 5 skills or areas of improvement to add \\\n",
" to their linkedin profile.\\n\\n\"\n",
" user_prompt += job.text\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00743dac-0e70-45b7-879a-d7293a6f68a6",
"metadata": {},
"outputs": [],
"source": [
"# Step 1: Create your prompts\n",
"\n",
"system_prompt = \" You are a recruiter speciallised in HR and talent adquisition. \\\n",
" You'll be analising the linkedin profile of a candidate and a published job \\\n",
" and will give the candidate recommendations on how to modify their profile \\\n",
" to better match the job. Respond in markdown.\"\n",
"\n",
"user_prompt = user_prompt_for_job(\n",
" candidate_profile_url=\"https://www.linkedin.com/in/eddonner/\", \n",
" job_url=\"https://www.linkedin.com/jobs/view/4130488506\")\n",
"\n",
"print(user_prompt)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d7535220",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Step 2: Make the messages list\n",
"\n",
"messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
"]\n",
"\n",
"# Step 3: Call OpenAI\n",
"response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = messages\n",
" )\n",
" \n",
"response = response.choices[0].message.content\n",
"\n",
"# Step 4: print the result\n",
"\n",
"display(Markdown(response))"
]
},
{
"cell_type": "markdown",
"id": "36ed9f14-b349-40e9-a42c-b367e77f8bda",
"metadata": {},
"source": [
"## An extra exercise for those who enjoy web scraping\n",
"\n",
"You may notice that if you try `display_summary(\"https://openai.com\")` - it doesn't work! That's because OpenAI has a fancy website that uses Javascript. There are many ways around this that some of you might be familiar with. For example, Selenium is a hugely popular framework that runs a browser behind the scenes, renders the page, and allows you to query it. If you have experience with Selenium, Playwright or similar, then feel free to improve the Website class to use them. In the community-contributions folder, you'll find an example Selenium solution from a student (thank you!)"
]
},
{
"cell_type": "markdown",
"id": "eeab24dc-5f90-4570-b542-b0585aca3eb6",
"metadata": {},
"source": [
"# Sharing your code\n",
"\n",
"I'd love it if you share your code afterwards so I can share it with others! You'll notice that some students have already made changes (including a Selenium implementation) which you will find in the community-contributions folder. If you'd like add your changes to that folder, submit a Pull Request with your new versions in that folder and I'll merge your changes.\n",
"\n",
"If you're not an expert with git (and I am not!) then GPT has given some nice instructions on how to submit a Pull Request. It's a bit of an involved process, but once you've done it once it's pretty clear. As a pro-tip: it's best if you clear the outputs of your Jupyter notebooks (Edit >> Clean outputs of all cells, and then Save) for clean notebooks.\n",
"\n",
"Here are good instructions courtesy of an AI friend: \n",
"https://chatgpt.com/share/677a9cb5-c64c-8012-99e0-e06e88afd293"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4484fcf-8b39-4c3f-9674-37970ed71988",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "llms",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,241 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "1b809d22-d170-4db3-a298-1740ce06b534",
"metadata": {},
"outputs": [],
"source": [
"#Udemy Course >> LLM Engineering: Master AI and LLMs\n",
"#Student: Jay\n",
"#Date: Apr 20, 2025\n",
"#Home work: Day1 - Summmarize website using local LLama\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "01e91579-7e32-4c4d-9cc9-c06d13c16209",
"metadata": {},
"outputs": [],
"source": [
"# Constants\n",
"\n",
"OLLAMA_API = \"http://localhost:11434/api/chat\"\n",
"HEADERS = {\"Content-Type\": \"application/json\"}\n",
"MODEL = \"llama3.2\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8d780fba-868c-4216-88f5-1e3ca5ad43ed",
"metadata": {},
"outputs": [],
"source": [
"# A class to represent a Webpage\n",
"# If you're not familiar with Classes, check out the \"Intermediate Python\" notebook\n",
"\n",
"# Some websites need you to use proper headers when fetching them:\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
"\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given url using the BeautifulSoup library\n",
" \"\"\"\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "839b645f-90ee-434d-b0bd-1cb4e574a8de",
"metadata": {},
"outputs": [],
"source": [
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ef2453e8-3eca-4f6d-8ccf-9e5274b589a7",
"metadata": {},
"outputs": [],
"source": [
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6ec397d5-e9b0-411d-8bdb-66605273cb11",
"metadata": {},
"outputs": [],
"source": [
"messages = [\n",
" {\"role\": \"system\", \"content\": \"You are a snarky assistant\"},\n",
" {\"role\": \"user\", \"content\": \"What is 2 + 2?\"}\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "76aed9eb-a085-4687-859d-817c771156fa",
"metadata": {},
"outputs": [],
"source": [
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "26de4682-cf4f-4b7e-8cb2-049f7f46b758",
"metadata": {},
"outputs": [],
"source": [
"def summarize(url):\n",
" website = Website(url)\n",
" ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')\n",
"\n",
" response = ollama_via_openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=messages_for(website) \n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "16b2532a-d44c-4903-83ec-0b828a2d1b92",
"metadata": {},
"outputs": [],
"source": [
"def display_summary(url):\n",
" summary = summarize(url)\n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "86af4905-5d5c-47c9-b9b2-27257452ff94",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"**Anthropic Website Summary**\n",
"=====================================\n",
"\n",
"### Mission and Values\n",
"\n",
"Anthropic's mission is to build AI that serves humanity's long-term well-being. They focus on designing powerful technologies with human benefit at their foundation, aiming to demonstrate responsible AI development in practice.\n",
"\n",
"### Notable Releases\n",
"\n",
"#### 2025\n",
"\n",
"* **Claude 3.7 Sonnet**: Anthropic's most intelligent AI model, now available.\n",
"* Recent news articles:\n",
"\t+ \"Tracing the thoughts of a large language model: Interpretability\"\n",
"\t+ \"Anthropic Economic Index: Societal Impacts\"\n",
"\n",
"### Products and Solutions\n",
"\n",
"* **Claude**: A suite of AI tools for building applications and custom experiences with human benefit in mind.\n",
"* **Claude Overview**, **API Platform**, and various other products, including:\n",
"\t+ **Claude 3.5 Haiku**\n",
"\t+ **Claude 3 Opus**\n",
"\n",
"### Research and Commitments\n",
"\n",
"* The Anthropic Academy: A learning platform for developers to build AI solutions with Claude.\n",
"* Responsible scaling policy and alignment science initiatives.\n",
"\n",
"### News Section (Selection)**\n",
"\n",
"Anthropic's recent news articles:\n",
"* \"Claude extended thinking\"\n",
"* \"Alignment faking in large language models\"\n",
"\n",
"### Company Information\n",
"\n",
"For more information on Anthropic, including company, careers, and help resources, follow the provided links."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_summary(\"https://anthropic.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a5151062-614e-44ff-b341-d3f64e28aa93",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,248 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
"metadata": {},
"source": [
"# Welcome to your first assignment!\n",
"\n",
"Instructions are below. Please give this a try, and look in the solutions folder if you get stuck (or feel free to ask me!)"
]
},
{
"cell_type": "markdown",
"id": "ada885d9-4d42-4d9b-97f0-74fbbbfe93a9",
"metadata": {},
"source": [
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../resources.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#f71;\">Just before we get to the assignment --</h2>\n",
" <span style=\"color:#f71;\">I thought I'd take a second to point you at this page of useful resources for the course. This includes links to all the slides.<br/>\n",
" <a href=\"https://edwarddonner.com/2024/11/13/llm-engineering-resources/\">https://edwarddonner.com/2024/11/13/llm-engineering-resources/</a><br/>\n",
" Please keep this bookmarked, and I'll continue to add more useful links there over time.\n",
" </span>\n",
" </td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "23057e00-b6fc-4678-93a9-6b31cb704bff",
"metadata": {},
"outputs": [],
"source": [
"# There's actually an alternative approach that some people might prefer\n",
"# You can use the OpenAI client python library to call Ollama:\n",
"\n",
"from openai import OpenAI\n",
"ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')\n",
"\n",
"response = ollama_via_openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=messages\n",
")\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "1622d9bb-5c68-4d4e-9ca4-b492c751f898",
"metadata": {},
"source": [
"# NOW the exercise for you\n",
"\n",
"Take the code from day1 and incorporate it here, to build a website summarizer that uses Llama 3.2 running locally instead of OpenAI; use either of the above approaches."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "37e35a64-7c2a-453d-96fa-9c8119c6618d",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "fc410fe7-7abe-48ab-9206-ec6412278ac5",
"metadata": {},
"outputs": [],
"source": [
"# Constants\n",
"\n",
"OLLAMA_API = \"http://localhost:11434/api/chat\"\n",
"HEADERS = {\"Content-Type\": \"application/json\"}\n",
"MODEL = \"llama3.2\""
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "654af616-1ad4-4d28-be41-f3c99b6e8f42",
"metadata": {},
"outputs": [],
"source": [
"# A class to represent a Webpage\n",
"\n",
"# Some websites need you to use proper headers when fetching them:\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
"\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given url using the BeautifulSoup library\n",
" \"\"\"\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "f665c051-95a2-4102-8e26-1974bd5c7d3a",
"metadata": {},
"outputs": [],
"source": [
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "258cf0af-650f-4225-b1c1-8f29e209ebfd",
"metadata": {},
"outputs": [],
"source": [
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "fe5291b0-a2bb-4b60-af77-d33517a7005b",
"metadata": {},
"outputs": [],
"source": [
"def summarize(url):\n",
" website = Website(url)\n",
" client = OpenAI(base_url=\"http://localhost:11434/v1\", api_key=\"ollama\")\n",
" response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=messages_for(website)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "b53f34cd-f8ce-4656-a46a-33e966156e2e",
"metadata": {},
"outputs": [],
"source": [
"# A function to display this nicely in the Jupyter output, using markdown\n",
"\n",
"def display_summary(url):\n",
" summary = summarize(url)\n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "5b28ccfa-eb27-4154-aeb6-aff439c8a723",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"**Website Summary**\n",
"=====================\n",
"\n",
"### About the Website\n",
"\n",
"This website is owned by Edward Donner, a co-founder and CTO of Nebula.io, an AI company that applies AI to help people discover their potential. The website provides information about his background, experience, and work with LLMs (Large Language Models).\n",
"\n",
"### News and Announcements\n",
"\n",
"* **Upcoming Events:**\n",
" + January 23, 2025: LLM Workshop - Hands-on with Agents - resources\n",
" + December 21, 2024: Welcome to the SuperDataScientists community!\n",
" + November 13, 2024: Mastering AI and LLL Engineering - Resources\n",
" + October 16, 2024: From Software Engineer to AI Data Scientist - resources\n",
"* **Acquisition:**\n",
" + In 2021, Edward's previous startup untapt was acquired.\n",
"\n",
"### Links\n",
"\n",
"The website also provides links to Edward Donner's social media profiles (LinkedIn, Twitter, Facebook), as well as a newsletter signup form."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_summary(\"https://edwarddonner.com\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,207 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "2b00a7de-c563-4d41-b8ab-84128f0f3069",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"import os\n",
"from dotenv import load_dotenv\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"\n",
"ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "daa9de5c-6241-46aa-a51d-98bc154ee6e7",
"metadata": {},
"outputs": [],
"source": [
"# Constants\n",
"\n",
"OLLAMA_API = \"http://localhost:11434/api/chat\"\n",
"HEADERS = {\"Content-Type\": \"application/json\"}\n",
"MODEL = \"llama3.2\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f3bf8e10-5770-4081-b099-cf83e41126b8",
"metadata": {},
"outputs": [],
"source": [
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
"\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given url using the BeautifulSoup library\n",
" \"\"\"\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e6a5d9d5-a617-4ea4-9b03-3eae2dd4520d",
"metadata": {},
"outputs": [],
"source": [
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\""
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "c01a0e24-ccf7-4359-a731-dcda6bfc5023",
"metadata": {},
"outputs": [],
"source": [
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "43f5df54-a34b-42cd-a6b2-e28996a84ff7",
"metadata": {},
"outputs": [],
"source": [
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "b79e63fb-b741-4f4a-8bc4-66a60feef2cd",
"metadata": {},
"outputs": [],
"source": [
"def summarize(url):\n",
" website = Website(url)\n",
" # response = openai.chat.completions.create(\n",
" # model = \"gpt-4o-mini\",\n",
" # messages = messages_for(website)\n",
" # )\n",
" response = ollama_via_openai.chat.completions.create(\n",
" model=\"deepseek-r1:1.5b\",\n",
" messages=messages_for(website)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3298a858-e5de-4804-b188-06c0ce6471b0",
"metadata": {},
"outputs": [],
"source": [
"\n",
"def display_summary(url):\n",
" summary = summarize(url)\n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "40dcb721-f807-47bf-9d18-f6a649c371e0",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"<think>\n",
"Alright, so I'm trying to figure out why CNN's \"World Today\" section is showing the word \"Gulf.\" I remember that \"Gulf\" was a significant event, but not sure if there have been any other notable events in the Gulf region around recent times. Is it about oil production or something else? Maybe natural gas?\n",
"Wait, I've heard of an oil-producing company called BP, which is associated with the Gulf. They're big on Shell and Opec members like Russia. But does that make \"Gulf\" part of their content? Or is it more about how the Gulf looks in visual terms?\n",
"\n",
"I'm also thinking about news categories—global events, geopolitical stuff, tech, culture, etc.—and maybe a recent oil production related report. Maybe they show real-time data about how much oil has been produced or the availability in some country nearby.\n",
"\n",
"Hmm, I'm not sure if \"Gulf\" refers to Earth's location or just part of the Gulf region because they have two Gulfagoras islands as landmarks that sound similar to \"Gulf.\" Could it be a typo where the team named these after BP? But then if it were Gulfagoras, wouldn't they be more about geography than the actual oil aspect? Or maybe they were the names when BP was discovered?\n",
"\n",
"I think CNN is aiming for current news, so they probably show how something or another company has done in the Gulf. Since BP is big there, with data like gas production and costs, that might fit under geopolitical or energy news. But why specifically \"Gulf\"? Maybe to align with how some other teams handle Earth-related news.\n",
"\n",
"Overall, I'm leaning towards it being about BP's recent activities due to their oil involvement in the Gulf region. They typically cover geological products, energy, and production reports, so \"Gulf\" probably refers to those specific topics or regions.\n",
"</think>\n",
"\n",
" CNN's \"World Today\" section shows \"Gulf,\" likely referring to BP's exploration of Earth resources, particularly in relation to their oil production in the Gulf region. BP is well-known for being part of the Gulf region and associated with major companies like Shell, Opec members such as Russia, and significant geological features like the Gulfagoras islands, which might be named after them due to BP's location or discovery nearby. Therefore, this title reflects their current geopolitical news focusing on energy-related activities in the Gulf.\n",
"\n",
"**Answer:** The \"Gulf\" section likely refers to the oil production activities of BP, linking them to the Gulf region and geologically significant features in Earth terms."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_summary(\"https://cnn.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f41a586e-fe1f-4040-8ebb-31887981907f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,266 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
"metadata": {},
"source": [
"# Welcome to your first assignment!\n",
"\n",
"Instructions are below. Please give this a try, and look in the solutions folder if you get stuck (or feel free to ask me!)"
]
},
{
"cell_type": "markdown",
"id": "ada885d9-4d42-4d9b-97f0-74fbbbfe93a9",
"metadata": {},
"source": [
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../resources.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#f71;\">Just before we get to the assignment --</h2>\n",
" <span style=\"color:#f71;\">I thought I'd take a second to point you at this page of useful resources for the course. This includes links to all the slides.<br/>\n",
" <a href=\"https://edwarddonner.com/2024/11/13/llm-engineering-resources/\">https://edwarddonner.com/2024/11/13/llm-engineering-resources/</a><br/>\n",
" Please keep this bookmarked, and I'll continue to add more useful links there over time.\n",
" </span>\n",
" </td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9cc85216-f6e4-436e-b6c1-976c8f2d1152",
"metadata": {},
"outputs": [],
"source": [
"!pip install webdriver-manager\n",
"!pip install selenium"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"import ollama\n",
"from openai import OpenAI\n",
"from selenium import webdriver\n",
"from selenium.webdriver.chrome.options import Options\n",
"from selenium.webdriver.chrome.service import Service\n",
"from webdriver_manager.chrome import ChromeDriverManager\n",
"import time"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "29ddd15d-a3c5-4f4e-a678-873f56162724",
"metadata": {},
"outputs": [],
"source": [
"# Constants\n",
"MODEL = \"llama3.2\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "479ff514-e8bd-4985-a572-2ea28bb4fa40",
"metadata": {},
"outputs": [],
"source": [
"# Let's just make sure the model is loaded\n",
"\n",
"!ollama pull llama3.2"
]
},
{
"cell_type": "markdown",
"id": "6a021f13-d6a1-4b96-8e18-4eae49d876fe",
"metadata": {},
"source": [
"# Introducing the ollama package\n",
"\n",
"And now we'll do the same thing, but using the elegant ollama python package instead of a direct HTTP call.\n",
"\n",
"Under the hood, it's making the same call as above to the ollama server running at localhost:11434"
]
},
{
"cell_type": "markdown",
"id": "a4704e10-f5fb-4c15-a935-f046c06fb13d",
"metadata": {},
"source": [
"## Alternative approach - using OpenAI python library to connect to Ollama"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "23057e00-b6fc-4678-93a9-6b31cb704bff",
"metadata": {},
"outputs": [],
"source": [
"# There's actually an alternative approach that some people might prefer\n",
"# You can use the OpenAI client python library to call Ollama:\n",
"\n",
"\n",
"ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')"
]
},
{
"cell_type": "markdown",
"id": "1622d9bb-5c68-4d4e-9ca4-b492c751f898",
"metadata": {},
"source": [
"# NOW the exercise for you\n",
"\n",
"Take the code from day1 and incorporate it here, to build a website summarizer that uses Llama 3.2 running locally instead of OpenAI; use either of the above approaches."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8251b6a5-7b43-42b9-84a9-4a94b6bdb933",
"metadata": {},
"outputs": [],
"source": [
"# A class to represent a Webpage\n",
"class ScrapeWebsite:\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given URL using Selenium + BeautifulSoup\n",
" Supports JavaScript-heavy and normal websites uniformly.\n",
" \"\"\"\n",
" self.url = url\n",
"\n",
" # Configure headless Chrome\n",
" options = Options()\n",
" options.add_argument('--headless')\n",
" options.add_argument('--no-sandbox')\n",
" options.add_argument('--disable-dev-shm-usage')\n",
"\n",
" # Use webdriver-manager to manage ChromeDriver\n",
" service = Service(ChromeDriverManager().install())\n",
"\n",
" # Initialize the Chrome WebDriver with the service and options\n",
" driver = webdriver.Chrome(service=service, options=options)\n",
"\n",
" # Start Selenium WebDriver\n",
" driver.get(url)\n",
"\n",
" # Wait for JS to load (adjust as needed)\n",
" time.sleep(3)\n",
"\n",
" # Fetch the page source after JS execution\n",
" page_source = driver.page_source\n",
" driver.quit()\n",
"\n",
" # Parse the HTML content with BeautifulSoup\n",
" soup = BeautifulSoup(page_source, 'html.parser')\n",
"\n",
" # Extract title\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
"\n",
" # Remove unnecessary elements\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
"\n",
" # Extract the main text\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6de38216-6d1c-48c4-877b-86d403f4e0f8",
"metadata": {},
"outputs": [],
"source": [
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\"\n",
"\n",
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]\n",
"\n",
"# And now: call the OpenAI API. You will get very familiar with this!\n",
"\n",
"def summarize(url):\n",
" website = ScrapeWebsite(url)\n",
" response = ollama_via_openai.chat.completions.create(\n",
" model = MODEL,\n",
" messages = messages_for(website)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5dbf8d5c-a42a-4a72-b3a4-c75865b841bb",
"metadata": {},
"outputs": [],
"source": [
"summary = summarize(\"https://edwarddonner.com/2024/11/13/llm-engineering-resources/\")\n",
"display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4ddfacdc-b16a-4999-9ff2-93ed19600d24",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,993 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "a07e7793-b8f5-44f4-aded-5562f633271a",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import json\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import gradio as gr\n",
"import base64\n",
"from io import BytesIO\n",
"import tempfile\n",
"import subprocess\n",
"from pydub import AudioSegment\n",
"import time\n",
"import anthropic\n",
"from datetime import datetime, time\n",
"import requests"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "717ea9d4-1e72-4035-b7c5-5d61da5b8ea3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"OpenAI API Key exists and begins sk-proj-\n"
]
}
],
"source": [
"# Initialization\n",
"\n",
"load_dotenv(override=True)\n",
"\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"open_weather_api_key=os.getenv('open_weather')\n",
"amadeus_api_key=os.getenv('amadeus_key')\n",
"amadeus_secret=os.getenv('amadeus_secret')\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
" \n",
"gpt_model = \"gpt-4o-mini\"\n",
"\n",
"openai = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "cc78f4fd-9920-4872-9117-90cd2aeb2a06",
"metadata": {},
"outputs": [],
"source": [
"system_message = \"\"\"You are a helpful assistant. You plan vacations for users in the following chronological manner - \n",
"1) you ask the user (if they have not already confirmed) which destination they want to travel to - beaches or mountains or any other destination \n",
" the user prefers \n",
"2) ask the current location of the user (if they have not already shared), please make sure the user shares the name of an exact city or a nearby city \n",
" with an airport\n",
"3) you list the best nearby vacation destinations (each destination should have an airport or provide the nearest airport option) \n",
"4) you ask them their travel start date for the present year only (if they have not already shared the information) that is which date in 2025 should they start\n",
"5) you make multiple tool calls (use tool check_weather) for finding weather data for each location from the list (from step 3 above) \n",
" by using the location, latitude, longitude, and date_year (as start_date + 1 from the previous year); \n",
" Example, if the start date is June 3rd, 2025, your date_year will become June 4th, 2024. You mandatorily have to share the date with the tool call function.\n",
"6) you shortlist top two destinations with better weather conditions and ask the user to share his final selection (vacation destination) and the number of people \n",
" who will be travelling and also share detailed itenerary for both the options for the user to select. Make sure the start and end destinations remain the same. \n",
" Example, if your onward journey is from bangalore to bali, the trip should end in bali, so that the user can avail the return flight from bali to bangalore.\n",
"7) after the user confirms the final selection and the number of heads(number of people availing the vacation) denoted by \"no\", you confirm \n",
" with the user, before proceeding to call the check_flights tool call to get the exact flight expenses for a return trip. Share the following data \n",
" along with the tool call - origin, destination, departure_date, return_date, number of people, currency (in which the user will be checking the costs). \n",
" Make sure to pass the IATA codes for the origin and destination locations to the check_flights_function tool, as an example, if the user is travelling from \n",
" bangalore to goa, pass the origin and destinations as BLR (not bangalore) and GOI (not GOA). \n",
"8) post obtaining the tool result, analyze, share the flight details and expenses with the user. But don't go ahead and book tickets.\n",
"9) Confirm with the user, the stay options that he/she has in mind - \n",
" a) How many rooms will be needed, then, post confirmation from the user, make the tool call \"check_stays_function\" to check the stay costs by\n",
" supplying the following parameters to the function (You should be having all above parameters by now, in case, anything is missing, ask the user \n",
" before proceeding)-\n",
" i) IATA code of the destination city, example if the user is travelling to goa, it will be GOA\n",
" ii) The check-in date, keep this parameter the same as the start_date for the user\n",
" iii) number of people who will be travelling\n",
" iv) Number of rooms needed\n",
" v) The check-out date, keep this parameter the same as the return date for the user\n",
"11) As the final step, show a detailed summary to the user showing the suggested flight with expenses and the suggested stay with expenses for the travel duration,\n",
" and the user's total expenses for the trip. Example, if the user is travelling from bangalore to goa, and you have already checked the flight costs from step 8\n",
" above as INR 2000, and from step 9 you confirm the stay costs as 1000 per day for a 5-day trip, (this you know from step 6 while making the itenerary), the total\n",
" expenses will be 2000 + (1000x5) = 7000 INR. Display the details to the user. \n",
"\n",
" IMPORTANT NOTE - \n",
" i) You will not proceed with booking any stay or flight tickets after step 11, so never say - \"This is your detailed itenerary, may I proceed to \n",
" book?\", rather say \"Hope you like the itenerary, in case of any changes, let me know and I will help revise the plan.\"\n",
" \n",
"Example usage - \n",
"\n",
"user - plan me a vacation\n",
"assistant - where to? beach or mountain?\n",
"user - beach\n",
"assistant - what is your location?\n",
"user - India\n",
"assistant - At what time of the year do you wish to travel? And which city in India with an airport you will be travelling from?\n",
"user - June 1st, 2025, Bangalore, India\n",
"assistant - top tourist destinations are - goa, gokarna, andaman and nicobar islands, varkala. Do you want me to proceed?\n",
"or do you want to add some suggestions?\n",
"user - please proceed\n",
"assistant - [makes tool calls for each location - goa, gokarna, andaman and nicobar islands, and varkala\n",
"for 2nd June 2024 (previous year), supplying the latitude and longitude data for each, along with the date as start_date+1 and shortlist the \n",
"top two places with the best weather conditions and share \n",
"the details with the user] here you go, based on what you asked for, goa and gokarna seem to be your best options\n",
"considering the weather conditions over a similar period in the previous year. Please let me know your final selection and number of people who will \n",
"be travelling and your preferred currency for the calculations\n",
"user - I will select Goa and 2 people myself and my wife will be travelling. I want to see the costs in INR\n",
"assistant - [makes a final itenerary taking into consideration the information provided by the user. Decides the return date and shares the final details with \n",
"the user and asks him to confirm] here is your final itenerary \"xxx\" start date is June 1st, return date is June 6th. Is this fine or do you want to make few changes \n",
"to the itenerary or start and/or end dates?\n",
"user - please proceed\n",
"assistant - [makes the tool call check_flights_function and passes the following information - Origin: BLR, Destination: GOI, Departure date - 2025-06-01,\n",
"return date - 2025-06-06, no = 2 (number of people), INR (currency). Checks the cost of the return flight ticket as a response from the tool call, analyzes and \n",
"finally showcases the user the flight expenses] Here you go, please find the flight details and the expenses for the same \"xx\". Can you please confirm the number\n",
"of rooms you will be availing so that I can check the hotel details?\n",
"user - sure, it will be 1 room\n",
"assistant - [makes the final tool call check_stays_function to find hotel price details] So the hotel options are \"xxx\" the prices are mostly in the range \"xx\", You final\n",
"itenerary is \"xxx\" with flights booked from bangalore to goa as \"xx\" flight expense - \"xx\", hotel expense \"xx\", total expense \"xx\"\n",
"\n",
"Make sure that the travel start date is confirmed by the user and you are passing the travel start date + 1 \n",
"as the \"check_weather\" tool call argument. Also. make sure that in your final answer, do not disclose the exact date again which \n",
"you made the api call. Rather say - Based on the weather conditions over a similar period last year, I will recommend\n",
"location x and y. But make sure you pass the date parameter as start_date + 1 from the previous year (2024) to the check_weather tool call\n",
"\n",
"for the check_flights_function tool call, please confirm the currency every time\n",
"for the check_weather tool call, please provide the date_year field each time\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c919b13a-50b6-4510-8e9d-02cdfd95cb98",
"metadata": {},
"outputs": [],
"source": [
"def check_weather(location, latitudes, longitudes, date_year):\n",
" \n",
" print(location)\n",
" print(latitudes)\n",
" print(longitudes)\n",
" print(date_year)\n",
" # if not (len(location) == len(latitudes) == len(longitudes) == len(date_year)):\n",
" # raise ValueError(\"All input lists must have the same length.\")\n",
"\n",
" timestamp1=get_unix_timestamp(date_year)\n",
" weather_data = []\n",
"\n",
" url = (\n",
" f\"https://api.openweathermap.org/data/3.0/onecall/timemachine?\"\n",
" f\"lat={latitudes}&lon={longitudes}&dt={timestamp1}&appid={open_weather_api_key}&units=metric\"\n",
" )\n",
" print(url)\n",
" try:\n",
" response = requests.get(url)\n",
" response.raise_for_status()\n",
" data = response.json()\n",
"\n",
" # Use first available hourly data as representative\n",
" hourly = data.get(\"data\") or data.get(\"hourly\") or []\n",
" if not hourly:\n",
" raise ValueError(\"No hourly data found in response.\")\n",
"\n",
" weather_point = hourly[0]\n",
" temperature = weather_point.get(\"temp\")\n",
" weather_desc = weather_point.get(\"weather\", [{}])[0].get(\"description\", \"N/A\")\n",
"\n",
" precipitation = 0\n",
" precip_type = \"none\"\n",
" if \"rain\" in weather_point:\n",
" precipitation = weather_point[\"rain\"].get(\"1h\", 0)\n",
" precip_type = \"rain\"\n",
" elif \"snow\" in weather_point:\n",
" precipitation = weather_point[\"snow\"].get(\"1h\", 0)\n",
" precip_type = \"snow\"\n",
"\n",
" weather_data.append({\n",
" \"location\": location,\n",
" \"date_year\": timestamp1,\n",
" \"temperature\": temperature,\n",
" \"weather\": weather_desc,\n",
" \"precipitation_type\": precip_type,\n",
" \"precipitation_mm\": precipitation\n",
" })\n",
"\n",
" except requests.RequestException as e:\n",
" weather_data.append({\n",
" \"location\": location,\n",
" \"date_year\": timestamp1,\n",
" \"error\": f\"Request failed: {e}\"\n",
" })\n",
" except Exception as e:\n",
" weather_data.append({\n",
" \"location\": location,\n",
" \"date_year\": timestamp1,\n",
" \"error\": f\"Processing error: {e}\"\n",
" })\n",
"\n",
" print(weather_data)\n",
"\n",
" return weather_data\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "73c4e65a-5080-448a-b3be-0914b10f99f7",
"metadata": {},
"outputs": [],
"source": [
"# call_amadeus(\"BLR\",\"GOI\",\"2025-07-29\",\"2025-08-05\",2,\"INR\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "7115bf68-dc5d-4b54-b5a2-07662e74af5f",
"metadata": {},
"outputs": [],
"source": [
"def extract_sorted_flights_by_price_with_baggage_weight(response_json):\n",
" results = []\n",
" \n",
" # Map carrier codes to full names\n",
" carrier_dict = response_json.get(\"dictionaries\", {}).get(\"carriers\", {})\n",
"\n",
" for offer in response_json.get(\"data\", []):\n",
" itineraries = offer.get(\"itineraries\", [])\n",
" traveler_pricing = offer.get(\"travelerPricings\", [])[0]\n",
" fare_details = traveler_pricing.get(\"fareDetailsBySegment\", [])\n",
" price = float(offer.get(\"price\", {}).get(\"total\", 0.0))\n",
" currency = offer.get(\"price\", {}).get(\"currency\", \"INR\")\n",
"\n",
" outbound_segment = itineraries[0][\"segments\"][0]\n",
" inbound_segment = itineraries[1][\"segments\"][0]\n",
"\n",
" outbound_airline = carrier_dict.get(outbound_segment[\"carrierCode\"], outbound_segment[\"carrierCode\"])\n",
" inbound_airline = carrier_dict.get(inbound_segment[\"carrierCode\"], inbound_segment[\"carrierCode\"])\n",
"\n",
" # Build baggage weight lookup\n",
" baggage_lookup = {\n",
" fare[\"segmentId\"]: fare.get(\"includedCheckedBags\", {}).get(\"weight\", \"N/A\")\n",
" for fare in fare_details\n",
" }\n",
"\n",
" summary = {\n",
" \"Price\": price,\n",
" \"Currency\": currency,\n",
" \"Departure Time\": outbound_segment[\"departure\"][\"at\"],\n",
" \"Return Time\": inbound_segment[\"departure\"][\"at\"],\n",
" \"Departure Airline\": outbound_airline,\n",
" \"Return Airline\": inbound_airline,\n",
" \"Check-in Baggage Weight\": {\n",
" \"Departure\": f'{baggage_lookup.get(outbound_segment[\"id\"], \"N/A\")}kg' if baggage_lookup.get(outbound_segment[\"id\"]) else \"N/A\",\n",
" \"Return\": f'{baggage_lookup.get(inbound_segment[\"id\"], \"N/A\")}kg' if baggage_lookup.get(inbound_segment[\"id\"]) else \"N/A\",\n",
" }\n",
" }\n",
"\n",
" results.append(summary)\n",
"\n",
" # Sort by price\n",
" sorted_results = sorted(results, key=lambda x: x[\"Price\"])\n",
" return sorted_results\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "6929ba76-bf75-490b-adc9-c43bf90ce72d",
"metadata": {},
"outputs": [],
"source": [
"# def get_city_iata_code(city_name):\n",
"# # Step 1: Get access token\n",
"# print(f\"finding iata code for {city_name}\")\n",
"# auth_response = requests.post(\n",
"# \"https://test.api.amadeus.com/v1/security/oauth2/token\",\n",
"# headers={\"Content-Type\": \"application/x-www-form-urlencoded\"},\n",
"# data={\n",
"# \"grant_type\": \"client_credentials\",\n",
"# \"client_id\": amadeus_api_key,\n",
"# \"client_secret\": amadeus_secret\n",
"# }\n",
"# )\n",
"# auth_response.raise_for_status()\n",
"# access_token = auth_response.json()[\"access_token\"]\n",
"\n",
"# # Step 2: Search for city IATA code\n",
"# location_response = requests.get(\n",
"# \"https://test.api.amadeus.com/v1/reference-data/locations\",\n",
"# headers={\"Authorization\": f\"Bearer {access_token}\"},\n",
"# params={\"keyword\": city_name, \"subType\": \"CITY\"}\n",
"# )\n",
"# location_response.raise_for_status()\n",
"# data = location_response.json().get(\"data\", [])\n",
"\n",
"# if not data:\n",
"# print(f\"No IATA code found for {city_name}\")\n",
"# return None\n",
"\n",
"# return data[0][\"iataCode\"]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "42a82601-1afa-4a0f-92bc-3cbfdfd9f119",
"metadata": {},
"outputs": [],
"source": [
"# # print(get_city_iata_code(\"bengaluru\"))\n",
"# def lower(s):\n",
"# result = \"\"\n",
"# for char in s:\n",
"# # Check if char is uppercase (ASCII 6590)\n",
"# if 'A' <= char <= 'Z':\n",
"# # Convert to lowercase by adding 32 to ASCII value\n",
"# result += chr(ord(char) + 32)\n",
"# else:\n",
"# result += char\n",
"# return result"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "aa6f59ce-9c7d-46ec-945e-aa40ee88e392",
"metadata": {},
"outputs": [],
"source": [
"def call_amadeus(origin, destination, departure_date, return_date, no, currency):\n",
" # or1=get_city_iata_code(lower(origin))\n",
" # dest=get_city_iata_code(lower(destination))\n",
" # print(f\"iata codes origin - {or1}, destination - {dest}\")\n",
" or1=origin\n",
" dest=destination\n",
" print(f\"origin is {or1}, destination is {dest}\")\n",
" auth_response = requests.post(\n",
" \"https://test.api.amadeus.com/v1/security/oauth2/token\",\n",
" data={\n",
" \"grant_type\": \"client_credentials\",\n",
" \"client_id\": amadeus_api_key,\n",
" \"client_secret\": amadeus_secret\n",
" }\n",
" )\n",
" access_token = auth_response.json()['access_token']\n",
"\n",
" # Search flights\n",
" headers = {\"Authorization\": f\"Bearer {access_token}\"}\n",
" params = {\n",
" \"originLocationCode\": or1,\n",
" \"destinationLocationCode\": dest,\n",
" \"departureDate\": departure_date,\n",
" \"returnDate\": return_date,\n",
" \"adults\": 2,\n",
" \"nonStop\": \"false\",\n",
" \"currencyCode\": currency,\n",
" \"max\":4\n",
" }\n",
" response = requests.get(\n",
" \"https://test.api.amadeus.com/v2/shopping/flight-offers\",\n",
" headers=headers,\n",
" params=params\n",
" )\n",
" \n",
" # print(response.json())\n",
"\n",
" return extract_sorted_flights_by_price_with_baggage_weight(response.json())"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "973d6078-baf8-4e88-a9f2-529fff15dee6",
"metadata": {},
"outputs": [],
"source": [
"def get_access_token():\n",
" url = 'https://test.api.amadeus.com/v1/security/oauth2/token'\n",
" payload = {\n",
" 'grant_type': 'client_credentials',\n",
" 'client_id': amadeus_api_key,\n",
" 'client_secret': amadeus_secret\n",
" }\n",
"\n",
" response = requests.post(url, data=payload)\n",
" response.raise_for_status()\n",
" return response.json()['access_token']"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "ad87f1f3-3fea-437b-9f4e-f1b46e1728fd",
"metadata": {},
"outputs": [],
"source": [
"def get_hotel_ids(city_code, radius_km):\n",
" print(\"--------------------checking hotel ids--------------------\")\n",
" token = get_access_token()\n",
" print(f\"Access Token: {token}\")\n",
" url = 'https://test.api.amadeus.com/v1/reference-data/locations/hotels/by-city'\n",
" headers = {\n",
" 'Authorization': f'Bearer {token}'\n",
" }\n",
" params = {\n",
" 'cityCode': city_code,\n",
" 'radius': radius_km,\n",
" 'radiusUnit': 'KM',\n",
" # 'amenities': 'SWIMMING_POOL', # Optional filter\n",
" # 'ratings': '3', # Optional filter\n",
" 'hotelSource': 'ALL'\n",
" }\n",
"\n",
" response = requests.get(url, headers=headers, params=params)\n",
" response.raise_for_status()\n",
" data = response.json().get('data', [])\n",
" hotel_ids = [hotel['hotelId'] for hotel in data]\n",
" \n",
" print(f\"✅ Found {len(hotel_ids)} hotels in {city_code}\")\n",
" return hotel_ids[:20]\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "7191f8a3-aabc-4f8e-bf11-635ecaf40c5d",
"metadata": {},
"outputs": [],
"source": [
"def get_hotel_offers(city_code, check_in_date, no,rooms, check_out_date):\n",
" print(\"---------------inside get hotel offers--------------\")\n",
" hotel_ids=get_hotel_ids(city_code,10)\n",
" \n",
" return get_hotel_offers_by_ids(hotel_ids,check_in_date,no, rooms,check_out_date)\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "e6323979-352c-4fb6-9fe7-216e3613513f",
"metadata": {},
"outputs": [],
"source": [
"def get_hotel_offers_by_ids(hotel_ids, check_in_date, adults, rooms, check_out_date):\n",
" print(\"--------------------checking hotel offers based on ids--------------------\")\n",
" token = get_access_token()\n",
" print(f\"Access Token: {token}\")\n",
" url = 'https://test.api.amadeus.com/v3/shopping/hotel-offers'\n",
" headers = {\n",
" 'Authorization': f'Bearer {token}'\n",
" }\n",
"\n",
" all_offers = []\n",
"\n",
" for hotel_id in hotel_ids:\n",
" params = {\n",
" 'hotelIds': hotel_id,\n",
" # 'adults': adults,\n",
" 'checkInDate': check_in_date,\n",
" 'checkOutDate': check_out_date\n",
" # 'roomQuantity': rooms,\n",
" # 'paymentPolicy': 'NONE',\n",
" # 'includeClosed': 'false',\n",
" # 'bestRateOnly': 'true',\n",
" # 'view': 'FULL',\n",
" # 'sort': 'PRICE'\n",
" }\n",
"\n",
" try:\n",
" print(f\"🔍 Checking hotel ID: {hotel_id}\")\n",
" response = requests.get(url, headers=headers, params=params)\n",
" response.raise_for_status()\n",
" offers = response.json()\n",
" if \"data\" in offers and offers[\"data\"]:\n",
" print(f\"✅ Found offers for hotel ID: {hotel_id}\")\n",
" all_offers.extend(offers[\"data\"])\n",
" else:\n",
" print(f\"⚠️ No offers returned for hotel ID: {hotel_id}\")\n",
" except requests.exceptions.HTTPError as e:\n",
" print(f\"❌ HTTPError for hotel ID {hotel_id}: {e}\")\n",
"\n",
" if all_offers:\n",
" return json.dumps({\"data\": all_offers}, indent=2)\n",
" else:\n",
" return json.dumps({\"message\": \"No valid hotel offers found.\"}, indent=2)\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "fa82a6d1-cd74-46aa-9b99-c56dc02cddff",
"metadata": {},
"outputs": [],
"source": [
"# print(get_hotel_offers(\"GOI\",\"2025-06-03\",2,1,\"2025-06-08\"))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "687e610d-0951-400b-b575-9a83b788bf79",
"metadata": {},
"outputs": [],
"source": [
"check_stays_function = {\n",
" \"name\": \"get_hotel_offers\",\n",
" \"description\": \"Call this tool whenever you need to check the hotel availability and prices for the vacation destination. You need to supply the city_code, check_in_date,\\\n",
" number of heads, and number of rooms\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"city_code\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The IATA code for the vacation destination\",\n",
" },\n",
" \"check_in_date\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"the date when the user will be checking into the hotel\",\n",
" },\n",
" \"no\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"the number of heads for which reservation needs to be made, that is, how many members should the reservation be made for\",\n",
" },\n",
" \"rooms\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The number of rooms to be reserved as confirmed by the user\",\n",
" },\n",
" \"check_out_date\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"the date when the user will be checking out of the hotel\",\n",
" },\n",
" },\n",
" \"required\": [\"city_code\",\"check_in_date\",\"no\",\"rooms\",\"check_out_date\"],\n",
" \"additionalProperties\": False\n",
" }\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "717913a3-aab8-4df7-b9a1-c4bbc649babd",
"metadata": {},
"outputs": [],
"source": [
"check_flights_function = {\n",
" \"name\": \"call_amadeus\",\n",
" \"description\": \"Call this tool whenever you need to check the flight prices and other details for a return \\\n",
" trip from the origin to the destination location (where the user wants to spend his vacation). Make sure that you \\\n",
" supply the origin, destination, departure date, return date, number of tickets, and the \\\n",
" currency in which the user would like to pay. Please note the details provided will be for the return trip and NOT one-way\\\n",
" \",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"origin\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"Origin location for the user - his origin city or a nearby city with an airport\",\n",
" },\n",
" \"destination\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"Destination location for the user - his vacation destination airport or airport \\\n",
" which is nearby to his vacation destination\",\n",
" },\n",
" \"departure_date\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"the start date for the user's vacation\",\n",
" },\n",
" \"return_date\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"the end date/ return date for the user's vacation\",\n",
" },\n",
" \"no\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"the number of tickets to purchase\",\n",
" },\n",
" \"currency\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"the currency in which payment is to be made\",\n",
" },\n",
" },\n",
" \"required\": [\"origin\",\"destination\",\"departure_date\",\"return_date\",\"no\",\"currency\"],\n",
" \"additionalProperties\": False\n",
" }\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "d2628781-6f5e-4ac1-bbe3-2e08aa0aae0d",
"metadata": {},
"outputs": [],
"source": [
"check_weather_function = {\n",
" \"name\": \"check_weather\",\n",
" \"description\": \"Call this tool whenever you need to check the weather of a location for a specific\\\n",
" time from the previous year. The tool will require -\\\n",
" 1) the llm to supply details of one location (based on the category- beaches or mountains or any other category the user selects) \\\n",
" and to which the user might travel to, \\\n",
" 2) the latitude and longitude of that location. \\\n",
" 3) the date_year, which basically is (vacation start date + 1 from previous year.) - this is essentially the date against which the weather conditions are to be \\\n",
" checked. For simplicity, we would keep it as the vacation start date + 1 from the previous year. Example, if the user provides the date as june 3rd, 2025,\\\n",
" date_year will be june 4th, 2024.\\\n",
" This tool call will return a list of weather situations one year ago for the chosen location.\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"location\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"One of the locations that is near the user's location based on the\\\n",
" category the user selects (beaches or mountains or any other destination category based on the user's choice)\",\n",
" },\n",
" \"lat\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The latitude of the location\",\n",
" },\n",
" \"long\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The longitude of the location\",\n",
" },\n",
" \"date_year\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The date of the previous year for which the weather needs to be fetched\",\n",
" }\n",
" },\n",
" \"required\": [\"location\",\"lat\",\"long\",\"date_year\"],\n",
" \"additionalProperties\": False\n",
" }\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "1d5d74a0-9c25-46a4-84ee-1f700bd55fa7",
"metadata": {},
"outputs": [],
"source": [
"# And this is included in a list of tools:\n",
"\n",
"tools = [{\"type\": \"function\", \"function\": check_weather_function},\n",
" {\"type\": \"function\", \"function\": check_flights_function},\n",
" {\"type\": \"function\", \"function\": check_stays_function}]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "fa18f535-f8a7-4386-b39a-df0f84d23406",
"metadata": {},
"outputs": [],
"source": [
"def play_audio(audio_segment):\n",
" temp_dir = tempfile.gettempdir()\n",
" temp_path = os.path.join(temp_dir, \"temp_audio.wav\")\n",
" try:\n",
" audio_segment.export(temp_path, format=\"wav\")\n",
" # time.sleep(3) # Student Dominic found that this was needed. You could also try commenting out to see if not needed on your PC\n",
" subprocess.call([\n",
" \"ffplay\",\n",
" \"-nodisp\",\n",
" \"-autoexit\",\n",
" \"-hide_banner\",\n",
" temp_path\n",
" ], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)\n",
" finally:\n",
" try:\n",
" os.remove(temp_path)\n",
" except Exception:\n",
" pass\n",
" \n",
"def talker(message):\n",
" response = openai.audio.speech.create(\n",
" model=\"tts-1\",\n",
" voice=\"alloy\", # Also, try replacing with onyx\n",
" input=message\n",
" )\n",
" audio_stream = BytesIO(response.content)\n",
" audio = AudioSegment.from_file(audio_stream, format=\"mp3\")\n",
" play_audio(audio)\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "b588d711-5f20-4a3a-9422-81a1fda8d5b0",
"metadata": {},
"outputs": [],
"source": [
"# We have to write that function handle_tool_call:\n",
"\n",
"def handle_tool_call1(name, args):\n",
" location = args.get('location')\n",
" lat = args.get('lat')\n",
" long = args.get('long')\n",
" date_year = args.get('date_year')\n",
" if name.replace('\"','') == \"check_weather\":\n",
" weather=check_weather(location, lat, long, date_year)\n",
" \n",
" return weather"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "4eaf63c6-d590-44b8-a508-6e99e314dee1",
"metadata": {},
"outputs": [],
"source": [
"# We have to write that function handle_tool_call:\n",
"\n",
"def handle_tool_call2(name, args):\n",
" origin = args.get('origin')\n",
" destination = args.get('destination')\n",
" departure_date = args.get('departure_date')\n",
" return_date = args.get('return_date')\n",
" no = args.get('no')\n",
" currency = args.get('currency')\n",
" if name.replace('\"','') == \"call_amadeus\":\n",
" flights=call_amadeus(origin, destination, departure_date, return_date,no,currency)\n",
" \n",
" return flights"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "dd85e8dc-6c40-4b6d-b1c3-0efe67093150",
"metadata": {},
"outputs": [],
"source": [
"# We have to write that function handle_tool_call:\n",
"\n",
"def handle_tool_call3(name, args):\n",
" city_code = args.get('city_code')\n",
" check_in_date = args.get('check_in_date')\n",
" no = args.get('no')\n",
" rooms = args.get('rooms')\n",
" check_out_date = args.get('check_out_date')\n",
" if name.replace('\"','') == \"get_hotel_offers\":\n",
" hotels=get_hotel_offers(city_code, check_in_date, no, rooms,check_out_date)\n",
" \n",
" return hotels"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "04a11068-96ab-40eb-9185-1177835a3de7",
"metadata": {},
"outputs": [],
"source": [
"def chat_open_ai(history):\n",
" messages = [{\"role\": \"system\", \"content\": system_message}] + history \n",
" response = openai.chat.completions.create(model=gpt_model, messages=messages, tools=tools)\n",
"\n",
" tool_responses = []\n",
"\n",
" if response.choices[0].finish_reason == \"tool_calls\":\n",
" message = response.choices[0].message\n",
" tool_calls = message.tool_calls # renamed to avoid UnboundLocalError\n",
"\n",
" print(f\"tool calls \\n\\n {tool_calls}\")\n",
"\n",
" for tool_call in tool_calls:\n",
" tool_id = tool_call.id\n",
" name = tool_call.function.name\n",
" args = json.loads(tool_call.function.arguments)\n",
"\n",
" # Call the tool handler\n",
" result = \"\"\n",
" if name == \"check_weather\":\n",
" result = handle_tool_call1(name, args)\n",
" elif name == \"call_amadeus\":\n",
" result = handle_tool_call2(name, args)\n",
" elif name == \"get_hotel_offers\":\n",
" result = handle_tool_call3(name, args)\n",
"\n",
" tool_responses.append({\n",
" \"role\": \"tool\",\n",
" \"tool_call_id\": tool_id,\n",
" \"content\": json.dumps(result),\n",
" })\n",
"\n",
" print(f\"tool responses {tool_responses}\")\n",
" messages.append(message)\n",
" messages.extend(tool_responses) # important fix here\n",
"\n",
" response = openai.chat.completions.create(\n",
" model=gpt_model,\n",
" messages=messages,\n",
" tools=tools\n",
" )\n",
"\n",
" reply = response.choices[0].message.content\n",
" # talker(reply)\n",
" history += [{\"role\": \"assistant\", \"content\": reply}]\n",
"\n",
" return history\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "a2547bb0-43a5-4b1d-8b9a-95da15a11040",
"metadata": {},
"outputs": [],
"source": [
"def chat(history):\n",
" # + [{\"role\": \"user\", \"content\": message}]\n",
" # if Model==\"Open AI\":\n",
" history = chat_open_ai(history)\n",
" # \n",
"\n",
" return history"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "36e11d99-9281-4efd-a792-dd4fa5935917",
"metadata": {},
"outputs": [],
"source": [
"def listen2(history):\n",
" import speech_recognition as sr\n",
"\n",
" r = sr.Recognizer()\n",
" with sr.Microphone() as source:\n",
" print(\"Speak now...\")\n",
" audio = r.listen(source, phrase_time_limit=30)\n",
" text=\"\"\n",
" try:\n",
" text = r.recognize_google(audio)\n",
" print(\"You said:\", text)\n",
" except sr.UnknownValueError:\n",
" print(\"Could not understand audio.\")\n",
"\n",
" history += [{\"role\":\"user\", \"content\":text}] \n",
" return \"\", history"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "54e18ba1-78c9-4435-9f12-50fb93ba41fc",
"metadata": {},
"outputs": [],
"source": [
"def get_unix_timestamp(date):\n",
" if date is None:\n",
" return \"Please select a date.\"\n",
" if isinstance(date, str):\n",
" # Convert timestamp (string) to datetime\n",
" date = datetime.strptime(date, \"%Y-%m-%d\").date()\n",
"\n",
" dt = datetime.combine(date, time(0, 0)) # Midnight UTC\n",
" unix_timestamp = int(dt.timestamp())\n",
" # url = f\"https://api.openweathermap.org/data/3.0/onecall/timemachine?lat=39.099724&lon=-94.578331&dt={unix_timestamp}&appid={open_weather_api_key}\"\n",
" return unix_timestamp"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "133904cf-4d72-4552-84a8-76650f334857",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* Running on local URL: http://127.0.0.1:7860\n",
"\n",
"To create a public link, set `share=True` in `launch()`.\n"
]
},
{
"data": {
"text/html": [
"<div><iframe src=\"http://127.0.0.1:7860/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with gr.Blocks() as ui:\n",
" with gr.Row():\n",
" chatbot = gr.Chatbot(height=500, type=\"messages\")\n",
" # image_output = gr.Image(height=500)\n",
" # with gr.Row(): \n",
" # date_input = gr.DateTime()\n",
" # output_box = gr.Textbox(label=\"UNIX Timestamp + API URL\", lines=3)\n",
"\n",
" with gr.Row():\n",
" entry = gr.Textbox(label=\"Chat with our AI Assistant:\")\n",
" with gr.Row():\n",
" speak = gr.Button(\"click for voice search\") \n",
" with gr.Row():\n",
" clear = gr.Button(\"Clear\")\n",
"\n",
" def listen(history):\n",
" message, history=listen2(history)\n",
" return message, history\n",
"\n",
" def do_entry(message, history):\n",
" history += [{\"role\":\"user\", \"content\":message}]\n",
" return \"\", history\n",
"\n",
" # entry.submit(get_unix_timestamp, inputs=[date_input], outputs=[output_box])\n",
" entry.submit(do_entry, inputs=[entry, chatbot], outputs=[entry, chatbot]).then(\n",
" # chat, inputs=chatbot, outputs=[chatbot, image_output]\n",
" chat, inputs=[chatbot], outputs=[chatbot]\n",
" )\n",
" speak.click(listen, inputs=[chatbot], outputs=[entry, chatbot]).then(\n",
" chat, inputs=[chatbot], outputs=[chatbot]\n",
" )\n",
" clear.click(lambda: None, inputs=None, outputs=chatbot, queue=False)\n",
"\n",
"ui.launch(inbrowser=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6507811e-9f98-4b8b-a482-9b0089c60db2",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "f3d3cf51-d2ae-4767-aa04-d8d6feb785bd",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,232 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "12ca6f8a",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import anthropic\n",
"from IPython.display import Markdown, display, update_display"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4b53a815",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"OpenAI API Key exists and begins sk-proj-\n",
"Anthropic API Key exists and begins sk-ant-\n",
"Google API Key not set\n"
]
}
],
"source": [
"# Load environment variables in a file called .env\n",
"# Print the key prefixes to help with any debugging\n",
"\n",
"load_dotenv(override=True)\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n",
"google_api_key = os.getenv('GOOGLE_API_KEY')\n",
"\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
" \n",
"if anthropic_api_key:\n",
" print(f\"Anthropic API Key exists and begins {anthropic_api_key[:7]}\")\n",
"else:\n",
" print(\"Anthropic API Key not set\")\n",
"\n",
"if google_api_key:\n",
" print(f\"Google API Key exists and begins {google_api_key[:8]}\")\n",
"else:\n",
" print(\"Google API Key not set\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d2b7cfe",
"metadata": {},
"outputs": [],
"source": [
"# Connect to OpenAI, Anthropic\n",
"\n",
"openai = OpenAI()\n",
"\n",
"claude = anthropic.Anthropic()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7d88d4b",
"metadata": {},
"outputs": [],
"source": [
"class ConversationManager:\n",
" def __init__(self):\n",
" self.conversation_history = []\n",
" self.participants = {}\n",
" \n",
" def add_participant(self, name, chatbot):\n",
" \"\"\"Add a model to the conversation\"\"\"\n",
" self.participants[name] = chatbot\n",
" \n",
" def add_message(self, speaker, message):\n",
" \"\"\"Add a message to the shared conversation history\"\"\"\n",
" self.conversation_history.append({\n",
" \"speaker\": speaker,\n",
" \"role\": \"assistant\" if speaker in self.participants else \"user\",\n",
" \"content\": message\n",
" })\n",
" \n",
" def get_context_for_model(self, model_name):\n",
" \"\"\"Create context appropriate for the given model\"\"\"\n",
" # Convert the shared history to model-specific format\n",
" messages = []\n",
" for msg in self.conversation_history:\n",
" if msg[\"speaker\"] == model_name:\n",
" messages.append({\"role\": \"assistant\", \"content\": msg[\"content\"]})\n",
" else:\n",
" messages.append({\"role\": \"user\", \"content\": msg[\"content\"]})\n",
" return messages\n",
" \n",
" def run_conversation(self, starting_message, turns=3, round_robin=True):\n",
" \"\"\"Run a multi-model conversation for specified number of turns\"\"\"\n",
" current_message = starting_message\n",
" models = list(self.participants.keys())\n",
" \n",
" # Add initial message\n",
" self.add_message(\"user\", current_message)\n",
" \n",
" for _ in range(turns):\n",
" for model_name in models:\n",
" # Get context appropriate for this model\n",
" model_context = self.get_context_for_model(model_name)\n",
" \n",
" # Get response from this model\n",
" chatbot = self.participants[model_name]\n",
" response = chatbot.generate_response(model_context)\n",
" \n",
" # Add to conversation history\n",
" self.add_message(model_name, response)\n",
" \n",
" print(f\"{model_name}:\\n{response}\\n\")\n",
" \n",
" if not round_robin:\n",
" # If not round-robin, use this response as input to next model\n",
" current_message = response"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80c537c3",
"metadata": {},
"outputs": [],
"source": [
"class ChatBot:\n",
" def __init__(self, model_name, system_prompt, **kwargs):\n",
" self.model_name = model_name\n",
" self.system_prompt = system_prompt\n",
" self.api_key = kwargs.get('api_key', None)\n",
" self.base_url = kwargs.get('base_url', None)\n",
" \n",
" def generate_response(self, messages):\n",
" \"\"\"Generate a response based on provided messages without storing history\"\"\"\n",
" # Prepare messages including system prompt\n",
" full_messages = [{\"role\": \"system\", \"content\": self.system_prompt}] + messages\n",
" \n",
" try:\n",
" if \"claude\" in self.model_name.lower():\n",
" # Format messages for Claude API\n",
" claude_messages = [m for m in messages if m[\"role\"] != \"system\"]\n",
" response = anthropic.Anthropic().messages.create(\n",
" model=self.model_name,\n",
" system=self.system_prompt,\n",
" messages=claude_messages,\n",
" max_tokens=200,\n",
" )\n",
" return response.content[0].text\n",
" \n",
" else:\n",
" # Use OpenAI API (works for OpenAI, Gemini via OpenAI client, etc)\n",
" openai_client = OpenAI(api_key=self.api_key, base_url=self.base_url)\n",
" response = openai_client.chat.completions.create(\n",
" model=self.model_name,\n",
" messages=full_messages,\n",
" max_tokens=200,\n",
" )\n",
" return response.choices[0].message.content\n",
" \n",
" except Exception as e:\n",
" return f\"Error: {str(e)}\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d197c3ef",
"metadata": {},
"outputs": [],
"source": [
"# Initialize models\n",
"gpt_bot = ChatBot(\"gpt-4o-mini\", \"You are witty and sarcastic.\")\n",
"claude_bot = ChatBot(\"claude-3-haiku-20240307\", \"You are thoughtful and philosophical.\")\n",
"\n",
"model_name = \"qwen2.5:1.5b\"\n",
"system_prompt = \"You are a helpful assistant that is very argumentative in a snarky way.\"\n",
"kwargs = {\n",
" \"api_key\": \"ollama\",\n",
" \"base_url\": 'http://localhost:11434/v1'\n",
"}\n",
"qwen = ChatBot(model_name, system_prompt, **kwargs)\n",
"\n",
"# Set up conversation manager\n",
"conversation = ConversationManager()\n",
"conversation.add_participant(\"GPT\", gpt_bot)\n",
"conversation.add_participant(\"Claude\", claude_bot)\n",
"conversation.add_participant(\"Qwen\", qwen)\n",
"\n",
"# Run a multi-model conversation\n",
"conversation.run_conversation(\"What's the most interesting technology trend right now?\", turns=2)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (llms)",
"language": "python",
"name": "llms"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,292 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "kayiMLgsBnVt"
},
"outputs": [],
"source": [
"!pip install -q requests torch bitsandbytes transformers sentencepiece accelerate openai gradio"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"executionInfo": {
"elapsed": 15255,
"status": "ok",
"timestamp": 1744678358807,
"user": {
"displayName": "Kenneth Andales",
"userId": "04047926009324958530"
},
"user_tz": -480
},
"id": "ByKEQHyhiLl7"
},
"outputs": [],
"source": [
"import os\n",
"import requests\n",
"from IPython.display import Markdown, display, update_display\n",
"from openai import OpenAI\n",
"from google.colab import drive, userdata\n",
"from huggingface_hub import login\n",
"from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer\n",
"import torch\n",
"import gradio as gr"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"executionInfo": {
"elapsed": 2,
"status": "ok",
"timestamp": 1744678358815,
"user": {
"displayName": "Kenneth Andales",
"userId": "04047926009324958530"
},
"user_tz": -480
},
"id": "9tzK_t3jiOo1"
},
"outputs": [],
"source": [
"AUDIO_MODEL = 'whisper-1'\n",
"LLAMA = \"meta-llama/Meta-Llama-3.1-8B-Instruct\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"executionInfo": {
"elapsed": 737,
"status": "ok",
"timestamp": 1744678360474,
"user": {
"displayName": "Kenneth Andales",
"userId": "04047926009324958530"
},
"user_tz": -480
},
"id": "PYNmGaQniW73"
},
"outputs": [],
"source": [
"hf_token = userdata.get('HF_TOKEN')\n",
"login(hf_token, add_to_git_credential=True)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"executionInfo": {
"elapsed": 555,
"status": "ok",
"timestamp": 1744678362522,
"user": {
"displayName": "Kenneth Andales",
"userId": "04047926009324958530"
},
"user_tz": -480
},
"id": "yGjVTeMEig-b"
},
"outputs": [],
"source": [
"openai_api_key = userdata.get(\"OPENAI_API_KEY\")\n",
"openai = OpenAI(api_key=openai_api_key)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"executionInfo": {
"elapsed": 9,
"status": "ok",
"timestamp": 1744679561600,
"user": {
"displayName": "Kenneth Andales",
"userId": "04047926009324958530"
},
"user_tz": -480
},
"id": "6jboyASHilLz"
},
"outputs": [],
"source": [
"def message_prompt(transciption):\n",
" system_message = \"\"\"\n",
" You are an assistant that translate japanese text into two different languages like 'English' and 'Filipino',\n",
" please display the translated text into markdown and include the original text from japanese using 'Romaji',\n",
" sample format would be - original text (converted to romaji): orignal_translated_text_here \\n\\n translated to english: translated_english_text_here \\n\\n translated to filipino: translated_filipino_text_here\"\n",
" \"\"\"\n",
"\n",
" user_propmpt = f\"Here is the transcripted japanese audio and translate it into two languages: '{transciption}'. No explaination just the translated languages only.\"\n",
"\n",
" messages = [\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\", \"content\": user_propmpt}\n",
" ]\n",
"\n",
" return messages"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"executionInfo": {
"elapsed": 7,
"status": "ok",
"timestamp": 1744678366113,
"user": {
"displayName": "Kenneth Andales",
"userId": "04047926009324958530"
},
"user_tz": -480
},
"id": "nYrf_wKmmoUs"
},
"outputs": [],
"source": [
"quant_config = BitsAndBytesConfig(\n",
" load_in_4bit=True,\n",
" bnb_4bit_use_double_quant=True,\n",
" bnb_4bit_quant_type=\"nf4\",\n",
" bnb_4bit_compute_dtype=torch.bfloat16\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"executionInfo": {
"elapsed": 7,
"status": "ok",
"timestamp": 1744678367778,
"user": {
"displayName": "Kenneth Andales",
"userId": "04047926009324958530"
},
"user_tz": -480
},
"id": "ESlOaRGioqUQ"
},
"outputs": [],
"source": [
"def translation(messages):\n",
" tokenizer = AutoTokenizer.from_pretrained(LLAMA)\n",
" tokenizer.pad_token = tokenizer.eos_token\n",
" inputs = tokenizer.apply_chat_template(messages, return_tensors=\"pt\").to(\"cuda\")\n",
" streamer = TextStreamer(tokenizer)\n",
" model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map=\"auto\", quantization_config=quant_config)\n",
" outputs = model.generate(inputs, max_new_tokens=2000, streamer=streamer)\n",
"\n",
" return tokenizer.decode(outputs[0])"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"executionInfo": {
"elapsed": 6,
"status": "ok",
"timestamp": 1744679567326,
"user": {
"displayName": "Kenneth Andales",
"userId": "04047926009324958530"
},
"user_tz": -480
},
"id": "FSGFTvIEys0j"
},
"outputs": [],
"source": [
"def translate_text(file):\n",
" try:\n",
" audio_file = open(file, \"rb\")\n",
"\n",
" transciption = openai.audio.transcriptions.create(\n",
" model=AUDIO_MODEL,\n",
" file=audio_file,\n",
" response_format=\"text\",\n",
" language=\"ja\"\n",
" )\n",
"\n",
" messages = message_prompt(transciption)\n",
" response = translation(messages)\n",
"\n",
" return response\n",
" except Exception as e:\n",
" return f\"Unexpected error: {str(e)}\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bexgSsWuvUmU"
},
"outputs": [],
"source": [
"with gr.Blocks() as demo:\n",
" gr.Markdown(\"# 🎙️ Anime Audio Translator\")\n",
" with gr.Row():\n",
" with gr.Column():\n",
" audio_file = gr.Audio(type=\"filepath\", label=\"Upload Audio\")\n",
" button = gr.Button(\"Translate\", variant=\"primary\")\n",
"\n",
" with gr.Column():\n",
" gr.Label(value=\"Result of translated text to 'English' and 'Filipino'\", label=\"Character\")\n",
" output_text = gr.Markdown()\n",
"\n",
" button.click(\n",
" fn=translate_text,\n",
" inputs=audio_file,\n",
" outputs=output_text,\n",
" trigger_mode=\"once\"\n",
" )\n",
"demo.launch()"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"authorship_tag": "ABX9TyO+HrhlkaVchpoGIfmYAHdf",
"gpuType": "T4",
"provenance": []
},
"kernelspec": {
"display_name": "Python [conda env:base] *",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -4,96 +4,113 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## codeXchange AI: Transform Code with a Click!\n",
"# codeXchange AI: Transform Code with a Click!\n",
"\n",
"**Created by Blaise Alako**\n",
"\n",
"Get ready to revolutionize your coding experience with **codeXchange AI**a web-based Gradio app that converts code between programming languages in a flash! Powered by cutting-edge frontier and open-source LLMs, this tool is a game-changer for beginners diving into new languages, intermediates streamlining projects, and advanced users pushing the limits of innovation. Just paste or upload your code, choose your target language, and watch the magic unfold!\n",
"**codeXchange AI** is a web-based tool that simplifies converting code between different programming languages. It uses advanced open-source language models and cutting-edge AI to quickly and accurately translate your code. Supporting conversion across 17 programming languages, this tool is perfect whether youre learning a new language or optimizing multi-language projects. With its user-friendly interface, you can even have your code documented automatically by simply ticking the documentation option—this adds appropriate docstrings following the native documentation style of the target language. Developed as part of the [LLM Engineering course](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/46867711#content) by Ed. Donner.\n",
"\n",
"**Why codeXchange AI?**\n",
"- **Effortless**: No downloads—just pure web-based magic.\n",
"- **Brilliant**: AI-driven conversions that nail accuracy.\n",
"- **Adaptable**: Add new languages or models with ease.\n",
"\n",
"Explore the source code [codeXchange AI](https://github.com/alakob/ai_code_converter) and experience the thrill!\n",
"**Key Features of codeXchange AI:**\n",
"- **Effortless Conversion:** A fully web-based solution that requires no local installations.\n",
"- **AI-Driven Accuracy:** Harnessing advanced language models for reliable and contextually accurate code conversions.\n",
"- **Adaptable and Scalable:** Easily extend the tool to accommodate new languages and transformation models.\n",
"\n",
"Discover more details and explore the project on the [codeXchange AI GitHub repository](https://github.com/alakob/ai_code_converter).\n",
"\n",
"---\n",
"\n",
"### Table of Contents\n",
"1. [Explore the Interface](#explore-the-interface)\n",
"2. [Upload and Convert](#upload-and-convert)\n",
"3. [See the Results](#see-the-results)\n",
"4. [Unleash Advanced Features](#unleash-advanced-features)\n",
"5. [Performance That Wows](#performance-that-wows)\n",
"6. [Get Started Now](#get-started-now)\n",
"1. [Overview of codeXchange AI](#overview-of-codexchange-ai)\n",
"2. [Uploading Your Code](#uploading-your-code)\n",
"3. [Instant Conversion Process](#instant-conversion-process)\n",
"4. [Reviewing the Results](#reviewing-the-results)\n",
"5. [Advanced Customization Options](#advanced-customization-options)\n",
"6. [Performance and Optimization](#performance-and-optimization)\n",
"7. [Get Started with codeXchange AI](#get-started-with-codexchange-ai)\n",
"\n",
"---\n",
"\n",
"### Explore the Interface\n",
"## Overview of codeXchange AI\n",
"\n",
"#### A Sleek Starting Point\n",
"Step into the world of codeXchange AI with its stunningly simple interface, designed to make your coding journey a breeze!\n",
"### A Seamless Code Transformation Tool\n",
"**codeXchange AI** delivers an accessible yet powerful solution for converting code between programming languages. Designed with both novice coders and experienced developers in mind, it highlights how modern AI can simplify and accelerate the code migration process. Immerse yourself in a world where conversion is not only accurate but also a valuable learning opportunity.\n",
"\n",
"![Initial Interface](screenshots/codeXchange_1.png) \n",
"*Screenshot: The apps clean starting screen, ready for your code.*\n",
"\n",
"With options to upload files or pick example snippets, youre just a click away from transforming your code.\n",
"![App Interface Overview](screenshots/codeXchange_1.png) \n",
"*Figure 1: The welcoming interface of codeXchange AI, inviting you to begin your transformative journey.*\n",
"\n",
"---\n",
"\n",
"### Upload and Convert\n",
"## Uploading Your Code\n",
"\n",
"#### Load Your Code with Ease\n",
"Whether youre a beginner or a pro, uploading your code is a snap. Drag and drop a file, or select a preloaded snippet to kick things off.\n",
"### Prepare Your Source Code for Conversion\n",
"Experience the ease of preparing your code for a swift transformation. The intuitive upload section allows you to drag-and-drop your files or select from preloaded example snippets, making the initiation process both fast and user-friendly.\n",
"\n",
"![Loading Code](screenshots/codeXchange_2.png) \n",
"*Screenshot: The upload section with a dropdown for example snippets.*\n",
"![Uploading Interface](screenshots/codeXchange_2.png) \n",
"*Figure 2: The upload area featuring convenient options to either insert your code directly or choose from examples.*\n",
"\n",
"Choose your input language, pick your target, and hit “Convert”—its that easy to bridge the language gap!\n",
"This design caters to a variety of programming languages, ensuring your input is processed with high precision from the outset.\n",
"\n",
"---\n",
"\n",
"### See the Results\n",
"## Instant Conversion Process\n",
"\n",
"#### Witness the Transformation\n",
"Watch codeXchange AI work its magic! It converts your code with precision, adding helpful documentation to make the output crystal clear.\n",
"### Transform Your Code in Real Time\n",
"Once your code is submitted, codeXchange AI activates its powerful engine. Simply select your target language and hit “Convert.” The application seamlessly translates your code, incorporating essential documentation to ensure clarity, usability, and maintainability.\n",
"\n",
"![Conversion Output](screenshots/codeXchange_3.png) \n",
"*Screenshot: A converted result with documentation, ready to run.*\n",
"![Conversion Process](screenshots/codeXchange_3.png) \n",
"*Figure 3: The real-time conversion stage, where your code is transformed with integrated documentation.*\n",
"\n",
"From Python to C++ or beyond, the app ensures your code is ready to shine in its new language.\n",
"This process not only demystifies the syntactical shifts between languages but also serves as an insightful demonstration of AI's capabilities in practical coding scenarios.\n",
"\n",
"---\n",
"\n",
"### Unleash Advanced Features\n",
"## Reviewing the Results\n",
"\n",
"#### Power Up Your Workflow\n",
"For those who love to tinker, codeXchange AI offers exciting customization! Select different models, adjust the “Temperature” for creative flair, and even add new languages to the mix.\n",
"### Examine Your Newly Transformed Code\n",
"After conversion, the output is presented with meticulous attention to detail. The translated code retains its logic and documentation integrity, ensuring compatibility for both testing and production environments.\n",
"\n",
"![Advanced Options](screenshots/codeXchange_3_1.png) \n",
"*Screenshot: Interface showcasing model selection, temperature slider, and more.*\n",
"![Conversion Result](screenshots/codeXchange_3.png) \n",
"*Figure 4: The output display, showcasing the fully converted code complete with insightful documentation.*\n",
"\n",
"Download your converted code with a single click and take your projects to the next level!\n",
"This clear and organized presentation guarantees that your new code is production-ready and easily maintainable.\n",
"\n",
"---\n",
"\n",
"### Performance That Wows\n",
"## Advanced Customization Options\n",
"\n",
"#### Speed That Impresses\n",
"codeXchange AI doesnt just convert—it optimizes! Check out the performance boost when running your code in a new language, with execution times thatll leave you amazed.\n",
"### Tailor Your Conversion Experience\n",
"For users who wish to fine-tune their conversion settings, codeXchange AI offers a suite of advanced options. Customize parameters such as the AI model selection and “Temperature” settings to introduce creative variations in the output. Additionally, the platform readily supports the addition of new languages and LLM models as your needs evolve.\n",
"\n",
"![Performance Results](screenshots/codeXchange_4.png) \n",
"*Screenshot: Execution results highlighting speed improvements.*\n",
"![Advanced Settings](screenshots/codeXchange_3_1.png) \n",
"*Figure 5: The advanced settings panel featuring options for model selection, temperature control, and further customization.*\n",
"\n",
"From 31.49 seconds in Python to just 2.32 seconds in C++—see the difference for yourself!\n",
"This flexibility ensures that the tool can be precisely adapted to your development environment and evolving project requirements.\n",
"\n",
"---\n",
"\n",
"### Get Started Now\n",
"## Performance and Optimization\n",
"\n",
"Ready to transform your coding game? Jump into [codeXchange AI source code](https://github.com/alakob/ai_code_converter) Convert, run, and download your code in seconds. Whether youre just starting out, managing complex projects, or innovating at an advanced level, this app is your ultimate coding companion.\n",
"### Experience Lightning-Fast Conversions\n",
"codeXchange AI not only transforms your code but also optimizes it for superior performance. Witness remarkable enhancements in execution speed across languages as the tool refines the code translation process. These performance metrics clearly demonstrate the efficiency gains achieved through AI-powered conversion.\n",
"\n",
"---\n"
"![Performance Metrics](screenshots/codeXchange_4.png) \n",
"*Figure 6: Detailed performance metrics showcasing execution time improvements across different programming languages.*\n",
"\n",
"With conversion times reduced dramatically, youre empowered to focus on innovation and development without delay.\n",
"\n",
"---\n",
"\n",
"## Get Started with codeXchange AI\n",
"\n",
"### Embark on Your Code Transformation Journey\n",
"Are you ready to enhance your coding skills and explore new possibilities? Dive into the full source code and setup instructions on [GitHub codeXchange AI](https://github.com/alakob/ai_code_converter). Whether you're experimenting with new languages, updating legacy projects, or pushing the frontiers of innovation, exploring codeXchange AI will expand your understanding of AI-driven code transformation.\n",
"\n",
"---\n",
"\n",
"### Acknowledgments\n",
"Special thanks to Ed. Donner for his transformative [LLM Engineering course](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/46867711#content) that inspired this project.\n",
"\n"
]
}
],

View File

@@ -0,0 +1,440 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "8dee7381-2291-4202-a6e6-9eb94e896141",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import io\n",
"import sys\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import google.generativeai\n",
"import anthropic\n",
"from IPython.display import Markdown, display, update_display\n",
"import gradio as gr\n",
"import subprocess\n",
"import platform\n",
"import os"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc145e4c-1e06-4414-aa2b-1ea1862b4600",
"metadata": {},
"outputs": [],
"source": [
"# environment\n",
"\n",
"load_dotenv(override=True)\n",
"os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')\n",
"os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bfaf8584-a10f-43f0-b550-f1b2b6f07160",
"metadata": {},
"outputs": [],
"source": [
"# initialize\n",
"\n",
"openai = OpenAI()\n",
"claude = anthropic.Anthropic()\n",
"\n",
"OPENAI_MODEL = \"gpt-4o-mini\"\n",
"CLAUDE_MODEL = \"claude-3-haiku-20240307\"\n",
"\n",
"# OPENAI_MODEL = \"gpt-4o\"\n",
"# CLAUDE_MODEL = \"claude-3-5-sonnet-20240620\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b47508e-dc60-4db5-a29c-f3f0ed57d894",
"metadata": {},
"outputs": [],
"source": [
"processor = platform.machine()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ee9ec20-3b1d-4a15-9ab3-b2fbb93296b4",
"metadata": {},
"outputs": [],
"source": [
"def get_name_by_extension(extension):\n",
" for lang in programming_languages:\n",
" if lang[\"extension\"] == extension:\n",
" return lang[\"name\"]\n",
" return None "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ee408ffd-fde2-4c1e-b87f-c8dce2ad49bc",
"metadata": {},
"outputs": [],
"source": [
"def get_system_message(prog_lang):\n",
" name = get_name_by_extension(prog_lang)\n",
" \n",
" system_message = f\"You are an assistant that reimplements Python code to {name} for an {processor} device. \"\n",
" system_message += f\"Respond only with code; use comments sparingly and do not provide any explanation other than occasional comments. \"\n",
" system_message += f\"The {name} response needs to produce an identical output in the fastest possible time.\"\n",
" system_message += f\"If the used function does not exists for {name} language interchange it for its compatibility and if not throw an error\"\n",
"\n",
" return system_message"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac8d5d3b-a018-4b94-8080-9b18f5634dc7",
"metadata": {},
"outputs": [],
"source": [
"def user_prompt_for(python, prog_lang):\n",
" name = get_name_by_extension(prog_lang)\n",
" \n",
" user_prompt = f\"Rewrite this Python code in {name} with the fastest possible implementation that produces identical output in the least time. \"\n",
" user_prompt += f\"Respond only with {name} code; do not explain your work other than a few comments. \"\n",
" user_prompt += \"Pay attention to number types to ensure no int overflows\\n\\n\"\n",
" user_prompt += python\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "23c58e61-5fdd-41f5-9e60-a0847f4bf86f",
"metadata": {},
"outputs": [],
"source": [
"def messages_for(python, prog_lang):\n",
" system_message = get_system_message(prog_lang)\n",
" \n",
" return [\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(python, prog_lang)}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e193cd6-16f4-440a-9376-6041672f91fc",
"metadata": {},
"outputs": [],
"source": [
"# write to a file called optimized.cpp\n",
"\n",
"def write_output(content, prog_lang):\n",
" code = content.replace(\"```cpp\",\"\").replace(\"javascript\",\"\").replace(\"```\",\"\")\n",
" \n",
" with open(f\"optimized.{prog_lang}\", \"w\") as f:\n",
" f.write(code)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "28b0be5e-73b6-49d8-8ef6-8209eace5ee6",
"metadata": {},
"outputs": [],
"source": [
"python_hard = \"\"\"# Be careful to support large number sizes\n",
"\n",
"def lcg(seed, a=1664525, c=1013904223, m=2**32):\n",
" value = seed\n",
" while True:\n",
" value = (a * value + c) % m\n",
" yield value\n",
" \n",
"def max_subarray_sum(n, seed, min_val, max_val):\n",
" lcg_gen = lcg(seed)\n",
" random_numbers = [next(lcg_gen) % (max_val - min_val + 1) + min_val for _ in range(n)]\n",
" max_sum = float('-inf')\n",
" for i in range(n):\n",
" current_sum = 0\n",
" for j in range(i, n):\n",
" current_sum += random_numbers[j]\n",
" if current_sum > max_sum:\n",
" max_sum = current_sum\n",
" return max_sum\n",
"\n",
"def total_max_subarray_sum(n, initial_seed, min_val, max_val):\n",
" total_sum = 0\n",
" lcg_gen = lcg(initial_seed)\n",
" for _ in range(20):\n",
" seed = next(lcg_gen)\n",
" total_sum += max_subarray_sum(n, seed, min_val, max_val)\n",
" return total_sum\n",
"\n",
"# Parameters\n",
"n = 10000 # Number of random numbers\n",
"initial_seed = 42 # Initial seed for the LCG\n",
"min_val = -10 # Minimum value of random numbers\n",
"max_val = 10 # Maximum value of random numbers\n",
"\n",
"# Timing the function\n",
"import time\n",
"start_time = time.time()\n",
"result = total_max_subarray_sum(n, initial_seed, min_val, max_val)\n",
"end_time = time.time()\n",
"\n",
"print(\"Total Maximum Subarray Sum (20 runs):\", result)\n",
"print(\"Execution Time: {:.6f} seconds\".format(end_time - start_time))\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2818063c-008e-4029-851a-959f63d3f0fc",
"metadata": {},
"outputs": [],
"source": [
"def stream_gpt(python, prog_lang): \n",
" stream = openai.chat.completions.create(model=OPENAI_MODEL, messages=messages_for(python, prog_lang), stream=True)\n",
" reply = \"\"\n",
" for chunk in stream:\n",
" fragment = chunk.choices[0].delta.content or \"\"\n",
" reply += fragment\n",
" yield reply.replace('```cpp\\n','').replace('javascript\\n','').replace('```','')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e3e0502-8550-46fe-bd2f-394078db6576",
"metadata": {},
"outputs": [],
"source": [
"def stream_claude(python, prog_lang):\n",
" system_message = get_system_message(prog_lang)\n",
" \n",
" result = claude.messages.stream(\n",
" model=CLAUDE_MODEL,\n",
" max_tokens=2000,\n",
" system=system_message,\n",
" messages=[{\"role\": \"user\", \"content\": user_prompt_for(python, prog_lang)}],\n",
" )\n",
" reply = \"\"\n",
" with result as stream:\n",
" for text in stream.text_stream:\n",
" reply += text\n",
" yield reply.replace('```cpp\\n','').replace('javascript\\n','').replace('```','')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10accbb2-b56d-4c79-beef-928c2a3b58f0",
"metadata": {},
"outputs": [],
"source": [
"def optimize(python, model, prog_lang):\n",
" if model==\"GPT\":\n",
" result = stream_gpt(python, prog_lang)\n",
" elif model==\"Claude\":\n",
" result = stream_claude(python, prog_lang)\n",
" else:\n",
" raise ValueError(\"Unknown model\")\n",
" for stream_so_far in result:\n",
" yield stream_so_far "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f1acb130-8b5c-4199-818a-3afa89c342cb",
"metadata": {},
"outputs": [],
"source": [
"def execute_python(code):\n",
" try:\n",
" output = io.StringIO()\n",
" sys.stdout = output\n",
"\n",
" namespace = {}\n",
" exec(code, namespace)\n",
" finally:\n",
" sys.stdout = sys.__stdout__\n",
" return output.getvalue()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e901e81-61d8-4ab2-9e16-f70c8ee6bdbe",
"metadata": {},
"outputs": [],
"source": [
"css = \"\"\"\n",
".python {background-color: #306998;}\n",
".cpp {background-color: #050;}\n",
".php {background-color: #cb7afa;}\n",
".js {background-color: #f4ff78;}\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1e0dfe2e-a87d-4595-b4ef-72797bd1ad44",
"metadata": {},
"outputs": [],
"source": [
"def execute_cpp(code):\n",
" try:\n",
" compile_cmd = [\"clang++\", \"-Ofast\", \"-std=c++17\", \"-o\", \"optimized\", \"optimized.cpp\"]\n",
" compile_result = subprocess.run(compile_cmd, shell=True, text=True, capture_output=True)\n",
" run_cmd = [\"./optimized\"]\n",
" run_result = subprocess.run(run_cmd, check=True, text=True, capture_output=True)\n",
" return run_result.stdout\n",
" except subprocess.CalledProcessError as e:\n",
" return f\"An error occurred:\\n{e.stderr}\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "91ba8a3c-8686-4636-bf21-efc861f3a2b7",
"metadata": {},
"outputs": [],
"source": [
"def execute_js(code):\n",
" try:\n",
" run_result = subprocess.run([\"node\", \"optimized.js\"], check=True, text=True, capture_output=True)\n",
" return run_result.stdout\n",
" except subprocess.CalledProcessError as e:\n",
" return f\"An error occurred:\\n{e.stderr}\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9006f67-f631-4ad4-bf45-b9366c822a04",
"metadata": {},
"outputs": [],
"source": [
"def execute_php(code):\n",
" try:\n",
" run_result = subprocess.run([\"php\", \"optimized.php\"], check=True, text=True, capture_output=True)\n",
" return run_result.stdout\n",
" except subprocess.CalledProcessError as e:\n",
" return f\"An error occurred:\\n{e.stderr}\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b3991a09-f60d-448a-8e92-2561296d05cf",
"metadata": {},
"outputs": [],
"source": [
"def handle_execution(code, prog_lang):\n",
" write_output(code, prog_lang)\n",
"\n",
" index = next((i for i, lang in enumerate(programming_languages) if lang[\"extension\"] == prog_lang), -1)\n",
" return programming_languages[index][\"fn\"](code)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c127bbc9-ef4d-40e4-871a-85873fc9e406",
"metadata": {},
"outputs": [],
"source": [
"programming_languages = [\n",
" {\"name\": \"C++\", \"extension\": \"cpp\", \"fn\": execute_cpp},\n",
" {\"name\": \"Javascript\", \"extension\": \"js\", \"fn\": execute_js},\n",
" {\"name\": \"Php\", \"extension\": \"php\", \"fn\": execute_php}\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "126636a1-4315-4811-9de9-61ee032effc8",
"metadata": {},
"outputs": [],
"source": [
"def create_prog_lang_ui(lang, model):\n",
" prog_name = lang[\"name\"]\n",
" extension = lang[\"extension\"]\n",
" fn = lang[\"fn\"]\n",
"\n",
" with gr.Row():\n",
" with gr.Column():\n",
" convert = gr.Button(f\"Convert to {prog_name}\")\n",
" converted_code = gr.Textbox(label=f\"Converted {prog_name} code:\", lines=10)\n",
"\n",
" with gr.Column():\n",
" prog_run = gr.Button(f\"Run {prog_name}\")\n",
" prog_out = gr.TextArea(label=f\"{prog_name} result:\", elem_classes=[extension])\n",
"\n",
" current_selected = gr.Dropdown([extension], value=extension, visible=False)\n",
" \n",
" convert.click(optimize, inputs=[python, model, current_selected], outputs=[converted_code])\n",
" prog_run.click(handle_execution, inputs=[converted_code, current_selected], outputs=[prog_out])\n",
"\n",
"with gr.Blocks(css=css) as ui:\n",
" gr.Markdown(\"# Convert code from Python to any Programming Language\")\n",
" with gr.Row():\n",
" with gr.Column():\n",
" python = gr.Textbox(label=\"Python code:\", value=python_hard, lines=10)\n",
" with gr.Column():\n",
" python_run = gr.Button(f\"Run Python\")\n",
" python_out = gr.TextArea(label=f\"Python result:\", elem_classes=[\"python\"])\n",
" \n",
" with gr.Row():\n",
" model = gr.Dropdown([\"GPT\", \"Claude\"], label=\"Select model\", value=\"GPT\")\n",
"\n",
" python_run.click(execute_python, inputs=[python], outputs=[python_out]) \n",
"\n",
"\n",
" for lang in programming_languages:\n",
" create_prog_lang_ui(lang, model)\n",
"\n",
"ui.launch(\n",
" inbrowser=True\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:base] *",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,74 @@
[
{
"id": "service_001",
"content": "We offer tire services including rotation, balancing, flat repair, and new tire sales and installation.",
"metadata": {
"source": "service_page",
"category": "tire_services",
"tags": "tire, rotation, repair, installation"
}
},
{
"id": "service_002",
"content": "Brake services include pad replacement, rotor resurfacing, and ABS diagnostics.",
"metadata": {
"source": "service_page",
"category": "brake_services",
"tags": "brake, pads, rotors, abs"
}
},
{
"id": "faq_001",
"content": "Walk-ins are welcome, but appointments are recommended for faster service.",
"metadata": {
"source": "faq",
"category": "appointments",
"tags": "appointment, walk-in"
}
},
{
"id": "faq_002",
"content": "Most oil changes are completed within 3045 minutes.",
"metadata": {
"source": "faq",
"category": "oil_change",
"tags": "oil change, duration"
}
},
{
"id": "general_001",
"content": "Pinkys Auto Care is located at Rte 112, Yorkjuh, JH 98746. We're open Monday through Friday from 8am to 6pm, and Saturday from 9am to 2pm.",
"metadata": {
"source": "general_info",
"category": "location_hours",
"tags": "location, hours, contact"
}
},
{
"id": "promo_001",
"content": "At Pinkys Auto Care, we combine modern diagnostics with friendly, small-town service. Our ASE-certified mechanics serve Springfield with over 15 years of experience.",
"metadata": {
"source": "about_us",
"category": "branding",
"tags": "promo, about us, experience"
}
},
{
"id": "customer_query_001",
"content": "My car shakes when braking—do I need new rotors?",
"metadata": {
"source": "user_query",
"category": "brake_services",
"tags": "brake, rotor, vibration"
}
},
{
"id": "customer_query_002",
"content": "Can you align wheels on a 2021 Subaru Outback?",
"metadata": {
"source": "user_query",
"category": "wheel_alignment",
"tags": "wheel alignment, vehicle-specific"
}
}
]

View File

@@ -0,0 +1,410 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "f3ce7a00-62c7-4cee-bed6-a89bf052e167",
"metadata": {},
"source": [
"# Colab Notebook RAG Assistant\n",
"\n",
"Short Notebook Description:\n",
"\n",
"This Jupyter/Colab notebook builds a Retrieval-Augmented Generation (RAG) assistant over your own collection of .ipynb files in Google Colab. It:\n",
"\n",
"1. Loads all notebooks from a local folder or mounted Google Drive.\n",
"\n",
"2. Chunks their content into manageable pieces.\n",
"\n",
"3. Embeds each chunk with OpenAI embeddings and stores them in a persistent Chroma vector database.\n",
"\n",
"4. Provides a ConversationalRetrievalChain with memory and a Gradio chat interface.\n",
"\n",
"5. For any user question, it returns both an answer and the names of the exact notebooks where the relevant information was found.\n",
"\n",
"This setup lets you query your entire notebook history—whether local or in Colab—just like a personal knowledge base."
]
},
{
"cell_type": "markdown",
"id": "1a7a0225-800c-4088-9bd2-ac98dbbb55c9",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe7e9772-171f-4ff6-bd3b-e77aa82b19d3",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import glob\n",
"from dotenv import load_dotenv\n",
"import gradio as gr\n",
"\n",
"from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n",
"from langchain.document_loaders import DirectoryLoader, NotebookLoader\n",
"from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter\n",
"from langchain_chroma import Chroma\n",
"from langchain.memory import ConversationBufferMemory\n",
"from langchain.chains import ConversationalRetrievalChain\n",
"from langchain.chains import RetrievalQA\n",
"\n",
"from sklearn.manifold import TSNE\n",
"import plotly.graph_objects as go\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"id": "8dfd2b57-3b3b-4fc2-bb2d-40be7aba3a4a",
"metadata": {},
"source": [
"## Configuration & Set Environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5d502b28-1d33-43bc-8797-41fed26d5fa0",
"metadata": {},
"outputs": [],
"source": [
"load_dotenv(override = True)\n",
"\n",
"OPENAI_KEY = os.getenv('OPENAI_API_KEY')\n",
"NOTEBOOKS_DIR = os.getenv('NOTEBOOKS_DIR')\n",
"VECTOR_DB_DIR = os.getenv('VECTOR_DB_DIR')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af7f5fa2-78f8-45bf-a0e0-e1f93cc98f4b",
"metadata": {},
"outputs": [],
"source": [
"MODEL = 'gpt-4o-mini'\n",
"CHUNK_SIZE = 1000\n",
"CHUNK_OVERLAP = 200"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "82f7b583-d176-448b-b762-28618f05c660",
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" # Colab\n",
" from google.colab import drive\n",
" print(\"Running in Colab: mounting Google Drive...\")\n",
" drive.mount('/content/drive')\n",
" is_colab = True\n",
"\n",
" # Colab defaults\n",
" NOTEBOOKS_DIR = '/content/drive/MyDrive/ColabNotebooks'\n",
" DB_DIR = VECTOR_DB_DIR or '/content/drive/MyDrive/colab_vector_db'\n",
"\n",
"except ImportError:\n",
" # Local Jupyter Lab:\n",
" print(\"Not in Colab: using local notebooks directory.\")\n",
" NOTEBOOKS_DIR = os.path.expanduser(NOTEBOOKS_DIR)\n",
" DB_DIR = VECTOR_DB_DIR\n",
"\n",
" # Verify the local notebooks directory exists\n",
" if not os.path.isdir(NOTEBOOKS_DIR):\n",
" raise FileNotFoundError(\n",
" f\"Local notebooks directory '{NOTEBOOKS_DIR}' not found.\" \n",
" \"\\nPlease sync your Google Drive folder (e.g., via Drive for Desktop) \"\n",
" \"or set NOTEBOOKS_DIR in your .env to the correct path.\"\n",
" )\n",
"# Confirm final paths\n",
"# print(f\"Indexing notebooks from: {NOTEBOOKS_DIR}\")\n",
"# print(f\"Chroma will store embeddings in: {DB_DIR}\")"
]
},
{
"cell_type": "markdown",
"id": "5eefd329-712f-4c43-b7aa-0322c0cd7c41",
"metadata": {},
"source": [
"## Read in Notebook files"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c4468cbc-8f04-47c7-a583-1cb81cdb17fb",
"metadata": {},
"outputs": [],
"source": [
"notebooks = glob.glob(\n",
" os.path.join(NOTEBOOKS_DIR, \"**\", \"*.ipynb\"),\n",
" recursive=True\n",
")\n",
"print(f\"Notebooks found: {len(notebooks)}\")\n",
"\n",
"\n",
"loader = DirectoryLoader(NOTEBOOKS_DIR,\n",
" glob = '**/*.ipynb',\n",
" loader_cls = NotebookLoader\n",
" )\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f8c0ce3-e3a9-4271-9824-e7fa42c8867d",
"metadata": {},
"outputs": [],
"source": [
"splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size = CHUNK_SIZE, \n",
" chunk_overlap = CHUNK_OVERLAP, \n",
" separators=[\"\\n## \", \"\\n### \", \"\\n#### \", \"\\n\\n\", \"\\n\", \" \", \"\"]\n",
")\n",
"\n",
"chunks = splitter.split_documents(docs)\n",
"print(f'Created {len(chunks)} chunks from your notebooks')"
]
},
{
"cell_type": "markdown",
"id": "d73f8869-020b-48ac-bce9-82fadd58e04b",
"metadata": {},
"source": [
"## Embedding"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27d6269c-88ac-4da7-8e87-691308d9e473",
"metadata": {},
"outputs": [],
"source": [
"embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_KEY)\n",
"\n",
"if os.path.exists(DB_DIR):\n",
" Chroma(persist_directory = DB_DIR, embedding_function = embeddings).delete_collection()\n",
"\n",
"\n",
"vectorstore = Chroma.from_documents(\n",
" documents = chunks,\n",
" embedding = embeddings,\n",
" persist_directory = VECTOR_DB_DIR\n",
")\n",
"\n",
"vector_count = vectorstore._collection.count()\n",
"print(f\"Vectorstore contains {vector_count} vectors.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d795bc7-82dc-4ad5-be39-97e08c033a4c",
"metadata": {},
"outputs": [],
"source": [
"sample_embedding = vectorstore._collection.get(limit=1, include=[\"embeddings\"])[\"embeddings\"][0]\n",
"dimensions = len(sample_embedding)\n",
"print(f\"There are {vectorstore._collection.count():,} vectors with {dimensions:,} dimensions in the vector store.\")"
]
},
{
"cell_type": "markdown",
"id": "bae1ab40-c22d-4815-bec4-840d69cf702b",
"metadata": {},
"source": [
"## Visualize in 3D"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e3e30379-eb8f-469d-841e-cf95a542595b",
"metadata": {},
"outputs": [],
"source": [
"result = vectorstore._collection.get(include=['embeddings', 'documents',])\n",
"vectors = np.array(result['embeddings'])\n",
"documents = result['documents']\n",
"colors = ['blue'] * len(vectors)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b0c61e2-1d5d-429d-ace7-e883e051fdd2",
"metadata": {},
"outputs": [],
"source": [
"tsne = TSNE(n_components=3, random_state=42)\n",
"reduced_vectors = tsne.fit_transform(vectors)\n",
"\n",
"fig = go.Figure(data=[go.Scatter3d(\n",
" x=reduced_vectors[:, 0],\n",
" y=reduced_vectors[:, 1],\n",
" z=reduced_vectors[:, 2],\n",
" mode='markers',\n",
" marker=dict(size=4, color=colors, opacity=0.8),\n",
" text=[d[:100] + \"...\" for d in documents],\n",
" hoverinfo='text'\n",
")])\n",
"\n",
"fig.update_layout(\n",
" title='3D TSNE of Notebook-Chunks',\n",
" scene=dict(\n",
" xaxis_title=\"TSNE-1\",\n",
" yaxis_title=\"TSNE-2\",\n",
" zaxis_title=\"TSNE-3\"\n",
" ),\n",
" width=800,\n",
" height=600,\n",
" margin=dict(r=10, b=10, l=10, t=40)\n",
")\n",
"\n",
"fig.show()"
]
},
{
"cell_type": "markdown",
"id": "831684d5-5694-488f-aaed-e219d57b909c",
"metadata": {},
"source": [
"## Build LLM"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "02197e43-c958-4f70-be38-666ee4c1c4ae",
"metadata": {},
"outputs": [],
"source": [
"llm = ChatOpenAI(model_name = MODEL, temperature = 0.6)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3aabba9e-e447-4597-a86e-fe3fc5b8babe",
"metadata": {},
"outputs": [],
"source": [
"qa = RetrievalQA.from_chain_type(\n",
" llm = llm,\n",
" chain_type=\"stuff\",\n",
" retriever=vectorstore.as_retriever(search_kwargs={\"k\": 4}),\n",
" return_source_documents=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a71daee6-7d82-4212-ae4f-a2553f1d3c8a",
"metadata": {},
"outputs": [],
"source": [
"memory = ConversationBufferMemory(\n",
" memory_key = 'chat_history',\n",
" return_messages = True\n",
")\n",
"\n",
"conv_chain = ConversationalRetrievalChain.from_llm(\n",
" llm = llm,\n",
" retriever = vectorstore.as_retriever(search_kwargs = {'k': 10}),\n",
" memory = memory\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d8e2b9e-bb50-4bda-9832-c6a2779526e3",
"metadata": {},
"outputs": [],
"source": [
"def chat_with_memory_and_sources(message, chat_history):\n",
" # Get the conversational answer (memory included)\n",
" conv_res = conv_chain.invoke({\n",
" \"question\": message,\n",
" \"chat_history\": chat_history\n",
" })\n",
" answer = conv_res[\"answer\"]\n",
"\n",
" # Retrieve source documents \n",
" src_res = qa({\"query\": message})\n",
" src_docs = src_res[\"source_documents\"]\n",
"\n",
" # Extract and dedupe notebook filenames from metadata\n",
" notebooks = [\n",
" os.path.basename(doc.metadata.get(\"source\", \"\"))\n",
" for doc in src_docs\n",
" if doc.metadata.get(\"source\")\n",
" ]\n",
" unique = []\n",
" for n in notebooks:\n",
" if n not in unique:\n",
" unique.append(n)\n",
"\n",
" # Append the list of notebook filenames\n",
" if unique:\n",
" answer += \"\\n\\n**Found Notebooks:**\\n\" + \"\\n\".join(f\"- {n}\" for n in unique)\n",
" else:\n",
" answer += \"\\n\\n_No Notebooks found._\"\n",
"\n",
" return answer"
]
},
{
"cell_type": "markdown",
"id": "3d5c53ec-4fe3-46cc-9c17-7326294d24ef",
"metadata": {},
"source": [
"## Gradio UI"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "371e32ee-df20-4ec5-91eb-5023fc4b70b2",
"metadata": {},
"outputs": [],
"source": [
"view = gr.ChatInterface(chat_with_memory_and_sources, \n",
" title=\"Notebook-RAG-Assistant mit Memory & Quellen\",\n",
" type = 'messages').launch(inbrowser=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,229 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "ba2779af-84ef-4227-9e9e-6eaf0df87e77",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import glob\n",
"from dotenv import load_dotenv\n",
"import gradio as gr\n",
"import json"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "802137aa-8a74-45e0-a487-d1974927d7ca",
"metadata": {},
"outputs": [],
"source": [
"# imports for langchain, plotly and Chroma\n",
"\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"from langchain.schema import Document\n",
"from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n",
"from langchain_chroma import Chroma\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.manifold import TSNE\n",
"import numpy as np \n",
"import plotly.graph_objects as go\n",
"from langchain.memory import ConversationBufferMemory\n",
"from langchain.chains import ConversationalRetrievalChain"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58c85082-e417-4708-9efe-81a5d55d1424",
"metadata": {},
"outputs": [],
"source": [
"# price is a factor for our company, so we're going to use a low cost model\n",
"\n",
"MODEL = \"gpt-4o-mini\"\n",
"db_name = \"vector_db\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ee78efcb-60fe-449e-a944-40bab26261af",
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv(override=True)\n",
"os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b14e6c30-37c6-4eac-845b-5471aa75f587",
"metadata": {},
"outputs": [],
"source": [
"##Load json\n",
"with open(\"knowledge-base/auto_shop.json\", 'r') as f: #place auto_shop.json file inside your knowledge-base folder\n",
" data = json.load(f)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "408bc620-477f-47fd-b9e8-ab9d21843ecd",
"metadata": {},
"outputs": [],
"source": [
"#Convert to Langchain\n",
"documents = []\n",
"for item in data:\n",
" content = item[\"content\"]\n",
" metadata = item.get(\"metadata\", {})\n",
" documents.append(Document(page_content=content, metadata=metadata))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0371d472-cd14-4967-bc09-9b78e233809f",
"metadata": {},
"outputs": [],
"source": [
"#Chunk documents\n",
"splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50, separators=[\"\\n\\n\", \"\\n\", \",\", \" \", \"\"])\n",
"chunks = splitter.split_documents(documents)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "91c2404b-b3c9-4c7f-b199-9895e429a3da",
"metadata": {},
"outputs": [],
"source": [
"doc_types = set(chunk.metadata['source'] for chunk in chunks)\n",
"#print(f\"Document types found: {', '.join(doc_types)}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "78998399-ac17-4e28-b15f-0b5f51e6ee23",
"metadata": {},
"outputs": [],
"source": [
"embeddings = OpenAIEmbeddings()\n",
"\n",
"# Delete if already exists\n",
"\n",
"if os.path.exists(db_name):\n",
" Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()\n",
"\n",
"# Create vectorstore\n",
"\n",
"vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)\n",
"#print(f\"Vectorstore created with {vectorstore._collection.count()} documents\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ff2e7687-60d4-4920-a1d7-a34b9f70a250",
"metadata": {},
"outputs": [],
"source": [
"# # Let's investigate the vectors. Use for debugging if needed\n",
"\n",
"# collection = vectorstore._collection\n",
"# count = collection.count()\n",
"\n",
"# sample_embedding = collection.get(limit=1, include=[\"embeddings\"])[\"embeddings\"][0]\n",
"# dimensions = len(sample_embedding)\n",
"# print(f\"There are {count:,} vectors with {dimensions:,} dimensions in the vector store\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "129c7d1e-0094-4479-9459-f9360b95f244",
"metadata": {},
"outputs": [],
"source": [
"# create a new Chat with OpenAI\n",
"llm = ChatOpenAI(temperature=0.7, model_name=MODEL)\n",
"\n",
"\n",
"memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)\n",
"\n",
"# the retriever is an abstraction over the VectorStore that will be used during RAG\n",
"retriever = vectorstore.as_retriever()\n",
"\n",
"# putting it together: set up the conversation chain with the GPT 3.5 LLM, the vector store and memory\n",
"conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)"
]
},
{
"cell_type": "markdown",
"id": "bbbcb659-13ce-47ab-8a5e-01b930494964",
"metadata": {},
"source": [
"## Now we will bring this up in Gradio using the Chat interface -"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3536590-85c7-4155-bd87-ae78a1467670",
"metadata": {},
"outputs": [],
"source": [
"# Wrapping that in a function\n",
"\n",
"def chat(question, history):\n",
" result = conversation_chain.invoke({\"question\": question})\n",
" return result[\"answer\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b252d8c1-61a8-406d-b57a-8f708a62b014",
"metadata": {},
"outputs": [],
"source": [
"# And in Gradio:\n",
"\n",
"view = gr.ChatInterface(chat, type=\"messages\").launch(inbrowser=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

View File

@@ -0,0 +1,101 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DocuSeek AI: Secure and Intelligent Document Retrieval\n",
"\n",
"**Created by Blaise Alako**\n",
"\n",
"**DocuSeek AI** is a tool designed to enable secure and intelligent interaction with private documents. It utilizes a Retrieval-Augmented Generation (RAG) system, combining Large Language Models (LLMs) and vector search to provide accurate, context-aware responses based on uploaded content. Developed as part of [LLM Engineering course](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/46867711#content) by Ed. Donner, it demonstrates practical applications of AI in document processing and data privacy, offering a hands-on example of secure, AI-driven data retrieval.\n",
"\n",
"**Key Features of DocuSeek AI:**\n",
"- **Privacy-Focused**: All data is processed locally, ensuring your documents remain private.\n",
"- **Intelligent**: Leverages LLMs and vector search for precise, contextually relevant answers.\n",
"- **Versatile**: Supports multiple file formats and provides interactive visualizations of data relationships.\n",
"\n",
"Explore the source code and setup instructions at [DocuSeek AI GitHub](https://github.com/alakob/ai_private_document_retriever) to experiment with RAG systems and LLM applications in document retrieval.\n",
"\n",
"---\n",
"\n",
"### Table of Contents\n",
"1. [Overview of DocuSeek AI](#overview-of-docuseek-ai)\n",
"2. [Document Upload and Processing](#document-upload-and-processing)\n",
"3. [Querying Your Documents](#querying-your-documents)\n",
"4. [Visualizing Data Relationships](#visualizing-data-relationships)\n",
"5. [Setup and Exploration](#setup-and-exploration)\n",
"\n",
"---\n",
"\n",
"## Overview of DocuSeek AI\n",
"\n",
"### An AI-Powered Document Retrieval Tool\n",
"**DocuSeek AI** provides an intuitive interface for users to interact with their private documents using AI technology. It is designed to be accessible to users of all levels while showcasing advanced concepts such as Retrieval-Augmented Generation (RAG) and vector-based search.\n",
"\n",
"![Interface Screenshot](docuseek1.png) \n",
"*Figure 1: The main interface of DocuSeek AI, where users begin their document exploration journey.*\n",
"\n",
"The tool integrates seamlessly with local files, ensuring data privacy and security throughout the process.\n",
"\n",
"---\n",
"\n",
"## Document Upload and Processing\n",
"\n",
"### Preparing Your Files for Intelligent Querying\n",
"To begin, users upload their documents in supported formats (e.g., PDFs, XLSX, PPTX, HTML, PNG, JPEG, TIFF, BMP, JSON, USPTO XML, JATS XML, Markdown, SdciiDoc, text files). DocuSeek AI processes these files locally, transforming them into a searchable knowledge base optimized for AI-driven querying.\n",
"\n",
"![Upload Screenshot](doc_upload.png) \n",
"*Figure 2: The document upload section, displaying supported formats and processing status.*\n",
"\n",
"This step ensures that your content is ready for accurate and efficient retrieval without compromising privacy.\n",
"\n",
"---\n",
"\n",
"## Querying Your Documents\n",
"\n",
"### Retrieving Contextually Relevant Answers\n",
"Once your documents are processed, you can query the system by asking questions related to your content. DocuSeek AIs RAG system retrieves precise, context-rich answers directly from your uploaded files.\n",
"\n",
"![Query Screenshot](docuseek3.png) \n",
"*Figure 3: The query interface, where users input questions and receive detailed, relevant responses.*\n",
"\n",
"This feature demonstrates the power of combining LLMs with vector search to deliver tailored responses, making it ideal for educational and professional use cases.\n",
"\n",
"---\n",
"\n",
"## Visualizing Data Relationships\n",
"\n",
"### Understanding Connections Within Your Documents\n",
"**DocuSeek AI** enhances document exploration by generating interactive visualizations that reveal relationships and patterns within your data. This feature is particularly useful for identifying key concepts and their connections.\n",
"\n",
"![Visualization Screenshot](docuseek4.png) \n",
"*Figure 4: A visualization highlighting relationships and concepts extracted from the documents.*\n",
"\n",
"These visualizations provide a deeper understanding of your content, making complex information more accessible and actionable.\n",
"\n",
"---\n",
"\n",
"## Setup and Exploration\n",
"\n",
"### Experiment with DocuSeek AI\n",
"To explore **DocuSeek AI** and its underlying technology, clone the repository from [DocuSeek AI GitHub](https://github.com/alakob/ai_private_document_retriever) and follow the setup instructions. This provides an opportunity to:\n",
"- Experiment with RAG systems and vector search.\n",
"- Understand LLM applications in secure document retrieval.\n",
"- Customize the tool for specific use cases or educational projects.\n",
"\n",
"---\n",
"\n",
"### Acknowledgments\n",
"Special thanks to Ed. Donner for his transformative [LLM Engineering course](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/46867711#content) that inspired this project.\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 155 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 102 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 98 KiB

View File

@@ -0,0 +1,41 @@
# Run Continuous Integration (CI) Tests on Modal
## Unit testing
Unit test strategy created like in
[This example repo](https://github.com/modal-labs/ci-on-modal)
## Usage
All commands below are run from the root of the repository (this directory).
_Note_: I removed modal-decorators from pricer.ci-module to be able to run unit tests.
### Run tests remotely on Modal
```bash
modal run pricer.ci::pytest
```
On the first execution, the [container image](https://modal.com/docs/guide/custom-container)
for your application will be built.
This image will be cached on Modal and only rebuilt if one of its dependencies,
like the `requirements.txt` file, changes.
### Debug tests running remotely
To debug the tests, you can open a shell
in the exact same environment that the tests are run in:
```bash
modal shell pricer.ci::pytest
```
_Note_: On the Modal worker, the `pytest` command is run from the home directory, `/root`,
which contains the `tests` folder, but the `modal shell` command will
drop you at the top of the filesystem, `/`.
To run test:
```bash
cd root
pytest
```

View File

@@ -0,0 +1,100 @@
from pathlib import Path
import modal
ROOT_PATH = Path(__file__).parent.parent
image = (
modal.Image.debian_slim()
.pip_install("pytest")
.pip_install_from_requirements(ROOT_PATH / "requirements.txt")
)
app = modal.App("pricer-ci-testing", image=image)
# mount: add local files to the remote container
tests = modal.Mount.from_local_dir(ROOT_PATH / "tests", remote_path="/root/tests")
@app.function(gpu="any", mounts=[tests])
def pytest():
import subprocess
subprocess.run(["pytest", "-vs"], check=True, cwd="/root")
secrets = [modal.Secret.from_name("huggingface-secret")]
# Constants
GPU = "T4"
BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
PROJECT_NAME = "pricer"
HF_USER = "ed-donner" # your HF name here! Or use mine if you just want to reproduce my results.
RUN_NAME = "2024-09-13_13.04.39"
PROJECT_RUN_NAME = f"{PROJECT_NAME}-{RUN_NAME}"
REVISION = "e8d637df551603dc86cd7a1598a8f44af4d7ae36"
FINETUNED_MODEL = f"{HF_USER}/{PROJECT_RUN_NAME}"
MODEL_DIR = "hf-cache/"
BASE_DIR = MODEL_DIR + BASE_MODEL
FINETUNED_DIR = MODEL_DIR + FINETUNED_MODEL
QUESTION = "How much does this cost to the nearest dollar?"
PREFIX = "Price is $"
class Pricer:
def download_model_to_folder(self):
from huggingface_hub import snapshot_download
import os
os.makedirs(MODEL_DIR, exist_ok=True)
print(f"Using this HF Token: {hf_token}")
snapshot_download(BASE_MODEL, local_dir=BASE_DIR, use_auth_token=hf_token)
snapshot_download(FINETUNED_MODEL, revision=REVISION, local_dir=FINETUNED_DIR, use_auth_token=hf_token)
def setup(self):
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, set_seed
from peft import PeftModel
# Quant Config
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4"
)
# Load model and tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(BASE_DIR)
self.tokenizer.pad_token = self.tokenizer.eos_token
self.tokenizer.padding_side = "right"
self.base_model = AutoModelForCausalLM.from_pretrained(
BASE_DIR,
quantization_config=quant_config,
device_map="auto"
)
self.fine_tuned_model = PeftModel.from_pretrained(self.base_model, FINETUNED_DIR, revision=REVISION)
def price(self, description: str) -> float:
import os
import re
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, set_seed
from peft import PeftModel
set_seed(42)
prompt = f"{QUESTION}\n\n{description}\n\n{PREFIX}"
inputs = self.tokenizer.encode(prompt, return_tensors="pt").to("cuda")
attention_mask = torch.ones(inputs.shape, device="cuda")
outputs = self.fine_tuned_model.generate(inputs, attention_mask=attention_mask, max_new_tokens=5, num_return_sequences=1)
result = self.tokenizer.decode(outputs[0])
contents = result.split("Price is $")[1]
contents = contents.replace(',','')
match = re.search(r"[-+]?\d*\.\d+|\d+", contents)
return float(match.group()) if match else 0
def wake_up(self) -> str:
return "ok"

View File

@@ -0,0 +1,101 @@
from typing import Optional
from transformers import AutoTokenizer
import re
BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
MIN_TOKENS = 150
MAX_TOKENS = 160
MIN_CHARS = 300
CEILING_CHARS = MAX_TOKENS * 7
class Item:
"""
An Item is a cleaned, curated datapoint of a Product with a Price
"""
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
PREFIX = "Price is $"
QUESTION = "How much does this cost to the nearest dollar?"
REMOVALS = ['"Batteries Included?": "No"', '"Batteries Included?": "Yes"', '"Batteries Required?": "No"', '"Batteries Required?": "Yes"', "By Manufacturer", "Item", "Date First", "Package", ":", "Number of", "Best Sellers", "Number", "Product "]
title: str
price: float
category: str
token_count: int = 0
details: Optional[str]
prompt: Optional[str] = None
include = False
def __init__(self, data, price):
self.title = data['title']
self.price = price
self.parse(data)
def scrub_details(self):
"""
Clean up the details string by removing common text that doesn't add value
"""
details = self.details
for remove in self.REMOVALS:
details = details.replace(remove, "")
return details
def scrub(self, stuff):
"""
Clean up the provided text by removing unnecessary characters and whitespace
Also remove words that are 7+ chars and contain numbers, as these are likely irrelevant product numbers
"""
stuff = re.sub(r'[:\[\]"{}【】\s]+', ' ', stuff).strip()
stuff = stuff.replace(" ,", ",").replace(",,,",",").replace(",,",",")
words = stuff.split(' ')
select = [word for word in words if len(word)<7 or not any(char.isdigit() for char in word)]
return " ".join(select)
def parse(self, data):
"""
Parse this datapoint and if it fits within the allowed Token range,
then set include to True
"""
contents = '\n'.join(data['description'])
if contents:
contents += '\n'
features = '\n'.join(data['features'])
if features:
contents += features + '\n'
self.details = data['details']
if self.details:
contents += self.scrub_details() + '\n'
if len(contents) > MIN_CHARS:
contents = contents[:CEILING_CHARS]
text = f"{self.scrub(self.title)}\n{self.scrub(contents)}"
tokens = self.tokenizer.encode(text, add_special_tokens=False)
if len(tokens) > MIN_TOKENS:
tokens = tokens[:MAX_TOKENS]
text = self.tokenizer.decode(tokens)
self.make_prompt(text)
self.include = True
def make_prompt(self, text):
"""
Set the prompt instance variable to be a prompt appropriate for training
"""
self.prompt = f"{self.QUESTION}\n\n{text}\n\n"
self.prompt += f"{self.PREFIX}{str(round(self.price))}.00"
self.token_count = len(self.tokenizer.encode(self.prompt, add_special_tokens=False))
def test_prompt(self):
"""
Return a prompt suitable for testing, with the actual price removed
"""
return self.prompt.split(self.PREFIX)[0] + self.PREFIX
def __repr__(self):
"""
Return a String version of this Item
"""
return f"<{self.title} = ${self.price}>"

View File

@@ -0,0 +1,10 @@
import time
import modal
from datetime import datetime
Pricer = modal.Cls.lookup("pricer-service", "Pricer")
pricer = Pricer()
while True:
reply = pricer.wake_up.remote()
print(f"{datetime.now()}: {reply}")
time.sleep(30)

View File

@@ -0,0 +1,44 @@
import modal
from modal import App, Volume, Image
# Setup
app = modal.App("llama")
image = Image.debian_slim().pip_install("torch", "transformers", "bitsandbytes", "accelerate")
secrets = [modal.Secret.from_name("hf-secret")]
GPU = "T4"
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B" # "google/gemma-2-2b"
@app.function(image=image, secrets=secrets, gpu=GPU, timeout=1800)
def generate(prompt: str) -> str:
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, set_seed
# Quant Config
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4"
)
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=quant_config,
device_map="auto"
)
set_seed(42)
inputs = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
attention_mask = torch.ones(inputs.shape, device="cuda")
outputs = model.generate(inputs, attention_mask=attention_mask, max_new_tokens=5, num_return_sequences=1)
return tokenizer.decode(outputs[0])

View File

@@ -0,0 +1,75 @@
import math
import matplotlib.pyplot as plt
GREEN = "\033[92m"
YELLOW = "\033[93m"
RED = "\033[91m"
RESET = "\033[0m"
COLOR_MAP = {"red":RED, "orange": YELLOW, "green": GREEN}
class Tester:
def __init__(self, predictor, data, title=None, size=250):
self.predictor = predictor
self.data = data
self.title = title or predictor.__name__.replace("_", " ").title()
self.size = size
self.guesses = []
self.truths = []
self.errors = []
self.sles = []
self.colors = []
def color_for(self, error, truth):
if error<40 or error/truth < 0.2:
return "green"
elif error<80 or error/truth < 0.4:
return "orange"
else:
return "red"
def run_datapoint(self, i):
datapoint = self.data[i]
guess = self.predictor(datapoint)
truth = datapoint.price
error = abs(guess - truth)
log_error = math.log(truth+1) - math.log(guess+1)
sle = log_error ** 2
color = self.color_for(error, truth)
title = datapoint.title if len(datapoint.title) <= 40 else datapoint.title[:40]+"..."
self.guesses.append(guess)
self.truths.append(truth)
self.errors.append(error)
self.sles.append(sle)
self.colors.append(color)
print(f"{COLOR_MAP[color]}{i+1}: Guess: ${guess:,.2f} Truth: ${truth:,.2f} Error: ${error:,.2f} SLE: {sle:,.2f} Item: {title}{RESET}")
def chart(self, title):
max_error = max(self.errors)
plt.figure(figsize=(12, 8))
max_val = max(max(self.truths), max(self.guesses))
plt.plot([0, max_val], [0, max_val], color='deepskyblue', lw=2, alpha=0.6)
plt.scatter(self.truths, self.guesses, s=3, c=self.colors)
plt.xlabel('Ground Truth')
plt.ylabel('Model Estimate')
plt.xlim(0, max_val)
plt.ylim(0, max_val)
plt.title(title)
plt.show()
def report(self):
average_error = sum(self.errors) / self.size
rmsle = math.sqrt(sum(self.sles) / self.size)
hits = sum(1 for color in self.colors if color=="green")
title = f"{self.title} Error=${average_error:,.2f} RMSLE={rmsle:,.2f} Hits={hits/self.size*100:.1f}%"
self.chart(title)
def run(self):
self.error = 0
for i in range(self.size):
self.run_datapoint(i)
self.report()
@classmethod
def test(cls, function, data):
cls(function, data).run()

View File

@@ -0,0 +1,6 @@
huggingface
torch
transformers
bitsandbytes
accelerate
peft

View File

@@ -0,0 +1,84 @@
import pdb
from pricer.ci import Pricer
from unittest.mock import patch, MagicMock
import torch
import pytest
from transformers import BitsAndBytesConfig
BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
PROJECT_NAME = "pricer"
HF_USER = "ed-donner" # your HF name here! Or use mine if you just want to reproduce my results.
RUN_NAME = "2024-09-13_13.04.39"
PROJECT_RUN_NAME = f"{PROJECT_NAME}-{RUN_NAME}"
REVISION = "e8d637df551603dc86cd7a1598a8f44af4d7ae36"
FINETUNED_MODEL = f"{HF_USER}/{PROJECT_RUN_NAME}"
MODEL_DIR = "hf-cache/"
BASE_DIR = MODEL_DIR + BASE_MODEL
FINETUNED_DIR = MODEL_DIR + FINETUNED_MODEL
@pytest.fixture
def pricer():
return Pricer()
def test_wake_up():
pricer = Pricer()
assert pricer.wake_up() == "ok"
@patch('transformers.AutoTokenizer')
@patch('peft.PeftModel')
@patch('transformers.AutoModelForCausalLM')
def test_setup(MockAutoModel, MockPeftModel, MockAutoTokenizer, pricer):
# Setup mocks
mock_tokenizer = MockAutoTokenizer.from_pretrained.return_value
mock_model = MockAutoModel.from_pretrained.return_value
mock_peft_model = MockPeftModel.from_pretrained.return_value
# Call the setup method
pricer.setup()
# Assertions to ensure the setup method works correctly
MockAutoTokenizer.from_pretrained.assert_called_once_with(BASE_DIR)
assert pricer.tokenizer == mock_tokenizer
assert pricer.tokenizer.pad_token == pricer.tokenizer.eos_token
assert pricer.tokenizer.padding_side == "right"
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4"
)
MockAutoModel.from_pretrained.assert_called_once_with(
BASE_DIR,
quantization_config=quant_config,
device_map="auto"
)
assert pricer.base_model == mock_model
MockPeftModel.from_pretrained.assert_called_once_with(mock_model, FINETUNED_DIR, revision=REVISION)
assert pricer.fine_tuned_model == mock_peft_model
@patch('transformers.AutoTokenizer')
@patch('peft.PeftModel')
def test_price(MockPeftModel, MockAutoTokenizer, pricer):
# Setup mocks
mock_tokenizer = MockAutoTokenizer.return_value
mock_tokenizer.encode.return_value = torch.tensor([[1, 2, 3]])
mock_tokenizer.decode.return_value = "Price is $123.45"
mock_model = MockPeftModel.return_value
mock_model.generate.return_value = torch.tensor([[1, 2, 3, 4, 5]])
# Assign mocks to the pricer instance
pricer.tokenizer = mock_tokenizer
pricer.fine_tuned_model = mock_model
# Call the method
description = "Test description"
result = pricer.price(description)
# Assert the result
assert result == 123.45