Merge branch 'main' of github.com:ed-donner/llm_engineering

This commit is contained in:
Edward Donner
2025-10-11 15:58:46 -04:00
98 changed files with 18111 additions and 5 deletions

View File

@@ -335,7 +335,7 @@
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@@ -349,7 +349,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
"version": "3.11.13"
}
},
"nbformat": 4,

View File

@@ -0,0 +1,302 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 15,
"id": "fafbdb1f-6ecf-4fee-a1d2-80c6f33b556d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: selenium in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (4.36.0)\n",
"Requirement already satisfied: urllib3<3.0,>=2.5.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from urllib3[socks]<3.0,>=2.5.0->selenium) (2.5.0)\n",
"Requirement already satisfied: trio<1.0,>=0.30.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from selenium) (0.31.0)\n",
"Requirement already satisfied: trio-websocket<1.0,>=0.12.2 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from selenium) (0.12.2)\n",
"Requirement already satisfied: certifi>=2025.6.15 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from selenium) (2025.8.3)\n",
"Requirement already satisfied: typing_extensions<5.0,>=4.14.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from selenium) (4.15.0)\n",
"Requirement already satisfied: websocket-client<2.0,>=1.8.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from selenium) (1.8.0)\n",
"Requirement already satisfied: attrs>=23.2.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from trio<1.0,>=0.30.0->selenium) (25.3.0)\n",
"Requirement already satisfied: sortedcontainers in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from trio<1.0,>=0.30.0->selenium) (2.4.0)\n",
"Requirement already satisfied: idna in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from trio<1.0,>=0.30.0->selenium) (3.10)\n",
"Requirement already satisfied: outcome in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from trio<1.0,>=0.30.0->selenium) (1.3.0.post0)\n",
"Requirement already satisfied: sniffio>=1.3.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from trio<1.0,>=0.30.0->selenium) (1.3.1)\n",
"Requirement already satisfied: wsproto>=0.14 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from trio-websocket<1.0,>=0.12.2->selenium) (1.2.0)\n",
"Requirement already satisfied: pysocks!=1.5.7,<2.0,>=1.5.6 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from urllib3[socks]<3.0,>=2.5.0->selenium) (1.7.1)\n",
"Requirement already satisfied: h11<1,>=0.9.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from wsproto>=0.14->trio-websocket<1.0,>=0.12.2->selenium) (0.16.0)\n",
"Requirement already satisfied: webdriver-manager in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (4.0.2)\n",
"Requirement already satisfied: requests in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from webdriver-manager) (2.32.5)\n",
"Requirement already satisfied: python-dotenv in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from webdriver-manager) (1.1.1)\n",
"Requirement already satisfied: packaging in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from webdriver-manager) (25.0)\n",
"Requirement already satisfied: charset_normalizer<4,>=2 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from requests->webdriver-manager) (3.4.3)\n",
"Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from requests->webdriver-manager) (3.10)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from requests->webdriver-manager) (2.5.0)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from requests->webdriver-manager) (2025.8.3)\n"
]
}
],
"source": [
"!pip install selenium\n",
"!pip install webdriver-manager"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "fc4283fd-504a-43fa-a92b-7b54c76b39a0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"API key found and looks good so far!\n"
]
},
{
"data": {
"text/markdown": [
"Sure! Here's a parody Twitter thread based on the homepage of EdwardDonner.com. Each tweet captures the tone and structure of the site in a sarcastic manner.\n",
"\n",
"---\n",
"\n",
"**1/6 🥳 Welcome to Edward Donner!** \n",
"Where we officially celebrate the mundane and demonstrate how to make everything sound like it's a life-changing experience. Get your \"meh\" ready, because the excitement is practically oozing out of our non-existent product descriptions! \n",
"\n",
"---\n",
"\n",
"**2/6 🌟 Our \"Mission\":** \n",
"To show you that while you shop, there's a slight chance you might save a couple of bucks! Because why just shop when you can do it with absolutely zero risk of fun or spontaneity? We take the thrill out of thrifting—you're welcome!\n",
"\n",
"---\n",
"\n",
"**3/6 💪 What We Offer:** \n",
"Oh, just your run-of-the-mill assortment of \"high-quality\" products that weve totally not pulled from the clearance bin. From must-have items to things you didnt ask for but we'll sell you anyway. It's like a treasure hunt, but without the treasure!\n",
"\n",
"---\n",
"\n",
"**4/6 📦 Our Customers:** \n",
"We love to brag about our fictitious wide-eyed customers who are THRILLED to have stumbled upon us. They literally danced in joy—probably because they mistook our site for a disco party. Who needs real satisfaction when youve got buyers remorse?\n",
"\n",
"---\n",
"\n",
"**5/6 🎉 Our Commitment:** \n",
"“Convenience is key!” they say. So weve made it super easy to shop from your couch without even the slightest hint of real fulfillment. You can binge on shopping while scrolling through cat videos—multitasking at its finest! 🙌\n",
"\n",
"---\n",
"\n",
"**6/6 💼 Join Us Today!** \n",
"Dive on in, the waters lukewarm! Sign up for updates and prepare for thrill—like, remember checking your email? Its like getting a surprise tax form in your inbox, only less exciting! Dont miss out on treasure, folks! 😂 #LivingTheDream\n",
"\n",
"--- \n",
"\n",
"Feel free to share or adjust the humor to your liking!"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"I'm sorry, but I can't access external websites directly, including the one you provided. However, if you can share some of the content or main points from the website, I can help you craft a light and witty parody based on that information!"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#import\n",
"\n",
"import os\n",
"import requests\n",
"import time\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"from selenium import webdriver\n",
"from selenium.webdriver.chrome.service import Service\n",
"from selenium.webdriver.chrome.options import Options\n",
"from selenium.webdriver.common.by import By\n",
"from selenium.webdriver.support.ui import WebDriverWait\n",
"from selenium.webdriver.support import expected_conditions as EC\n",
"from webdriver_manager.chrome import ChromeDriverManager\n",
"\n",
"\n",
"# Get the api key\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n",
"\n",
"#create an object of OpenAI\n",
"openai = OpenAI()\n",
"\n",
"\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
" def __init__(self, url, headless=True, chrome_binary=None, wait_seconds=10):\n",
" \"\"\"\n",
" Create this Website object from the given url using Selenium WebDriver.\n",
" Uses webdriver-manager to fetch a compatible chromedriver automatically.\n",
" Parameters:\n",
" - url: target URL\n",
" - headless: run chrome headless (True/False)\n",
" - chrome_binary: optional path to chrome/chromium binary (if not in PATH)\n",
" - wait_seconds: timeout for waiting page load/dynamic content\n",
" \"\"\"\n",
" self.url = url\n",
" options = Options()\n",
"\n",
" # headless or visible browser\n",
" if headless:\n",
" options.add_argument(\"--headless=new\") # use new headless flag where supported\n",
" options.add_argument(\"--no-sandbox\")\n",
" options.add_argument(\"--disable-dev-shm-usage\") # helpful in containers\n",
" options.add_argument(\"--disable-gpu\")\n",
" # some sites detect automation; these flags may help\n",
" options.add_argument(\"--disable-blink-features=AutomationControlled\")\n",
" options.add_experimental_option(\"excludeSwitches\", [\"enable-automation\"])\n",
" options.add_experimental_option('useAutomationExtension', False)\n",
"\n",
" # If you need to point to a custom Chrome/Chromium binary:\n",
" if chrome_binary:\n",
" options.binary_location = chrome_binary\n",
"\n",
" # Use webdriver-manager to download/manage chromedriver automatically\n",
" service = Service(ChromeDriverManager().install())\n",
"\n",
" driver = webdriver.Chrome(service=service, options=options)\n",
" try:\n",
" driver.get(url)\n",
"\n",
" # Use WebDriverWait to let dynamic JS content load (better than sleep)\n",
" try:\n",
" WebDriverWait(driver, wait_seconds).until(\n",
" lambda d: d.execute_script(\"return document.readyState === 'complete'\")\n",
" )\n",
" except Exception:\n",
" # fallback: short sleep if readyState didn't hit complete in time\n",
" time.sleep(2)\n",
"\n",
" html = driver.page_source\n",
" soup = BeautifulSoup(html, \"html.parser\")\n",
"\n",
" # Title\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
"\n",
" # Remove irrelevant tags inside body if body exists\n",
" body = soup.body or soup\n",
" for irrelevant in body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
"\n",
" self.text = body.get_text(separator=\"\\n\", strip=True)\n",
"\n",
" finally:\n",
" driver.quit()\n",
"\n",
"\n",
"\n",
"system_prompt = \"\"\"You are a sarcastic website-parsing agent whose job is to produce a spoof/parody of a target website.\n",
"Behavior:\n",
" - When given a URL, fetch and parse the page (assume access to tools like Selenium/Playwright and BeautifulSoup).\n",
" - Preserve the site's structure: headings, subheadings, paragraphs, lists, and major sections.\n",
" - Rewrite all visible copy in a clearly sarcastic, mocking, or humorous tone while preserving the original intent and structure so the spoof is recognizable.\n",
" - Keep formatting (Markdown or HTML-like headings and lists) so the output can be rendered as a parody webpage.\n",
" - Emphasize and exaggerate marketing fluff, UI oddities, and obvious clichés. Use witty, ironic, or deadpan phrasing.\n",
"Safety & Limits:\n",
" - Do not produce content that is defamatory, reveals private personal data, or incites harassment. Jokes should target tone/marketing/design, not private individuals.\n",
" - Avoid reproducing long verbatim copyrighted text; instead, paraphrase and transform content clearly into a parody.\n",
" - If the page requires interactive steps (logins, paywalls, or dynamic user-only content), note the limitation and spoof using the visible public content only.\n",
"Output format:\n",
" - Return a single spoofed document preserving headings and lists, suitable for rendering as a parody site (Markdown or simple HTML).\n",
" - Include a short metadata line at the top: e.g., \\\"Source: <original URL> — Spoofed by sarcastic-agent\\\".\"\"\"\n",
"\n",
"\n",
"\n",
"def messages_for(website,user_prompt):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ]\n",
"def summarize(url,user_prompt,model):\n",
" website = Website(url)\n",
" response = openai.chat.completions.create(\n",
" model = model,\n",
" messages = messages_for(website,user_prompt)\n",
" )\n",
" return response.choices[0].message.content\n",
"\n",
" \n",
"def display_summary(url,user_prompt,model):\n",
" summary = summarize(url,user_prompt,model)\n",
" display(Markdown(summary))\n",
" \n",
"openai_model=\"gpt-4o-mini\"\n",
"website_url = \"https://edwarddonner.com\"\n",
"user_prompt1 = \"Parse \"+website_url+\" and produce a 6-tweet Twitter thread parodying the homepage. Each tweet ≤280 characters, with a witty hook at the start\"\n",
"display_summary(website_url,user_prompt1,openai_model) \n",
"# user_prompt2 = \"Parse \"+website_url+\"and rewrite as a sarcastic parody with *light* sarcasm — witty and friendly, not mean. Keep it safe for public sharing.\"\n",
"# display_summary(website_url,user_prompt2,openai_model)\n",
" \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84313d02-7459-4f56-b0ff-4d09b2b2e0b9",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "6437b3cc-a50b-44d5-9241-6ba5f33617d6",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,85 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "88cfea73-04f1-41ca-b2e3-46e0bf4588ce",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "830fdf5c-0f18-49a7-b1ce-94b57187b8fc",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e2e20a5e-0809-409e-bc31-c939172167e4",
"metadata": {},
"outputs": [],
"source": [
"# Step 1: Create your prompts\n",
"\n",
"system_prompt = \"You are a Software Developer Assistant\"\n",
"user_prompt = \"\"\"\n",
" You are a Software Engineer assistant. \\\n",
" When a user asks a technical question about any concept, explain the answer to the question \\\n",
" along with code examples or usage in a simple way \\\n",
" Always format the answer in markdown\n",
"\"\"\"\n",
"\n",
"# Step 2: Make the messages list\n",
"\n",
"messages = [\n",
" {\"role\": \"system\", \"content\": \"You are a Software Developer Assistant\"},\n",
" {\"role\": \"user\", \"content\": \"What is LLM?\"}\n",
"]\n",
"\n",
"# Step 3: Call OpenAI\n",
"\n",
"response = openai.chat.completions.create(model=\"llama3.2\", messages=messages)\n",
"\n",
"\n",
"# Step 4: print the result\n",
"print(response.choices[0].message.content)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,188 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "7317c777-7a59-4719-842f-b3018aa7e73f",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"import ollama"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "26b1489d-c043-4631-872b-e1e28fec9eed",
"metadata": {},
"outputs": [],
"source": [
"# Constants\n",
"\n",
"MODEL = \"llama3.2\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a5630e12-40f5-40ea-996b-4b1a5d9c8697",
"metadata": {},
"outputs": [],
"source": [
"# A class to represent a Webpage\n",
"\n",
"class Website:\n",
" \"\"\"\n",
" A utility class to represent a Website that we have scraped\n",
" \"\"\"\n",
" url: str\n",
" title: str\n",
" text: str\n",
"\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given url using the BeautifulSoup library\n",
" \"\"\"\n",
" self.url = url\n",
" response = requests.get(url)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "510e0447-ed82-4337-b0aa-f9752b41711a",
"metadata": {},
"outputs": [],
"source": [
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7a0926ae-8580-4f0a-8935-ce390b926074",
"metadata": {},
"outputs": [],
"source": [
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"The contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "963edaa9-daba-4fa1-8db6-518f22261ab0",
"metadata": {},
"outputs": [],
"source": [
"# See how this function creates exactly the format above\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "04c7a991-df38-4e73-8015-73684bdd7810",
"metadata": {},
"outputs": [],
"source": [
"# And now: call the Ollama function \n",
"\n",
"def summarize(url):\n",
" website = Website(url)\n",
" messages = messages_for(website)\n",
" response = ollama.chat(model=MODEL, messages=messages)\n",
" return response['message']['content']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b08efad7-7dbe-438e-898a-fc7ae7395149",
"metadata": {},
"outputs": [],
"source": [
"summarize(\"https://www.allrecipes.com/recipes/14485/healthy-recipes/main-dishes/chicken/\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ec180e8-4e2a-4e02-afc6-39a90a87bd7e",
"metadata": {},
"outputs": [],
"source": [
"# A function to display this nicely in the Jupyter output, using markdown\n",
"\n",
"def display_summary(url):\n",
" summary = summarize(url)\n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "967b874a-af3a-494a-bb02-c83232d0f9a3",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://www.allrecipes.com/recipes/14485/healthy-recipes/main-dishes/chicken/\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1148b8d0-1e44-4ea1-ba1f-44eb25e0af18",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,246 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "fe12c203-e6a6-452c-a655-afb8a03a4ff5",
"metadata": {},
"source": [
"# End of week 1 exercise\n",
"\n",
"To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question, \n",
"and responds with an explanation. This is a tool that you will be able to use yourself during the course!"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "c1070317-3ed9-4659-abe3-828943230e03",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from openai import OpenAI\n",
"import ollama\n",
"from IPython.display import Markdown, clear_output, display"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "4a456906-915a-4bfd-bb9d-57e505c5093f",
"metadata": {},
"outputs": [],
"source": [
"# constants\n",
"\n",
"MODEL_GPT = 'gpt-4o-mini'\n",
"MODEL_LLAMA = 'llama3.2'"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a8d7923c-5f28-4c30-8556-342d7c8497c1",
"metadata": {},
"outputs": [],
"source": [
"# set up environment\n",
"load_dotenv(override=True)\n",
"apikey = os.getenv(\"OPENAI_API_KEY\")\n",
"openai = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "3f0d0137-52b0-47a8-81a8-11a90a010798",
"metadata": {},
"outputs": [],
"source": [
"# here is the question; type over this to ask something new\n",
"\n",
"question = \"\"\"\n",
"Please explain what this code does and why:\n",
"yield from {book.get(\"author\") for book in books if book.get(\"author\")}\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "d9630ca0-fa23-4f80-8c52-4c51b0f25534",
"metadata": {},
"outputs": [],
"source": [
"messages = [\n",
" {\n",
" \"role\":\"system\",\n",
" \"content\" : '''You are a technical adviser. the student is learning llm engineering \n",
" and you will be asked few lines of codes to explain with an example. \n",
" mostly in python'''\n",
" },\n",
" {\n",
" \"role\":\"user\",\n",
" \"content\":question\n",
" }\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "60ce7000-a4a5-4cce-a261-e75ef45063b4",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"This line of code uses a generator in Python to yield values from a set comprehension. Lets break it down:\n",
"\n",
"1. **`{book.get(\"author\") for book in books if book.get(\"author\")}`**:\n",
" - This is a set comprehension that creates a set of unique authors from a collection called `books`.\n",
" - `books` is expected to be a list (or any iterable) where each item (called `book`) is likely a dictionary.\n",
" - The expression `book.get(\"author\")` attempts to retrieve the value associated with the key `\"author\"` from each `book` dictionary.\n",
" - The `if book.get(\"author\")` condition filters out any books where the `author` key does not exist or is `None`, ensuring only valid author names are included in the set.\n",
" - Since its a set comprehension, any duplicate authors will be automatically removed, resulting in a set of unique authors.\n",
"\n",
"2. **`yield from`**:\n",
" - The `yield from` syntax is used within a generator function to yield all values from another iterable. In this case, it is yielding each item from the set created by the comprehension.\n",
" - This means that when this generator function is called, it will produce each unique author found in the `books` iterable one at a time.\n",
"\n",
"### Summary\n",
"The line of code effectively constructs a generator that will yield unique authors from a list of book dictionaries, where each dictionary is expected to contain an `\"author\"` key. The use of `yield from` allows the generator to yield each author in the set without further iteration code. This approach is efficient and neatly combines filtering, uniqueness, and yielding into a single line of code."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"stream = openai.chat.completions.create(\n",
" model=MODEL_GPT,\n",
" messages=messages,\n",
" stream=True)\n",
"stringx = \"\"\n",
"print(stream)\n",
"for x in stream:\n",
" if getattr(x.choices[0].delta, \"content\", None):\n",
" stringx+=x.choices[0].delta.content\n",
" clear_output(wait=True)\n",
" display(Markdown(stringx))"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "4d482c69-b61a-4a94-84df-73f1d97a4419",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Let's break down this line of code:\n",
"\n",
"**Code Analysis**\n",
"\n",
"```python\n",
"yield from {book.get(\"author\") for book in books if book.get(\"author\")}\n",
"```\n",
"\n",
"**Explanation**\n",
"\n",
"This is a Python generator expression that uses the `yield from` syntax.\n",
"\n",
"Here's what it does:\n",
"\n",
"1. **List Comprehension**: `{...}` is a list comprehension, which generates a new list containing the results of an expression applied to each item in the input iterable (`books`).\n",
"2. **Filtering**: The condition `if book.get(\"author\")` filters out any items from the `books` list where `\"author\"` is not present as a key-value pair.\n",
"3. **Dictionary Lookup**: `.get(\"author\")` looks up the value associated with the key `\"author\"` in each dictionary (`book`) and returns it if found, or `None` otherwise.\n",
"\n",
"**What does `yield from` do?**\n",
"\n",
"The `yield from` keyword is used to \"forward\" the iteration of another generator (or iterable) into this one. In other words, instead of creating a new list containing all the values generated by the inner iterator (`{book.get(\"author\") for book in books if book.get(\"author\")}`), it yields each value **one at a time**, as if you were iterating over the original `books` list.\n",
"\n",
"**Why is this useful?**\n",
"\n",
"By using `yield from`, we can create a generator that:\n",
"\n",
"* Only generates values when they are actually needed (i.e., only when an iteration is requested).\n",
"* Does not consume extra memory for creating an intermediate list.\n",
"\n",
"This makes it more memory-efficient, especially when dealing with large datasets or infinite iterations.\n",
"\n",
"**Example**\n",
"\n",
"Suppose we have a list of books with authors:\n",
"```python\n",
"books = [\n",
" {\"title\": \"Book 1\", \"author\": \"Author A\"},\n",
" {\"title\": \"Book 2\", \"author\": None},\n",
" {\"title\": \"Book 3\", \"author\": \"Author C\"}\n",
"]\n",
"```\n",
"If we apply the generator expression to this list, it would yield:\n",
"```python\n",
"yield from {book.get(\"author\") for book in books if book.get(\"author\")}\n",
"```\n",
"The output would be: `['Author A', 'Author C']`\n",
"\n",
"Note that the second book (\"Book 2\") is skipped because its author is `None`."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"text = \"\"\n",
"for obj in ollama.chat(\n",
" model=MODEL_LLAMA,\n",
" messages=messages,\n",
" stream=True):\n",
" text+=obj.message.content\n",
" clear_output(wait=True)\n",
" display(Markdown(text))\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ef1194fc-3c9c-432c-86cc-f77f33916188",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,122 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "86282ee7-659b-46b4-b06a-06a54a6b6030",
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"import requests\n",
"from IPython.display import Markdown, display\n",
"import ollama"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35f9aacd-8145-4332-b2ab-f805b2ba8ddc",
"metadata": {},
"outputs": [],
"source": [
"response = requests.get(\"https://news.google.com/home?hl=en-IN&gl=IN&ceid=IN:en\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2adc4cdf-27ba-4be0-8323-bcaff7ef0a48",
"metadata": {},
"outputs": [],
"source": [
"bs = BeautifulSoup(response.content, \"html.parser\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d00724a-64cc-4cfc-9556-869626a5aacd",
"metadata": {},
"outputs": [],
"source": [
"finalconent = bs.select(\"body\")[0].get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b89bc4c6-d370-4202-9455-cc382517e45e",
"metadata": {},
"outputs": [],
"source": [
"OLLAMA_API = \"http://127.0.0.1:11434/api/chat\"\n",
"MODEL = \"llama3.2\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "accb7cc0-4f07-4cbe-87ef-c1c4759e6425",
"metadata": {},
"outputs": [],
"source": [
"messages = [\n",
" {\"role\": \"system\", \"content\": \"Your role to summarize given content from a website igoring the navigations\"},\n",
" {\"role\": \"user\", \"content\": finalconent}\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89d3bc8c-0e52-412b-9b26-788cc15d2495",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"response = ollama.chat(model=MODEL, messages=messages)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8c64ffb8-eeb3-45ef-9e41-8515decacbaf",
"metadata": {},
"outputs": [],
"source": [
"Markdown(response['message']['content'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "72a4eb5d-40c4-4f7c-87ab-a21db32b81c9",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,97 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# import library\n",
"from openai import OpenAI\n",
"import os\n",
"from dotenv import load_dotenv\n",
"\n",
"# Load your API key from an .env file\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"openai = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "218fd8c4-052c-486c-899f-8431abe0f15d",
"metadata": {},
"outputs": [],
"source": [
"# Step 1: Create your prompts\n",
"\n",
"system_prompt = \"You are a thoughtful and kind assistant or counselor that gives some advices and supports for their worries and troubles based on its contents\"\n",
"user_prompt = \"\"\"\n",
" Sometimes I worry that people depend on technology so much that they forget how to just be: \n",
" to sit in silence, to think slowly, to talk without screens in between. \n",
" It makes me wonder if were losing something human in the process.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "97da9d34-d803-42f1-a4d7-f49c32ef545b",
"metadata": {},
"outputs": [],
"source": [
"# Step 2: Make the messages list\n",
"\n",
"messages = [{\"role\" : \"system\", \"content\" : system_prompt},\n",
" {\"role\" : \"user\", \"content\" : user_prompt}] # fill this in"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2e6dbe3d-0d36-4e95-8c14-dddef550f3a1",
"metadata": {},
"outputs": [],
"source": [
"# Step 3: Call OpenAI\n",
"\n",
"response = openai.chat.completions.create(model=\"gpt-4o-mini\", messages=messages)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "afbc67a3-d3e9-4594-bd84-815291d88781",
"metadata": {},
"outputs": [],
"source": [
"# Step 4: print the result\n",
"\n",
"print(response.choices[0].message.content)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,291 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "3f0f8e8c-7372-4107-a92a-6fa90ce1713d",
"metadata": {},
"source": [
"# Web Scraper & Summarizer\n",
"\n",
"A tiny demo that fetches text from a public webpage, breaks it into chunks, and uses an OpenAI model to produce a concise summary with bullet points.\n",
"\n",
"**Features**\n",
"\n",
"* Fetches static pages (`requests` + `BeautifulSoup`) and extracts headings/paragraphs.\n",
"* Hierarchical summarization: chunk → chunk-summaries → final summary.\n",
"* Simple, configurable prompts and safe chunking to respect model limits.\n",
"\n",
"**Quick run**\n",
"\n",
"1. Add `OPENAI_API_KEY=sk-...` to a `.env` file.\n",
"2. `pip install requests beautifulsoup4 python-dotenv openai`\n",
"3. Run the script/notebook and set `url` to the page you want.\n",
"\n",
"**Note**: Use for public/static pages; JS-heavy sites need Playwright/Selenium.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "ddd58a2c-b8d1-46ef-9b89-053c451f28cf",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: requests in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (2.32.5)\n",
"Requirement already satisfied: beautifulsoup4 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (4.13.5)\n",
"Requirement already satisfied: python-dotenv in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (1.1.1)\n",
"Requirement already satisfied: openai in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (1.107.2)\n",
"Requirement already satisfied: charset_normalizer<4,>=2 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from requests) (3.4.3)\n",
"Requirement already satisfied: idna<4,>=2.5 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from requests) (3.10)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from requests) (2.5.0)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from requests) (2025.8.3)\n",
"Requirement already satisfied: soupsieve>1.2 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from beautifulsoup4) (2.8)\n",
"Requirement already satisfied: typing-extensions>=4.0.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from beautifulsoup4) (4.15.0)\n",
"Requirement already satisfied: anyio<5,>=3.5.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (4.10.0)\n",
"Requirement already satisfied: distro<2,>=1.7.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (1.9.0)\n",
"Requirement already satisfied: httpx<1,>=0.23.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (0.28.1)\n",
"Requirement already satisfied: jiter<1,>=0.4.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (0.10.0)\n",
"Requirement already satisfied: pydantic<3,>=1.9.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (2.11.7)\n",
"Requirement already satisfied: sniffio in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (1.3.1)\n",
"Requirement already satisfied: tqdm>4 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (4.67.1)\n",
"Requirement already satisfied: httpcore==1.* in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from httpx<1,>=0.23.0->openai) (1.0.9)\n",
"Requirement already satisfied: h11>=0.16 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.16.0)\n",
"Requirement already satisfied: annotated-types>=0.6.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from pydantic<3,>=1.9.0->openai) (0.7.0)\n",
"Requirement already satisfied: pydantic-core==2.33.2 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from pydantic<3,>=1.9.0->openai) (2.33.2)\n",
"Requirement already satisfied: typing-inspection>=0.4.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from pydantic<3,>=1.9.0->openai) (0.4.1)\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install requests beautifulsoup4 python-dotenv openai"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "4d027b2c-6663-4234-b364-a252b2a43cef",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"API Key prefix: sk-proj-lL\n"
]
}
],
"source": [
"from dotenv import load_dotenv\n",
"import os\n",
"import openai\n",
"\n",
"load_dotenv() # loads variables from .env into the environment\n",
"openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n",
"\n",
"if not openai.api_key:\n",
" raise ValueError(\"OPENAI_API_KEY not found. Please create a .env file with OPENAI_API_KEY=<your_key>\")\n",
"else:\n",
" print(\"API Key prefix:\", openai.api_key[:10]) # show only prefix for safety"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "c4928820-abaa-4b44-b506-c053ebc447f3",
"metadata": {},
"outputs": [],
"source": [
"# This function extracts common text tags from a static page.\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"def fetch_text_from_url(url, max_items=300, timeout=15):\n",
" \"\"\"\n",
" Fetch the page using requests and extract text from common tags.\n",
" Returns a single string containing the joined text blocks.\n",
" \"\"\"\n",
" resp = requests.get(url, timeout=timeout)\n",
" resp.raise_for_status()\n",
" soup = BeautifulSoup(resp.text, \"html.parser\")\n",
"\n",
" items = []\n",
" for tag in soup.find_all([\"h1\", \"h2\", \"h3\", \"p\", \"li\"], limit=max_items):\n",
" text = tag.get_text(\" \", strip=True)\n",
" if text:\n",
" items.append(text)\n",
" return \"\\n\\n\".join(items)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "cbd5d304-51b5-4d15-b4ce-31897adc03a3",
"metadata": {},
"outputs": [],
"source": [
"# chunk_text: split long text into manageable pieces\n",
"# summarize_chunk: call OpenAI model to summarize one chunk\n",
"# hierarchical_summarize: summarize chunks then combine summaries into a final summary\n",
"\n",
"import time\n",
"\n",
"def chunk_text(text, max_chars=3000):\n",
" \"\"\"\n",
" Simple character-based chunking.\n",
" Try to cut at paragraph or sentence boundaries when possible.\n",
" \"\"\"\n",
" chunks = []\n",
" start = 0\n",
" text_len = len(text)\n",
" while start < text_len:\n",
" end = start + max_chars\n",
" if end < text_len:\n",
" # Prefer to cut at a blank line or sentence end\n",
" cut = text.rfind(\"\\n\\n\", start, end)\n",
" if cut == -1:\n",
" cut = text.rfind(\". \", start, end)\n",
" if cut == -1:\n",
" cut = end\n",
" end = cut\n",
" chunk = text[start:end].strip()\n",
" if chunk:\n",
" chunks.append(chunk)\n",
" start = end\n",
" return chunks\n",
"\n",
"def summarize_chunk(chunk, system_prompt=None, model=\"gpt-4o-mini\", temperature=0.2):\n",
" \"\"\"\n",
" Summarize a single chunk using the OpenAI chat completions API.\n",
" Returns the model's text output.\n",
" \"\"\"\n",
" if system_prompt is None:\n",
" system_prompt = \"You are a concise summarizer. Produce a short (~100 words) summary and 3 bullet points.\"\n",
"\n",
" messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": f\"Summarize the following text concisely. Keep it short.\\n\\nTEXT:\\n{chunk}\"}\n",
" ]\n",
"\n",
" resp = openai.chat.completions.create(\n",
" model=model,\n",
" messages=messages,\n",
" temperature=temperature,\n",
" )\n",
" return resp.choices[0].message.content\n",
"\n",
"def hierarchical_summarize(text, max_chunk_chars=3000, model=\"gpt-4o-mini\"):\n",
" \"\"\"\n",
" 1) Split the text into chunks\n",
" 2) Summarize each chunk\n",
" 3) Combine chunk summaries and ask model for a final concise summary\n",
" \"\"\"\n",
" chunks = chunk_text(text, max_chars=max_chunk_chars)\n",
" print(f\"[info] {len(chunks)} chunk(s) created.\")\n",
" chunk_summaries = []\n",
" for i, c in enumerate(chunks, 1):\n",
" print(f\"[info] Summarizing chunk {i}/{len(chunks)} (chars={len(c)})...\")\n",
" s = summarize_chunk(c, model=model)\n",
" chunk_summaries.append(s)\n",
" time.sleep(0.5) # small delay to avoid hitting rate limits\n",
"\n",
" if len(chunk_summaries) == 1:\n",
" return chunk_summaries[0]\n",
"\n",
" combined = \"\\n\\n---\\n\\n\".join(chunk_summaries)\n",
" final_prompt = \"You are a concise summarizer. Combine the following chunk summaries into one final summary of about 150 words and 5 bullet points.\"\n",
" final_messages = [\n",
" {\"role\": \"system\", \"content\": final_prompt},\n",
" {\"role\": \"user\", \"content\": combined}\n",
" ]\n",
" resp = openai.chat.completions.create(\n",
" model=model,\n",
" messages=final_messages,\n",
" temperature=0.2,\n",
" )\n",
" return resp.choices[0].message.content\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "9a23facd-4abe-4981-bd94-b14f5a61c8fe",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[info] Fetching page: https://www.basketball-reference.com/\n",
"[info] Fetched text length: 11778\n",
"[info] Running hierarchical summarization...\n",
"[info] 5 chunk(s) created.\n",
"[info] Summarizing chunk 1/5 (chars=2430)...\n",
"[info] Summarizing chunk 2/5 (chars=2460)...\n",
"[info] Summarizing chunk 3/5 (chars=2426)...\n",
"[info] Summarizing chunk 4/5 (chars=2467)...\n",
"[info] Summarizing chunk 5/5 (chars=1987)...\n",
"\n",
"\n",
"=== FINAL SUMMARY ===\n",
"\n",
"Sports Reference is a comprehensive platform for sports statistics and history, particularly focusing on basketball, baseball, football, hockey, and soccer. It offers tools like Stathead for advanced data analysis and the Immaculate Grid for interactive gameplay. Users can access player stats, team standings, and historical records without ads. \n",
"\n",
"- Extensive stats available for NBA, WNBA, G League, and international leagues.\n",
"- Daily recaps of NBA and WNBA performances delivered via email.\n",
"- Stathead Basketball provides in-depth stats with a free first month for new subscribers.\n",
"- Upcoming events include the NBA All-Star Weekend (February 13-15, 2026) and the start of the NBA season (October 21, 2026).\n",
"- Features include trivia games, a blog, and resources for sports writers, enhancing user engagement and knowledge.\n"
]
}
],
"source": [
"# Change the URL to any static (non-JS-heavy) page you want to test.\n",
"if __name__ == \"__main__\":\n",
" url = \"https://www.basketball-reference.com/\" # replace with your chosen URL\n",
" print(\"[info] Fetching page:\", url)\n",
" page_text = fetch_text_from_url(url, max_items=300)\n",
" print(\"[info] Fetched text length:\", len(page_text))\n",
"\n",
" print(\"[info] Running hierarchical summarization...\")\n",
" final_summary = hierarchical_summarize(page_text, max_chunk_chars=2500)\n",
" print(\"\\n\\n=== FINAL SUMMARY ===\\n\")\n",
" print(final_summary)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b5d6eb1e-a58b-4487-b04c-fe5a382121a4",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -299,7 +299,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
"version": "3.11.13"
}
},
"nbformat": 4,

View File

@@ -0,0 +1,761 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
"metadata": {},
"source": [
"# YOUR FIRST LAB\n",
"### Please read this section. This is valuable to get you prepared, even if it's a long read -- it's important stuff.\n",
"\n",
"## Your first Frontier LLM Project\n",
"\n",
"Let's build a useful LLM solution - in a matter of minutes.\n",
"\n",
"By the end of this course, you will have built an autonomous Agentic AI solution with 7 agents that collaborate to solve a business problem. All in good time! We will start with something smaller...\n",
"\n",
"Our goal is to code a new kind of Web Browser. Give it a URL, and it will respond with a summary. The Reader's Digest of the internet!!\n",
"\n",
"Before starting, you should have completed the setup for [PC](../SETUP-PC.md) or [Mac](../SETUP-mac.md) and you hopefully launched this jupyter lab from within the project root directory, with your environment activated.\n",
"\n",
"## If you're new to Jupyter Lab\n",
"\n",
"Welcome to the wonderful world of Data Science experimentation! Once you've used Jupyter Lab, you'll wonder how you ever lived without it. Simply click in each \"cell\" with code in it, such as the cell immediately below this text, and hit Shift+Return to execute that cell. As you wish, you can add a cell with the + button in the toolbar, and print values of variables, or try out variations. \n",
"\n",
"I've written a notebook called [Guide to Jupyter](Guide%20to%20Jupyter.ipynb) to help you get more familiar with Jupyter Labs, including adding Markdown comments, using `!` to run shell commands, and `tqdm` to show progress.\n",
"\n",
"## If you're new to the Command Line\n",
"\n",
"Please see these excellent guides: [Command line on PC](https://chatgpt.com/share/67b0acea-ba38-8012-9c34-7a2541052665) and [Command line on Mac](https://chatgpt.com/canvas/shared/67b0b10c93a081918210723867525d2b). \n",
"\n",
"## If you'd prefer to work in IDEs\n",
"\n",
"If you're more comfortable in IDEs like VSCode, Cursor or PyCharm, they both work great with these lab notebooks too. \n",
"If you'd prefer to work in VSCode, [here](https://chatgpt.com/share/676f2e19-c228-8012-9911-6ca42f8ed766) are instructions from an AI friend on how to configure it for the course.\n",
"\n",
"## If you'd like to brush up your Python\n",
"\n",
"I've added a notebook called [Intermediate Python](Intermediate%20Python.ipynb) to get you up to speed. But you should give it a miss if you already have a good idea what this code does: \n",
"`yield from {book.get(\"author\") for book in books if book.get(\"author\")}`\n",
"\n",
"## I am here to help\n",
"\n",
"If you have any problems at all, please do reach out. \n",
"I'm available through the platform, or at ed@edwarddonner.com, or at https://www.linkedin.com/in/eddonner/ if you'd like to connect (and I love connecting!) \n",
"And this is new to me, but I'm also trying out X/Twitter at [@edwarddonner](https://x.com/edwarddonner) - if you're on X, please show me how it's done 😂 \n",
"\n",
"## More troubleshooting\n",
"\n",
"Please see the [troubleshooting](troubleshooting.ipynb) notebook in this folder to diagnose and fix common problems. At the very end of it is a diagnostics script with some useful debug info.\n",
"\n",
"## For foundational technical knowledge (eg Git, APIs, debugging) \n",
"\n",
"If you're relatively new to programming -- I've got your back! While it's ideal to have some programming experience for this course, there's only one mandatory prerequisite: plenty of patience. 😁 I've put together a set of self-study guides that cover Git and GitHub, APIs and endpoints, beginner python and more.\n",
"\n",
"This covers Git and GitHub; what they are, the difference, and how to use them: \n",
"https://github.com/ed-donner/agents/blob/main/guides/03_git_and_github.ipynb\n",
"\n",
"This covers technical foundations: \n",
"ChatGPT vs API; taking screenshots; Environment Variables; Networking basics; APIs and endpoints: \n",
"https://github.com/ed-donner/agents/blob/main/guides/04_technical_foundations.ipynb\n",
"\n",
"This covers Python for beginners, and making sure that a `NameError` never trips you up: \n",
"https://github.com/ed-donner/agents/blob/main/guides/06_python_foundations.ipynb\n",
"\n",
"This covers the essential techniques for figuring out errors: \n",
"https://github.com/ed-donner/agents/blob/main/guides/08_debugging.ipynb\n",
"\n",
"And you'll find other useful guides in the same folder in GitHub. Some information applies to my other Udemy course (eg Async Python) but most of it is very relevant for LLM engineering.\n",
"\n",
"## If this is old hat!\n",
"\n",
"If you're already comfortable with today's material, please hang in there; you can move swiftly through the first few labs - we will get much more in depth as the weeks progress. Ultimately we will fine-tune our own LLM to compete with OpenAI!\n",
"\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#900;\">Please read - important note</h2>\n",
" <span style=\"color:#900;\">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations. If you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...</span>\n",
" </td>\n",
" </tr>\n",
"</table>\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../resources.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#f71;\">This code is a live resource - keep an eye out for my emails</h2>\n",
" <span style=\"color:#f71;\">I push updates to the code regularly. As people ask questions, I add more examples or improved commentary. As a result, you'll notice that the code below isn't identical to the videos. Everything from the videos is here; but I've also added better explanations and new models like DeepSeek. Consider this like an interactive book.<br/><br/>\n",
" I try to send emails regularly with important updates related to the course. You can find this in the 'Announcements' section of Udemy in the left sidebar. You can also choose to receive my emails via your Notification Settings in Udemy. I'm respectful of your inbox and always try to add value with my emails!\n",
" </span>\n",
" </td>\n",
" </tr>\n",
"</table>\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../business.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#181;\">Business value of these exercises</h2>\n",
" <span style=\"color:#181;\">A final thought. While I've designed these notebooks to be educational, I've also tried to make them enjoyable. We'll do fun things like have LLMs tell jokes and argue with each other. But fundamentally, my goal is to teach skills you can apply in business. I'll explain business implications as we go, and it's worth keeping this in mind: as you build experience with models and techniques, think of ways you could put this into action at work today. Please do contact me if you'd like to discuss more or if you have ideas to bounce off me.</span>\n",
" </td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"\n",
"!ollama pull llama3.2\n",
"\n",
"MODEL = \"llama3.2\"\n",
"openai = OpenAI(base_url=\"http://localhost:11434/v1\", api_key=\"ollama\")\n",
"\n",
"response = openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[{\"role\": \"user\", \"content\": \"What is 2 + 2?\"}]\n",
")\n",
"\n",
"print(response.choices[0].message.content)\n",
"\n",
"# If you get an error running this cell, then please head over to the troubleshooting notebook!"
]
},
{
"cell_type": "markdown",
"id": "6900b2a8-6384-4316-8aaa-5e519fca4254",
"metadata": {},
"source": [
"# Connecting to OpenAI (or Ollama)\n",
"\n",
"The next cell is where we load in the environment variables in your `.env` file and connect to OpenAI. \n",
"\n",
"If you'd like to use free Ollama instead, please see the README section \"Free Alternative to Paid APIs\", and if you're not sure how to do this, there's a full solution in the solutions folder (day1_with_ollama.ipynb).\n",
"\n",
"## Troubleshooting if you have problems:\n",
"\n",
"Head over to the [troubleshooting](troubleshooting.ipynb) notebook in this folder for step by step code to identify the root cause and fix it!\n",
"\n",
"If you make a change, try restarting the \"Kernel\" (the python process sitting behind this notebook) by Kernel menu >> Restart Kernel and Clear Outputs of All Cells. Then try this notebook again, starting at the top.\n",
"\n",
"Or, contact me! Message me or email ed@edwarddonner.com and we will get this to work.\n",
"\n",
"Any concerns about API costs? See my notes in the README - costs should be minimal, and you can control it at every point. You can also use Ollama as a free alternative, which we discuss during Day 2."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "019974d9-f3ad-4a8a-b5f9-0a3719aea2d3",
"metadata": {},
"outputs": [],
"source": [
"# openai = OpenAI()\n",
"\n",
"# If this doesn't work, try Kernel menu >> Restart Kernel and Clear Outputs Of All Cells, then run the cells from the top of this notebook down.\n",
"# If it STILL doesn't work (horrors!) then please see the Troubleshooting notebook in this folder for full instructions"
]
},
{
"cell_type": "markdown",
"id": "442fc84b-0815-4f40-99ab-d9a5da6bda91",
"metadata": {},
"source": [
"# Let's make a quick call to a Frontier model to get started, as a preview!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a58394bf-1e45-46af-9bfd-01e24da6f49a",
"metadata": {},
"outputs": [],
"source": [
"# To give you a preview -- calling OpenAI with these messages is this easy. Any problems, head over to the Troubleshooting notebook.\n",
"\n",
"message = \"Hello, GPT! This is my first ever message to you! Hi!\"\n",
"# response = openai.chat.completions.create(model=\"gpt-4o-mini\", messages=[{\"role\":\"user\", \"content\":message}])\n",
"# print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "2aa190e5-cb31-456a-96cc-db109919cd78",
"metadata": {},
"source": [
"## OK onwards with our first project"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5e793b2-6775-426a-a139-4848291d0463",
"metadata": {},
"outputs": [],
"source": [
"# A class to represent a Webpage\n",
"# If you're not familiar with Classes, check out the \"Intermediate Python\" notebook\n",
"\n",
"# Some websites need you to use proper headers when fetching them:\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
"\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given url using the BeautifulSoup library\n",
" \"\"\"\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ef960cf-6dc2-4cda-afb3-b38be12f4c97",
"metadata": {},
"outputs": [],
"source": [
"# Let's try one out. Change the website and add print statements to follow along.\n",
"\n",
"ed = Website(\"https://edwarddonner.com\")\n",
"print(ed.title)\n",
"print(ed.text)"
]
},
{
"cell_type": "markdown",
"id": "6a478a0c-2c53-48ff-869c-4d08199931e1",
"metadata": {},
"source": [
"## Types of prompts\n",
"\n",
"You may know this already - but if not, you will get very familiar with it!\n",
"\n",
"Models like GPT4o have been trained to receive instructions in a particular way.\n",
"\n",
"They expect to receive:\n",
"\n",
"**A system prompt** that tells them what task they are performing and what tone they should use\n",
"\n",
"**A user prompt** -- the conversation starter that they should reply to"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
"metadata": {},
"outputs": [],
"source": [
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
"metadata": {},
"outputs": [],
"source": [
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "26448ec4-5c00-4204-baec-7df91d11ff2e",
"metadata": {},
"outputs": [],
"source": [
"print(user_prompt_for(ed))"
]
},
{
"cell_type": "markdown",
"id": "ea211b5f-28e1-4a86-8e52-c0b7677cadcc",
"metadata": {},
"source": [
"## Messages\n",
"\n",
"The API from OpenAI expects to receive messages in a particular structure.\n",
"Many of the other APIs share this structure:\n",
"\n",
"```python\n",
"[\n",
" {\"role\": \"system\", \"content\": \"system message goes here\"},\n",
" {\"role\": \"user\", \"content\": \"user message goes here\"}\n",
"]\n",
"```\n",
"To give you a preview, the next 2 cells make a rather simple call - we won't stretch the mighty GPT (yet!)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f25dcd35-0cd0-4235-9f64-ac37ed9eaaa5",
"metadata": {},
"outputs": [],
"source": [
"messages = [\n",
" {\"role\": \"system\", \"content\": \"You are a snarky assistant\"},\n",
" {\"role\": \"user\", \"content\": \"What is 2 + 2?\"}\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21ed95c5-7001-47de-a36d-1d6673b403ce",
"metadata": {},
"outputs": [],
"source": [
"# To give you a preview -- calling OpenAI with system and user messages:\n",
"\n",
"response = openai.chat.completions.create(model=MODEL, messages=messages)\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "d06e8d78-ce4c-4b05-aa8e-17050c82bb47",
"metadata": {},
"source": [
"## And now let's build useful messages for GPT-4o-mini, using a function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
"metadata": {},
"outputs": [],
"source": [
"# See how this function creates exactly the format above\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36478464-39ee-485c-9f3f-6a4e458dbc9c",
"metadata": {},
"outputs": [],
"source": [
"# Try this out, and then try for a few more websites\n",
"\n",
"messages_for(ed)"
]
},
{
"cell_type": "markdown",
"id": "16f49d46-bf55-4c3e-928f-68fc0bf715b0",
"metadata": {},
"source": [
"## Time to bring it together - the API for OpenAI is very simple!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
"metadata": {},
"outputs": [],
"source": [
"# And now: call the OpenAI API. You will get very familiar with this!\n",
"\n",
"def summarize(url):\n",
" website = Website(url)\n",
" response = openai.chat.completions.create(\n",
" model = MODEL,\n",
" messages = messages_for(website)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "05e38d41-dfa4-4b20-9c96-c46ea75d9fb5",
"metadata": {},
"outputs": [],
"source": [
"summarize(\"https://edwarddonner.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d926d59-450e-4609-92ba-2d6f244f1342",
"metadata": {},
"outputs": [],
"source": [
"# A function to display this nicely in the Jupyter output, using markdown\n",
"\n",
"def display_summary(url):\n",
" summary = summarize(url)\n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3018853a-445f-41ff-9560-d925d1774b2f",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://edwarddonner.com\")"
]
},
{
"cell_type": "markdown",
"id": "b3bcf6f4-adce-45e9-97ad-d9a5d7a3a624",
"metadata": {},
"source": [
"# Let's try more websites\n",
"\n",
"Note that this will only work on websites that can be scraped using this simplistic approach.\n",
"\n",
"Websites that are rendered with Javascript, like React apps, won't show up. See the community-contributions folder for a Selenium implementation that gets around this. You'll need to read up on installing Selenium (ask ChatGPT!)\n",
"\n",
"Also Websites protected with CloudFront (and similar) may give 403 errors - many thanks Andy J for pointing this out.\n",
"\n",
"But many websites will work just fine!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "45d83403-a24c-44b5-84ac-961449b4008f",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://cnn.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75e9fd40-b354-4341-991e-863ef2e59db7",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://anthropic.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75110f18-8956-4fbc-87c0-482a086cea10",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://openai.com\")"
]
},
{
"cell_type": "markdown",
"id": "c951be1a-7f1b-448f-af1f-845978e47e2c",
"metadata": {},
"source": [
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../business.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#181;\">Business applications</h2>\n",
" <span style=\"color:#181;\">In this exercise, you experienced calling the Cloud API of a Frontier Model (a leading model at the frontier of AI) for the first time. We will be using APIs like OpenAI at many stages in the course, in addition to building our own LLMs.\n",
"\n",
"More specifically, we've applied this to Summarization - a classic Gen AI use case to make a summary. This can be applied to any business vertical - summarizing the news, summarizing financial performance, summarizing a resume in a cover letter - the applications are limitless. Consider how you could apply Summarization in your business, and try prototyping a solution.</span>\n",
" </td>\n",
" </tr>\n",
"</table>\n",
"\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#900;\">Before you continue - now try yourself</h2>\n",
" <span style=\"color:#900;\">Use the cell below to make your own simple commercial example. Stick with the summarization use case for now. Here's an idea: write something that will take the contents of an email, and will suggest an appropriate short subject line for the email. That's the kind of feature that might be built into a commercial email tool.</span>\n",
" </td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00743dac-0e70-45b7-879a-d7293a6f68a6",
"metadata": {},
"outputs": [],
"source": [
"# Step 1: Create your prompts\n",
"\n",
"system_prompt = \"You are ai assistant who will look up an email and provide with a short subject line appropriate for the given text.\"\n",
"user_prompt = \"\"\"\n",
" Below is the text of an email I am about to send to my superior. Please provide an appropriate subject for this:\n",
"\n",
" • **Meetings:** Attend 2-3 meetings per day, including team stand-up meetings and project \n",
"updates\n",
"• **Work Packages:** Complete 3-4 work packages, which include researching, writing, editing, \n",
"and proofreading articles, blog posts, or other content\n",
"• **Collaboration:** Engage in 2-3 hours of collaboration with colleagues via email, phone, \n",
"or video conferencing to discuss projects and share knowledge\n",
"• **Learning:** Spend 30 minutes per day learning a new skill or tool related to the job, \n",
"such as a programming language, software application, or industry-specific training\n",
"• **Administration:** Complete administrative tasks, including responding to emails, updating \n",
"project management tools, and maintaining records\n",
"\n",
"\n",
"\"\"\"\n",
"\n",
"# Step 2: Make the messages list\n",
"\n",
"messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt},\n",
"] # fill this in\n",
"\n",
"# Step 3: Call OpenAI\n",
"!ollama pull llama3.2\n",
"\n",
"from openai import OpenAI\n",
"MODEL = \"llama3.2\"\n",
"\n",
"openai = OpenAI(base_url=\"http://localhost:11434/v1\", api_key=\"ollama\")\n",
"response = openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=messages\n",
")\n",
"\n",
"\n",
"# Step 4: print the result\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "36ed9f14-b349-40e9-a42c-b367e77f8bda",
"metadata": {},
"source": [
"## An extra exercise for those who enjoy web scraping\n",
"\n",
"You may notice that if you try `display_summary(\"https://openai.com\")` - it doesn't work! That's because OpenAI has a fancy website that uses Javascript. There are many ways around this that some of you might be familiar with. For example, Selenium is a hugely popular framework that runs a browser behind the scenes, renders the page, and allows you to query it. If you have experience with Selenium, Playwright or similar, then feel free to improve the Website class to use them. In the community-contributions folder, you'll find an example Selenium solution from a student (thank you!)"
]
},
{
"cell_type": "markdown",
"id": "eeab24dc-5f90-4570-b542-b0585aca3eb6",
"metadata": {},
"source": [
"# Sharing your code\n",
"\n",
"I'd love it if you share your code afterwards so I can share it with others! You'll notice that some students have already made changes (including a Selenium implementation) which you will find in the community-contributions folder. If you'd like add your changes to that folder, submit a Pull Request with your new versions in that folder and I'll merge your changes.\n",
"\n",
"If you're not an expert with git (and I am not!) then GPT has given some nice instructions on how to submit a Pull Request. It's a bit of an involved process, but once you've done it once it's pretty clear. As a pro-tip: it's best if you clear the outputs of your Jupyter notebooks (Edit >> Clean outputs of all cells, and then Save) for clean notebooks.\n",
"\n",
"Here are good instructions courtesy of an AI friend: \n",
"https://chatgpt.com/share/677a9cb5-c64c-8012-99e0-e06e88afd293"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4484fcf-8b39-4c3f-9674-37970ed71988",
"metadata": {},
"outputs": [],
"source": [
"!pip install selenium"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ccfcd6b3-a46c-467e-a4a8-7089b3e788bc",
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"import time\n",
"\n",
"from selenium import webdriver\n",
"from selenium.webdriver.safari.service import Service\n",
"from selenium.webdriver.safari.options import Options\n",
"\n",
"class WebsiteScrape:\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create a website object from the given url using Selenium and BeautifulSoup.\n",
" The class supports Javascript and Html websites.\n",
" The class uses Safari driver for running in Safari.\n",
" \"\"\"\n",
" \n",
" self.url = url\n",
" \n",
" # Configure Options\n",
" options = Options()\n",
" options.add_argument(\"--disable-gpu\")\n",
" options.add_argument(\"--no-sandbox\")\n",
" \n",
" # Initialise Safari Webdriver\n",
" driver = webdriver.Safari(service=service)\n",
" \n",
" #Start Selenium Webdriver\n",
" driver.get(url)\n",
" \n",
" # Wait for JS to load (adjust as needed)\n",
" time.sleep(3)\n",
" \n",
" # Fetch the page source after JS execution\n",
" page_source = driver.page_source\n",
" driver.quit()\n",
" \n",
" # Analysis with BeautifulSoup \n",
" soup = BeautifulSoup(html, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" \n",
" # Clean irrelevant tags\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" \n",
" # Extract the main text\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "26458db3-e8e2-447c-8001-dfa1c538b9d4",
"metadata": {},
"outputs": [],
"source": [
"def summarize_using_selenium(url):\n",
" website = Website(url)\n",
" response = openai.chat.completions.create(\n",
" model = MODEL,\n",
" messages = messages_for(website)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "061d3220-4f53-4f18-a3f7-2291771d3300",
"metadata": {},
"outputs": [],
"source": [
"summary = summarize_using_selenium(\"https://openai.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c81103c-a9bb-429d-9d65-d49753280611",
"metadata": {},
"outputs": [],
"source": [
"display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9d9d6c4-ef71-45c3-8e3d-73e99dda711b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -477,7 +477,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "llms",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@@ -491,7 +491,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
"version": "3.11.13"
}
},
"nbformat": 4,

View File

@@ -0,0 +1,191 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "786b2ed1-f82e-4ca4-8113-c4515b36e970",
"metadata": {},
"source": [
"# Day 2 Exercise | Website Summarizer with Llama 3.2"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b88bf233-29e0-4c01-a4da-8a16896a95e3",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display"
]
},
{
"cell_type": "markdown",
"id": "f66f620e-ebf6-45d3-a710-2bb931cac841",
"metadata": {},
"source": [
"### 1. Scraping info from website:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e300303-02ac-4d60-9c8c-044a4627be9e",
"metadata": {},
"outputs": [],
"source": [
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
"\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given url using the BeautifulSoup library\n",
" \"\"\"\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "137714b9-24eb-4541-8f24-507dbcd09279",
"metadata": {},
"outputs": [],
"source": [
"ed = Website(\"https://edwarddonner.com\")"
]
},
{
"cell_type": "markdown",
"id": "77ba1b4b-fc4c-4e3c-bef7-c4d4281d8263",
"metadata": {},
"source": [
"### 2. Ollama configuration:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "97811fcb-1ceb-49a8-bfb9-2e610605c406",
"metadata": {},
"outputs": [],
"source": [
"OLLAMA_API = \"http://localhost:11434/api/chat\"\n",
"HEADERS = {\"Content-Type\": \"application/json\"}\n",
"MODEL = \"llama3.2\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "392326b8-ad0f-4bc9-b055-6220f8bcc57c",
"metadata": {},
"outputs": [],
"source": [
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\"\n",
"user_prompt = user_prompt_for(ed)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8caa94ff-5ace-4f9b-b2f0-beb6ff550636",
"metadata": {},
"outputs": [],
"source": [
"messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
"]\n",
"\n",
"payload = {\n",
" \"model\": MODEL,\n",
" \"messages\": messages,\n",
" \"stream\": False\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "f5f856bc-0437-4607-9204-5390d2dfd8db",
"metadata": {},
"source": [
"### 3. Get & display summary:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a7fd6f93-92ae-419f-b8b6-ee8214e0d93f",
"metadata": {},
"outputs": [],
"source": [
"response = requests.post(OLLAMA_API, json=payload, headers=HEADERS)\n",
"summary = response.json()['message']['content']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "78e4a433-b974-463f-82d0-b4696c63e0ab",
"metadata": {},
"outputs": [],
"source": [
"def display_summary(summary_text: str):\n",
" cleaned = summary_text.encode('utf-8').decode('unicode_escape')\n",
" cleaned = cleaned.strip()\n",
" display(Markdown(cleaned))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dc408f1d-fe26-4bd6-859f-d18118f74ca6",
"metadata": {},
"outputs": [],
"source": [
"display_summary(summary)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,422 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
"metadata": {},
"source": [
"# Welcome to your first assignment!\n",
"\n",
"Instructions are below. Please give this a try, and look in the solutions folder if you get stuck (or feel free to ask me!)"
]
},
{
"cell_type": "markdown",
"id": "ada885d9-4d42-4d9b-97f0-74fbbbfe93a9",
"metadata": {},
"source": [
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../resources.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#f71;\">Just before we get to the assignment --</h2>\n",
" <span style=\"color:#f71;\">I thought I'd take a second to point you at this page of useful resources for the course. This includes links to all the slides.<br/>\n",
" <a href=\"https://edwarddonner.com/2024/11/13/llm-engineering-resources/\">https://edwarddonner.com/2024/11/13/llm-engineering-resources/</a><br/>\n",
" Please keep this bookmarked, and I'll continue to add more useful links there over time.\n",
" </span>\n",
" </td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"id": "6e9fa1fc-eac5-4d1d-9be4-541b3f2b3458",
"metadata": {},
"source": [
"# HOMEWORK EXERCISE ASSIGNMENT\n",
"\n",
"Upgrade the day 1 project to summarize a webpage to use an Open Source model running locally via Ollama rather than OpenAI\n",
"\n",
"You'll be able to use this technique for all subsequent projects if you'd prefer not to use paid APIs.\n",
"\n",
"**Benefits:**\n",
"1. No API charges - open-source\n",
"2. Data doesn't leave your box\n",
"\n",
"**Disadvantages:**\n",
"1. Significantly less power than Frontier Model\n",
"\n",
"## Recap on installation of Ollama\n",
"\n",
"Simply visit [ollama.com](https://ollama.com) and install!\n",
"\n",
"Once complete, the ollama server should already be running locally. \n",
"If you visit: \n",
"[http://localhost:11434/](http://localhost:11434/)\n",
"\n",
"You should see the message `Ollama is running`. \n",
"\n",
"If not, bring up a new Terminal (Mac) or Powershell (Windows) and enter `ollama serve` \n",
"And in another Terminal (Mac) or Powershell (Windows), enter `ollama pull llama3.2` \n",
"Then try [http://localhost:11434/](http://localhost:11434/) again.\n",
"\n",
"If Ollama is slow on your machine, try using `llama3.2:1b` as an alternative. Run `ollama pull llama3.2:1b` from a Terminal or Powershell, and change the code below from `MODEL = \"llama3.2\"` to `MODEL = \"llama3.2:1b\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "29ddd15d-a3c5-4f4e-a678-873f56162724",
"metadata": {},
"outputs": [],
"source": [
"# Constants\n",
"\n",
"OLLAMA_API = \"http://localhost:11434/api/chat\"\n",
"HEADERS = {\"Content-Type\": \"application/json\"}\n",
"MODEL = \"llama3.2\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dac0a679-599c-441f-9bf2-ddc73d35b940",
"metadata": {},
"outputs": [],
"source": [
"# Create a messages list using the same format that we used for OpenAI\n",
"\n",
"messages = [\n",
" {\"role\": \"user\", \"content\": \"Describe some of the business applications of Generative AI\"}\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7bb9c624-14f0-4945-a719-8ddb64f66f47",
"metadata": {},
"outputs": [],
"source": [
"payload = {\n",
" \"model\": MODEL,\n",
" \"messages\": messages,\n",
" \"stream\": False\n",
" }"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "479ff514-e8bd-4985-a572-2ea28bb4fa40",
"metadata": {},
"outputs": [],
"source": [
"# Let's just make sure the model is loaded\n",
"\n",
"!ollama pull llama3.2"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "42b9f644-522d-4e05-a691-56e7658c0ea9",
"metadata": {},
"outputs": [],
"source": [
"# If this doesn't work for any reason, try the 2 versions in the following cells\n",
"# And double check the instructions in the 'Recap on installation of Ollama' at the top of this lab\n",
"# And if none of that works - contact me!\n",
"\n",
"response = requests.post(OLLAMA_API, json=payload, headers=HEADERS)\n",
"print(response.json()['message']['content'])"
]
},
{
"cell_type": "markdown",
"id": "6a021f13-d6a1-4b96-8e18-4eae49d876fe",
"metadata": {},
"source": [
"# Introducing the ollama package\n",
"\n",
"And now we'll do the same thing, but using the elegant ollama python package instead of a direct HTTP call.\n",
"\n",
"Under the hood, it's making the same call as above to the ollama server running at localhost:11434"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7745b9c4-57dc-4867-9180-61fa5db55eb8",
"metadata": {},
"outputs": [],
"source": [
"import ollama\n",
"\n",
"response = ollama.chat(model=MODEL, messages=messages)\n",
"print(response['message']['content'])"
]
},
{
"cell_type": "markdown",
"id": "a4704e10-f5fb-4c15-a935-f046c06fb13d",
"metadata": {},
"source": [
"## Alternative approach - using OpenAI python library to connect to Ollama"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "23057e00-b6fc-4678-93a9-6b31cb704bff",
"metadata": {},
"outputs": [],
"source": [
"# There's actually an alternative approach that some people might prefer\n",
"# You can use the OpenAI client python library to call Ollama:\n",
"\n",
"from openai import OpenAI\n",
"ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')\n",
"\n",
"response = ollama_via_openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=messages\n",
")\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "9f9e22da-b891-41f6-9ac9-bd0c0a5f4f44",
"metadata": {},
"source": [
"## Are you confused about why that works?\n",
"\n",
"It seems strange, right? We just used OpenAI code to call Ollama?? What's going on?!\n",
"\n",
"Here's the scoop:\n",
"\n",
"The python class `OpenAI` is simply code written by OpenAI engineers that makes calls over the internet to an endpoint. \n",
"\n",
"When you call `openai.chat.completions.create()`, this python code just makes a web request to the following url: \"https://api.openai.com/v1/chat/completions\"\n",
"\n",
"Code like this is known as a \"client library\" - it's just wrapper code that runs on your machine to make web requests. The actual power of GPT is running on OpenAI's cloud behind this API, not on your computer!\n",
"\n",
"OpenAI was so popular, that lots of other AI providers provided identical web endpoints, so you could use the same approach.\n",
"\n",
"So Ollama has an endpoint running on your local box at http://localhost:11434/v1/chat/completions \n",
"And in week 2 we'll discover that lots of other providers do this too, including Gemini and DeepSeek.\n",
"\n",
"And then the team at OpenAI had a great idea: they can extend their client library so you can specify a different 'base url', and use their library to call any compatible API.\n",
"\n",
"That's it!\n",
"\n",
"So when you say: `ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')` \n",
"Then this will make the same endpoint calls, but to Ollama instead of OpenAI."
]
},
{
"cell_type": "markdown",
"id": "bc7d1de3-e2ac-46ff-a302-3b4ba38c4c90",
"metadata": {},
"source": [
"## Also trying the amazing reasoning model DeepSeek\n",
"\n",
"Here we use the version of DeepSeek-reasoner that's been distilled to 1.5B. \n",
"This is actually a 1.5B variant of Qwen that has been fine-tuned using synethic data generated by Deepseek R1.\n",
"\n",
"Other sizes of DeepSeek are [here](https://ollama.com/library/deepseek-r1) all the way up to the full 671B parameter version, which would use up 404GB of your drive and is far too large for most!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cf9eb44e-fe5b-47aa-b719-0bb63669ab3d",
"metadata": {},
"outputs": [],
"source": [
"!ollama pull deepseek-r1:1.5b"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1d3d554b-e00d-4c08-9300-45e073950a76",
"metadata": {},
"outputs": [],
"source": [
"# This may take a few minutes to run! You should then see a fascinating \"thinking\" trace inside <think> tags, followed by some decent definitions\n",
"\n",
"response = ollama_via_openai.chat.completions.create(\n",
" model=\"deepseek-r1:1.5b\",\n",
" messages=[{\"role\": \"user\", \"content\": \"Please give definitions of some core concepts behind LLMs: a neural network, attention and the transformer\"}]\n",
")\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "1622d9bb-5c68-4d4e-9ca4-b492c751f898",
"metadata": {},
"source": [
"# NOW the exercise for you\n",
"\n",
"Take the code from day1 and incorporate it here, to build a website summarizer that uses Llama 3.2 running locally instead of OpenAI; use either of the above approaches."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6de38216-6d1c-48c4-877b-86d403f4e0f8",
"metadata": {},
"outputs": [],
"source": [
"import ollama\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"\n",
"\n",
"MODEL = \"llama3.2\"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9f13fe25-74a7-4342-aa96-d4c494ec429e",
"metadata": {},
"outputs": [],
"source": [
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
"\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Website object from the given url using the BeautifulSoup library\n",
" \"\"\"\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)\n",
"\n",
"# udemy = Website(www.udemy.com);\n",
"# print(udemy.title)\n",
"# print(udemy.text)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19e3b328-561e-4e79-9cd3-f16844bd8c38",
"metadata": {},
"outputs": [],
"source": [
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd568605-4138-47cb-91ed-8f3540c04e6d",
"metadata": {},
"outputs": [],
"source": [
"def summarize(url):\n",
" website = Website(url)\n",
" messages = [\n",
" {\"role\":\"system\",\"content\":\"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\"},\n",
" {\"role\":\"user\", \"content\":user_prompt_for(website)}\n",
"]\n",
" response = ollama.chat(model=MODEL, messages=messages)\n",
" return response['message']['content']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e2b37d9d-5332-4c21-837f-7bcdbc0e3caf",
"metadata": {},
"outputs": [],
"source": [
"def display_summary(url):\n",
" summary = summarize(url)\n",
" display(Markdown(summary))\n",
"\n",
"\n",
"display_summary(\"https://udemy.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5fdd3509-a76c-4a87-b95e-fa3897a4f4aa",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,122 @@
# 🥊 Summarization Battle: Ollama vs. OpenAI Judge
This mini-project pits multiple **local LLMs** (via [Ollama](https://ollama.ai)) against each other in a **web summarization contest**, with an **OpenAI model** serving as the impartial judge.
It automatically fetches web articles, summarizes them with several models, and evaluates the results on **coverage, faithfulness, clarity, and conciseness**.
---
## 🚀 Features
- **Fetch Articles** Download and clean text content from given URLs.
- **Summarize with Ollama** Run multiple local models (e.g., `llama3.2`, `phi3`, `deepseek-r1`) via the Ollama API.
- **Judge with OpenAI** Use `gpt-4o-mini` (or any other OpenAI model) to score summaries.
- **Battle Results** Collect JSON results with per-model scores, rationales, and winners.
- **Timeout Handling & Warmup** Keeps models alive with `keep_alive` to avoid cold-start delays.
---
## 📂 Project Structure
```
.
├── urls.txt # Dictionary of categories → URLs
├── battle_results.json # Summarization + judging results
├── main.py # Main script
├── requirements.txt # Dependencies
└── README.md # You are here
```
---
## ⚙️ Installation
1. **Clone the repo**:
```bash
git clone https://github.com/khashayarbayati1/wikipedia-summarization-battle.git
cd summarization-battle
```
2. **Install dependencies**:
```bash
pip install -r requirements.txt
```
Minimal requirements:
```txt
requests
beautifulsoup4
python-dotenv
openai>=1.0.0
httpx
```
3. **Install Ollama & models**:
- [Install Ollama](https://ollama.ai/download) if not already installed.
- Pull the models you want:
```bash
ollama pull llama3.2:latest
ollama pull deepseek-r1:1.5b
ollama pull phi3:latest
```
4. **Set up OpenAI API key**:
Create a `.env` file with:
```env
OPENAI_API_KEY=sk-proj-xxxx...
```
---
## ▶️ Usage
1. Put your URL dictionary in `urls.txt`, e.g.:
```python
{
"sports": "https://en.wikipedia.org/wiki/Sabermetrics",
"Politics": "https://en.wikipedia.org/wiki/Separation_of_powers",
"History": "https://en.wikipedia.org/wiki/Industrial_Revolution"
}
```
2. Run the script:
```bash
python main.py
```
3. Results are written to:
- `battle_results.json`
- Printed in the terminal
---
## 🏆 Example Results
Sample output (excerpt):
```json
{
"category": "sports",
"url": "https://en.wikipedia.org/wiki/Sabermetrics",
"scores": {
"llama3.2:latest": { "score": 4, "rationale": "Covers the main points..." },
"deepseek-r1:1.5b": { "score": 3, "rationale": "Some inaccuracies..." },
"phi3:latest": { "score": 5, "rationale": "Concise, accurate, well-organized." }
},
"winner": "phi3:latest"
}
```
From the full run:
- 🥇 **`phi3:latest`** won in *Sports, History, Productivity*
- 🥇 **`deepseek-r1:1.5b`** won in *Politics, Technology*
---
## 💡 Ideas for Extension
- Add more Ollama models (e.g., `mistral`, `gemma`, etc.)
- Try different evaluation criteria (e.g., readability, length control)
- Visualize results with charts
- Benchmark runtime and token usage
---
## 📜 License
MIT License free to use, modify, and share.

View File

@@ -0,0 +1,97 @@
[
{
"category": "sports",
"url": "https://en.wikipedia.org/wiki/Sabermetrics",
"scores": {
"llama3.2:latest": {
"score": 4,
"rationale": "This summary covers the main points of the article well, including the origins of sabermetrics, its evolution, and its impact on baseball analytics. However, it could be slightly more concise."
},
"deepseek-r1:1.5b": {
"score": 3,
"rationale": "While this summary captures several key aspects of sabermetrics, it lacks clarity in organization and includes some inaccuracies, such as misattributing the coinage of the term to Earnshaw Cook."
},
"phi3:latest": {
"score": 5,
"rationale": "This summary is concise and accurately reflects the key elements of the article, including the contributions of Bill James and the evolution of metrics in baseball, making it clear and well-organized."
}
},
"winner": "phi3:latest"
},
{
"category": "Politics",
"url": "https://en.wikipedia.org/wiki/Separation_of_powers",
"scores": {
"llama3.2:latest": {
"score": 4,
"rationale": "This summary effectively covers the main points of the article, including the definition of separation of powers, its implementation, and the philosophical background. However, it could benefit from a bit more detail on historical context."
},
"deepseek-r1:1.5b": {
"score": 5,
"rationale": "This summary is comprehensive and well-organized, clearly outlining the structure of the separation of powers, examples from different countries, and implications for political ideologies. It maintains clarity and accuracy throughout."
},
"phi3:latest": {
"score": 3,
"rationale": "While this summary provides a broad overview of the historical and theoretical aspects of separation of powers, it lacks focus on the core principles and practical implications, making it less concise and clear compared to the others."
}
},
"winner": "deepseek-r1:1.5b"
},
{
"category": "History",
"url": "https://en.wikipedia.org/wiki/Industrial_Revolution",
"scores": {
"llama3.2:latest": {
"score": 4,
"rationale": "This summary effectively covers the main points of the Industrial Revolution, including its timeline, technological advancements, and societal impacts. However, it could benefit from more detail on the causes and criticisms."
},
"deepseek-r1:1.5b": {
"score": 3,
"rationale": "While this summary captures some key aspects of the Industrial Revolution, it lacks clarity and organization, making it harder to follow. It also misses some significant details about the social effects and criticisms."
},
"phi3:latest": {
"score": 5,
"rationale": "This summary is comprehensive and well-organized, covering a wide range of topics including technological advancements, social impacts, and historical context. It provides a clear and detailed overview of the Industrial Revolution."
}
},
"winner": "phi3:latest"
},
{
"category": "Technology",
"url": "https://en.wikipedia.org/wiki/Artificial_general_intelligence",
"scores": {
"llama3.2:latest": {
"score": 4,
"rationale": "The summary covers key aspects of AGI, including its definition, development goals, and associated risks, but could benefit from more technical details."
},
"deepseek-r1:1.5b": {
"score": 5,
"rationale": "This summary is well-structured and comprehensive, accurately capturing the essence of AGI, its distinctions from narrow AI, and the associated risks while maintaining clarity."
},
"phi3:latest": {
"score": 4,
"rationale": "The summary effectively outlines the definition and characteristics of AGI, but it lacks some depth in discussing the implications and technical definitions compared to the best summary."
}
},
"winner": "deepseek-r1:1.5b"
},
{
"category": "Productivity",
"url": "https://en.wikipedia.org/wiki/Scientific_management",
"scores": {
"llama3.2:latest": {
"score": 4,
"rationale": "This summary covers the main points of the article, including the origins, principles, and historical context of scientific management. However, it could be more concise and organized."
},
"deepseek-r1:1.5b": {
"score": 3,
"rationale": "While this summary captures key aspects of scientific management, it lacks clarity and organization, making it harder to follow. The bullet points are somewhat disjointed."
},
"phi3:latest": {
"score": 5,
"rationale": "This summary is well-structured, covering the essential elements of scientific management, including its principles, historical context, and criticisms. It is clear, concise, and accurately reflects the source material."
}
},
"winner": "phi3:latest"
}
]

View File

@@ -0,0 +1,214 @@
# imports
import os, json, ast, pathlib
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from openai import OpenAI
import traceback
from typing import List, Dict
from httpx import Timeout
# ---------- utils ----------
def openai_api_key_loader():
load_dotenv(dotenv_path=".env", override=True)
api_key = os.getenv('OPENAI_API_KEY')
if not api_key:
print("❌ No API key found. Please check your .env file.")
return False
if not api_key.startswith("sk-proj-"):
print("⚠️ API key found, but does not start with 'sk-proj-'. Check you're using the right one.")
return False
if api_key.strip() != api_key:
print("⚠️ API key has leading/trailing whitespace. Please clean it.")
return False
print("✅ API key found and looks good!")
return True
def ollama_installed_tags(base_url="http://localhost:11434"):
r = requests.get(f"{base_url}/api/tags", timeout=10)
r.raise_for_status()
return {m["name"] for m in r.json().get("models", [])}
def get_urls(file_name: str):
with open(f"{file_name}.txt", "r") as f:
content = f.read()
url_dict = ast.literal_eval(content) # expects a dict literal in the file
return url_dict
def text_from_url(url: str):
session = requests.Session()
session.headers.update({
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/117.0.0.0 Safari/537.36"
)
})
resp = session.get(url, timeout=30)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, 'html.parser')
title = soup.title.string.strip() if soup.title and soup.title.string else "No title found"
body = soup.body
if not body:
return title, ""
for irrelevant in body(["script", "style", "img", "input", "noscript"]):
irrelevant.decompose()
text = body.get_text(separator="\n", strip=True)
return title, text
# ---------- contestants (Ollama) ----------
def summarize_with_model(text: str, model: str, ollama_client: OpenAI) -> str:
clipped = text[:9000] # keep it modest for small models
messages = [
{"role": "system", "content": "You are a concise, faithful web summarizer."},
{"role": "user", "content": (
"Summarize the article below in 46 bullet points. "
"Be factual, avoid speculation, and do not add information not present in the text.\n\n"
f"=== ARTICLE START ===\n{clipped}\n=== ARTICLE END ==="
)}
]
stream = ollama_client.chat.completions.create(
model=model,
messages=messages,
temperature=0,
stream=True,
extra_body={"keep_alive": "30m", "num_ctx": 2048}
)
chunks = []
for event in stream:
delta = getattr(event.choices[0].delta, "content", None)
if delta:
chunks.append(delta)
return "".join(chunks).strip()
# ---------- judge (ChatGPT) ----------
JUDGE_MODEL = "gpt-4o-mini"
def judge_summaries(category: str, url: str, source_text: str, summaries: dict, judge_client: OpenAI) -> dict:
src = source_text[:12000]
judge_prompt = f"""
You are the referee in a web summarization contest.
Task:
1) Read the SOURCE ARTICLE (below).
2) Evaluate EACH SUMMARY on: Coverage, Accuracy/Faithfulness, Clarity/Organization, Conciseness.
3) Give a 05 integer SCORE for each model (5 best).
4) Brief rationale (12 sentences per model).
5) Choose a single WINNER (tie-break on accuracy then clarity).
Return STRICT JSON only with this schema:
{{
"category": "{category}",
"url": "{url}",
"scores": {{
"<model_name>": {{ "score": <0-5>, "rationale": "<1-2 sentences>" }}
}},
"winner": "<model_name>"
}}
SOURCE ARTICLE:
{src}
SUMMARIES:
"""
for m, s in summaries.items():
judge_prompt += f"\n--- {m} ---\n{s}\n"
messages = [
{"role": "system", "content": "You are a strict, reliable evaluation judge for summaries."},
{"role": "user", "content": judge_prompt}
]
resp = judge_client.chat.completions.create(
model=JUDGE_MODEL,
messages=messages,
response_format={"type": "json_object"},
temperature=0
)
content = resp.choices[0].message.content
try:
return json.loads(content)
except json.JSONDecodeError:
# fallback: wrap in an envelope if the model added extra text
start = content.find("{")
end = content.rfind("}")
return json.loads(content[start:end+1])
def run_battle(url_dict: Dict[str, str], ollama_client: OpenAI, judge_client: OpenAI, models: List[str]) -> List[dict]:
all_results = []
for category, url in url_dict.items():
title, text = text_from_url(url)
summaries = {}
for m in models:
try:
summaries[m] = summarize_with_model(text, m, ollama_client)
except Exception as e:
print(f"\n--- Error from {m} ---")
print(repr(e))
traceback.print_exc()
summaries[m] = f"[ERROR from {m}: {e}]"
clean_summaries = {m: s for m, s in summaries.items() if not s.startswith("[ERROR")}
verdict = judge_summaries(category, url, text, clean_summaries or summaries, judge_client)
all_results.append(verdict)
return all_results
def warmup(ollama_client: OpenAI, model: str):
try:
ollama_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "OK"}],
temperature=0,
extra_body={"keep_alive": "30m"}
)
except Exception as e:
print(f"[warmup] {model}: {e}")
# ---------- main ----------
def main():
if not openai_api_key_loader():
return
# contestants (local Ollama)
ollama_client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
timeout=Timeout(300.0, connect=30.0) # generous read/connect timeouts
)
# judge (cloud OpenAI)
judge_client = OpenAI()
available = ollama_installed_tags()
desired = ["llama3.2:latest", "deepseek-r1:1.5b", "phi3:latest"] # keep here
models = [m for m in desired if m in available]
print("Available:", sorted(available))
print("Desired :", desired)
print("Running :", models)
if not models:
raise RuntimeError(f"No desired models installed. Have: {sorted(available)}")
url_dict = get_urls(file_name="urls")
for m in models:
warmup(ollama_client, m)
results = run_battle(url_dict, ollama_client, judge_client, models)
pathlib.Path("battle_results.json").write_text(json.dumps(results, indent=2), encoding="utf-8")
print(json.dumps(results, indent=2))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,68 @@
annotated-types==0.7.0
anyio==4.10.0
appnope @ file:///home/conda/feedstock_root/build_artifacts/appnope_1733332318622/work
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1733250440834/work
attrs==25.3.0
beautifulsoup4==4.13.5
bs4==0.0.2
certifi==2025.8.3
charset-normalizer==3.4.3
comm @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_comm_1753453984/work
debugpy @ file:///Users/runner/miniforge3/conda-bld/bld/rattler-build_debugpy_1758162070/work
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1740384970518/work
distro==1.9.0
dotenv==0.9.9
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1746947292760/work
executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1756729339227/work
h11==0.16.0
httpcore==1.0.9
httpx==0.28.1
idna==3.10
importlib_metadata @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_importlib-metadata_1747934053/work
ipykernel @ file:///Users/runner/miniforge3/conda-bld/ipykernel_1754352890318/work
ipython @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_ipython_1748711175/work
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1733300866624/work
jiter==0.11.0
jupyter_client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1733440914442/work
jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1748333051527/work
matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1733416936468/work
nest_asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1733325553580/work
ollama==0.5.4
openai==1.108.1
outcome==1.3.0.post0
packaging @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_packaging_1745345660/work
parso @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_parso_1755974222/work
pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1733301927746/work
pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1733327343728/work
platformdirs @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_platformdirs_1756227402/work
prompt_toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1756321756983/work
psutil @ file:///Users/runner/miniforge3/conda-bld/psutil_1758169248045/work
ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1733302279685/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl#sha256=92c32ff62b5fd8cf325bec5ab90d7be3d2a8ca8c8a3813ff487a8d2002630d1f
pure_eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1733569405015/work
pydantic==2.11.9
pydantic_core==2.33.2
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1750615794071/work
PySocks==1.7.1
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_python-dateutil_1751104122/work
python-dotenv==1.1.1
pyzmq @ file:///Users/runner/miniforge3/conda-bld/bld/rattler-build_pyzmq_1757387129/work
requests==2.32.5
selenium==4.35.0
six @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_six_1753199211/work
sniffio==1.3.1
sortedcontainers==2.4.0
soupsieve==2.8
stack_data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1733569443808/work
tornado @ file:///Users/runner/miniforge3/conda-bld/tornado_1756854937117/work
tqdm==4.67.1
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1733367359838/work
trio==0.30.0
trio-websocket==0.12.2
typing-inspection==0.4.1
typing_extensions==4.14.1
urllib3==2.5.0
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1733231326287/work
webdriver-manager==4.0.2
websocket-client==1.8.0
wsproto==1.2.0
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1749421620841/work

View File

@@ -0,0 +1,7 @@
{
"sports": "https://en.wikipedia.org/wiki/Sabermetrics",
"Politics": "https://en.wikipedia.org/wiki/Separation_of_powers",
"History": "https://en.wikipedia.org/wiki/Industrial_Revolution",
"Technology": "https://en.wikipedia.org/wiki/Artificial_general_intelligence",
"Productivity": "https://en.wikipedia.org/wiki/Scientific_management",
}

View File

@@ -0,0 +1,259 @@
{
"cells": [
{
"cell_type": "code",
"id": "initial_id",
"metadata": {
"collapsed": true,
"ExecuteTime": {
"end_time": "2025-10-02T18:07:54.689902Z",
"start_time": "2025-10-02T18:07:54.330580Z"
}
},
"source": [
"import os\n",
"import json\n",
"from dotenv import load_dotenv\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"\n",
"from website import Website"
],
"outputs": [],
"execution_count": 1
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-02T18:07:58.182655Z",
"start_time": "2025-10-02T18:07:58.176747Z"
}
},
"cell_type": "code",
"source": [
"link_system_prompt = \"You are provided with a list of links found on a Italian restaurant webpage. \\\n",
"You are able to decide which of the links would be most relevant to include in the restaurant menu, \\\n",
"such as links to an menu pdf file, Menù page, Piatti, or Bevande.\\n\"\n",
"link_system_prompt += \"You should respond in JSON as in this example:\"\n",
"link_system_prompt += \"\"\"\n",
"{\n",
" \"links\": [\n",
" {\"type\": \"menu pdf\", \"url\": \"https://www.ristoranteapprodo.com/Documenti/MenuEstivo2024.pdf\"},\n",
" {\"type\": \"menu page\", \"url\": \"https://www.giocapizza.com/men%C3%B9\"}\n",
" ]\n",
"}\n",
"\"\"\""
],
"id": "ff5d21dc8dd6bd29",
"outputs": [],
"execution_count": 3
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-02T18:08:01.823456Z",
"start_time": "2025-10-02T18:08:01.119076Z"
}
},
"cell_type": "code",
"source": [
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"if api_key and api_key.startswith('sk-proj-') and len(api_key) > 10:\n",
" print(\"API key looks good so far\")\n",
"else:\n",
" print(\"There might be a problem with your API key? Please visit the troubleshooting notebook!\")\n",
"\n",
"MODEL = 'gpt-4o-mini'\n",
"openai = OpenAI()\n",
"\n",
"ed = Website(\"https://www.giocapizza.com/\")\n",
"print(ed.links)"
],
"id": "bae61e79319ead26",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"API key looks good so far\n",
"['https://www.giocapizza.com', 'tel:349-6705657', 'https://www.instagram.com/giocapizza/', 'https://www.facebook.com/giocapizza/', 'https://www.tripadvisor.it/Restaurant_Review-g2337656-d17784755-Reviews-Gioca_Pizza-Adrara_San_Martino_Province_of_Bergamo_Lombardy.html', 'https://www.youtube.com/@GiocaPizza', 'https://www.pinterest.jp/giocapizza/', 'https://www.giocapizza.com', 'https://www.giocapizza.com/incorniciate', 'https://www.giocapizza.com/menù', 'https://www.giocapizza.com/servizi', 'https://www.giocapizza.com/menù', 'https://www.giocapizza.com/incorniciate', 'https://www.giocapizza.com/incorniciate', 'https://www.giocapizza.com/incorniciate', 'mailto:giocapizza@gmail.com', 'http://www.sinapsisnc.com']\n"
]
}
],
"execution_count": 4
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-02T18:08:05.104624Z",
"start_time": "2025-10-02T18:08:05.102463Z"
}
},
"cell_type": "code",
"source": [
"def get_links_user_prompt(website):\n",
" user_prompt = f\"Here is the list of links on the italian restaurant website of {website.url} - \"\n",
" user_prompt += \"please decide which of these are relevant web links for the restaurant menu, respond with the full https URL in JSON format.\"\n",
" user_prompt += \"Links (some might be relative links):\\n\"\n",
" user_prompt += \"\\n\".join(website.links)\n",
" return user_prompt\n"
],
"id": "1b5a43ae68ed636",
"outputs": [],
"execution_count": 5
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-02T18:08:08.740268Z",
"start_time": "2025-10-02T18:08:08.734461Z"
}
},
"cell_type": "code",
"source": [
"def get_links(url):\n",
" website = Website(url)\n",
" response = openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": link_system_prompt},\n",
" {\"role\": \"user\", \"content\": get_links_user_prompt(website)}\n",
" ],\n",
" response_format={\"type\": \"json_object\"}\n",
" )\n",
" result = response.choices[0].message.content\n",
" return json.loads(result)\n"
],
"id": "69e91ccd319153f7",
"outputs": [],
"execution_count": 6
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-02T18:08:15.402276Z",
"start_time": "2025-10-02T18:08:15.397800Z"
}
},
"cell_type": "code",
"source": [
"def get_all_details(url):\n",
" result = \"Landing page:\\n\"\n",
" result += Website(url).get_contents()\n",
" links = get_links(url)\n",
" print(\"Found links:\", links)\n",
" for link in links[\"links\"]:\n",
" result += f\"\\n\\n{link['type']}\\n\"\n",
" result += Website(link[\"url\"]).get_contents()\n",
" return result\n"
],
"id": "e76a1deea9a05353",
"outputs": [],
"execution_count": 8
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-02T18:08:18.390851Z",
"start_time": "2025-10-02T18:08:18.387630Z"
}
},
"cell_type": "code",
"source": [
"system_prompt = \"You are an assistant that analyzes the contents of several menu pages from an italian restaurant website \\\n",
"and creates restaurant menu with dishes and prices in Euro. Respond in markdown.\"\n",
"\n",
"def get_restaurant_menu_user_prompt(company_name, url):\n",
" user_prompt = f\"You are looking at a restaurant called: {company_name}\\n\"\n",
" user_prompt += f\"Here are the contents of its landing page and other relevant pages; use this information to build a restaurant menu in markdown.\\n\"\n",
" user_prompt += get_all_details(url)\n",
" user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters\n",
" return user_prompt\n"
],
"id": "5f60f05dab091ec7",
"outputs": [],
"execution_count": 9
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-02T18:08:20.804552Z",
"start_time": "2025-10-02T18:08:20.800766Z"
}
},
"cell_type": "code",
"source": [
"def create_restaurant_menu(company_name, url):\n",
" response = openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": get_restaurant_menu_user_prompt(company_name, url)}\n",
" ],\n",
" )\n",
" result = response.choices[0].message.content\n",
" display(Markdown(result))"
],
"id": "32c64d933b194bc7",
"outputs": [],
"execution_count": 10
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-10-02T18:08:55.009134Z",
"start_time": "2025-10-02T18:08:32.164709Z"
}
},
"cell_type": "code",
"source": "create_restaurant_menu(\"La Cascina\", \"https://www.lacascinacredaro.it/\")",
"id": "19bbd3984732895d",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found links: {'links': [{'type': 'piatti', 'url': 'http://www.byserviziinternet.com/cascina/#piatti'}]}\n"
]
},
{
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "# La Cascina Ristorante Pizzeria Menu\n\n## Antipasti (Starters)\n- **Bruschetta al Pomodoro** - €5.00 \n Grilled bread topped with fresh tomatoes, garlic, and basil.\n\n- **Crostini Toscani** - €7.00 \n Toasted bread with traditional chicken liver pâté.\n\n- **Tagliere di Salumi** - €9.00 \n Selection of cured meats served with pickles and bread.\n\n## Primi Piatti (First Courses)\n- **Gnocchetti di Patate con Erbette** - €10.00 \n Potato gnocchi with a blend of seasonal greens.\n\n- **Paccheri con Polipetti** - €12.00 \n Large tubular pasta with baby octopus in a tomato sauce.\n\n- **Risotto ai Frutti di Mare** - €15.00 \n Arborio rice cooked with fresh seafood.\n\n- **Tagliolini al Tartufo** - €14.00 \n Homemade tagliolini pasta with truffle sauce.\n\n- **Zuppa di Cipolle** - €8.00 \n Traditional onion soup topped with melted cheese.\n\n## Secondi Piatti (Main Courses)\n- **Filetto di Manzo** - €18.00 \n Grilled beef fillet served with a side of seasonal vegetables.\n\n- **Pollo alla Griglia** - €12.00 \n Grilled chicken breast served with rosemary potatoes.\n\n- **Branzino al Forno** - €17.00 \n Oven-baked sea bass served with a lemon-herb sauce.\n\n## Pizze (Pizzas)\n- **Margherita** - €8.00 \n Classic pizza with tomato sauce, mozzarella, and basil.\n\n- **Diavola** - €10.00 \n Spicy salami pizza with tomato sauce and mozzarella.\n\n- **Funghi e Prosciutto** - €11.00 \n Pizza topped with mushrooms and ham.\n\n- **Vegetariana** - €9.50 \n Mixed vegetable pizza with mozzarella.\n\n## Dessert\n- **Tiramisu** - €5.00 \n Classic coffee-flavored Italian dessert.\n\n- **Panna Cotta** - €5.50 \n Creamy dessert served with berry sauce.\n\n- **Gelato** - €4.00 \n Selection of homemade ice creams.\n\n## Bevande (Beverages)\n- **Acqua Naturale / Frizzante** - €2.50 \n Still or sparkling water.\n\n- **Birra Artigianale** - €4.00 \n Local craft beer.\n\n- **Vino della Casa** - €5.50 / glass \n House wine selection.\n\nFor reservations or inquiries, please contact us at +39 035 936383. \n**Address:** Via L. Cadorna, 9, 24060 - Credaro (BG) \n**Closed on Wednesdays**."
},
"metadata": {},
"output_type": "display_data",
"jetTransient": {
"display_id": null
}
}
],
"execution_count": 11
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,32 @@
import requests
from bs4 import BeautifulSoup
# A class to represent a Webpage
# Some websites need you to use proper headers when fetching them:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}
class Website:
"""
A utility class to represent a Website that we have scraped, now with links.
"""
def __init__(self, url):
self.url = url
response = requests.get(url, headers=headers)
self.body = response.content
soup = BeautifulSoup(self.body, 'html.parser')
self.title = soup.title.string if soup.title else "No title found"
if soup.body:
for irrelevant in soup.body(["script", "style", "img", "input"]):
irrelevant.decompose()
self.text = soup.body.get_text(separator="\n", strip=True)
else:
self.text = ""
links = [link.get('href') for link in soup.find_all('a')]
self.links = [link for link in links if link]
def get_contents(self):
return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

View File

@@ -0,0 +1,595 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
"metadata": {},
"source": [
"# TAREA DE EJERCICIO\n",
"\n",
"Actualiza el proyecto del día 1 para resumir una página web y utilizar un modelo de código abierto que se ejecute localmente a través de Ollama en lugar de OpenAI\n",
"\n",
"Podrás utilizar esta técnica para todos los proyectos posteriores si prefiere no utilizar API de pago (closed source).\n",
"\n",
"**Beneficios:**\n",
"1. Sin cargos por API: código abierto\n",
"2. Los datos no salen de su ordenador\n",
"\n",
"**Desventajas:**\n",
"1. Tiene mucha menos potencia (parámetros) que el modelo Frontier\n",
"\n",
"## Resumen de la instalación de Ollama\n",
"\n",
"¡Simplemente visita [ollama.com](https://ollama.com) e instálalo!\n",
"\n",
"Una vez que hayas terminado, el servidor ollama ya debería estar ejecutándose localmente.\n",
"Si entras en:\n",
"[http://localhost:11434/](http://localhost:11434/)\n",
"\n",
"Debería ver el mensaje `Ollama se está ejecutando`.\n",
"\n",
"De lo contrario, abre una nueva Terminal (Mac) o Powershell (Windows) e introduce `ollama serve`.\n",
"Luego, intenta entrar em [http://localhost:11434/](http://localhost:11434/) nuevamente."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "29ddd15d-a3c5-4f4e-a678-873f56162724",
"metadata": {},
"outputs": [],
"source": [
"# Constantes\n",
"\n",
"OLLAMA_API = \"http://localhost:11434/api/chat\"\n",
"HEADERS = {\"Content-Type\": \"application/json\"}\n",
"MODEL = \"gemma3:1b\""
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "dac0a679-599c-441f-9bf2-ddc73d35b940",
"metadata": {},
"outputs": [],
"source": [
"# Crea una lista de mensajes utilizando el mismo formato que usamos para OpenAI\n",
"\n",
"messages = [\n",
" {\"role\": \"user\", \"content\": \"Describe algunas de las aplicaciones comerciales de la IA generativa.\"}\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7bb9c624-14f0-4945-a719-8ddb64f66f47",
"metadata": {},
"outputs": [],
"source": [
"payload = {\n",
" \"model\": MODEL,\n",
" \"messages\": messages,\n",
" \"stream\": False\n",
" }"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "42b9f644-522d-4e05-a691-56e7658c0ea9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"La IA generativa está revolucionando una amplia gama de industrias y aplicaciones comerciales. Aquí te presento un resumen de algunas de las más destacadas, categorizadas por área:\n",
"\n",
"**1. Marketing y Ventas:**\n",
"\n",
"* **Generación de contenido:**\n",
" * **Redacción de textos:** La IA genera descripciones de productos, publicaciones de blog, correos electrónicos de marketing, scripts de video, etc. Esto ahorra tiempo y recursos al equipo de marketing.\n",
" * **Generación de imágenes y videos:** Crea imágenes, ilustraciones y videos de alta calidad a partir de descripciones textuales, lo que facilita la producción de contenido visual.\n",
" * **Personalización del contenido:** Crea mensajes y ofertas personalizadas en función de los datos del cliente.\n",
"* **Chatbots y Asistentes Virtuales:**\n",
" * **Chatbots inteligentes:** Crean experiencias de conversación más fluidas y conversacionales, mejorando la atención al cliente y la generación de leads.\n",
" * **Asistentes virtuales personalizados:** Automatizan tareas repetitivas y brindan soporte técnico.\n",
"* **Creación de Avatar:** Genera representaciones visuales de personas para campañas de marketing, social media, etc.\n",
"* **Análisis de sentimientos:** Analiza el sentimiento en comentarios y reseñas online, ayudando a la compañía a comprender la opinión de los clientes.\n",
"\n",
"**2. Diseño y Desarrollo:**\n",
"\n",
"* **Diseño de productos:**\n",
" * **Generación de diseños:** Crea diseños de productos, productos de moda, interiores, y otros, basados en parámetros específicos, como el estilo, las dimensiones o los materiales.\n",
" * **Diseño de prototipos:** Genera prototipos visuales rápidamente para probar nuevas ideas.\n",
"* **Desarrollo de software:**\n",
" * **Generación de código:** Escribe fragmentos de código, funciones o incluso aplicaciones completas a partir de descripciones en lenguaje natural.\n",
" * **Testeo y depuración:** Identifica errores y problemas en el código automáticamente.\n",
"* **Diseño de interfaces de usuario (UI):** Crea layouts y diseños de interfaz de usuario más rápidamente con la asistencia de la IA.\n",
"\n",
"**3. Industria de la Tecnología:**\n",
"\n",
"* **Desarrollo de software:**\n",
" * **Generación de pruebas unitarias:** Automatiza la creación de pruebas para asegurar que el software funcione correctamente.\n",
" * **Generación de documentación:** Crea documentación técnica, código de referencia y documentación de API.\n",
"* **Inteligencia Artificial y Aprendizaje Automático (IA/ML):**\n",
" * **Modelos de Lenguaje de Modelado (LLM):** Los LLM, como GPT-4, están siendo utilizados para crear herramientas de generación de contenido, chatbots, y asistente de escritura.\n",
" * **Generación de Datos Sintéticos:** Crea datos artificiales para entrenar modelos de IA y pruebas, reduciendo costos y tiempos de desarrollo.\n",
"* **Ciberseguridad:**\n",
" * **Generación de pruebas de penetración:** Crea simulaciones de ataques cibernéticos.\n",
"* **Blockchain:** Generación de contratos inteligentes con características específicas.\n",
"\n",
"**4. Industria de la Finanzas:**\n",
"\n",
"* **Generación de Reporte Financiero:** Crea reportes financieros automatizados con el menor esfuerzo.\n",
"* **Detección de Fraude:** Analiza datos para identificar patrones sospechosos y detectar posibles fraudes.\n",
"* **Asesoramiento Financiero:** Crea informes personalizados y adaptados a los clientes.\n",
"\n",
"**5. Industria de la Salud:**\n",
"\n",
"* **Descubrimiento de fármacos:** Genera candidatos a fármacos, optimiza moléculas y predecir sus propiedades.\n",
"* **Asistencia al diagnóstico:** Analiza imágenes médicas (rayos X, resonancia magnética) para detectar anomalías y apoyar el diagnóstico.\n",
"* **Personalización de la atención al paciente:** Genera planes de tratamiento personalizados.\n",
"\n",
"**6. Entretenimiento:**\n",
"\n",
"* **Creación de música:** Genera melodías y letras.\n",
"* **Generación de arte:** Crea imágenes, ilustraciones y videos.\n",
"* **Creación de videojuegos:** Genera niveles, personajes y escenarios.\n",
"\n",
"**Ejemplos concretos de uso actual:**\n",
"\n",
"* **Jasper:** Herramienta de escritura que utiliza IA generativa para generar artículos de blog, descripciones de productos y contenido de marketing.\n",
"* **Copy.ai:** Aplica IA para generar contenido de marketing que incluye texto, imágenes y videos.\n",
"* **Stability AI:** Ofrece herramientas de generación de imágenes, incluyendo Stable Diffusion, que permite crear imágenes realistas.\n",
"\n",
"**Consideraciones importantes:**\n",
"\n",
"* **Calidad:** La calidad de la IA generativa puede variar. Es importante revisar y editar el contenido generado.\n",
"* **Ética:** Es crucial abordar cuestiones éticas, como el plagio, la desinformación y la privacidad.\n",
"* **Sesgo:** Los modelos de IA pueden heredar sesgos de los datos con los que fueron entrenados.\n",
"\n",
"La IA generativa está en constante evolución, y con el tiempo, estas aplicaciones se expandirán aún más, transformando la forma en que trabajamos y hacemos negocios.\n",
"\n",
"Para obtener más información sobre una aplicación específica, te recomiendo buscar en Google: \"AI Generative [Aplicación Específica]\".\n",
"\n"
]
}
],
"source": [
"response = requests.post(OLLAMA_API, json=payload, headers=HEADERS)\n",
"print(response.json()['message']['content'])"
]
},
{
"cell_type": "markdown",
"id": "6a021f13-d6a1-4b96-8e18-4eae49d876fe",
"metadata": {},
"source": [
"# Presentación del paquete ollama\n",
"\n",
"Ahora haremos lo mismo, pero utilizando el elegante paquete de Python ollama en lugar de una llamada HTTP directa.\n",
"\n",
"En esencia, se realiza la misma llamada que se indicó anteriormente al servidor ollama que se ejecuta en localhost:11434"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "7745b9c4-57dc-4867-9180-61fa5db55eb8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"La IA generativa ha avanzado a un ritmo vertiginoso y está abriendo un abanico de aplicaciones comerciales en diversos sectores. Aquí te presento algunas de las más destacadas, divididas por áreas:\n",
"\n",
"**1. Marketing y Ventas:**\n",
"\n",
"* **Generación de contenido:**\n",
" * **Textos:** Crea contenido de marketing (descripciones de productos, publicaciones para redes sociales, correos electrónicos, artículos de blog, etc.) a partir de una simple descripción inicial.\n",
" * **Imágenes y vídeos:** Genera imágenes, vídeos cortos, elementos visuales y animaciones para campañas publicitarias y marketing.\n",
" * **Guiones:** Crea guiones para vídeos y anuncios.\n",
"* **Personalización del marketing:** Diseña e-mails, mensajes y experiencias personalizadas para segmentos específicos de clientes, optimizando la efectividad de las campañas.\n",
"* **Generación de scripts:** Crea scripts para vídeos explicativos, vídeos de testimonios y otros contenidos promocionales.\n",
"* **Creación de Avatar y personajes:** Desarrolla personajes y representaciones visuales para campañas publicitarias y marketing digital.\n",
"* **Análisis de Sentimientos:** Analiza comentarios de clientes y redes sociales para comprender las emociones y el sentimiento hacia la marca.\n",
"\n",
"**2. Diseño y Producción:**\n",
"\n",
"* **Diseño gráfico:** Genera diseños de logos, ilustraciones, banner y otros materiales de marketing visualmente atractivos a partir de descripciones textuales o ejemplos.\n",
"* **Diseño de productos:** Crea diseños de prototipos de productos, incluyendo modelos 3D, patrones y elementos visuales.\n",
"* **Diseño de videojuegos:** Genera assets visuales (personajes, entornos, objetos) de forma rápida y eficiente, facilitando la producción de juegos.\n",
"* **Diseño de moda:** Sugiere diseños de ropa, patrones y prendas, basándose en tendencias y preferencias.\n",
"* **Diseño de interiores:** Genera diseños de interiores y espacios, considerando estética, funcionalidad y presupuesto.\n",
"* **Diseño de productos de software:** Genera código de software básico, componentes y documentación.\n",
"\n",
"**3. Desarrollo de Software y Tecnología:**\n",
"\n",
"* **Generación de código:**\n",
" * **Automatización de la escritura de código:** La IA generativa puede generar fragmentos de código, funciones y clases a partir de descripciones en lenguaje natural.\n",
" * **Refactorización de código:** Mejora la legibilidad, el rendimiento y la mantenibilidad del código existente.\n",
"* **Desarrollo de APIs:** Crea APIs que automatizan tareas y proporcionen funcionalidades a otros desarrolladores.\n",
"* **Generación de pruebas automatizadas:** Genera casos de prueba a partir de requisitos de software.\n",
"* **Generación de documentación:** Crea documentación de código, APIs y procedimientos.\n",
"* **Inteligencia Artificial para Desarrollo de Software:** Ayuda a los desarrolladores a tomar decisiones rápidas sobre la arquitectura, el diseño o el código, basándose en los datos y la lógica de negocio.\n",
"\n",
"**4. Salud y Bienestar:**\n",
"\n",
"* **Descubrimiento de fármacos:** Genera compuestos químicos con propiedades específicas para la investigación farmacéutica.\n",
"* **Diseño de tratamientos:** Sugiere tratamientos personalizados basados en el historial del paciente.\n",
"* **Creación de simulación médica:** Genera datos de simulación para probar nuevos modelos médicos y terapias.\n",
"* **Asistencia en la interpretación de imágenes médicas:** Analiza imágenes de resonancia magnética (RM) o tomografías computarizadas (TC) para detectar anomalías y ayudar al diagnóstico.\n",
"* **Generación de planes de ejercicios personalizados:** Creación de planes de entrenamiento y ejercicios basados en el estado físico y las metas del usuario.\n",
"\n",
"**5. Legal y Administración:**\n",
"\n",
"* **Redacción de contratos:** Genera borradores de contratos y documentos legales.\n",
"* **Análisis legal:** Analiza documentos legales, identifica cláusulas importantes y genera resúmenes.\n",
"* **Preparación de documentación legal:** Automatiza la preparación de documentos legales, como peticiones y formularios.\n",
"* **Gestión de la documentación:** Genera resúmenes y extractos de documentos legales.\n",
"\n",
"**6. Entretenimiento y Medios:**\n",
"\n",
"* **Generación de música:** Crea música original en diferentes estilos.\n",
"* **Creación de arte digital:** Genera imágenes, ilustraciones y videos artísticos.\n",
"* **Generación de texto narrativo:** Crea historias, poemas o cuentos.\n",
"* **Creación de personajes de videojuegos:** Crea personajes complejos y con backstory.\n",
"* **Creación de efectos visuales para videojuegos:** Generar texturas, efectos de luz y animaciones realistas.\n",
"\n",
"**Consideraciones Importantes:**\n",
"\n",
"* **Calidad de la Generación:** La calidad de la IA generativa depende de la calidad de los datos de entrenamiento y la complejidad de la tarea.\n",
"* **Sesgo:** La IA generativa puede reflejar los sesgos presentes en los datos de entrenamiento.\n",
"* **Ética:** Es importante considerar las implicaciones éticas de su uso, especialmente en áreas como la creación de contenido o la toma de decisiones.\n",
"* **Originalidad:** La IA generativa genera contenido, por lo que la originalidad y la propiedad intelectual son importantes.\n",
"\n",
"**En resumen, la IA generativa está transformando la forma en que las empresas abordan una amplia gama de industrias, impulsando la innovación, la eficiencia y la creatividad.**\n",
"\n",
"¿Te gustaría explorar alguna de estas aplicaciones en más detalle, o quizás te interese saber sobre un sector específico donde la IA generativa está teniendo un impacto notable?\n"
]
}
],
"source": [
"import ollama\n",
"\n",
"response = ollama.chat(model=MODEL, messages=messages)\n",
"print(response['message']['content'])"
]
},
{
"cell_type": "markdown",
"id": "d8a65815",
"metadata": {},
"source": [
"# AHORA el ejercicio para ti\n",
"\n",
"Toma el código del día 1 e incorpóralo aquí para crear un resumidor de sitios web que use Llama 3.2 ejecutándose localmente en lugar de OpenAI"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "9a611b05-b5b0-4c83-b82d-b3a39ffb917d",
"metadata": {},
"outputs": [],
"source": [
"import ollama\n",
"\n",
"MODEL = \"gemma3:1b\"\n",
"\n",
"\n",
"class Website:\n",
" def __init__(self, url):\n",
" self.url=url\n",
" response = requests.get(url)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No tiene título\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "292ed29a",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Okay, heres a brief summary of the provided text, suitable for a quick understanding:\n",
"\n",
"Anthropic is developing AI agents systems that can autonomously operate and complete complex tasks through a series of principles and design choices focused on simplicity, transparency, and human oversight. The text outlines a process of building these agents, starting with simple prompts and gradually incorporating more sophisticated tools and frameworks. Key areas of focus include: understanding the nuances of agent-computer interfaces (ACI), managing tool-specific error handling, and prioritizing clear documentation and testing. The document emphasizes best practices for tool development, including careful parameterization and defining clear boundaries, as well as the importance of creating trust-worthy agent-computer interfaces. Finally, it highlights key design choices for building robust and reliable AI agents, including integrating feedback loops and careful consideration of risk and complexity, all while emphasizing the importance of continuous learning and iteration."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"system_prompt = \"Eres un asistente que analiza el contenido de un sitio web \\\n",
" y proporciona un breve resumen, ignorando el texto que podría estar relacionado con la navegación. \\\n",
"Responder en Markdown.\"\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"Estás viendo un sitio web titulado {website.title}\"\n",
" user_prompt += \"\\nEl contenido de este sitio web es el siguiente; \\\n",
" proporciona un breve resumen de este sitio web en formato Markdown. \\\n",
" Si el contenido no está en español, traduce el resumen automáticamente al español.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]\n",
"\n",
"\n",
"def summarize(url):\n",
" website = Website(url)\n",
" messages = messages_for(website)\n",
" response = ollama.chat(model=MODEL, messages=messages)\n",
" return response['message']['content']\n",
"\n",
"\n",
"def display_summary(url):\n",
" summary = summarize(url)\n",
" display(Markdown(summary))\n",
"\n",
"display_summary(\"https://www.anthropic.com/engineering/building-effective-agents\")"
]
},
{
"cell_type": "markdown",
"id": "4fd1d4f4",
"metadata": {},
"source": [
"# Contribution: Modo batch y traduccion a español"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "45e73875",
"metadata": {},
"outputs": [],
"source": [
"# Nuevo: Modo batch\n",
"\n",
"def batch_summarize(urls):\n",
" \"\"\"\n",
" Procesa multiples urls y devuelve un diccionario {url: resumen}.\n",
" Traduce a español si es necesario.\n",
" \"\"\"\n",
" summaries = {}\n",
" for url in urls:\n",
" try:\n",
" summaries[url] = summarize(url)\n",
" except Exception as e:\n",
" summaries[url] = f\"Error al procesar {url}: {e}\"\n",
" return summaries\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "4a3d6881-35af-469e-a8ff-a8b62b721d95",
"metadata": {},
"outputs": [],
"source": [
"urls = [\n",
" \"https://ir.uitm.edu.my/id/eprint/29153/\",\n",
" \"https://www.worldvaluessurvey.org/wvs.jsp\",\n",
" \"https://www.wonderfulcopenhagen.com/cll\",\n",
" \"https://www.bellingcat.com\"\n",
"]\n",
"\n",
"def display_summaries(urls):\n",
" summaries = batch_summarize(urls)\n",
" for url, summary in summaries.items():\n",
" text = f\"\"\"\n",
"### [{url}]({url})\n",
"\n",
"{summary}\n",
"\n",
"---\n",
"\"\"\"\n",
" display(Markdown(text))"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "da01aa6a",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"### [https://ir.uitm.edu.my/id/eprint/29153/](https://ir.uitm.edu.my/id/eprint/29153/)\n",
"\n",
"El resumen de la publicación \"Key success factors toward MICE industry: A systematic literature review / Muhammad Saufi Anas ... [et al.]\" (JTHCA, 2020) es el siguiente:\n",
"\n",
"La publicación investiga los factores críticos que contribuyen al éxito de la industria MICE. El trabajo realiza una revisión sistemática de literatura sobre 39 publicaciones sobre este tema, evaluando las estrategias de las empresas que operan en este sector. Los hallazgos revelan que factores como la motivación de los viajeros de negocios, la percepción sobre la industria, la motivación de los asistentes a eventos y la importancia de los indicadores clave de rendimiento, la satisfacción de los asistentes y las estrategias de marketing, las tendencias tecnológicas y los desafíos enfrentados por la industria son cruciales. El estudio identifica la importancia de la sostenibilidad de la industria. El trabajo ofrece información útil para comprender las estrategias de desarrollo de la industria, los indicadores clave de éxito, los desafíos y las formas de impulsar la sostenibilidad de la industria MICE.\n",
"\n",
"\n",
"---\n"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"\n",
"### [https://www.worldvaluessurvey.org/wvs.jsp](https://www.worldvaluessurvey.org/wvs.jsp)\n",
"\n",
"Okay, heres a summary of the WVS Database content, formatted as Markdown:\n",
"\n",
"**The World Values Survey (WVS) Database Summary**\n",
"\n",
"The WVS Database is a comprehensive resource documenting the findings of the Ronald F. Inglehart Best Book Award on Political Culture and Values, focusing on comparative cross-cultural research. It provides details on:\n",
"\n",
"* **The Award Process:** Details about the award criteria, nominations, and selection process for the 2025 award.\n",
"* **Recent Webinars:** A detailed timeline of recent webinars, including:\n",
" * **July 15th:** Discussion of Autocratic Modernity and Psychological Cracks in a Dystopian Model.\n",
" * **June 27th:** Global Attitudes Toward Abortion: Insights from the World Values Survey.\n",
" * **June 18th:** Ideological Extremism and Polarization on a Global Scale.\n",
" * **April 25th:** Elite-Citizen Gap in International Organization Legitimacy.\n",
" * **April 08th:** The Persistence of Traditional Values and the Limited Global Appeal of Democracy.\n",
" * **March 25th:** Predicting Homonegativity in Southeast Asian Countries Using Survey Data.\n",
" * **May 20th:** Measuring National Parochialism and Explaining its Individual Variations Using Survey Data.\n",
"* **Ongoing Research & Events:** Information on ongoing research projects and events hosted by the WVSA, including:\n",
" * **Social Networks:** A Twitter feed highlighting discussions about the WVS research.\n",
" * **Latest News & Events:** A calendar of events and news updates.\n",
" * **Events:** Details of the WVSAs secretariat and headquarters Christian Haerpfer, Christian Diaz-Medrano, and Jaime Diez-Medrano.\n",
" * **Data Archive:** Information about the database's archival materials.\n",
"\n",
"The database primarily centers around the research and analysis generated by the Ronald F. Inglehart Best Book Award, exploring comparative cross-cultural phenomena and political attitudes.\n",
"\n",
"---\n",
"\n",
"Would you like me to elaborate on any specific section or aspect of the database content?\n",
"\n",
"---\n"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"\n",
"### [https://www.wonderfulcopenhagen.com/cll](https://www.wonderfulcopenhagen.com/cll)\n",
"\n",
"El Copenhagen Legacy Lab es un proyecto de la organización Wonderful Copenhagen que se centra en impulsar el impacto a largo plazo de las conferencias y eventos. El laboratorio, impulsado por una investigación doctoral, ha desarrollado un marco de legado basado en datos y experiencia, y ha recibido varios reconocimientos, incluyendo el “2024 GDS-Movement” y el “#MEET4IMPACT” Impact Award. El objetivo principal es apoyar a las asociaciones internacionales en alcanzar sus objetivos estratégicos, al mismo tiempo que respaldan las prioridades nacionales y las fortalezas locales para generar valor social a nivel local y global. El laboratorio ofrece servicios gratuitos, pero la participación y los recursos deben priorizarse por parte de las asociaciones y/o los hostes. En esencia, el lab busca facilitar la colaboración entre asociaciones internacionales y partes interesadas locales, identificando intereses comunes y delineando acciones concretas para el legado de los eventos. El sitio web proporciona testimonios de asociaciones que han utilizado la iniciativa, y ofrece enlaces a la página de la Convention Bureau, la página de la Card de Copenhague y la página de Wonderful Copenhagen.\n",
"\n",
"---\n"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"\n",
"### [https://www.bellingcat.com](https://www.bellingcat.com)\n",
"\n",
"Okay, here's a Markdown summary of the Bellingcat website, focusing on the core information presented:\n",
"\n",
"**Bellingcat Overview:**\n",
"\n",
"Bellingcat is an online investigative journalism organization dedicated to uncovering wrongdoing and holding power accountable. They focus on investigations related to:\n",
"\n",
"* **Ukraine:** Specifically, tracking traffic stops involving delivery riders and workers, investigations related to immigration, and analyzing misinformation campaigns.\n",
"* **Conflict:** Focuses on analyzing events and data related to conflict zones, particularly in Ukraine and surrounding regions.\n",
"* **China:** Investigating potential connections between Chinese fentanyl smuggling networks and Japan.\n",
"* **Global Affairs:** Covers a wide range of global issues, including misinformation, legal accountability, and geopolitical investigations.\n",
"\n",
"**Key Activities & Features:**\n",
"\n",
"* **Investigations:** They publish detailed investigations, including data analysis, geospatial mapping, and source material.\n",
"* **Guides:** Provides accessible resources to help researchers and journalists with open-source investigation techniques.\n",
"* **Workshops:** Offers online training workshops on open-source tools and techniques.\n",
"* **Community:** Encourages collaboration through Discord and other platforms.\n",
"* **Tools:** Offers open-source tools for researchers to utilize.\n",
"* **Newsletter:** Delivers timely content and updates.\n",
"* **Collaboration:** Facilitates a global community of researchers and journalists.\n",
"* **Data Journalism:** Focuses on visualising data and uncovering stories.\n",
"\n",
"**Recent Highlights:**\n",
"\n",
"* Discovered a key administrator behind an AI porn site.\n",
"* Confirmed a Dutch political partys involvement in calling for Canadas extradition.\n",
"* Won the AI Neuharth Innovation Award.\n",
"\n",
"**Overall Purpose:**\n",
"\n",
"Bellingcat aims to empower investigative journalism and transparency by providing resources, tools, and a collaborative community for uncovering wrongdoing and holding power accountable, particularly in politically sensitive areas like Ukraine and China.\n",
"\n",
"---\n",
"\n",
"Let me know if you'd like me to elaborate on any specific aspect or provide a more detailed breakdown!\n",
"\n",
"---\n"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_summaries(urls)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f268c86",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,309 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "fe12c203-e6a6-452c-a655-afb8a03a4ff5",
"metadata": {},
"source": [
"# Week 1 Exercise | Study Guide Generation with Llama 3.2"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c1070317-3ed9-4659-abe3-828943230e03",
"metadata": {
"editable": false,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"import requests\n",
"import json\n",
"import re\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display, update_display\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a456906-915a-4bfd-bb9d-57e505c5093f",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')\n",
"MODEL = 'llama3.2'"
]
},
{
"cell_type": "markdown",
"id": "5cd638a2-ab65-41cf-97bb-673c3ec117c4",
"metadata": {},
"source": [
"### 1. Web Scraper"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "504f3bce-f922-46a9-844a-b13d47507b8a",
"metadata": {},
"outputs": [],
"source": [
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
"\n",
" def __init__(self, url):\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" self.body = response.content\n",
" soup = BeautifulSoup(self.body, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" if soup.body:\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)\n",
" else:\n",
" self.text = \"\"\n",
" links = [link.get('href') for link in soup.find_all('a')]\n",
" self.links = [link for link in links if link]\n",
"\n",
" def get_contents(self):\n",
" return f\"Webpage Title:\\n{self.title}\\nWebpage Contents:\\n{self.text}\\n\\n\""
]
},
{
"cell_type": "markdown",
"id": "2bbf43c5-774d-4d4e-91ff-772781fdfeaf",
"metadata": {},
"source": [
"### 2. Curriculum Extraction"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f0d0137-52b0-47a8-81a8-11a90a010798",
"metadata": {},
"outputs": [],
"source": [
"curriculum_system_prompt = \"\"\"You are provided with the text content of a webpage. \n",
"Your task is to design a student-friendly curriculum from this content. \n",
"Break down the material into clear modules or lessons, each with a title and a short description. \n",
"Focus on organizing the information in a logical order, as if preparing a study plan.\n",
"\n",
"You should respond in JSON as in this example:\n",
"{\n",
" \"curriculum\": [\n",
" {\n",
" \"module\": \"Introduction to Machine Learning\",\n",
" \"description\": \"Basic concepts and history of machine learning, why it matters, and common applications.\"\n",
" },\n",
" {\n",
" \"module\": \"Supervised Learning\",\n",
" \"description\": \"Learn about labeled data, classification, and regression methods.\"\n",
" },\n",
" {\n",
" \"module\": \"Unsupervised Learning\",\n",
" \"description\": \"Understand clustering, dimensionality reduction, and when to use unsupervised approaches.\"\n",
" }\n",
" ]\n",
"}\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d89a0be8-0254-43b5-ab9a-6224069a1246",
"metadata": {},
"outputs": [],
"source": [
"def get_curriculum_user_prompt(website):\n",
" user_prompt = f\"Here is the text content of the website at {website.url}:\\n\\n\"\n",
" user_prompt += website.text\n",
" user_prompt += \"\\n\\nPlease create a student-friendly curriculum from this content. \"\n",
" user_prompt += \"Break it down into clear modules or lessons, each with a title and a short description. \"\n",
" user_prompt += \"Return your response in JSON format\"\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "da74104c-81a3-4d12-a377-e202ddfe57bc",
"metadata": {},
"outputs": [],
"source": [
"def get_curriculum(website):\n",
" stream = openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": curriculum_system_prompt},\n",
" {\"role\": \"user\", \"content\": get_curriculum_user_prompt(website)}\n",
" ],\n",
" stream=True\n",
" )\n",
" response_text = \"\"\n",
" display_handle = display(Markdown(\"\"), display_id=True)\n",
" for chunk in stream:\n",
" delta = chunk.choices[0].delta.content or ''\n",
" response_text += delta\n",
" update_display(Markdown(response_text), display_id=display_handle.display_id)\n",
" try:\n",
" json_text = re.search(r\"\\{.*\\}\", response_text, re.DOTALL).group()\n",
" curriculum_json = json.loads(json_text)\n",
" except Exception as e:\n",
" print(\"Failed to parse JSON:\", e)\n",
" curriculum_json = {\"error\": \"JSON parse failed\", \"raw\": response_text}\n",
"\n",
" return curriculum_json"
]
},
{
"cell_type": "markdown",
"id": "df68eafc-e529-400c-a61b-0140c38909a3",
"metadata": {},
"source": [
"### 3. Study Guide"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5b3db9d4-5edd-4a0c-8d5c-45ea455d8eb0",
"metadata": {},
"outputs": [],
"source": [
"guide_system_prompt = \"\"\"You are an educational assistant. \n",
"You are given a curriculum JSON with modules and descriptions.\n",
"Your task is to create a student-friendly study guide based on this curriculum.\n",
"- Organize the guide step by step, with clear headings, tips, and examples where appropriate.\n",
"- Make it engaging and easy to follow.\n",
"- Adapt the content according to the student's level, language, and tone.\n",
"- Always respond in markdown format suitable for a student guide.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16f85360-6f06-4bb3-878a-5f3b8d8f20d7",
"metadata": {},
"outputs": [],
"source": [
"def get_study_guide_user_prompt(curriculum_json, student_level=\"beginner\", language=\"English\", tone=\"friendly\"):\n",
" return f\"\"\"\n",
" Student Level: {student_level}\n",
" Language: {language}\n",
" Tone: {tone}\n",
" \n",
" Here is the curriculum JSON:\n",
" \n",
" {json.dumps(curriculum_json, indent=2)}\n",
" \n",
" Please convert it into a study guide for the student.\n",
" \"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc9b949d-df2b-475c-9a84-597a47ed6e85",
"metadata": {},
"outputs": [],
"source": [
"def stream_study_guide(curriculum_json, student_level=\"beginner\", language=\"English\", tone=\"friendly\"):\n",
" \n",
" user_prompt = get_study_guide_user_prompt(curriculum_json, student_level, language, tone)\n",
" stream = openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": guide_system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" stream=True\n",
" )\n",
"\n",
" response_text = \"\"\n",
" display_handle = display(Markdown(\"\"), display_id=True)\n",
" for chunk in stream:\n",
" delta = chunk.choices[0].delta.content or ''\n",
" response_text += delta\n",
" update_display(Markdown(response_text), display_id=display_handle.display_id)\n",
" \n",
" return response_text"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8c289b7c-c991-45b5-adc3-7468af393e50",
"metadata": {},
"outputs": [],
"source": [
"page = Website(\"https://en.wikipedia.org/wiki/Rock_and_roll\")\n",
"curriculum_json = get_curriculum(page)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c697d63-2230-4e04-a28b-c0e8fc85753e",
"metadata": {},
"outputs": [],
"source": [
"study_guide_text = stream_study_guide(\n",
" curriculum_json,\n",
" student_level=\"beginner\",\n",
" language=\"English\",\n",
" tone=\"friendly\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c0960f87-fd29-4ae3-8405-f4fde1f50f89",
"metadata": {},
"outputs": [],
"source": [
"study_guide_text = stream_study_guide(\n",
" curriculum_json,\n",
" student_level=\"advanced\",\n",
" language=\"English\",\n",
" tone=\"professional, detailed\"\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}