LLM_Engineering_OLD/week1/community-contributions/day2_exercise_ollama_website_summarization.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
   "metadata": {},
   "source": [
    "# Welcome to your first assignment!\n",
    "\n",
    "Instructions are below. Please give this a try, and look in the solutions folder if you get stuck (or feel free to ask me!)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ada885d9-4d42-4d9b-97f0-74fbbbfe93a9",
   "metadata": {},
   "source": [
    "<table style=\"margin: 0; text-align: left;\">\n",
    "    <tr>\n",
    "        <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
    "            <img src=\"../resources.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
    "        </td>\n",
    "        <td>\n",
    "            <h2 style=\"color:#f71;\">Just before we get to the assignment --</h2>\n",
    "            <span style=\"color:#f71;\">I thought I'd take a second to point you at this page of useful resources for the course. This includes links to all the slides.<br/>\n",
    "            <a href=\"https://edwarddonner.com/2024/11/13/llm-engineering-resources/\">https://edwarddonner.com/2024/11/13/llm-engineering-resources/</a><br/>\n",
    "            Please keep this bookmarked, and I'll continue to add more useful links there over time.\n",
    "            </span>\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9cc85216-f6e4-436e-b6c1-976c8f2d1152",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install webdriver-manager\n",
    "!pip install selenium"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# imports\n",
    "\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "from IPython.display import Markdown, display\n",
    "import ollama\n",
    "from openai import OpenAI\n",
    "from selenium import webdriver\n",
    "from selenium.webdriver.chrome.options import Options\n",
    "from selenium.webdriver.chrome.service import Service\n",
    "from webdriver_manager.chrome import ChromeDriverManager\n",
    "import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29ddd15d-a3c5-4f4e-a678-873f56162724",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Constants\n",
    "MODEL = \"llama3.2\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "479ff514-e8bd-4985-a572-2ea28bb4fa40",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's just make sure the model is loaded\n",
    "\n",
    "!ollama pull llama3.2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a021f13-d6a1-4b96-8e18-4eae49d876fe",
   "metadata": {},
   "source": [
    "# Introducing the ollama package\n",
    "\n",
    "And now we'll do the same thing, but using the elegant ollama python package instead of a direct HTTP call.\n",
    "\n",
    "Under the hood, it's making the same call as above to the ollama server running at localhost:11434"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4704e10-f5fb-4c15-a935-f046c06fb13d",
   "metadata": {},
   "source": [
    "## Alternative approach - using OpenAI python library to connect to Ollama"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23057e00-b6fc-4678-93a9-6b31cb704bff",
   "metadata": {},
   "outputs": [],
   "source": [
    "# There's actually an alternative approach that some people might prefer\n",
    "# You can use the OpenAI client python library to call Ollama:\n",
    "\n",
    "\n",
    "ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1622d9bb-5c68-4d4e-9ca4-b492c751f898",
   "metadata": {},
   "source": [
    "# NOW the exercise for you\n",
    "\n",
    "Take the code from day1 and incorporate it here, to build a website summarizer that uses Llama 3.2 running locally instead of OpenAI; use either of the above approaches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8251b6a5-7b43-42b9-84a9-4a94b6bdb933",
   "metadata": {},
   "outputs": [],
   "source": [
    "# A class to represent a Webpage\n",
    "class ScrapeWebsite:\n",
    "    def __init__(self, url):\n",
    "        \"\"\"\n",
    "        Create this Website object from the given URL using Selenium + BeautifulSoup\n",
    "        Supports JavaScript-heavy and normal websites uniformly.\n",
    "        \"\"\"\n",
    "        self.url = url\n",
    "\n",
    "        # Configure headless Chrome\n",
    "        options = Options()\n",
    "        options.add_argument('--headless')\n",
    "        options.add_argument('--no-sandbox')\n",
    "        options.add_argument('--disable-dev-shm-usage')\n",
    "\n",
    "        # Use webdriver-manager to manage ChromeDriver\n",
    "        service = Service(ChromeDriverManager().install())\n",
    "\n",
    "        # Initialize the Chrome WebDriver with the service and options\n",
    "        driver = webdriver.Chrome(service=service, options=options)\n",
    "\n",
    "        # Start Selenium WebDriver\n",
    "        driver.get(url)\n",
    "\n",
    "        # Wait for JS to load (adjust as needed)\n",
    "        time.sleep(3)\n",
    "\n",
    "        # Fetch the page source after JS execution\n",
    "        page_source = driver.page_source\n",
    "        driver.quit()\n",
    "\n",
    "        # Parse the HTML content with BeautifulSoup\n",
    "        soup = BeautifulSoup(page_source, 'html.parser')\n",
    "\n",
    "        # Extract title\n",
    "        self.title = soup.title.string if soup.title else \"No title found\"\n",
    "\n",
    "        # Remove unnecessary elements\n",
    "        for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
    "            irrelevant.decompose()\n",
    "\n",
    "        # Extract the main text\n",
    "        self.text = soup.body.get_text(separator=\"\\n\", strip=True)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6de38216-6d1c-48c4-877b-86d403f4e0f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
    "\n",
    "system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
    "and provides a short summary, ignoring text that might be navigation related. \\\n",
    "Respond in markdown.\"\n",
    "\n",
    "# A function that writes a User Prompt that asks for summaries of websites:\n",
    "\n",
    "def user_prompt_for(website):\n",
    "    user_prompt = f\"You are looking at a website titled {website.title}\"\n",
    "    user_prompt += \"\\nThe contents of this website is as follows; \\\n",
    "please provide a short summary of this website in markdown. \\\n",
    "If it includes news or announcements, then summarize these too.\\n\\n\"\n",
    "    user_prompt += website.text\n",
    "    return user_prompt\n",
    "\n",
    "def messages_for(website):\n",
    "    return [\n",
    "        {\"role\": \"system\", \"content\": system_prompt},\n",
    "        {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
    "    ]\n",
    "\n",
    "# And now: call the OpenAI API. You will get very familiar with this!\n",
    "\n",
    "def summarize(url):\n",
    "    website = ScrapeWebsite(url)\n",
    "    response = ollama_via_openai.chat.completions.create(\n",
    "        model = MODEL,\n",
    "        messages = messages_for(website)\n",
    "    )\n",
    "    return response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5dbf8d5c-a42a-4a72-b3a4-c75865b841bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "summary = summarize(\"https://edwarddonner.com/2024/11/13/llm-engineering-resources/\")\n",
    "display(Markdown(summary))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ddfacdc-b16a-4999-9ff2-93ed19600d24",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}