{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "53211323-6a09-452a-b471-98e22d92bfc2", "metadata": {}, "source": [ "# 🌐 WebPage Summarizer\n", "---\n", "- 🌍 **Task:** Summarizing webpage content using AI. \n", "- 🧠 **Model:** OpenAI's ``gpt-4o-mini`` and ``llama3.2:3b`` for text summarization. \n", "- 🕵️‍♂️ **Data Extraction:** Selenium for handling both static and JavaScript-rendered websites. \n", "- 📌 **Output Format:** Markdown-formatted summaries. \n", "- 🔗 **Scope:** Processes only the given webpage URL (not the entire site). \n", "- 🚀 **Tools:** Python, Requests, Selenium, BeautifulSoup, OpenAI API, Ollama. \n", "- 🧑‍💻 **Skill Level:** Beginner.\n", "\n", "🛠️ Requirements\n", "- ⚙️ Hardware: ✅ CPU is sufficient — no GPU required\n", "- 🔑 OpenAI API Key (for GPT model)\n", "- Install Ollama and pull llama3.2:3b or another lightweight model\n", "- Google Chrome browser installed\n", "\n", "**✨ This script handles both JavaScript and non-JavaScript websites using Selenium with Chrome WebDriver for reliable content extraction from modern web applications.**\n", "\n", "Let's get started and automate website summarization! 🚀\n", "\n", "![](https://github.com/lisekarimi/lexo/blob/main/assets/01_basic_llm_project.jpg?raw=true)\n", "\n", "---\n", "📢 Find more LLM notebooks on my [GitHub repository](https://github.com/lisekarimi/lexo)" ] }, { "cell_type": "markdown", "id": "d70aa4b0", "metadata": {}, "source": [ "## 🛠️ Environment Setup & Dependencies" ] }, { "cell_type": "code", "execution_count": null, "id": "ebf2fa36", "metadata": {}, "outputs": [], "source": [ "%pip install selenium webdriver-manager" ] }, { "cell_type": "code", "execution_count": null, "id": "1dcf1d9d-c540-4900-b14e-ad36a28fc822", "metadata": {}, "outputs": [], "source": [ "# ===========================\n", "# System & Environment\n", "# ===========================\n", "import os\n", "from dotenv import load_dotenv\n", "\n", "# ===========================\n", "# Web Scraping\n", "# ===========================\n", "import time\n", "from bs4 import BeautifulSoup\n", "from selenium import webdriver\n", "from selenium.webdriver.chrome.options import Options\n", "from selenium.webdriver.common.by import By\n", "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.support import expected_conditions as EC\n", "\n", "# ===========================\n", "# AI-related\n", "# ===========================\n", "from IPython.display import Markdown, display\n", "from openai import OpenAI\n", "import ollama" ] }, { "cell_type": "markdown", "id": "cc20642b", "metadata": {}, "source": [ "## 🔐 Model Configuration & Authentication" ] }, { "cell_type": "code", "execution_count": null, "id": "8598c299-05ca-492e-b085-6bcc2f7dda0d", "metadata": {}, "outputs": [], "source": [ "load_dotenv(override=True)\n", "api_key = os.getenv('OPENAI_API_KEY')\n", "\n", "if not api_key:\n", " raise ValueError(\"OPENAI_API_KEY not found in environment variables\")\n", "\n", "print(\"✅ API key loaded successfully!\")\n", "openai = OpenAI()" ] }, { "cell_type": "code", "execution_count": null, "id": "8098defb", "metadata": {}, "outputs": [], "source": [ "MODEL_OPENAI = \"gpt-4o-mini\"\n", "MODEL_OLLAMA = \"llama3.2:3b\"" ] }, { "cell_type": "markdown", "id": "2bd1d83f", "metadata": {}, "source": [ "## 🌐 Web Scraping Infrastructure" ] }, { "cell_type": "code", "execution_count": null, "id": "c6fe5114", "metadata": {}, "outputs": [], "source": [ "class WebsiteCrawler:\n", " def __init__(self, url):\n", " self.url = url\n", " self.title = \"\"\n", " self.text = \"\"\n", " self.scrape()\n", "\n", " def scrape(self):\n", " try:\n", " # Chrome options\n", " chrome_options = Options()\n", " chrome_options.add_argument(\"--headless\")\n", " chrome_options.add_argument(\"--no-sandbox\")\n", " chrome_options.add_argument(\"--disable-dev-shm-usage\")\n", " chrome_options.add_argument(\"--disable-gpu\")\n", " chrome_options.add_argument(\"--window-size=1920,1080\")\n", " chrome_options.add_argument(\"--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36\")\n", "\n", " # Try to find Chrome\n", " chrome_paths = [\n", " r\"C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe\",\n", " r\"C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe\",\n", " r\"C:\\Users\\{}\\AppData\\Local\\Google\\Chrome\\Application\\chrome.exe\".format(os.getenv('USERNAME')),\n", " ]\n", "\n", " chrome_binary = None\n", " for path in chrome_paths:\n", " if os.path.exists(path):\n", " chrome_binary = path\n", " break\n", "\n", " if chrome_binary:\n", " chrome_options.binary_location = chrome_binary\n", "\n", " # Create driver\n", " driver = webdriver.Chrome(options=chrome_options)\n", " driver.set_page_load_timeout(30)\n", "\n", " print(f\"🔍 Loading: {self.url}\")\n", " driver.get(self.url)\n", "\n", " # Wait for page to load\n", " time.sleep(5)\n", "\n", " # Try to wait for main content\n", " try:\n", " WebDriverWait(driver, 10).until(\n", " EC.presence_of_element_located((By.TAG_NAME, \"main\"))\n", " )\n", " except Exception:\n", " try:\n", " WebDriverWait(driver, 10).until(\n", " EC.presence_of_element_located((By.TAG_NAME, \"body\"))\n", " )\n", " except Exception:\n", " pass # Continue anyway\n", "\n", " # Get title and page source\n", " self.title = driver.title\n", " page_source = driver.page_source\n", " driver.quit()\n", "\n", " print(f\"✅ Page loaded: {self.title}\")\n", "\n", " # Parse with BeautifulSoup\n", " soup = BeautifulSoup(page_source, 'html.parser')\n", "\n", " # Remove unwanted elements\n", " for element in soup([\"script\", \"style\", \"img\", \"input\", \"button\", \"nav\", \"footer\", \"header\"]):\n", " element.decompose()\n", "\n", " # Get main content\n", " main = soup.find('main') or soup.find('article') or soup.find('.content') or soup.find('body')\n", " if main:\n", " self.text = main.get_text(separator=\"\\n\", strip=True)\n", " else:\n", " self.text = soup.get_text(separator=\"\\n\", strip=True)\n", "\n", " # Clean up text\n", " lines = [line.strip() for line in self.text.split('\\n') if line.strip() and len(line.strip()) > 2]\n", " self.text = '\\n'.join(lines[:200]) # Limit to first 200 lines\n", "\n", " print(f\"📄 Extracted {len(self.text)} characters\")\n", "\n", " except Exception as e:\n", " print(f\"❌ Error occurred: {e}\")\n", " self.title = \"Error occurred\"\n", " self.text = \"Could not scrape website content\"" ] }, { "cell_type": "markdown", "id": "d727feff", "metadata": {}, "source": [ "## 🧠 Prompt Engineering & Templates" ] }, { "cell_type": "code", "execution_count": null, "id": "02e3a673-a8a1-4101-a441-3816f7ab9e4d", "metadata": {}, "outputs": [], "source": [ "system_prompt = \"You are an assistant that analyzes the contents of a website \\\n", "and provides a short summary, ignoring text that might be navigation related. \\\n", "Respond in markdown.\"" ] }, { "cell_type": "code", "execution_count": null, "id": "86bb80f9-9e7c-4825-985f-9b83fe50839f", "metadata": {}, "outputs": [], "source": [ "def user_prompt_for(website):\n", " user_prompt = f\"You are looking at a website titled {website.title}\"\n", " user_prompt += \"\\nThe contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.\\n\\n\"\n", " user_prompt += website.text\n", " return user_prompt" ] }, { "cell_type": "code", "execution_count": null, "id": "89998b18-77aa-4aaf-a137-f0d078d61f75", "metadata": {}, "outputs": [], "source": [ "def messages_for(website):\n", " return [\n", " {\"role\": \"system\", \"content\": system_prompt},\n", " {\"role\": \"user\", \"content\": user_prompt_for(website)}\n", " ]" ] }, { "cell_type": "markdown", "id": "cde36d4f", "metadata": {}, "source": [ "## 📝 Summarization " ] }, { "cell_type": "code", "execution_count": null, "id": "5636affe", "metadata": {}, "outputs": [], "source": [ "def summarize_gpt(url):\n", " \"\"\"Scrape website and summarize with GPT\"\"\"\n", " site = WebsiteCrawler(url)\n", "\n", " if \"Error occurred\" in site.title or len(site.text) < 50:\n", " print(f\"❌ Failed to scrape meaningful content from {url}\")\n", " return\n", "\n", " print(\"🤖 Creating summary...\")\n", "\n", " # Create summary\n", " response = openai.chat.completions.create(\n", " model=MODEL_OPENAI,\n", " messages=[\n", " {\"role\": \"system\", \"content\": system_prompt},\n", " {\"role\": \"user\", \"content\": user_prompt_for(site)}\n", " ]\n", " )\n", "\n", " web_summary = response.choices[0].message.content\n", " display(Markdown(web_summary))\n", "\n", "summarize_gpt('https://openai.com')\n", "# summarize_gpt('https://stripe.com')\n", "# summarize_gpt('https://vercel.com')\n", "# summarize_gpt('https://react.dev')" ] }, { "cell_type": "code", "execution_count": null, "id": "90b9a8f8-0c1c-40c8-a4b3-e8e1fcd29df5", "metadata": {}, "outputs": [], "source": [ "def summarize_ollama(url):\n", " website = WebsiteCrawler(url)\n", " response = ollama.chat(\n", " model=MODEL_OLLAMA,\n", " messages=messages_for(website))\n", " display(Markdown(response['message']['content'])) # Generate and display output\n", "\n", "summarize_ollama('https://github.com')\n", "# summarize_ollama('https://nextjs.org')" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 5 }