LLM_Engineering_OLD/week1/community-contributions/day1-selenium-web-summary-es-mx.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2588fbba",
   "metadata": {},
   "source": [
    "# Website Analysis and Summarization with Selenium and OpenAI\n",
    "\n",
    "> This notebook demonstrates how to extract and summarize the main content of any website using Selenium for dynamic extraction and OpenAI for generating concise summaries in Mexican Spanish.\n",
    "\n",
    "## Overview\n",
    "This notebook provides a workflow to automatically analyze websites, extract relevant text, and generate a short summary using a language model. Navigation elements are ignored, focusing on news, announcements, and main content.\n",
    "\n",
    "## Features\n",
    "- Extracts relevant text from web pages using Selenium and BeautifulSoup.\n",
    "- Generates automatic summaries using OpenAI's language models.\n",
    "- Presents results in markdown format.\n",
    "\n",
    "## Requirements\n",
    "- Python 3.8+\n",
    "- Google Chrome browser installed\n",
    "- The following Python packages:\n",
    "  - selenium\n",
    "  - webdriver-manager\n",
    "  - beautifulsoup4\n",
    "  - openai\n",
    "  - python-dotenv\n",
    "  - requests\n",
    "- An OpenAI API key (project key, starting with `sk-proj-`)\n",
    "- Internet connection\n",
    "\n",
    "## How to Use\n",
    "1. Install the required packages:\n",
    "   ```bash\n",
    "   pip install selenium webdriver-manager undetected-chromedriver beautifulsoup4 openai python-dotenv requests\n",
    "   ```\n",
    "2. Add your OpenAI API key to a `.env` file as `OPENAI_API_KEY`.\n",
    "3. Run the notebook cells in order. You can change the target website URL in the code to analyze different sites.\n",
    "4. The summary will be displayed in markdown format below the code cell.\n",
    "\n",
    "**Note:** Some websites may block automated access. The notebook includes options to simulate a real user and avoid bot detection, but results may vary depending on the site's protections.\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc7c2ade",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Imports\n",
    "import os\n",
    "import requests\n",
    "from dotenv import load_dotenv\n",
    "from bs4 import BeautifulSoup\n",
    "from IPython.display import Markdown, display\n",
    "from openai import OpenAI\n",
    "\n",
    "from selenium import webdriver\n",
    "from selenium.webdriver.chrome.service import Service\n",
    "from selenium.webdriver.common.by import By\n",
    "from selenium.webdriver.chrome.options import Options\n",
    "from selenium.webdriver.support.ui import WebDriverWait\n",
    "from selenium.webdriver.support import expected_conditions as EC\n",
    "from webdriver_manager.chrome import ChromeDriverManager\n",
    "import undetected_chromedriver as uc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2d21987",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the environment variables from .env\n",
    "load_dotenv(override=True)\n",
    "api_key = os.getenv('OPENAI_API_KEY')\n",
    "\n",
    "# Check the key\n",
    "\n",
    "if not api_key:\n",
    "    print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
    "elif not api_key.startswith(\"sk-proj-\"):\n",
    "    print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
    "elif api_key.strip() != api_key:\n",
    "    print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
    "else:\n",
    "    print(\"API key found and looks good so far!\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bbb3a8ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "openai = OpenAI()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5313aa64",
   "metadata": {},
   "outputs": [],
   "source": [
    "class Website:\n",
    "    def __init__(self, url, headless=True, wait_time=10):\n",
    "        self.url = url  # Website URL to analyze\n",
    "        self.title = None  # Title of the website\n",
    "        self.text = None  # Extracted text from the website\n",
    "        \n",
    "        # Chrome options configuration for Selenium\n",
    "        options = Options()\n",
    "        if headless:\n",
    "            options.add_argument(\"--headless=new\")  # Run Chrome in headless mode (no window)\n",
    "        options.add_argument(\"--disable-gpu\")  # Disable GPU acceleration\n",
    "        options.add_argument(\"--no-sandbox\")  # Disable Chrome sandbox (required for some environments)\n",
    "        options.add_argument(\"--window-size=1920,1080\")  # Set window size to simulate a real user\n",
    "        # Simulate a real user-agent to avoid bot detection\n",
    "        options.add_argument(\"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\")\n",
    "        \n",
    "        # Initialize Chrome WebDriver\n",
    "        self.driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)\n",
    "        self.driver.get(url)  # Open the URL in the browser\n",
    "        \n",
    "        try:\n",
    "            # Wait until the <body> element is present in the page\n",
    "            WebDriverWait(self.driver, wait_time).until(EC.presence_of_element_located((By.TAG_NAME, \"body\")))\n",
    "            html = self.driver.page_source  # Get the full HTML of the page\n",
    "            soup = BeautifulSoup(html, 'html.parser')  # Parse HTML with BeautifulSoup\n",
    "            self.title = soup.title.string if soup.title else 'No title found'  # Extract the title\n",
    "            if soup.body:\n",
    "                # Remove irrelevant elements from the body\n",
    "                for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
    "                    irrelevant.decompose()\n",
    "                # Extract clean text from the body\n",
    "                self.text = soup.body.get_text(separator='\\n', strip=True)\n",
    "            else:\n",
    "                self.text = \"No body found\"  # If no body is found, indicate it\n",
    "        except Exception as e:\n",
    "            print(f\"Error accessing the site: {e}\")  # Print error to console\n",
    "            self.text = \"Error accessing the site\"  # Store error in the attribute\n",
    "        finally:\n",
    "            self.driver.quit()  # Always close the browser, whether or not an error occurred"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e902c6b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
    "and provides a short summary, ignoring text that might be navigation related. \\\n",
    "Respond in markdown in Mexican Spanish.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eaee8f36",
   "metadata": {},
   "outputs": [],
   "source": [
    "# A function that writes a User Prompt that asks for summaries of websites:\n",
    "\n",
    "def user_prompt_for(website):\n",
    "    user_prompt = f\"You are looking at a website titled {website.title}\"\n",
    "    user_prompt += \"\\nThe contents of this website is as follows; \\\n",
    "please provide a short summary of this website in markdown. \\\n",
    "If it includes news or announcements, then summarize these too.\\n\\n\"\n",
    "    user_prompt += website.text\n",
    "    return user_prompt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ac4ed8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creates messages for the OpenAI API\n",
    "def messages_for(website):\n",
    "    return [\n",
    "        {\"role\": \"system\", \"content\": system_prompt},\n",
    "        {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1536d537",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creates a summary for the given URL\n",
    "def summarize(url):\n",
    "    website = Website(url)\n",
    "    response = openai.chat.completions.create(\n",
    "        model = \"gpt-4o-mini\",\n",
    "        messages = messages_for(website)\n",
    "    )\n",
    "    return response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe135339",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shows the summary for the given URL\n",
    "def display_summary(url):\n",
    "    summary = summarize(url)\n",
    "    display(Markdown(summary))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a301ab4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "display_summary(\"https://openai.com/\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}