Files
LLM_Engineering_OLD/week1/community-contributions/Shriyash_Patil_WebscrapperDay1.ipynb
2025-10-04 15:00:53 +05:30

303 lines
15 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "code",
"execution_count": 15,
"id": "fafbdb1f-6ecf-4fee-a1d2-80c6f33b556d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: selenium in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (4.36.0)\n",
"Requirement already satisfied: urllib3<3.0,>=2.5.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from urllib3[socks]<3.0,>=2.5.0->selenium) (2.5.0)\n",
"Requirement already satisfied: trio<1.0,>=0.30.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from selenium) (0.31.0)\n",
"Requirement already satisfied: trio-websocket<1.0,>=0.12.2 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from selenium) (0.12.2)\n",
"Requirement already satisfied: certifi>=2025.6.15 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from selenium) (2025.8.3)\n",
"Requirement already satisfied: typing_extensions<5.0,>=4.14.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from selenium) (4.15.0)\n",
"Requirement already satisfied: websocket-client<2.0,>=1.8.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from selenium) (1.8.0)\n",
"Requirement already satisfied: attrs>=23.2.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from trio<1.0,>=0.30.0->selenium) (25.3.0)\n",
"Requirement already satisfied: sortedcontainers in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from trio<1.0,>=0.30.0->selenium) (2.4.0)\n",
"Requirement already satisfied: idna in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from trio<1.0,>=0.30.0->selenium) (3.10)\n",
"Requirement already satisfied: outcome in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from trio<1.0,>=0.30.0->selenium) (1.3.0.post0)\n",
"Requirement already satisfied: sniffio>=1.3.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from trio<1.0,>=0.30.0->selenium) (1.3.1)\n",
"Requirement already satisfied: wsproto>=0.14 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from trio-websocket<1.0,>=0.12.2->selenium) (1.2.0)\n",
"Requirement already satisfied: pysocks!=1.5.7,<2.0,>=1.5.6 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from urllib3[socks]<3.0,>=2.5.0->selenium) (1.7.1)\n",
"Requirement already satisfied: h11<1,>=0.9.0 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from wsproto>=0.14->trio-websocket<1.0,>=0.12.2->selenium) (0.16.0)\n",
"Requirement already satisfied: webdriver-manager in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (4.0.2)\n",
"Requirement already satisfied: requests in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from webdriver-manager) (2.32.5)\n",
"Requirement already satisfied: python-dotenv in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from webdriver-manager) (1.1.1)\n",
"Requirement already satisfied: packaging in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from webdriver-manager) (25.0)\n",
"Requirement already satisfied: charset_normalizer<4,>=2 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from requests->webdriver-manager) (3.4.3)\n",
"Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from requests->webdriver-manager) (3.10)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from requests->webdriver-manager) (2.5.0)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/llms/lib/python3.11/site-packages (from requests->webdriver-manager) (2025.8.3)\n"
]
}
],
"source": [
"!pip install selenium\n",
"!pip install webdriver-manager"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "fc4283fd-504a-43fa-a92b-7b54c76b39a0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"API key found and looks good so far!\n"
]
},
{
"data": {
"text/markdown": [
"Sure! Here's a parody Twitter thread based on the homepage of EdwardDonner.com. Each tweet captures the tone and structure of the site in a sarcastic manner.\n",
"\n",
"---\n",
"\n",
"**1/6 🥳 Welcome to Edward Donner!** \n",
"Where we officially celebrate the mundane and demonstrate how to make everything sound like it's a life-changing experience. Get your \"meh\" ready, because the excitement is practically oozing out of our non-existent product descriptions! \n",
"\n",
"---\n",
"\n",
"**2/6 🌟 Our \"Mission\":** \n",
"To show you that while you shop, there's a slight chance you might save a couple of bucks! Because why just shop when you can do it with absolutely zero risk of fun or spontaneity? We take the thrill out of thrifting—you're welcome!\n",
"\n",
"---\n",
"\n",
"**3/6 💪 What We Offer:** \n",
"Oh, just your run-of-the-mill assortment of \"high-quality\" products that weve totally not pulled from the clearance bin. From must-have items to things you didnt ask for but we'll sell you anyway. It's like a treasure hunt, but without the treasure!\n",
"\n",
"---\n",
"\n",
"**4/6 📦 Our Customers:** \n",
"We love to brag about our fictitious wide-eyed customers who are THRILLED to have stumbled upon us. They literally danced in joy—probably because they mistook our site for a disco party. Who needs real satisfaction when youve got buyers remorse?\n",
"\n",
"---\n",
"\n",
"**5/6 🎉 Our Commitment:** \n",
"“Convenience is key!” they say. So weve made it super easy to shop from your couch without even the slightest hint of real fulfillment. You can binge on shopping while scrolling through cat videos—multitasking at its finest! 🙌\n",
"\n",
"---\n",
"\n",
"**6/6 💼 Join Us Today!** \n",
"Dive on in, the waters lukewarm! Sign up for updates and prepare for thrill—like, remember checking your email? Its like getting a surprise tax form in your inbox, only less exciting! Dont miss out on treasure, folks! 😂 #LivingTheDream\n",
"\n",
"--- \n",
"\n",
"Feel free to share or adjust the humor to your liking!"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"I'm sorry, but I can't access external websites directly, including the one you provided. However, if you can share some of the content or main points from the website, I can help you craft a light and witty parody based on that information!"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#import\n",
"\n",
"import os\n",
"import requests\n",
"import time\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"from selenium import webdriver\n",
"from selenium.webdriver.chrome.service import Service\n",
"from selenium.webdriver.chrome.options import Options\n",
"from selenium.webdriver.common.by import By\n",
"from selenium.webdriver.support.ui import WebDriverWait\n",
"from selenium.webdriver.support import expected_conditions as EC\n",
"from webdriver_manager.chrome import ChromeDriverManager\n",
"\n",
"\n",
"# Get the api key\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n",
"\n",
"#create an object of OpenAI\n",
"openai = OpenAI()\n",
"\n",
"\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
" def __init__(self, url, headless=True, chrome_binary=None, wait_seconds=10):\n",
" \"\"\"\n",
" Create this Website object from the given url using Selenium WebDriver.\n",
" Uses webdriver-manager to fetch a compatible chromedriver automatically.\n",
" Parameters:\n",
" - url: target URL\n",
" - headless: run chrome headless (True/False)\n",
" - chrome_binary: optional path to chrome/chromium binary (if not in PATH)\n",
" - wait_seconds: timeout for waiting page load/dynamic content\n",
" \"\"\"\n",
" self.url = url\n",
" options = Options()\n",
"\n",
" # headless or visible browser\n",
" if headless:\n",
" options.add_argument(\"--headless=new\") # use new headless flag where supported\n",
" options.add_argument(\"--no-sandbox\")\n",
" options.add_argument(\"--disable-dev-shm-usage\") # helpful in containers\n",
" options.add_argument(\"--disable-gpu\")\n",
" # some sites detect automation; these flags may help\n",
" options.add_argument(\"--disable-blink-features=AutomationControlled\")\n",
" options.add_experimental_option(\"excludeSwitches\", [\"enable-automation\"])\n",
" options.add_experimental_option('useAutomationExtension', False)\n",
"\n",
" # If you need to point to a custom Chrome/Chromium binary:\n",
" if chrome_binary:\n",
" options.binary_location = chrome_binary\n",
"\n",
" # Use webdriver-manager to download/manage chromedriver automatically\n",
" service = Service(ChromeDriverManager().install())\n",
"\n",
" driver = webdriver.Chrome(service=service, options=options)\n",
" try:\n",
" driver.get(url)\n",
"\n",
" # Use WebDriverWait to let dynamic JS content load (better than sleep)\n",
" try:\n",
" WebDriverWait(driver, wait_seconds).until(\n",
" lambda d: d.execute_script(\"return document.readyState === 'complete'\")\n",
" )\n",
" except Exception:\n",
" # fallback: short sleep if readyState didn't hit complete in time\n",
" time.sleep(2)\n",
"\n",
" html = driver.page_source\n",
" soup = BeautifulSoup(html, \"html.parser\")\n",
"\n",
" # Title\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
"\n",
" # Remove irrelevant tags inside body if body exists\n",
" body = soup.body or soup\n",
" for irrelevant in body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
"\n",
" self.text = body.get_text(separator=\"\\n\", strip=True)\n",
"\n",
" finally:\n",
" driver.quit()\n",
"\n",
"\n",
"\n",
"system_prompt = \"\"\"You are a sarcastic website-parsing agent whose job is to produce a spoof/parody of a target website.\n",
"Behavior:\n",
" - When given a URL, fetch and parse the page (assume access to tools like Selenium/Playwright and BeautifulSoup).\n",
" - Preserve the site's structure: headings, subheadings, paragraphs, lists, and major sections.\n",
" - Rewrite all visible copy in a clearly sarcastic, mocking, or humorous tone while preserving the original intent and structure so the spoof is recognizable.\n",
" - Keep formatting (Markdown or HTML-like headings and lists) so the output can be rendered as a parody webpage.\n",
" - Emphasize and exaggerate marketing fluff, UI oddities, and obvious clichés. Use witty, ironic, or deadpan phrasing.\n",
"Safety & Limits:\n",
" - Do not produce content that is defamatory, reveals private personal data, or incites harassment. Jokes should target tone/marketing/design, not private individuals.\n",
" - Avoid reproducing long verbatim copyrighted text; instead, paraphrase and transform content clearly into a parody.\n",
" - If the page requires interactive steps (logins, paywalls, or dynamic user-only content), note the limitation and spoof using the visible public content only.\n",
"Output format:\n",
" - Return a single spoofed document preserving headings and lists, suitable for rendering as a parody site (Markdown or simple HTML).\n",
" - Include a short metadata line at the top: e.g., \\\"Source: <original URL> — Spoofed by sarcastic-agent\\\".\"\"\"\n",
"\n",
"\n",
"\n",
"def messages_for(website,user_prompt):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ]\n",
"def summarize(url,user_prompt,model):\n",
" website = Website(url)\n",
" response = openai.chat.completions.create(\n",
" model = model,\n",
" messages = messages_for(website,user_prompt)\n",
" )\n",
" return response.choices[0].message.content\n",
"\n",
" \n",
"def display_summary(url,user_prompt,model):\n",
" summary = summarize(url,user_prompt,model)\n",
" display(Markdown(summary))\n",
" \n",
"openai_model=\"gpt-4o-mini\"\n",
"website_url = \"https://edwarddonner.com\"\n",
"user_prompt1 = \"Parse \"+website_url+\" and produce a 6-tweet Twitter thread parodying the homepage. Each tweet ≤280 characters, with a witty hook at the start\"\n",
"display_summary(website_url,user_prompt1,openai_model) \n",
"# user_prompt2 = \"Parse \"+website_url+\"and rewrite as a sarcastic parody with *light* sarcasm — witty and friendly, not mean. Keep it safe for public sharing.\"\n",
"# display_summary(website_url,user_prompt2,openai_model)\n",
" \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84313d02-7459-4f56-b0ff-4d09b2b2e0b9",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "6437b3cc-a50b-44d5-9241-6ba5f33617d6",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}