432 lines
15 KiB
Plaintext
432 lines
15 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "35f59eb3",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Pluggable Web Scraper and Summarizer with Interface-Based Design\n",
|
||
"\n",
|
||
"This system implements a **pluggable architecture** for web scraping and summarization, built on interface-based design using Python’s `Protocol` types. Each stage of the pipeline—content fetching, HTML parsing, and LLM-based summarization—is defined through explicit structural contracts rather than concrete implementations. Components like `RequestsFetcher`, `RobustSoupParser`, and `OllamaClient` fulfill these protocols and can be swapped independently, enabling flexibility, testing, and future extension without modifying core logic. Immutable data models (`@dataclass(frozen=True)`) enforce data integrity throughout the pipeline, while the design cleanly separates concerns across modules to support maintainability and modular growth."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"id": "f42e6d21",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from dataclasses import dataclass\n",
|
||
"from typing import Protocol, Optional, List, Dict, Tuple\n",
|
||
"import requests\n",
|
||
"from bs4 import BeautifulSoup\n",
|
||
"from IPython.display import Markdown, display\n",
|
||
"from openai import OpenAI\n",
|
||
"import logging\n",
|
||
"import chardet"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "65c17368",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Configuration"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"id": "eb0904d7",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"logging.basicConfig(level=logging.INFO)\n",
|
||
"logger = logging.getLogger(__name__)\n",
|
||
"\n",
|
||
"HEADERS = {\n",
|
||
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36\",\n",
|
||
"}\n",
|
||
"DEFAULT_TIMEOUT = 10\n",
|
||
"UNWANTED_TAGS = [\"script\", \"style\", \"nav\", \"header\", \"footer\", \"img\", \"input\"]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8110aa46",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Data Models"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"id": "cdb6c990",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"@dataclass(frozen=True)\n",
|
||
"class RawResponse:\n",
|
||
" content: bytes\n",
|
||
" status_code: int\n",
|
||
" encoding: str\n",
|
||
" headers: Dict[str, str]\n",
|
||
" elapsed: float\n",
|
||
" final_url: str\n",
|
||
"\n",
|
||
"@dataclass(frozen=True)\n",
|
||
"class WebsiteContent:\n",
|
||
" url: str\n",
|
||
" title: str\n",
|
||
" text: str\n",
|
||
" status_code: int\n",
|
||
" response_time: float\n",
|
||
"\n",
|
||
"@dataclass(frozen=True)\n",
|
||
"class LLMResponse:\n",
|
||
" content: str\n",
|
||
" model: str\n",
|
||
" tokens_used: int"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "87b2a97a",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Protocols"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"id": "3070eac2",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class ContentFetcher(Protocol):\n",
|
||
" def fetch(self, url: str) -> RawResponse: ...\n",
|
||
"\n",
|
||
"class ContentParser(Protocol):\n",
|
||
" def parse(self, response: RawResponse) -> WebsiteContent: ...\n",
|
||
"\n",
|
||
"class LLMClient(Protocol):\n",
|
||
" def generate(self, messages: List[Dict[str, str]], model: str) -> LLMResponse: ...\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "553daa11",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Implementations"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"id": "1a42bed9",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class RequestsFetcher:\n",
|
||
" def __init__(self, \n",
|
||
" headers: Dict[str, str] = HEADERS,\n",
|
||
" timeout: int = DEFAULT_TIMEOUT,\n",
|
||
" max_redirects: int = 5):\n",
|
||
" self.headers = headers\n",
|
||
" self.timeout = timeout\n",
|
||
" self.max_redirects = max_redirects\n",
|
||
"\n",
|
||
" def fetch(self, url: str) -> RawResponse:\n",
|
||
" logger.info(f\"Fetching content from {url}\")\n",
|
||
" try:\n",
|
||
" response = requests.get(\n",
|
||
" url,\n",
|
||
" headers=self.headers,\n",
|
||
" timeout=self.timeout,\n",
|
||
" allow_redirects=True,\n",
|
||
" stream=False # Prevent partial content issues\n",
|
||
" )\n",
|
||
" response.raise_for_status()\n",
|
||
" \n",
|
||
" return RawResponse(\n",
|
||
" content=response.content,\n",
|
||
" status_code=response.status_code,\n",
|
||
" encoding=response.encoding,\n",
|
||
" headers=dict(response.headers),\n",
|
||
" elapsed=response.elapsed.total_seconds(),\n",
|
||
" final_url=response.url\n",
|
||
" )\n",
|
||
" except requests.exceptions.RequestException as e:\n",
|
||
" logger.error(f\"Failed to fetch {url}: {str(e)}\")\n",
|
||
" raise\n",
|
||
"\n",
|
||
"class RobustSoupParser:\n",
|
||
" def __init__(self, unwanted_tags: Tuple[str] = UNWANTED_TAGS):\n",
|
||
" self.unwanted_tags = unwanted_tags\n",
|
||
"\n",
|
||
" def parse(self, response: RawResponse) -> WebsiteContent:\n",
|
||
" logger.info(f\"Parsing content from {response.final_url}\")\n",
|
||
" \n",
|
||
" # Detect encoding if not provided\n",
|
||
" encoding = response.encoding or self._detect_encoding(response.content)\n",
|
||
" \n",
|
||
" try:\n",
|
||
" decoded_content = response.content.decode(encoding, errors='replace')\n",
|
||
" soup = BeautifulSoup(decoded_content, 'html.parser')\n",
|
||
" except Exception as e:\n",
|
||
" logger.error(f\"Failed to parse content: {str(e)}\")\n",
|
||
" raise\n",
|
||
"\n",
|
||
" return WebsiteContent(\n",
|
||
" url=response.final_url,\n",
|
||
" title=self._extract_title(soup),\n",
|
||
" text=self._clean_content(soup),\n",
|
||
" status_code=response.status_code,\n",
|
||
" response_time=response.elapsed\n",
|
||
" )\n",
|
||
"\n",
|
||
" def _detect_encoding(self, content: bytes) -> str:\n",
|
||
" result = chardet.detect(content)\n",
|
||
" return result['encoding'] or 'utf-8'\n",
|
||
"\n",
|
||
" def _extract_title(self, soup: BeautifulSoup) -> str:\n",
|
||
" title_tag = soup.find('title')\n",
|
||
" return title_tag.text.strip() if title_tag else \"Untitled\"\n",
|
||
"\n",
|
||
" def _clean_content(self, soup: BeautifulSoup) -> str:\n",
|
||
" # Remove unwanted tags\n",
|
||
" for tag in self.unwanted_tags:\n",
|
||
" for element in soup.find_all(tag):\n",
|
||
" element.decompose()\n",
|
||
"\n",
|
||
" # Extract text with semantic line breaks\n",
|
||
" text = '\\n\\n'.join([\n",
|
||
" element.get_text().strip()\n",
|
||
" for element in soup.find_all(['p', 'h1', 'h2', 'h3', 'article'])\n",
|
||
" if element.get_text().strip()\n",
|
||
" ])\n",
|
||
" \n",
|
||
" return text or \"No readable content found\"\n",
|
||
"\n",
|
||
"class OllamaClient:\n",
|
||
" def __init__(self, \n",
|
||
" base_url: str = 'http://localhost:11434/v1',\n",
|
||
" api_key: str = 'ollama',\n",
|
||
" max_retries: int = 3):\n",
|
||
" self.client = OpenAI(base_url=base_url, api_key=api_key)\n",
|
||
" self.max_retries = max_retries\n",
|
||
"\n",
|
||
" def generate(self, \n",
|
||
" messages: List[Dict[str, str]], \n",
|
||
" model: str = \"llama3.2\") -> LLMResponse:\n",
|
||
" logger.info(f\"Generating summary with {model}\")\n",
|
||
" \n",
|
||
" for attempt in range(self.max_retries):\n",
|
||
" try:\n",
|
||
" response = self.client.chat.completions.create(\n",
|
||
" model=model,\n",
|
||
" messages=messages\n",
|
||
" )\n",
|
||
" return LLMResponse(\n",
|
||
" content=response.choices[0].message.content,\n",
|
||
" model=model,\n",
|
||
" tokens_used=response.usage.total_tokens\n",
|
||
" )\n",
|
||
" except Exception as e:\n",
|
||
" if attempt == self.max_retries - 1:\n",
|
||
" logger.error(f\"Failed after {self.max_retries} attempts: {str(e)}\")\n",
|
||
" raise\n",
|
||
" logger.warning(f\"Retry {attempt + 1}/{self.max_retries}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1805d4f8",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Core Pipeline"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"id": "a985806a",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class SummarizationPipeline:\n",
|
||
" SYSTEM_PROMPT = \"\"\"You are a professional web content analyst. Provide a structured markdown summary containing:\n",
|
||
"- Key points\n",
|
||
"- Notable statistics\n",
|
||
"- Important names/dates\n",
|
||
"- Actionable insights\n",
|
||
"Avoid navigation content and marketing fluff.\"\"\"\n",
|
||
"\n",
|
||
" def __init__(self,\n",
|
||
" fetcher: ContentFetcher,\n",
|
||
" parser: ContentParser,\n",
|
||
" llm_client: LLMClient):\n",
|
||
" self.fetcher = fetcher\n",
|
||
" self.parser = parser\n",
|
||
" self.llm_client = llm_client\n",
|
||
"\n",
|
||
" def summarize(self, url: str, model: str = \"llama3.2\") -> LLMResponse:\n",
|
||
" raw_response = self.fetcher.fetch(url)\n",
|
||
" website_content = self.parser.parse(raw_response)\n",
|
||
" messages = self._build_messages(website_content)\n",
|
||
" return self.llm_client.generate(messages, model)\n",
|
||
"\n",
|
||
" def _build_messages(self, content: WebsiteContent) -> List[Dict[str, str]]:\n",
|
||
" user_prompt = f\"\"\"**Website Analysis Request**\n",
|
||
"URL: {content.url}\n",
|
||
"Title: {content.title}\n",
|
||
"\n",
|
||
"Content:\n",
|
||
"{content.text[:8000]} # Truncate to stay within context window\n",
|
||
"\n",
|
||
"Please provide a comprehensive summary following the guidelines above.\"\"\"\n",
|
||
" \n",
|
||
" return [\n",
|
||
" {\"role\": \"system\", \"content\": self.SYSTEM_PROMPT},\n",
|
||
" {\"role\": \"user\", \"content\": user_prompt}\n",
|
||
" ]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "41832e20",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Factory & Presentation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"id": "656b8dd4",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def create_default_pipeline() -> SummarizationPipeline:\n",
|
||
" return SummarizationPipeline(\n",
|
||
" fetcher=RequestsFetcher(),\n",
|
||
" parser=RobustSoupParser(),\n",
|
||
" llm_client=OllamaClient()\n",
|
||
" )\n",
|
||
"\n",
|
||
"class JupyterPresenter:\n",
|
||
" @staticmethod\n",
|
||
" def display(response: LLMResponse) -> None:\n",
|
||
" display(Markdown(f\"\"\"\n",
|
||
"## Summary Results\n",
|
||
"**Model**: {response.model} \n",
|
||
"**Tokens Used**: {response.tokens_used} \n",
|
||
"**Summary**:\n",
|
||
"{response.content}\n",
|
||
" \"\"\"))\n",
|
||
" "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "76339788",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Execution"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 29,
|
||
"id": "69304964",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"INFO:__main__:Fetching content from https://edwarddonner.com\n",
|
||
"INFO:__main__:Parsing content from https://edwarddonner.com/\n",
|
||
"INFO:__main__:Generating summary with llama3.2\n",
|
||
"INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/markdown": [
|
||
"\n",
|
||
"## Summary Results\n",
|
||
"**Model**: llama3.2 \n",
|
||
"**Tokens Used**: 630 \n",
|
||
"**Summary**:\n",
|
||
"**Website Analysis Summary**\n",
|
||
"==========================\n",
|
||
"\n",
|
||
"### Key Points\n",
|
||
"\n",
|
||
"* The website belongs to Edward Donner, a co-founder and CTO of Nebula.io, an AI startup applying LLMs for talent discovery.\n",
|
||
"* The website showcases Donner's interests in code writing, music production, and technology.\n",
|
||
"* It announces the launch of The Complete Agentic AI Engineering Course and provides resources on LLM workshop and mastering AI.\n",
|
||
"\n",
|
||
"### Notable Statistics\n",
|
||
"\n",
|
||
"* None mentioned, as there are no explicit statistics provided on the website.\n",
|
||
"\n",
|
||
"### Important Names/Dates\n",
|
||
"\n",
|
||
"* Edward Donner: Website owner and CTO of Nebula.io.\n",
|
||
"* 2021: Year in which AI startup untapt was acquired by an unknown party (no information about the acquirer is available).\n",
|
||
"\n",
|
||
"### Actionable Insights\n",
|
||
"\n",
|
||
"* The website appears to be a personal page showcasing Donner's expertise in AI, LLMs, and talent discovery. It may serve as a way for him to establish his professional brand and network with potential clients or collaborators.\n",
|
||
"* Offering resources and courses, such as \"The Complete Agentic AI Engineering Course\" and workshops, can help attract visitors and demonstrate the company's capabilities.\n",
|
||
"* Subscribing to the website might offer exclusive access to updates, insights on LLMs and talent discovery, and potentially lucrative career opportunities.\n",
|
||
" "
|
||
],
|
||
"text/plain": [
|
||
"<IPython.core.display.Markdown object>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"pipeline = create_default_pipeline()\n",
|
||
"try:\n",
|
||
" response = pipeline.summarize(\"https://edwarddonner.com\")\n",
|
||
" JupyterPresenter.display(response)\n",
|
||
"except Exception as e:\n",
|
||
" logger.error(f\"Summarization failed: {str(e)}\")\n",
|
||
" display(Markdown(\"## Error\\nUnable to generate summary. Please check the URL and try again.\"))"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": ".venv",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.11.9"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|