Merge pull request #340 from lakovicb/main

Community contribution: Playwright-based scraper by Bojan
2025-04-25 15:27:57 +02:00
parent d22f41f5e0 6ea90801bd
commit ff55d89e27
5 changed files with 456 additions and 0 deletions
--- a/community-contributions/playwright-bojan/Playwright_Solution_JupyterAsync.ipynb
+++ b/community-contributions/playwright-bojan/Playwright_Solution_JupyterAsync.ipynb
@@ -0,0 +1,173 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "aa629e55-8f41-41ab-b319-b55dd1cfc76b",
+   "metadata": {},
+   "source": [
+    "# Playwright Scraper Showcase (Async in Jupyter)\n",
+    "\n",
+    "This notebook demonstrates how to run async Playwright-based scraping code inside JupyterLab using `nest_asyncio`.\n",
+    "\n",
+    "**Note:** Requires `openai_scraper_playwright.py` to be in the same directory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "97469777",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import nest_asyncio\n",
+    "import asyncio\n",
+    "nest_asyncio.apply()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "6254fa89",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from openai_scraper_playwright import EnhancedOpenAIScraper, analyze_content"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "33d2737b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "### 1. Overall Summary of the Website:\n",
+      "The website appears to be a hub for various applications of AI technology, particularly focusing on the capabilities of ChatGPT and other AI models developed by OpenAI. It offers a range of services from answering queries, assisting in planning trips, explaining technical topics, helping with language translation, and providing educational content. The site also features updates on new AI models, research publications, and business solutions integrating AI.\n",
+      "\n",
+      "### 2. Key Individuals or Entities:\n",
+      "- **OpenAI**: Mentioned as the organization behind the development of AI models and technologies such as ChatGPT, GPT-4.1, and image generation models. OpenAI seems to be focused on advancing and applying AI in various fields.\n",
+      "- **Lyndon Barrois & Sora**: Featured in a story, possibly highlighting individual experiences or contributions within the OpenAI ecosystem.\n",
+      "\n",
+      "### 3. Recent Announcements or Updates:\n",
+      "- **Introducing our latest image generation model in the API** (Product, Apr 23, 2025)\n",
+      "- **Thinking with images** (Release, Apr 16, 2025)\n",
+      "- **OpenAI announces nonprofit commission advisors** (Company, Apr 15, 2025)\n",
+      "- **Our updated Preparedness Framework** (Publication, Apr 15, 2025)\n",
+      "- **BrowseComp: a benchmark for browsing agents** (Publication, Apr 10, 2025)\n",
+      "- **OpenAI Pioneers Program** (Company, Apr 9, 2025)\n",
+      "\n",
+      "### 4. Main Topics or Themes:\n",
+      "- **AI Model Development and Application**: Discusses various AI models like ChatGPT, GPT-4.1, and image generation models.\n",
+      "- **Educational and Practical AI Uses**: Offers help in educational topics, practical tasks, and creative endeavors using AI.\n",
+      "- **Business Integration**: Focuses on integrating AI into business processes, automating tasks in finance, legal, and other sectors.\n",
+      "- **Research and Publications**: Shares updates on the latest research and publications related to AI technology.\n",
+      "\n",
+      "### 5. Any Noteworthy Features or Projects:\n",
+      "- **GPT-4.1 and Image Generation Models**: Introduction of new and advanced AI models for text and image processing.\n",
+      "- **OpenAI Pioneers Program**: A significant initiative likely aimed at fostering innovation and practical applications of AI technology.\n",
+      "- **BrowseComp and PaperBench**: Research projects or benchmarks designed to evaluate and improve AI capabilities in specific domains.\n"
+     ]
+    }
+   ],
+   "source": [
+    "result = asyncio.run(analyze_content())\n",
+    "print(result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d7450ccf",
+   "metadata": {},
+   "source": [
+    "✅ If you see structured analysis above, the async code ran successfully in Jupyter!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a46716c-6f77-4b2b-b423-cc9fe05014da",
+   "metadata": {},
+   "source": [
+    "# 🧪 Playwright Scraper Output (Formatted)\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 🧭 1. **Overall Summary of the Website**\n",
+    "\n",
+    "*The website appears to be focused on showcasing various applications and updates related to OpenAI's technology, specifically ChatGPT and other AI tools. It provides information on product releases, company updates, and educational content on how to use AI technologies in different scenarios such as planning trips, learning games, coding, and more.*\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 🧑‍💼 2. **Key Individuals or Entities**\n",
+    "\n",
+    "- **OpenAI** — Company behind the technologies and updates discussed on the website  \n",
+    "- **Lyndon Barrois & Sora** — Featured in a story, possibly highlighting user experiences or contributions\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 📰 3. **Recent Announcements or Updates**\n",
+    "\n",
+    "- 📢 **Introducing GPT-4.1 in the API** — *(no date provided)*\n",
+    "- 🖼️ **Introducing 4o Image Generation** — *(no date provided)*\n",
+    "- 🐟 **Catching halibut with ChatGPT** — *(no date provided)*\n",
+    "- 🧠 **Thinking with images** — *Apr 16, 2025*\n",
+    "- 🧑‍⚖️ **Nonprofit commission advisors announced** — *Apr 15, 2025*\n",
+    "- ⚙️ **Updated Preparedness Framework** — *Apr 15, 2025*\n",
+    "- 🌐 **BrowseComp benchmark for browsing agents** — *Apr 10, 2025*\n",
+    "- 🚀 **OpenAI Pioneers Program launched** — *Apr 9, 2025*\n",
+    "- 📊 **PaperBench research benchmark published** — *Apr 2, 2025*\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 📚 4. **Main Topics or Themes**\n",
+    "\n",
+    "- 🤖 **AI Technology Applications** — Using AI for tasks like planning, learning, and troubleshooting  \n",
+    "- 🧩 **Product and Feature Releases** — Updates on new capabilities  \n",
+    "- 📘 **Educational Content** — Guides for using AI effectively  \n",
+    "- 🧪 **Research and Development** — Publications and technical benchmarks\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## ⭐ 5. **Noteworthy Features or Projects**\n",
+    "\n",
+    "- ✅ **GPT-4.1** — A new API-accessible version of the language model  \n",
+    "- 🖼️ **4o Image Generation** — Feature focused on AI-generated images  \n",
+    "- 🚀 **OpenAI Pioneers Program** — Initiative likely fostering innovation in AI  \n",
+    "- 📊 **BrowseComp & PaperBench** — Benchmarks for evaluating AI agents\n",
+    "\n",
+    "---\n",
+    "\n",
+    "✅ *If you're reading this inside Jupyter and seeing clean structure — your async notebook setup is working beautifully.*\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95c38374-5daa-487c-8bd9-919bb4037ea3",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/community-contributions/playwright-bojan/Playwright_Solution_Showcase_Formatted.ipynb
+++ b/community-contributions/playwright-bojan/Playwright_Solution_Showcase_Formatted.ipynb
@@ -0,0 +1,69 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "3df9df94",
+   "metadata": {},
+   "source": [
+    "# 🧪 Playwright Scraper Output (Formatted)\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 🧭 1. **Overall Summary of the Website**\n",
+    "\n",
+    "*The website appears to be focused on showcasing various applications and updates related to OpenAI's technology, specifically ChatGPT and other AI tools. It provides information on product releases, company updates, and educational content on how to use AI technologies in different scenarios such as planning trips, learning games, coding, and more.*\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 🧑‍💼 2. **Key Individuals or Entities**\n",
+    "\n",
+    "- **OpenAI** — Company behind the technologies and updates discussed on the website  \n",
+    "- **Lyndon Barrois & Sora** — Featured in a story, possibly highlighting user experiences or contributions\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 📰 3. **Recent Announcements or Updates**\n",
+    "\n",
+    "- 📢 **Introducing GPT-4.1 in the API** — *(no date provided)*\n",
+    "- 🖼️ **Introducing 4o Image Generation** — *(no date provided)*\n",
+    "- 🐟 **Catching halibut with ChatGPT** — *(no date provided)*\n",
+    "- 🧠 **Thinking with images** — *Apr 16, 2025*\n",
+    "- 🧑‍⚖️ **Nonprofit commission advisors announced** — *Apr 15, 2025*\n",
+    "- ⚙️ **Updated Preparedness Framework** — *Apr 15, 2025*\n",
+    "- 🌐 **BrowseComp benchmark for browsing agents** — *Apr 10, 2025*\n",
+    "- 🚀 **OpenAI Pioneers Program launched** — *Apr 9, 2025*\n",
+    "- 📊 **PaperBench research benchmark published** — *Apr 2, 2025*\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 📚 4. **Main Topics or Themes**\n",
+    "\n",
+    "- 🤖 **AI Technology Applications** — Using AI for tasks like planning, learning, and troubleshooting  \n",
+    "- 🧩 **Product and Feature Releases** — Updates on new capabilities  \n",
+    "- 📘 **Educational Content** — Guides for using AI effectively  \n",
+    "- 🧪 **Research and Development** — Publications and technical benchmarks\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## ⭐ 5. **Noteworthy Features or Projects**\n",
+    "\n",
+    "- ✅ **GPT-4.1** — A new API-accessible version of the language model  \n",
+    "- 🖼️ **4o Image Generation** — Feature focused on AI-generated images  \n",
+    "- 🚀 **OpenAI Pioneers Program** — Initiative likely fostering innovation in AI  \n",
+    "- 📊 **BrowseComp & PaperBench** — Benchmarks for evaluating AI agents\n",
+    "\n",
+    "---\n",
+    "\n",
+    "✅ *If you're reading this inside Jupyter and seeing clean structure — your async notebook setup is working beautifully.*\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/community-contributions/playwright-bojan/README.md
+++ b/community-contributions/playwright-bojan/README.md
@@ -0,0 +1,67 @@
+# 🧠 Community Contribution: Async Playwright-based OpenAI Scraper
+
+This contribution presents a fully asynchronous, headless-browser-based scraper for [https://openai.com](https://openai.com) using **Playwright** — an alternative to Selenium.
+
+Developed by: [lakovicb](https://github.com/lakovicb)  
+IDE used: WingIDE Pro (Jupyter compatibility via `nest_asyncio`)
+
+---
+
+## 📦 Features
+
+- 🧭 Simulates human-like interactions (mouse movement, scrolling)
+- 🧠 GPT-based analysis using OpenAI's API
+- 🧪 Works inside **JupyterLab** using `nest_asyncio`
+- 📊 Prometheus metrics for scraping observability
+- ⚡ Smart content caching via `diskcache`
+
+---
+
+## 🚀 How to Run
+
+### 1. Install dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+> Ensure [Playwright is installed & browsers are downloaded](https://playwright.dev/python/docs/intro)
+
+```bash
+playwright install
+```
+
+### 2. Set environment variables in `.env`
+
+```env
+OPENAI_API_KEY=your_openai_key
+BROWSER_PATH=/usr/bin/chromium-browser
+```
+
+You can also define optional proxy/login params if needed.
+
+---
+
+## 📘 Notebooks Included
+
+| Notebook | Description |
+|----------|-------------|
+| `Playwright_Solution_JupyterAsync.ipynb` | Executes async scraper directly inside Jupyter |
+| `Playwright_Solution_Showcase_Formatted.ipynb` | Nicely formatted output for human reading |
+
+---
+
+## 🔁 Output Example
+
+- GPT-generated summary
+- Timeline of updates
+- Entities and projects mentioned
+- Structured topics & themes
+
+✅ *Can be extended with PDF export, LangChain pipeline, or vector store ingestion.*
+
+---
+
+## 🙏 Thanks
+
+Huge thanks to Ed Donner for the amazing course and challenge inspiration!
--- a/community-contributions/playwright-bojan/openai_scraper_playwright.py
+++ b/community-contributions/playwright-bojan/openai_scraper_playwright.py
@@ -0,0 +1,141 @@
+# openai_scraper_playwright.py
+
+import asyncio
+from playwright.async_api import async_playwright
+from openai import OpenAI
+import logging
+import random
+import time
+import os
+from prometheus_client import start_http_server, Counter, Histogram
+from diskcache import Cache
+from dotenv import load_dotenv
+
+load_dotenv()
+
+SCRAPE_ATTEMPTS = Counter('scrape_attempts', 'Total scraping attempts')
+SCRAPE_DURATION = Histogram('scrape_duration', 'Scraping duration distribution')
+cache = Cache('./scraper_cache')
+
+class ScrapingError(Exception): pass
+class ContentAnalysisError(Exception): pass
+
+class EnhancedOpenAIScraper:
+    API_KEY = os.getenv("OPENAI_API_KEY")
+    BROWSER_EXECUTABLE = os.getenv("BROWSER_PATH", "/usr/bin/chromium-browser")
+    MAX_CONTENT_LENGTH = int(os.getenv("MAX_CONTENT_LENGTH", 30000))
+
+    def __init__(self, headless=True):
+        self.user_agents = [
+            "Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
+            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."
+        ]
+        self.timeout = 45000
+        self.retry_count = int(os.getenv("RETRY_COUNT", 2))
+        self.headless = headless
+        self.proxy_servers = [x.strip() for x in os.getenv("PROXY_SERVERS", "").split(',') if x.strip()]
+
+    async def human_interaction(self, page):
+        for _ in range(random.randint(2, 5)):
+            x, y = random.randint(0, 1366), random.randint(0, 768)
+            await page.mouse.move(x, y, steps=random.randint(5, 20))
+            await page.wait_for_timeout(random.randint(50, 200))
+
+        if random.random() < 0.3:
+            await page.keyboard.press('Tab')
+            await page.keyboard.type(' ', delay=random.randint(50, 200))
+
+        await page.mouse.wheel(0, random.choice([300, 600, 900]))
+        await page.wait_for_timeout(random.randint(500, 2000))
+
+    async def load_page(self, page, url):
+        try:
+            await page.goto(url, wait_until="domcontentloaded", timeout=self.timeout)
+            selectors = ['main article', '#main-content', 'section:first-of-type', 'div[class*="content"]', 'body']
+            for selector in selectors:
+                if await page.query_selector(selector):
+                    return True
+            await page.wait_for_timeout(5000)
+            return True
+        except Exception as e:
+            logging.error(f"Error loading page {url}: {e}")
+            return False
+
+    @SCRAPE_DURATION.time()
+    async def scrape_with_retry(self, url):
+        SCRAPE_ATTEMPTS.inc()
+        last_error = None
+        try:
+            async with async_playwright() as p:
+                args = {
+                    "headless": self.headless,
+                    "args": ["--disable-blink-features=AutomationControlled", "--no-sandbox"],
+                    "executable_path": self.BROWSER_EXECUTABLE
+                }
+                browser = await p.chromium.launch(**args)
+                context = await browser.new_context(user_agent=random.choice(self.user_agents))
+                page = await context.new_page()
+                await page.add_init_script("""
+                    Object.defineProperty(navigator, 'webdriver', { get: () => false });
+                """)
+
+                for attempt in range(self.retry_count):
+                    try:
+                        if not await self.load_page(page, url):
+                            raise ScrapingError("Failed to load page")
+                        await self.human_interaction(page)
+                        content = await page.evaluate("""() => document.body.innerText""")
+                        if not content.strip():
+                            raise ContentAnalysisError("No content extracted")
+                        await browser.close()
+                        return content[:self.MAX_CONTENT_LENGTH]
+                    except Exception as e:
+                        last_error = e
+                        if attempt < self.retry_count - 1:
+                            await asyncio.sleep(5)
+                        else:
+                            await browser.close()
+                            raise
+        except Exception as e:
+            raise last_error or e
+
+    async def get_cached_content(self, url):
+        key = 'cache_' + url.replace('https://', '').replace('/', '_')
+        content = cache.get(key)
+        if content is None:
+            content = await self.scrape_with_retry(url)
+            cache.set(key, content, expire=int(os.getenv("CACHE_EXPIRY", 3600)))
+        return content
+
+async def analyze_content(url="https://openai.com", headless=True):
+    scraper = EnhancedOpenAIScraper(headless=headless)
+    content = await scraper.get_cached_content(url)
+    client = OpenAI(api_key=EnhancedOpenAIScraper.API_KEY)
+    if not client.api_key:
+        raise ContentAnalysisError("OpenAI API key not configured")
+
+    prompt = f"""
+Analyze this page:
+
+{content}
+    """
+    model = os.getenv("OPENAI_MODEL", "gpt-4-turbo")
+    temperature = float(os.getenv("MODEL_TEMPERATURE", 0.3))
+    max_tokens = int(os.getenv("MAX_TOKENS", 1500))
+    top_p = float(os.getenv("MODEL_TOP_P", 0.9))
+
+    response = client.chat.completions.create(
+        model=model,
+        messages=[
+            {"role": "system", "content": "You are a content analyst."},
+            {"role": "user", "content": prompt}
+        ],
+        temperature=temperature,
+        max_tokens=max_tokens,
+        top_p=top_p
+    )
+
+    if not response.choices:
+        raise ContentAnalysisError("Empty response from GPT")
+
+    return response.choices[0].message.content
--- a/community-contributions/playwright-bojan/requirements.txt
+++ b/community-contributions/playwright-bojan/requirements.txt
@@ -0,0 +1,6 @@
+playwright>=1.43.0
+openai>=1.14.2
+prometheus-client>=0.19.0
+diskcache>=5.6.1
+python-dotenv>=1.0.1
+nest_asyncio>=1.6.0