Final adjustments and preparation for Ed's review

2025-04-24 15:37:07 +02:00
parent 1a7f4e86b0
commit 6ea90801bd
2 changed files with 234 additions and 220 deletions
--- a/community-contributions/playwright-bojan/Playwright_Solution_JupyterAsync.ipynb
+++ b/community-contributions/playwright-bojan/Playwright_Solution_JupyterAsync.ipynb
@@ -0,0 +1,173 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "aa629e55-8f41-41ab-b319-b55dd1cfc76b",
   "metadata": {},
   "source": [
    "# Playwright Scraper Showcase (Async in Jupyter)\n",
    "\n",
    "This notebook demonstrates how to run async Playwright-based scraping code inside JupyterLab using `nest_asyncio`.\n",
    "\n",
    "**Note:** Requires `openai_scraper_playwright.py` to be in the same directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "97469777",
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "import asyncio\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "6254fa89",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai_scraper_playwright import EnhancedOpenAIScraper, analyze_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "33d2737b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### 1. Overall Summary of the Website:\n",
      "The website appears to be a hub for various applications of AI technology, particularly focusing on the capabilities of ChatGPT and other AI models developed by OpenAI. It offers a range of services from answering queries, assisting in planning trips, explaining technical topics, helping with language translation, and providing educational content. The site also features updates on new AI models, research publications, and business solutions integrating AI.\n",
      "\n",
      "### 2. Key Individuals or Entities:\n",
      "- **OpenAI**: Mentioned as the organization behind the development of AI models and technologies such as ChatGPT, GPT-4.1, and image generation models. OpenAI seems to be focused on advancing and applying AI in various fields.\n",
      "- **Lyndon Barrois & Sora**: Featured in a story, possibly highlighting individual experiences or contributions within the OpenAI ecosystem.\n",
      "\n",
      "### 3. Recent Announcements or Updates:\n",
      "- **Introducing our latest image generation model in the API** (Product, Apr 23, 2025)\n",
      "- **Thinking with images** (Release, Apr 16, 2025)\n",
      "- **OpenAI announces nonprofit commission advisors** (Company, Apr 15, 2025)\n",
      "- **Our updated Preparedness Framework** (Publication, Apr 15, 2025)\n",
      "- **BrowseComp: a benchmark for browsing agents** (Publication, Apr 10, 2025)\n",
      "- **OpenAI Pioneers Program** (Company, Apr 9, 2025)\n",
      "\n",
      "### 4. Main Topics or Themes:\n",
      "- **AI Model Development and Application**: Discusses various AI models like ChatGPT, GPT-4.1, and image generation models.\n",
      "- **Educational and Practical AI Uses**: Offers help in educational topics, practical tasks, and creative endeavors using AI.\n",
      "- **Business Integration**: Focuses on integrating AI into business processes, automating tasks in finance, legal, and other sectors.\n",
      "- **Research and Publications**: Shares updates on the latest research and publications related to AI technology.\n",
      "\n",
      "### 5. Any Noteworthy Features or Projects:\n",
      "- **GPT-4.1 and Image Generation Models**: Introduction of new and advanced AI models for text and image processing.\n",
      "- **OpenAI Pioneers Program**: A significant initiative likely aimed at fostering innovation and practical applications of AI technology.\n",
      "- **BrowseComp and PaperBench**: Research projects or benchmarks designed to evaluate and improve AI capabilities in specific domains.\n"
     ]
    }
   ],
   "source": [
    "result = asyncio.run(analyze_content())\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7450ccf",
   "metadata": {},
   "source": [
    "✅ If you see structured analysis above, the async code ran successfully in Jupyter!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a46716c-6f77-4b2b-b423-cc9fe05014da",
   "metadata": {},
   "source": [
    "# 🧪 Playwright Scraper Output (Formatted)\n",
    "\n",
    "---\n",
    "\n",
    "## 🧭 1. **Overall Summary of the Website**\n",
    "\n",
    "*The website appears to be focused on showcasing various applications and updates related to OpenAI's technology, specifically ChatGPT and other AI tools. It provides information on product releases, company updates, and educational content on how to use AI technologies in different scenarios such as planning trips, learning games, coding, and more.*\n",
    "\n",
    "---\n",
    "\n",
    "## 🧑‍💼 2. **Key Individuals or Entities**\n",
    "\n",
    "- **OpenAI** — Company behind the technologies and updates discussed on the website  \n",
    "- **Lyndon Barrois & Sora** — Featured in a story, possibly highlighting user experiences or contributions\n",
    "\n",
    "---\n",
    "\n",
    "## 📰 3. **Recent Announcements or Updates**\n",
    "\n",
    "- 📢 **Introducing GPT-4.1 in the API** — *(no date provided)*\n",
    "- 🖼️ **Introducing 4o Image Generation** — *(no date provided)*\n",
    "- 🐟 **Catching halibut with ChatGPT** — *(no date provided)*\n",
    "- 🧠 **Thinking with images** — *Apr 16, 2025*\n",
    "- 🧑‍⚖️ **Nonprofit commission advisors announced** — *Apr 15, 2025*\n",
    "- ⚙️ **Updated Preparedness Framework** — *Apr 15, 2025*\n",
    "- 🌐 **BrowseComp benchmark for browsing agents** — *Apr 10, 2025*\n",
    "- 🚀 **OpenAI Pioneers Program launched** — *Apr 9, 2025*\n",
    "- 📊 **PaperBench research benchmark published** — *Apr 2, 2025*\n",
    "\n",
    "---\n",
    "\n",
    "## 📚 4. **Main Topics or Themes**\n",
    "\n",
    "- 🤖 **AI Technology Applications** — Using AI for tasks like planning, learning, and troubleshooting  \n",
    "- 🧩 **Product and Feature Releases** — Updates on new capabilities  \n",
    "- 📘 **Educational Content** — Guides for using AI effectively  \n",
    "- 🧪 **Research and Development** — Publications and technical benchmarks\n",
    "\n",
    "---\n",
    "\n",
    "## ⭐ 5. **Noteworthy Features or Projects**\n",
    "\n",
    "- ✅ **GPT-4.1** — A new API-accessible version of the language model  \n",
    "- 🖼️ **4o Image Generation** — Feature focused on AI-generated images  \n",
    "- 🚀 **OpenAI Pioneers Program** — Initiative likely fostering innovation in AI  \n",
    "- 📊 **BrowseComp & PaperBench** — Benchmarks for evaluating AI agents\n",
    "\n",
    "---\n",
    "\n",
    "✅ *If you're reading this inside Jupyter and seeing clean structure — your async notebook setup is working beautifully.*\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95c38374-5daa-487c-8bd9-919bb4037ea3",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/community-contributions/playwright-bojan/openai_scraper_playwright.py
+++ b/community-contributions/playwright-bojan/openai_scraper_playwright.py
@@ -1,3 +1,5 @@
 # openai_scraper_playwright.py
 import asyncio
 from playwright.async_api import async_playwright
 from openai import OpenAI
@@ -11,250 +13,122 @@ from dotenv import load_dotenv
 load_dotenv()
 # Setting up Prometheus metrics
 SCRAPE_ATTEMPTS = Counter('scrape_attempts', 'Total scraping attempts')
-SCRAPE_DURATION = Histogram(
+SCRAPE_DURATION = Histogram('scrape_duration', 'Scraping duration distribution')
    'scrape_duration', 'Scraping duration distribution')
 # Setting up cache
 cache = Cache('./scraper_cache')
-
+class ScrapingError(Exception): pass
-class ScrapingError(Exception):
+class ContentAnalysisError(Exception): pass
    pass
 class ContentAnalysisError(Exception):
    pass
 class EnhancedOpenAIScraper:
    API_KEY = os.getenv("OPENAI_API_KEY")
-    BROWSER_EXECUTABLE = os.getenv(
+    BROWSER_EXECUTABLE = os.getenv("BROWSER_PATH", "/usr/bin/chromium-browser")
        "BROWSER_PATH", "/usr/bin/chromium-browser")
    MAX_CONTENT_LENGTH = int(os.getenv("MAX_CONTENT_LENGTH", 30000))
    def __init__(self, headless=True):
        self.user_agents = [
-            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
+            "Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
-            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
+            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."
        ]
-        self.timeout = 45000  # 45 seconds
+        self.timeout = 45000
        self.retry_count = int(os.getenv("RETRY_COUNT", 2))
        self.headless = headless
-        self.mouse_velocity_range = (100, 500)  # px/ms
+        self.proxy_servers = [x.strip() for x in os.getenv("PROXY_SERVERS", "").split(',') if x.strip()]
        self.interaction_delays = {
            'scroll': (int(os.getenv("SCROLL_DELAY_MIN", 500)), int(os.getenv("SCROLL_DELAY_MAX", 2000))),
            'click': (int(os.getenv("CLICK_DELAY_MIN", 100)), int(os.getenv("CLICK_DELAY_MAX", 300))),
            'movement': (int(os.getenv("MOVEMENT_DELAY_MIN", 50)), int(os.getenv("MOVEMENT_DELAY_MAX", 200)))
        }
        self.proxy_servers = [server.strip() for server in os.getenv(
            "PROXY_SERVERS", "").split(',') if server.strip()]
    async def human_interaction(self, page):
        """Advanced simulation of user behavior"""
        # Random mouse movement path
        for _ in range(random.randint(2, 5)):
-            x = random.randint(0, 1366)
+            x, y = random.randint(0, 1366), random.randint(0, 768)
            y = random.randint(0, 768)
            await page.mouse.move(x, y, steps=random.randint(5, 20))
-            await page.wait_for_timeout(random.randint(*self.interaction_delays['movement']))
+            await page.wait_for_timeout(random.randint(50, 200))
        # Simulating typing
        if random.random() < 0.3:
            await page.keyboard.press('Tab')
            await page.keyboard.type(' ', delay=random.randint(50, 200))
-        # More realistic scrolling
+        await page.mouse.wheel(0, random.choice([300, 600, 900]))
-        scroll_distance = random.choice([300, 600, 900])
+        await page.wait_for_timeout(random.randint(500, 2000))
        await page.mouse.wheel(0, scroll_distance)
        await page.wait_for_timeout(random.randint(*self.interaction_delays['scroll']))
    async def load_page(self, page, url):
        """Smarter page loading with dynamic waiting"""
        start_time = time.time()
        try:
            await page.goto(url, wait_until="domcontentloaded", timeout=self.timeout)
-
+            selectors = ['main article', '#main-content', 'section:first-of-type', 'div[class*="content"]', 'body']
            # Smarter content extraction selectors
            selectors = [
                'main article',
                '#main-content',
                'section:first-of-type',
                'div[class*="content"]',
                'body'  # Fallback
            ]
            for selector in selectors:
-                try:
+                if await page.query_selector(selector):
-                    element = await page.query_selector(selector)
+                    return True
-                    if element:
+            await page.wait_for_timeout(5000)
            return True
                except Exception:
                    continue
            # Fallback if no selector is found within a certain time
            if time.time() - start_time < 30:  # If we haven't used the full timeout
                await page.wait_for_timeout(30000 - int(time.time() - start_time))
            return True  # Page likely loaded
        except Exception as e:
            logging.error(f"Error loading page {url}: {e}")
            return False
    @SCRAPE_DURATION.time()
-    async def scrape_with_retry(self):
+    async def scrape_with_retry(self, url):
        """Main function with retry mechanism and browser reuse"""
        SCRAPE_ATTEMPTS.inc()
        last_error = None
        browser = None
        context = None
        page = None
        try:
            async with async_playwright() as p:
-                launch_args = {
+                args = {
                    "headless": self.headless,
-                    "args": [
+                    "args": ["--disable-blink-features=AutomationControlled", "--no-sandbox"],
                        "--disable-blink-features=AutomationControlled",
                        "--single-process",
                        "--no-sandbox",
                        f"--user-agent={random.choice(self.user_agents)}"
                    ],
                    "executable_path": self.BROWSER_EXECUTABLE
                }
-                if self.proxy_servers:
+                browser = await p.chromium.launch(**args)
-                    proxy_url = random.choice(self.proxy_servers)
+                context = await browser.new_context(user_agent=random.choice(self.user_agents))
                    proxy_config = {"server": proxy_url}
                    proxy_username = os.getenv('PROXY_USER')
                    proxy_password = os.getenv('PROXY_PASS')
                    if proxy_username and proxy_password:
                        proxy_config['username'] = proxy_username
                        proxy_config['password'] = proxy_password
                    launch_args['proxy'] = proxy_config
                browser = await p.chromium.launch(**launch_args)
                context = await browser.new_context(
                    user_agent=random.choice(self.user_agents),
                    viewport={"width": 1366, "height": 768},
                    locale=os.getenv("BROWSER_LOCALE", "en-US")
                )
                await context.route("**/*", lambda route: route.continue_())
                page = await context.new_page()
                await page.add_init_script("""
                    Object.defineProperty(navigator, 'webdriver', { get: () => false });
                    window.navigator.chrome = { runtime: {}, app: { isInstalled: false } };
                """)
                for attempt in range(self.retry_count):
                    try:
-                        logging.info(
+                        if not await self.load_page(page, url):
-                            f"Attempt {attempt + 1}: Loading OpenAI...")
+                            raise ScrapingError("Failed to load page")
                        if not await self.load_page(page, "https://openai.com"):
                            raise ScrapingError(
                                "Failed to load key content on OpenAI website.")
                        await self.human_interaction(page)
-                        await page.screenshot(path=f"openai_debug_{attempt}.png")
+                        content = await page.evaluate("""() => document.body.innerText""")
                        content = await page.evaluate("""() => {
                            const selectors = [
                                'main article',
                                '#main-content',
                                'section:first-of-type',
                                'div[class*="content"]'
                            ];
                            let content = '';
                            for (const selector of selectors) {
                                const element = document.querySelector(selector);
                                if (element) {
                                    content += element.innerText + '\\n\\n';
                                }
                            }
                            return content.trim() || document.body.innerText;
                        }""")
                        if not content.strip():
-                            raise ContentAnalysisError(
+                            raise ContentAnalysisError("No content extracted")
-                                "No content extracted from the page.")
+                        await browser.close()
                        return content[:self.MAX_CONTENT_LENGTH]
-
+                    except Exception as e:
                    except (ScrapingError, ContentAnalysisError) as e:
                        last_error = e
                        logging.warning(
                            f"Attempt {attempt + 1} failed: {str(e)}")
                        if attempt < self.retry_count - 1:
                            await asyncio.sleep(5)
                        else:
                            if browser:
                            await browser.close()
                                browser = None
                            raise
        except Exception as e:
-                        last_error = e
+            raise last_error or e
                        logging.exception(f"Unexpected error on attempt {
                                          attempt + 1}: {str(e)}")
                        if attempt < self.retry_count - 1:
                            await asyncio.sleep(5)
                        else:
                            if browser:
                                await browser.close()
                                browser = None
                            raise
-        except Exception as e:
+    async def get_cached_content(self, url):
-            last_error = e
+        key = 'cache_' + url.replace('https://', '').replace('/', '_')
        finally:
            if browser:
                await browser.close()
        raise last_error if last_error else Exception(
            "All scraping attempts failed.")
    async def get_cached_content(self):
        key = 'openai_content_cache_key'
        content = cache.get(key)
        if content is None:
-            content = await self.scrape_with_retry()
+            content = await self.scrape_with_retry(url)
-            cache.set(key, content, expire=int(
+            cache.set(key, content, expire=int(os.getenv("CACHE_EXPIRY", 3600)))
                os.getenv("CACHE_EXPIRY", 3600)))
        return content
-
+async def analyze_content(url="https://openai.com", headless=True):
 async def analyze_content(headless=True):
    try:
    scraper = EnhancedOpenAIScraper(headless=headless)
-        content = await scraper.get_cached_content()
+    content = await scraper.get_cached_content(url)
    client = OpenAI(api_key=EnhancedOpenAIScraper.API_KEY)
    if not client.api_key:
-            raise ContentAnalysisError(
+        raise ContentAnalysisError("OpenAI API key not configured")
                "OpenAI API key not configured (check environment variables).")
-        prompt_template = """
+    prompt = f"""
-            Analyze the following website content and extract the following information if present:
+Analyze this page:
-            1.  **Overall Summary of the Website:** Provide a concise overview of the website's purpose and the main topics discussed.
+{content}
            2.  **Key Individuals or Entities:** Identify and briefly describe any prominent individuals, companies, or organizations mentioned.
            3.  **Recent Announcements or Updates:** List any recent announcements, news, or updates found on the website, including dates if available.
            4.  **Main Topics or Themes:** Identify the primary subjects or themes explored on the website.
            5.  **Any Noteworthy Features or Projects:** Highlight any significant features, projects, or initiatives mentioned.
            Format the output clearly under each of these headings. If a particular piece of information is not found, indicate that it is not present.
            Content:
            {content}
    """
-
+    model = os.getenv("OPENAI_MODEL", "gpt-4-turbo")
        formatted_prompt = prompt_template.format(content=content)
        model_name = os.getenv("OPENAI_MODEL", "gpt-4-turbo")
    temperature = float(os.getenv("MODEL_TEMPERATURE", 0.3))
    max_tokens = int(os.getenv("MAX_TOKENS", 1500))
    top_p = float(os.getenv("MODEL_TOP_P", 0.9))
    response = client.chat.completions.create(
-            model=model_name,
+        model=model,
        messages=[
-                {"role": "system", "content": "You are a helpful assistant that analyzes website content and extracts key information in a structured format."},
+            {"role": "system", "content": "You are a content analyst."},
-                {"role": "user", "content": formatted_prompt}
+            {"role": "user", "content": prompt}
        ],
        temperature=temperature,
        max_tokens=max_tokens,
@@ -262,39 +136,6 @@ async def analyze_content(headless=True):
    )
    if not response.choices:
-            raise ContentAnalysisError("Empty response from GPT.")
+        raise ContentAnalysisError("Empty response from GPT")
    return response.choices[0].message.content
    except (ScrapingError, ContentAnalysisError) as e:
        logging.error(f"Analysis failed: {str(e)}")
        return f"Critical analysis error: {str(e)}"
    except Exception as e:
        logging.exception("Unexpected error during analysis.")
        return f"Unexpected analysis error: {str(e)}"
 async def main():
    logging.basicConfig(
        level=os.getenv("LOG_LEVEL", "INFO").upper(),
        format='%(asctime)s - %(levelname)s - %(message)s'
    )
    # Start Prometheus HTTP server for exposing metrics
    try:
        prometheus_port = int(os.getenv("PROMETHEUS_PORT", 8000))
        start_http_server(prometheus_port)
        logging.info(f"Prometheus metrics server started on port {
                     prometheus_port}")
    except Exception as e:
        logging.warning(f"Failed to start Prometheus metrics server: {e}")
    start_time = time.time()
    result = await analyze_content(headless=True)
    end_time = time.time()
    print(f"\nAnalysis completed in {end_time - start_time:.2f} seconds\n")
    print(result)
 if __name__ == "__main__":
    asyncio.run(main())