Added Playwright-based scraper solution to community-contributions

2025-04-23 14:45:58 +02:00
parent b8b2f766e5
commit ba7455e38e
2 changed files with 356 additions and 0 deletions
--- a/community-contributions/playwright-bojan/README.md
+++ b/community-contributions/playwright-bojan/README.md
@@ -0,0 +1,56 @@
+# 🧠 Playwright-Based Web Scraper for openai.com  
+### 📚 Community Contribution for Ed Donner's "LLM Engineering: Master AI" Course
+
+> _“An extra exercise for those who enjoy web scraping...  
+> In the community-contributions folder, you'll find an example Selenium solution from a student.”_
+
+---
+
+## 🔍 About This Project
+
+This is a response to Ed Donner’s bonus exercise to scrape `https://openai.com`, which uses dynamic JavaScript rendering.  
+A fellow student contributed a Selenium-based solution — this one goes a step further with **Playwright**.
+
+---
+
+## 🆚 Why Playwright Over Selenium?
+
+| Feature              | Selenium                     | Playwright 🏆               |
+|----------------------|------------------------------|-----------------------------|
+| **Installation**     | More complex setup           | Minimal + faster setup      |
+| **Speed**            | Slower due to architecture   | Faster execution (async)    |
+| **Multi-browser**    | Requires config              | Built-in Chrome, Firefox, WebKit support |
+| **Headless mode**    | Less stable                  | Super stable                |
+| **Async-friendly**   | Not built-in                 | Native support via asyncio  |
+| **Interaction APIs** | Limited                      | Richer simulation (mouse, scroll, etc.) |
+
+---
+
+## ⚙️ Features
+
+- ✅ **Full JavaScript rendering** using Chromium
+- ✅ **Human-like behavior simulation** (mouse movement, scrolling, typing)
+- ✅ **Caching** with `diskcache`
+- ✅ **Prometheus metrics**
+- ✅ **Asynchronous scraping logic**
+- ✅ **Content summarization via OpenAI GPT API**
+
+---
+
+## 🧠 Why not in JupyterLab?
+
+Due to the async nature of Playwright and the use of `asyncio.run()`, running this inside Jupyter causes `RuntimeError` conflicts.
+
+This solution was developed and tested in:
+
+- 💻 WingIDE 10 Pro
+- 🐧 Ubuntu via WSL
+- 🐍 Conda environment with Anaconda Python 3.12
+
+---
+
+## 🚀 How to Run
+
+1. Install dependencies:
+```bash
+pip install -r requirements.txt