Added Playwright-based scraper solution to community-contributions

This commit is contained in:
lakovicb
2025-04-23 14:45:58 +02:00
parent b8b2f766e5
commit ba7455e38e
2 changed files with 356 additions and 0 deletions

View File

@@ -0,0 +1,56 @@
# 🧠 Playwright-Based Web Scraper for openai.com
### 📚 Community Contribution for Ed Donner's "LLM Engineering: Master AI" Course
> _“An extra exercise for those who enjoy web scraping...
> In the community-contributions folder, you'll find an example Selenium solution from a student.”_
---
## 🔍 About This Project
This is a response to Ed Donners bonus exercise to scrape `https://openai.com`, which uses dynamic JavaScript rendering.
A fellow student contributed a Selenium-based solution — this one goes a step further with **Playwright**.
---
## 🆚 Why Playwright Over Selenium?
| Feature | Selenium | Playwright 🏆 |
|----------------------|------------------------------|-----------------------------|
| **Installation** | More complex setup | Minimal + faster setup |
| **Speed** | Slower due to architecture | Faster execution (async) |
| **Multi-browser** | Requires config | Built-in Chrome, Firefox, WebKit support |
| **Headless mode** | Less stable | Super stable |
| **Async-friendly** | Not built-in | Native support via asyncio |
| **Interaction APIs** | Limited | Richer simulation (mouse, scroll, etc.) |
---
## ⚙️ Features
-**Full JavaScript rendering** using Chromium
-**Human-like behavior simulation** (mouse movement, scrolling, typing)
-**Caching** with `diskcache`
-**Prometheus metrics**
-**Asynchronous scraping logic**
-**Content summarization via OpenAI GPT API**
---
## 🧠 Why not in JupyterLab?
Due to the async nature of Playwright and the use of `asyncio.run()`, running this inside Jupyter causes `RuntimeError` conflicts.
This solution was developed and tested in:
- 💻 WingIDE 10 Pro
- 🐧 Ubuntu via WSL
- 🐍 Conda environment with Anaconda Python 3.12
---
## 🚀 How to Run
1. Install dependencies:
```bash
pip install -r requirements.txt