Added detailed README for Playwright-based scraper contribution
This commit is contained in:
@@ -1,56 +1,67 @@
|
||||
# 🧠 Playwright-Based Web Scraper for openai.com
|
||||
### 📚 Community Contribution for Ed Donner's "LLM Engineering: Master AI" Course
|
||||
# 🧠 Community Contribution: Async Playwright-based OpenAI Scraper
|
||||
|
||||
> _“An extra exercise for those who enjoy web scraping...
|
||||
> In the community-contributions folder, you'll find an example Selenium solution from a student.”_
|
||||
This contribution presents a fully asynchronous, headless-browser-based scraper for [https://openai.com](https://openai.com) using **Playwright** — an alternative to Selenium.
|
||||
|
||||
Developed by: [lakovicb](https://github.com/lakovicb)
|
||||
IDE used: WingIDE Pro (Jupyter compatibility via `nest_asyncio`)
|
||||
|
||||
---
|
||||
|
||||
## 🔍 About This Project
|
||||
## 📦 Features
|
||||
|
||||
This is a response to Ed Donner’s bonus exercise to scrape `https://openai.com`, which uses dynamic JavaScript rendering.
|
||||
A fellow student contributed a Selenium-based solution — this one goes a step further with **Playwright**.
|
||||
|
||||
---
|
||||
|
||||
## 🆚 Why Playwright Over Selenium?
|
||||
|
||||
| Feature | Selenium | Playwright 🏆 |
|
||||
|----------------------|------------------------------|-----------------------------|
|
||||
| **Installation** | More complex setup | Minimal + faster setup |
|
||||
| **Speed** | Slower due to architecture | Faster execution (async) |
|
||||
| **Multi-browser** | Requires config | Built-in Chrome, Firefox, WebKit support |
|
||||
| **Headless mode** | Less stable | Super stable |
|
||||
| **Async-friendly** | Not built-in | Native support via asyncio |
|
||||
| **Interaction APIs** | Limited | Richer simulation (mouse, scroll, etc.) |
|
||||
|
||||
---
|
||||
|
||||
## ⚙️ Features
|
||||
|
||||
- ✅ **Full JavaScript rendering** using Chromium
|
||||
- ✅ **Human-like behavior simulation** (mouse movement, scrolling, typing)
|
||||
- ✅ **Caching** with `diskcache`
|
||||
- ✅ **Prometheus metrics**
|
||||
- ✅ **Asynchronous scraping logic**
|
||||
- ✅ **Content summarization via OpenAI GPT API**
|
||||
|
||||
---
|
||||
|
||||
## 🧠 Why not in JupyterLab?
|
||||
|
||||
Due to the async nature of Playwright and the use of `asyncio.run()`, running this inside Jupyter causes `RuntimeError` conflicts.
|
||||
|
||||
This solution was developed and tested in:
|
||||
|
||||
- 💻 WingIDE 10 Pro
|
||||
- 🐧 Ubuntu via WSL
|
||||
- 🐍 Conda environment with Anaconda Python 3.12
|
||||
- 🧭 Simulates human-like interactions (mouse movement, scrolling)
|
||||
- 🧠 GPT-based analysis using OpenAI's API
|
||||
- 🧪 Works inside **JupyterLab** using `nest_asyncio`
|
||||
- 📊 Prometheus metrics for scraping observability
|
||||
- ⚡ Smart content caching via `diskcache`
|
||||
|
||||
---
|
||||
|
||||
## 🚀 How to Run
|
||||
|
||||
1. Install dependencies:
|
||||
### 1. Install dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
> Ensure [Playwright is installed & browsers are downloaded](https://playwright.dev/python/docs/intro)
|
||||
|
||||
```bash
|
||||
playwright install
|
||||
```
|
||||
|
||||
### 2. Set environment variables in `.env`
|
||||
|
||||
```env
|
||||
OPENAI_API_KEY=your_openai_key
|
||||
BROWSER_PATH=/usr/bin/chromium-browser
|
||||
```
|
||||
|
||||
You can also define optional proxy/login params if needed.
|
||||
|
||||
---
|
||||
|
||||
## 📘 Notebooks Included
|
||||
|
||||
| Notebook | Description |
|
||||
|----------|-------------|
|
||||
| `Playwright_Solution_JupyterAsync.ipynb` | Executes async scraper directly inside Jupyter |
|
||||
| `Playwright_Solution_Showcase_Formatted.ipynb` | Nicely formatted output for human reading |
|
||||
|
||||
---
|
||||
|
||||
## 🔁 Output Example
|
||||
|
||||
- GPT-generated summary
|
||||
- Timeline of updates
|
||||
- Entities and projects mentioned
|
||||
- Structured topics & themes
|
||||
|
||||
✅ *Can be extended with PDF export, LangChain pipeline, or vector store ingestion.*
|
||||
|
||||
---
|
||||
|
||||
## 🙏 Thanks
|
||||
|
||||
Huge thanks to Ed Donner for the amazing course and challenge inspiration!
|
||||
|
||||
Reference in New Issue
Block a user