Add Bojan's Playwright asynchronous scraper project
This contribution includes a fully asynchronous scraper using Playwright and OpenAI API, with Python scripts, Jupyter notebooks (outputs cleared), Markdown summaries, and a README. Organized under community-contributions/bojan-playwright-scraper/. Limited content retrieval from Huggingface.co is documented in the README.
This commit is contained in:
144
community-contributions/bojan-playwright-scraper/README.md
Normal file
144
community-contributions/bojan-playwright-scraper/README.md
Normal file
@@ -0,0 +1,144 @@
|
||||
|
||||
# 🧠 Community Contribution: Async Playwright-based AI Scraper
|
||||
|
||||
## Overview
|
||||
This project is a fully asynchronous, headless-browser-based scraper built using Playwright and the OpenAI API.
|
||||
It scrapes and analyzes content from four AI-related websites, producing structured summaries in Markdown and Jupyter notebook formats.
|
||||
Playwright was chosen over Selenium for its speed and efficiency, making it ideal for modern web scraping tasks.
|
||||
|
||||
**Developed by:** lakovicb
|
||||
**IDE used:** WingIDE Pro 10 (Jupyter compatibility via nest_asyncio)
|
||||
**Python version:** 3.12.9 (developed and tested with Anaconda)
|
||||
|
||||
---
|
||||
|
||||
## 📦 Features
|
||||
- 🧭 Simulates human-like interactions (mouse movement, scrolling)
|
||||
- 🧠 GPT-based analysis using OpenAI's API
|
||||
- 🧪 Works inside JupyterLab using nest_asyncio
|
||||
- 📊 Prometheus metrics for scraping observability
|
||||
- ⚡ Smart content caching via diskcache
|
||||
- 📝 Generates structured Markdown summaries and Jupyter notebooks
|
||||
|
||||
---
|
||||
|
||||
## 🚀 How to Run
|
||||
|
||||
### 1. Install dependencies
|
||||
Run these commands in your terminal:
|
||||
```bash
|
||||
conda install python-dotenv prometheus_client diskcache nbformat
|
||||
pip install playwright openai
|
||||
playwright install
|
||||
```
|
||||
> Note: Ensure your environment supports Python 3.12 for optimal performance.
|
||||
|
||||
---
|
||||
|
||||
### 2. Set environment variables
|
||||
Create a `.env` file in `/home/lakov/projects/llm_engineering/` with:
|
||||
```env
|
||||
OPENAI_API_KEY=your_openai_key
|
||||
```
|
||||
(Optional) Define proxy/login parameters if needed.
|
||||
|
||||
---
|
||||
|
||||
### 3. Run the scraper
|
||||
```bash
|
||||
python playwright_ai_scraper.py
|
||||
```
|
||||
This scrapes and analyzes the following URLs:
|
||||
- https://www.anthropic.com
|
||||
- https://deepmind.google
|
||||
- https://huggingface.co
|
||||
- https://runwayml.com
|
||||
|
||||
---
|
||||
|
||||
### 4. Generate notebooks
|
||||
```bash
|
||||
python notebook_generator.py
|
||||
```
|
||||
Enter a URL when prompted to generate a Jupyter notebook in the `notebooks/` directory.
|
||||
|
||||
---
|
||||
|
||||
## 📊 Results
|
||||
|
||||
### Python Files for Developers
|
||||
- `playwright_ai_scraper.py`: Core async scraper and analyzer.
|
||||
- `notebook_generator.py`: Creates Jupyter notebooks for given URLs.
|
||||
|
||||
These files enable transparency, reproducibility, and extendability.
|
||||
|
||||
---
|
||||
|
||||
### Markdown Summaries
|
||||
Saved in `outputs/`:
|
||||
- Structured analyses with sections for Summary, Entities, Updates, Topics, and Features.
|
||||
- Readable and portable format.
|
||||
|
||||
---
|
||||
|
||||
### Jupyter Notebooks
|
||||
Available in `notebooks/`:
|
||||
- `Playwright_AI_Scraper_JupyterAsync.ipynb`
|
||||
- `Playwright_AI_Scraper_Showcase_Formatted.ipynb`
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Playwright vs. Selenium
|
||||
|
||||
| Criteria | Selenium | Playwright |
|
||||
|------------------------|---------------------------------------|--------------------------------------|
|
||||
| Release Year | 2004 | 2020 |
|
||||
| Supported Browsers | Chrome, Firefox, Safari, Edge, IE | Chromium, Firefox, WebKit |
|
||||
| Supported Languages | Many | Python, JS/TS, Java, C# |
|
||||
| Setup | Complex (WebDrivers) | Simple (auto-installs binaries) |
|
||||
| Execution Speed | Slower | Faster (WebSocket) |
|
||||
| Dynamic Content | Good (requires explicit waits) | Excellent (auto-waits) |
|
||||
| Community Support | Large, mature | Growing, modern, Microsoft-backed |
|
||||
|
||||
> **Playwright** was chosen for its speed, simplicity, and modern feature set.
|
||||
|
||||
---
|
||||
|
||||
## ⚙️ Asynchronous Code and WingIDE Pro 10
|
||||
|
||||
- Fully async scraping with `asyncio`.
|
||||
- Developed using WingIDE Pro 10 for:
|
||||
- Robust async support
|
||||
- Full Python 3.12 compatibility
|
||||
- Integration with JupyterLab via `nest_asyncio`
|
||||
- Stability and efficient debugging
|
||||
|
||||
---
|
||||
|
||||
## 📁 Directory Structure
|
||||
|
||||
```bash
|
||||
playwright_ai_scraper.py # Main scraper script
|
||||
notebook_generator.py # Notebook generator script
|
||||
outputs/ # Markdown summaries
|
||||
notebooks/ # Generated Jupyter notebooks
|
||||
requirements.txt # List of dependencies
|
||||
scraper_cache/ # Cache directory
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📝 Notes
|
||||
|
||||
- Uses Prometheus metrics and diskcache.
|
||||
- Ensure a valid OpenAI API key.
|
||||
- Potential extensions: PDF export, LangChain pipeline, vector store ingestion.
|
||||
|
||||
- **Note:** Due to the dynamic nature and limited static text on the Huggingface.co homepage, the scraper retrieved only minimal information, which resulted in a limited AI-generated summary. This behavior reflects a realistic limitation of scraping dynamic websites without interaction-based extraction.
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 🙏 Thanks
|
||||
|
||||
Special thanks to **Ed Donner** for the amazing course and project challenge inspiration!
|
||||
Reference in New Issue
Block a user