Files

lakovicb 1a626abba0 Add Bojan's Playwright asynchronous scraper project

This contribution includes a fully asynchronous scraper using Playwright and OpenAI API, with Python scripts, Jupyter notebooks (outputs cleared), Markdown summaries, and a README. Organized under community-contributions/bojan-playwright-scraper/. Limited content retrieval from Huggingface.co is documented in the README.

2025-04-29 10:07:18 +02:00

4.8 KiB

Raw Blame History

🧠 Community Contribution: Async Playwright-based AI Scraper

Overview

This project is a fully asynchronous, headless-browser-based scraper built using Playwright and the OpenAI API.
It scrapes and analyzes content from four AI-related websites, producing structured summaries in Markdown and Jupyter notebook formats.
Playwright was chosen over Selenium for its speed and efficiency, making it ideal for modern web scraping tasks.

Developed by: lakovicb
IDE used: WingIDE Pro 10 (Jupyter compatibility via nest_asyncio)
Python version: 3.12.9 (developed and tested with Anaconda)

📦 Features

🧭 Simulates human-like interactions (mouse movement, scrolling)
🧠 GPT-based analysis using OpenAI's API
🧪 Works inside JupyterLab using nest_asyncio
📊 Prometheus metrics for scraping observability
⚡ Smart content caching via diskcache
📝 Generates structured Markdown summaries and Jupyter notebooks

🚀 How to Run

1. Install dependencies

Run these commands in your terminal:

conda install python-dotenv prometheus_client diskcache nbformat
pip install playwright openai
playwright install

Note: Ensure your environment supports Python 3.12 for optimal performance.

2. Set environment variables

Create a .env file in /home/lakov/projects/llm_engineering/ with:

OPENAI_API_KEY=your_openai_key

(Optional) Define proxy/login parameters if needed.

3. Run the scraper

python playwright_ai_scraper.py

This scrapes and analyzes the following URLs:

4. Generate notebooks

python notebook_generator.py

Enter a URL when prompted to generate a Jupyter notebook in the notebooks/ directory.

📊 Results

Python Files for Developers

playwright_ai_scraper.py: Core async scraper and analyzer.
notebook_generator.py: Creates Jupyter notebooks for given URLs.

These files enable transparency, reproducibility, and extendability.

Markdown Summaries

Saved in outputs/:

Structured analyses with sections for Summary, Entities, Updates, Topics, and Features.
Readable and portable format.

Jupyter Notebooks

Available in notebooks/:

Playwright_AI_Scraper_JupyterAsync.ipynb
Playwright_AI_Scraper_Showcase_Formatted.ipynb

🔍 Playwright vs. Selenium

Criteria	Selenium	Playwright
Release Year	2004	2020
Supported Browsers	Chrome, Firefox, Safari, Edge, IE	Chromium, Firefox, WebKit
Supported Languages	Many	Python, JS/TS, Java, C#
Setup	Complex (WebDrivers)	Simple (auto-installs binaries)
Execution Speed	Slower	Faster (WebSocket)
Dynamic Content	Good (requires explicit waits)	Excellent (auto-waits)
Community Support	Large, mature	Growing, modern, Microsoft-backed

Playwright was chosen for its speed, simplicity, and modern feature set.

⚙️ Asynchronous Code and WingIDE Pro 10

Fully async scraping with asyncio.
Developed using WingIDE Pro 10 for:
- Robust async support
- Full Python 3.12 compatibility
- Integration with JupyterLab via nest_asyncio
- Stability and efficient debugging

📁 Directory Structure

playwright_ai_scraper.py         # Main scraper script
notebook_generator.py            # Notebook generator script
outputs/                         # Markdown summaries
notebooks/                       # Generated Jupyter notebooks
requirements.txt                 # List of dependencies
scraper_cache/                   # Cache directory

📝 Notes

Uses Prometheus metrics and diskcache.
Ensure a valid OpenAI API key.
Potential extensions: PDF export, LangChain pipeline, vector store ingestion.
Note: Due to the dynamic nature and limited static text on the Huggingface.co homepage, the scraper retrieved only minimal information, which resulted in a limited AI-generated summary. This behavior reflects a realistic limitation of scraping dynamic websites without interaction-based extraction.

🙏 Thanks

Special thanks to Ed Donner for the amazing course and project challenge inspiration!

4.8 KiB Raw Blame History