Files
LLM_Engineering_OLD/community-contributions/playwright-bojan/README.md

68 lines
1.6 KiB
Markdown

# 🧠 Community Contribution: Async Playwright-based OpenAI Scraper
This contribution presents a fully asynchronous, headless-browser-based scraper for [https://openai.com](https://openai.com) using **Playwright** — an alternative to Selenium.
Developed by: [lakovicb](https://github.com/lakovicb)
IDE used: WingIDE Pro (Jupyter compatibility via `nest_asyncio`)
---
## 📦 Features
- 🧭 Simulates human-like interactions (mouse movement, scrolling)
- 🧠 GPT-based analysis using OpenAI's API
- 🧪 Works inside **JupyterLab** using `nest_asyncio`
- 📊 Prometheus metrics for scraping observability
- ⚡ Smart content caching via `diskcache`
---
## 🚀 How to Run
### 1. Install dependencies
```bash
pip install -r requirements.txt
```
> Ensure [Playwright is installed & browsers are downloaded](https://playwright.dev/python/docs/intro)
```bash
playwright install
```
### 2. Set environment variables in `.env`
```env
OPENAI_API_KEY=your_openai_key
BROWSER_PATH=/usr/bin/chromium-browser
```
You can also define optional proxy/login params if needed.
---
## 📘 Notebooks Included
| Notebook | Description |
|----------|-------------|
| `Playwright_Solution_JupyterAsync.ipynb` | Executes async scraper directly inside Jupyter |
| `Playwright_Solution_Showcase_Formatted.ipynb` | Nicely formatted output for human reading |
---
## 🔁 Output Example
- GPT-generated summary
- Timeline of updates
- Entities and projects mentioned
- Structured topics & themes
*Can be extended with PDF export, LangChain pipeline, or vector store ingestion.*
---
## 🙏 Thanks
Huge thanks to Ed Donner for the amazing course and challenge inspiration!