159 lines
6.5 KiB
Markdown
159 lines
6.5 KiB
Markdown
# Web Scraper & Data Analyzer
|
|
|
|
A modern Python application with a sleek PyQt5 GUI for web scraping, data analysis, visualization, and AI-powered website insights. Features a clean, minimalistic design with real-time progress tracking, comprehensive data filtering, and an integrated AI chat assistant for advanced analysis.
|
|
|
|
## Features
|
|
|
|
- **Modern UI**: Clean, minimalistic design with dark theme and smooth animations
|
|
- **Web Scraping**: Multi-threaded scraping with configurable depth (max 100 levels)
|
|
- **Data Visualization**: Interactive table with sorting and filtering capabilities
|
|
- **Content Preview**: Dual preview system with both text and visual HTML rendering
|
|
- **Data Analysis**: Comprehensive statistics and domain breakdown
|
|
- **AI-Powered Analysis**: Chat-based assistant for website insights, SEO suggestions, and content analysis
|
|
- **Export Functionality**: JSON export with full metadata
|
|
- **URL Normalization**: Handles www/non-www domains intelligently
|
|
- **Real-time Progress**: Live progress updates during scraping operations
|
|
- **Loop Prevention**: Advanced duplicate detection to prevent infinite loops
|
|
- **Smart Limits**: Configurable limits to prevent runaway scraping
|
|
|
|
## AI Analysis Tab
|
|
|
|
The application features an advanced **AI Analysis** tab:
|
|
|
|
- **Conversational Chat UI**: Ask questions about your scraped websites in a modern chat interface (like ChatGPT)
|
|
- **Quick Actions**: One-click questions for structure, SEO, content themes, and performance
|
|
- **Markdown Responses**: AI replies are formatted for clarity and readability
|
|
- **Context Awareness**: AI uses your scraped data for tailored insights
|
|
- **Requirements**: Internet connection and the `openai` Python package (see Installation)
|
|
- **Fallback**: If `openai` is not installed, a placeholder response is shown
|
|
|
|
## Loop Prevention & Duplicate Detection
|
|
|
|
The scraper includes robust protection against infinite loops and circular references:
|
|
|
|
### 🔄 URL Normalization
|
|
- Removes `www.` prefixes for consistent domain handling
|
|
- Strips URL fragments (`#section`) to prevent duplicate content
|
|
- Removes trailing slashes for consistency
|
|
- Normalizes query parameters
|
|
|
|
### 🚫 Duplicate Detection
|
|
- **Visited URL Tracking**: Maintains a set of all visited URLs
|
|
- **Unlimited Crawling**: No page limits per domain or total pages
|
|
- **Per-Page Duplicate Filtering**: Removes duplicate links within the same page
|
|
|
|
### 🛡️ Smart Restrictions
|
|
- **No Depth Limits**: Crawl as deep as the specified max_depth allows
|
|
- **Content Type Filtering**: Only scrapes HTML content
|
|
- **File Type Filtering**: Skips non-content files (PDFs, images, etc.)
|
|
- **Consecutive Empty Level Detection**: Stops if 3 consecutive levels have no new content
|
|
|
|
### 📊 Enhanced Tracking
|
|
- **Domain Page Counts**: Tracks pages scraped per domain (for statistics)
|
|
- **URL Check Counts**: Shows total URLs checked vs. pages scraped
|
|
- **Detailed Statistics**: Comprehensive reporting on scraping efficiency
|
|
- **Unlimited Processing**: No artificial limits on crawling scope
|
|
|
|
## Installation
|
|
|
|
1. **Clone or download the project files**
|
|
|
|
2. **Install dependencies**:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
- This will install all required packages, including `PyQt5`, `PyQtWebEngine` (for visual preview), and `openai` (for AI features).
|
|
|
|
3. **Run the application**:
|
|
```bash
|
|
python web_scraper_app.py
|
|
```
|
|
|
|
## Usage
|
|
|
|
### 1. Scraping Configuration
|
|
- Enter a starting URL (with or without http/https)
|
|
- Set maximum crawl depth (1-100)
|
|
- Click "Start Scraping" to begin
|
|
|
|
### 2. Data View & Filtering
|
|
- View scraped data in an interactive table
|
|
- Filter by search terms or specific domains
|
|
- Double-click any row to preview content
|
|
- Export data to JSON format
|
|
|
|
### 3. Analysis & Statistics
|
|
- View comprehensive scraping statistics
|
|
- See domain breakdown and word counts
|
|
- Preview content in both text and visual formats
|
|
- Analyze load times and link counts
|
|
- Monitor duplicate detection efficiency
|
|
|
|
### 4. AI Analysis (New!)
|
|
- Switch to the **AI Analysis** tab
|
|
- Type your question or use quick action buttons (e.g., "Analyze the website structure", "Suggest SEO improvements")
|
|
- The AI will analyze your scraped data and provide actionable insights
|
|
- Requires an internet connection and the `openai` package
|
|
|
|
## Visual Preview Feature
|
|
|
|
The application includes a visual HTML preview feature that renders scraped web pages in a browser-like view:
|
|
|
|
- **Requirements**: PyQtWebEngine (automatically installed with requirements.txt)
|
|
- **Functionality**: Displays HTML content with proper styling and formatting
|
|
- **Fallback**: If PyQtWebEngine is not available, shows a text-only preview
|
|
- **Error Handling**: Graceful error messages for invalid HTML content
|
|
|
|
## Technical Details
|
|
|
|
- **Backend**: Pure Python with urllib and html.parser (no compilation required)
|
|
- **Frontend**: PyQt5 with custom modern styling
|
|
- **Threading**: Multi-threaded scraping for better performance
|
|
- **Data Storage**: Website objects with full metadata
|
|
- **URL Handling**: Intelligent normalization and domain filtering
|
|
- **Loop Prevention**: Multi-layered duplicate detection system
|
|
- **AI Integration**: Uses OpenAI API (via openrouter) for chat-based analysis
|
|
|
|
## File Structure
|
|
|
|
```
|
|
Testing/
|
|
├── web_scraper_app.py # Main application (with AI and GUI)
|
|
├── module.py # Core scraping logic
|
|
├── test.py # Basic functionality tests
|
|
├── requirements.txt # Dependencies
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Visual Preview Not Working
|
|
1. Ensure PyQtWebEngine is installed: `pip install PyQtWebEngine`
|
|
2. Check console output for import errors
|
|
|
|
### AI Analysis Not Working
|
|
1. Ensure the `openai` package is installed: `pip install openai`
|
|
2. Check your internet connection (AI requires online access)
|
|
3. If not installed, the AI tab will show a placeholder response
|
|
|
|
### Scraping Issues
|
|
1. Verify internet connection
|
|
2. Check URL format (add https:// if needed)
|
|
3. Try with a lower depth setting
|
|
4. Check console for error messages
|
|
|
|
### Loop Prevention
|
|
1. The scraper automatically prevents infinite loops
|
|
2. Check the analysis tab for detailed statistics
|
|
3. Monitor "Total URLs Checked" vs "Total Pages" for efficiency
|
|
4. Use lower depth settings for sites with many internal links
|
|
|
|
### Performance
|
|
- Use lower depth settings for faster scraping
|
|
- Filter data to focus on specific domains
|
|
- Close other applications to free up resources
|
|
- Monitor domain page counts to avoid hitting limits
|
|
|
|
## License
|
|
|
|
This project is open source and available under the MIT License. |