Files
2025-07-10 15:40:40 +05:30

6.5 KiB

Web Scraper & Data Analyzer

A modern Python application with a sleek PyQt5 GUI for web scraping, data analysis, visualization, and AI-powered website insights. Features a clean, minimalistic design with real-time progress tracking, comprehensive data filtering, and an integrated AI chat assistant for advanced analysis.

Features

  • Modern UI: Clean, minimalistic design with dark theme and smooth animations
  • Web Scraping: Multi-threaded scraping with configurable depth (max 100 levels)
  • Data Visualization: Interactive table with sorting and filtering capabilities
  • Content Preview: Dual preview system with both text and visual HTML rendering
  • Data Analysis: Comprehensive statistics and domain breakdown
  • AI-Powered Analysis: Chat-based assistant for website insights, SEO suggestions, and content analysis
  • Export Functionality: JSON export with full metadata
  • URL Normalization: Handles www/non-www domains intelligently
  • Real-time Progress: Live progress updates during scraping operations
  • Loop Prevention: Advanced duplicate detection to prevent infinite loops
  • Smart Limits: Configurable limits to prevent runaway scraping

AI Analysis Tab

The application features an advanced AI Analysis tab:

  • Conversational Chat UI: Ask questions about your scraped websites in a modern chat interface (like ChatGPT)
  • Quick Actions: One-click questions for structure, SEO, content themes, and performance
  • Markdown Responses: AI replies are formatted for clarity and readability
  • Context Awareness: AI uses your scraped data for tailored insights
  • Requirements: Internet connection and the openai Python package (see Installation)
  • Fallback: If openai is not installed, a placeholder response is shown

Loop Prevention & Duplicate Detection

The scraper includes robust protection against infinite loops and circular references:

🔄 URL Normalization

  • Removes www. prefixes for consistent domain handling
  • Strips URL fragments (#section) to prevent duplicate content
  • Removes trailing slashes for consistency
  • Normalizes query parameters

🚫 Duplicate Detection

  • Visited URL Tracking: Maintains a set of all visited URLs
  • Unlimited Crawling: No page limits per domain or total pages
  • Per-Page Duplicate Filtering: Removes duplicate links within the same page

🛡️ Smart Restrictions

  • No Depth Limits: Crawl as deep as the specified max_depth allows
  • Content Type Filtering: Only scrapes HTML content
  • File Type Filtering: Skips non-content files (PDFs, images, etc.)
  • Consecutive Empty Level Detection: Stops if 3 consecutive levels have no new content

📊 Enhanced Tracking

  • Domain Page Counts: Tracks pages scraped per domain (for statistics)
  • URL Check Counts: Shows total URLs checked vs. pages scraped
  • Detailed Statistics: Comprehensive reporting on scraping efficiency
  • Unlimited Processing: No artificial limits on crawling scope

Installation

  1. Clone or download the project files

  2. Install dependencies:

    pip install -r requirements.txt
    
    • This will install all required packages, including PyQt5, PyQtWebEngine (for visual preview), and openai (for AI features).
  3. Run the application:

    python web_scraper_app.py
    

Usage

1. Scraping Configuration

  • Enter a starting URL (with or without http/https)
  • Set maximum crawl depth (1-100)
  • Click "Start Scraping" to begin

2. Data View & Filtering

  • View scraped data in an interactive table
  • Filter by search terms or specific domains
  • Double-click any row to preview content
  • Export data to JSON format

3. Analysis & Statistics

  • View comprehensive scraping statistics
  • See domain breakdown and word counts
  • Preview content in both text and visual formats
  • Analyze load times and link counts
  • Monitor duplicate detection efficiency

4. AI Analysis (New!)

  • Switch to the AI Analysis tab
  • Type your question or use quick action buttons (e.g., "Analyze the website structure", "Suggest SEO improvements")
  • The AI will analyze your scraped data and provide actionable insights
  • Requires an internet connection and the openai package

Visual Preview Feature

The application includes a visual HTML preview feature that renders scraped web pages in a browser-like view:

  • Requirements: PyQtWebEngine (automatically installed with requirements.txt)
  • Functionality: Displays HTML content with proper styling and formatting
  • Fallback: If PyQtWebEngine is not available, shows a text-only preview
  • Error Handling: Graceful error messages for invalid HTML content

Technical Details

  • Backend: Pure Python with urllib and html.parser (no compilation required)
  • Frontend: PyQt5 with custom modern styling
  • Threading: Multi-threaded scraping for better performance
  • Data Storage: Website objects with full metadata
  • URL Handling: Intelligent normalization and domain filtering
  • Loop Prevention: Multi-layered duplicate detection system
  • AI Integration: Uses OpenAI API (via openrouter) for chat-based analysis

File Structure

Testing/
├── web_scraper_app.py      # Main application (with AI and GUI)
├── module.py               # Core scraping logic
├── test.py                 # Basic functionality tests
├── requirements.txt        # Dependencies
└── README.md               # This file

Troubleshooting

Visual Preview Not Working

  1. Ensure PyQtWebEngine is installed: pip install PyQtWebEngine
  2. Check console output for import errors

AI Analysis Not Working

  1. Ensure the openai package is installed: pip install openai
  2. Check your internet connection (AI requires online access)
  3. If not installed, the AI tab will show a placeholder response

Scraping Issues

  1. Verify internet connection
  2. Check URL format (add https:// if needed)
  3. Try with a lower depth setting
  4. Check console for error messages

Loop Prevention

  1. The scraper automatically prevents infinite loops
  2. Check the analysis tab for detailed statistics
  3. Monitor "Total URLs Checked" vs "Total Pages" for efficiency
  4. Use lower depth settings for sites with many internal links

Performance

  • Use lower depth settings for faster scraping
  • Filter data to focus on specific domains
  • Close other applications to free up resources
  • Monitor domain page counts to avoid hitting limits

License

This project is open source and available under the MIT License.