Files

Abhinav M 865bf2dd2c Remove nested git and add folder

2025-07-10 15:40:40 +05:30

6.5 KiB

Raw Permalink Blame History

Web Scraper & Data Analyzer

A modern Python application with a sleek PyQt5 GUI for web scraping, data analysis, visualization, and AI-powered website insights. Features a clean, minimalistic design with real-time progress tracking, comprehensive data filtering, and an integrated AI chat assistant for advanced analysis.

Features

Modern UI: Clean, minimalistic design with dark theme and smooth animations
Web Scraping: Multi-threaded scraping with configurable depth (max 100 levels)
Data Visualization: Interactive table with sorting and filtering capabilities
Content Preview: Dual preview system with both text and visual HTML rendering
Data Analysis: Comprehensive statistics and domain breakdown
AI-Powered Analysis: Chat-based assistant for website insights, SEO suggestions, and content analysis
Export Functionality: JSON export with full metadata
URL Normalization: Handles www/non-www domains intelligently
Real-time Progress: Live progress updates during scraping operations
Loop Prevention: Advanced duplicate detection to prevent infinite loops
Smart Limits: Configurable limits to prevent runaway scraping

AI Analysis Tab

The application features an advanced AI Analysis tab:

Conversational Chat UI: Ask questions about your scraped websites in a modern chat interface (like ChatGPT)
Quick Actions: One-click questions for structure, SEO, content themes, and performance
Markdown Responses: AI replies are formatted for clarity and readability
Context Awareness: AI uses your scraped data for tailored insights
Requirements: Internet connection and the openai Python package (see Installation)
Fallback: If openai is not installed, a placeholder response is shown

Loop Prevention & Duplicate Detection

The scraper includes robust protection against infinite loops and circular references:

🔄 URL Normalization

Removes www. prefixes for consistent domain handling
Strips URL fragments (#section) to prevent duplicate content
Removes trailing slashes for consistency
Normalizes query parameters

🚫 Duplicate Detection

Visited URL Tracking: Maintains a set of all visited URLs
Unlimited Crawling: No page limits per domain or total pages
Per-Page Duplicate Filtering: Removes duplicate links within the same page

🛡️ Smart Restrictions

No Depth Limits: Crawl as deep as the specified max_depth allows
Content Type Filtering: Only scrapes HTML content
File Type Filtering: Skips non-content files (PDFs, images, etc.)
Consecutive Empty Level Detection: Stops if 3 consecutive levels have no new content

📊 Enhanced Tracking

Domain Page Counts: Tracks pages scraped per domain (for statistics)
URL Check Counts: Shows total URLs checked vs. pages scraped
Detailed Statistics: Comprehensive reporting on scraping efficiency
Unlimited Processing: No artificial limits on crawling scope

Installation

Clone or download the project files
Install dependencies:
```
pip install -r requirements.txt
```
- This will install all required packages, including PyQt5, PyQtWebEngine (for visual preview), and openai (for AI features).
Run the application:
```
python web_scraper_app.py
```

Usage

1. Scraping Configuration

Enter a starting URL (with or without http/https)
Set maximum crawl depth (1-100)
Click "Start Scraping" to begin

2. Data View & Filtering

View scraped data in an interactive table
Filter by search terms or specific domains
Double-click any row to preview content
Export data to JSON format

3. Analysis & Statistics

View comprehensive scraping statistics
See domain breakdown and word counts
Preview content in both text and visual formats
Analyze load times and link counts
Monitor duplicate detection efficiency

4. AI Analysis (New!)

Switch to the AI Analysis tab
Type your question or use quick action buttons (e.g., "Analyze the website structure", "Suggest SEO improvements")
The AI will analyze your scraped data and provide actionable insights
Requires an internet connection and the openai package

Visual Preview Feature

The application includes a visual HTML preview feature that renders scraped web pages in a browser-like view:

Requirements: PyQtWebEngine (automatically installed with requirements.txt)
Functionality: Displays HTML content with proper styling and formatting
Fallback: If PyQtWebEngine is not available, shows a text-only preview
Error Handling: Graceful error messages for invalid HTML content

Technical Details

Backend: Pure Python with urllib and html.parser (no compilation required)
Frontend: PyQt5 with custom modern styling
Threading: Multi-threaded scraping for better performance
Data Storage: Website objects with full metadata
URL Handling: Intelligent normalization and domain filtering
Loop Prevention: Multi-layered duplicate detection system
AI Integration: Uses OpenAI API (via openrouter) for chat-based analysis

File Structure

Testing/
├── web_scraper_app.py      # Main application (with AI and GUI)
├── module.py               # Core scraping logic
├── test.py                 # Basic functionality tests
├── requirements.txt        # Dependencies
└── README.md               # This file

Troubleshooting

Visual Preview Not Working

Ensure PyQtWebEngine is installed: pip install PyQtWebEngine
Check console output for import errors

AI Analysis Not Working

Ensure the openai package is installed: pip install openai
Check your internet connection (AI requires online access)
If not installed, the AI tab will show a placeholder response

Scraping Issues

Verify internet connection
Check URL format (add https:// if needed)
Try with a lower depth setting
Check console for error messages

Loop Prevention

The scraper automatically prevents infinite loops
Check the analysis tab for detailed statistics
Monitor "Total URLs Checked" vs "Total Pages" for efficiency
Use lower depth settings for sites with many internal links

Performance

Use lower depth settings for faster scraping
Filter data to focus on specific domains
Close other applications to free up resources
Monitor domain page counts to avoid hitting limits

License

This project is open source and available under the MIT License.

6.5 KiB Raw Permalink Blame History