Remove nested git and add folder
This commit is contained in:
159
community-contributions/WebScraperApp/README.md
Normal file
159
community-contributions/WebScraperApp/README.md
Normal file
@@ -0,0 +1,159 @@
|
||||
# Web Scraper & Data Analyzer
|
||||
|
||||
A modern Python application with a sleek PyQt5 GUI for web scraping, data analysis, visualization, and AI-powered website insights. Features a clean, minimalistic design with real-time progress tracking, comprehensive data filtering, and an integrated AI chat assistant for advanced analysis.
|
||||
|
||||
## Features
|
||||
|
||||
- **Modern UI**: Clean, minimalistic design with dark theme and smooth animations
|
||||
- **Web Scraping**: Multi-threaded scraping with configurable depth (max 100 levels)
|
||||
- **Data Visualization**: Interactive table with sorting and filtering capabilities
|
||||
- **Content Preview**: Dual preview system with both text and visual HTML rendering
|
||||
- **Data Analysis**: Comprehensive statistics and domain breakdown
|
||||
- **AI-Powered Analysis**: Chat-based assistant for website insights, SEO suggestions, and content analysis
|
||||
- **Export Functionality**: JSON export with full metadata
|
||||
- **URL Normalization**: Handles www/non-www domains intelligently
|
||||
- **Real-time Progress**: Live progress updates during scraping operations
|
||||
- **Loop Prevention**: Advanced duplicate detection to prevent infinite loops
|
||||
- **Smart Limits**: Configurable limits to prevent runaway scraping
|
||||
|
||||
## AI Analysis Tab
|
||||
|
||||
The application features an advanced **AI Analysis** tab:
|
||||
|
||||
- **Conversational Chat UI**: Ask questions about your scraped websites in a modern chat interface (like ChatGPT)
|
||||
- **Quick Actions**: One-click questions for structure, SEO, content themes, and performance
|
||||
- **Markdown Responses**: AI replies are formatted for clarity and readability
|
||||
- **Context Awareness**: AI uses your scraped data for tailored insights
|
||||
- **Requirements**: Internet connection and the `openai` Python package (see Installation)
|
||||
- **Fallback**: If `openai` is not installed, a placeholder response is shown
|
||||
|
||||
## Loop Prevention & Duplicate Detection
|
||||
|
||||
The scraper includes robust protection against infinite loops and circular references:
|
||||
|
||||
### 🔄 URL Normalization
|
||||
- Removes `www.` prefixes for consistent domain handling
|
||||
- Strips URL fragments (`#section`) to prevent duplicate content
|
||||
- Removes trailing slashes for consistency
|
||||
- Normalizes query parameters
|
||||
|
||||
### 🚫 Duplicate Detection
|
||||
- **Visited URL Tracking**: Maintains a set of all visited URLs
|
||||
- **Unlimited Crawling**: No page limits per domain or total pages
|
||||
- **Per-Page Duplicate Filtering**: Removes duplicate links within the same page
|
||||
|
||||
### 🛡️ Smart Restrictions
|
||||
- **No Depth Limits**: Crawl as deep as the specified max_depth allows
|
||||
- **Content Type Filtering**: Only scrapes HTML content
|
||||
- **File Type Filtering**: Skips non-content files (PDFs, images, etc.)
|
||||
- **Consecutive Empty Level Detection**: Stops if 3 consecutive levels have no new content
|
||||
|
||||
### 📊 Enhanced Tracking
|
||||
- **Domain Page Counts**: Tracks pages scraped per domain (for statistics)
|
||||
- **URL Check Counts**: Shows total URLs checked vs. pages scraped
|
||||
- **Detailed Statistics**: Comprehensive reporting on scraping efficiency
|
||||
- **Unlimited Processing**: No artificial limits on crawling scope
|
||||
|
||||
## Installation
|
||||
|
||||
1. **Clone or download the project files**
|
||||
|
||||
2. **Install dependencies**:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
- This will install all required packages, including `PyQt5`, `PyQtWebEngine` (for visual preview), and `openai` (for AI features).
|
||||
|
||||
3. **Run the application**:
|
||||
```bash
|
||||
python web_scraper_app.py
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### 1. Scraping Configuration
|
||||
- Enter a starting URL (with or without http/https)
|
||||
- Set maximum crawl depth (1-100)
|
||||
- Click "Start Scraping" to begin
|
||||
|
||||
### 2. Data View & Filtering
|
||||
- View scraped data in an interactive table
|
||||
- Filter by search terms or specific domains
|
||||
- Double-click any row to preview content
|
||||
- Export data to JSON format
|
||||
|
||||
### 3. Analysis & Statistics
|
||||
- View comprehensive scraping statistics
|
||||
- See domain breakdown and word counts
|
||||
- Preview content in both text and visual formats
|
||||
- Analyze load times and link counts
|
||||
- Monitor duplicate detection efficiency
|
||||
|
||||
### 4. AI Analysis (New!)
|
||||
- Switch to the **AI Analysis** tab
|
||||
- Type your question or use quick action buttons (e.g., "Analyze the website structure", "Suggest SEO improvements")
|
||||
- The AI will analyze your scraped data and provide actionable insights
|
||||
- Requires an internet connection and the `openai` package
|
||||
|
||||
## Visual Preview Feature
|
||||
|
||||
The application includes a visual HTML preview feature that renders scraped web pages in a browser-like view:
|
||||
|
||||
- **Requirements**: PyQtWebEngine (automatically installed with requirements.txt)
|
||||
- **Functionality**: Displays HTML content with proper styling and formatting
|
||||
- **Fallback**: If PyQtWebEngine is not available, shows a text-only preview
|
||||
- **Error Handling**: Graceful error messages for invalid HTML content
|
||||
|
||||
## Technical Details
|
||||
|
||||
- **Backend**: Pure Python with urllib and html.parser (no compilation required)
|
||||
- **Frontend**: PyQt5 with custom modern styling
|
||||
- **Threading**: Multi-threaded scraping for better performance
|
||||
- **Data Storage**: Website objects with full metadata
|
||||
- **URL Handling**: Intelligent normalization and domain filtering
|
||||
- **Loop Prevention**: Multi-layered duplicate detection system
|
||||
- **AI Integration**: Uses OpenAI API (via openrouter) for chat-based analysis
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
Testing/
|
||||
├── web_scraper_app.py # Main application (with AI and GUI)
|
||||
├── module.py # Core scraping logic
|
||||
├── test.py # Basic functionality tests
|
||||
├── requirements.txt # Dependencies
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Visual Preview Not Working
|
||||
1. Ensure PyQtWebEngine is installed: `pip install PyQtWebEngine`
|
||||
2. Check console output for import errors
|
||||
|
||||
### AI Analysis Not Working
|
||||
1. Ensure the `openai` package is installed: `pip install openai`
|
||||
2. Check your internet connection (AI requires online access)
|
||||
3. If not installed, the AI tab will show a placeholder response
|
||||
|
||||
### Scraping Issues
|
||||
1. Verify internet connection
|
||||
2. Check URL format (add https:// if needed)
|
||||
3. Try with a lower depth setting
|
||||
4. Check console for error messages
|
||||
|
||||
### Loop Prevention
|
||||
1. The scraper automatically prevents infinite loops
|
||||
2. Check the analysis tab for detailed statistics
|
||||
3. Monitor "Total URLs Checked" vs "Total Pages" for efficiency
|
||||
4. Use lower depth settings for sites with many internal links
|
||||
|
||||
### Performance
|
||||
- Use lower depth settings for faster scraping
|
||||
- Filter data to focus on specific domains
|
||||
- Close other applications to free up resources
|
||||
- Monitor domain page counts to avoid hitting limits
|
||||
|
||||
## License
|
||||
|
||||
This project is open source and available under the MIT License.
|
||||
Reference in New Issue
Block a user