Remove nested git and add folder

2025-07-10 15:40:40 +05:30
parent 0fe9e4e7d7
commit 865bf2dd2c
5 changed files with 2476 additions and 0 deletions
--- a/community-contributions/WebScraperApp/README.md
+++ b/community-contributions/WebScraperApp/README.md
@@ -0,0 +1,159 @@
+# Web Scraper & Data Analyzer
+
+A modern Python application with a sleek PyQt5 GUI for web scraping, data analysis, visualization, and AI-powered website insights. Features a clean, minimalistic design with real-time progress tracking, comprehensive data filtering, and an integrated AI chat assistant for advanced analysis.
+
+## Features
+
+- **Modern UI**: Clean, minimalistic design with dark theme and smooth animations
+- **Web Scraping**: Multi-threaded scraping with configurable depth (max 100 levels)
+- **Data Visualization**: Interactive table with sorting and filtering capabilities
+- **Content Preview**: Dual preview system with both text and visual HTML rendering
+- **Data Analysis**: Comprehensive statistics and domain breakdown
+- **AI-Powered Analysis**: Chat-based assistant for website insights, SEO suggestions, and content analysis
+- **Export Functionality**: JSON export with full metadata
+- **URL Normalization**: Handles www/non-www domains intelligently
+- **Real-time Progress**: Live progress updates during scraping operations
+- **Loop Prevention**: Advanced duplicate detection to prevent infinite loops
+- **Smart Limits**: Configurable limits to prevent runaway scraping
+
+## AI Analysis Tab
+
+The application features an advanced **AI Analysis** tab:
+
+- **Conversational Chat UI**: Ask questions about your scraped websites in a modern chat interface (like ChatGPT)
+- **Quick Actions**: One-click questions for structure, SEO, content themes, and performance
+- **Markdown Responses**: AI replies are formatted for clarity and readability
+- **Context Awareness**: AI uses your scraped data for tailored insights
+- **Requirements**: Internet connection and the `openai` Python package (see Installation)
+- **Fallback**: If `openai` is not installed, a placeholder response is shown
+
+## Loop Prevention & Duplicate Detection
+
+The scraper includes robust protection against infinite loops and circular references:
+
+### 🔄 URL Normalization
+- Removes `www.` prefixes for consistent domain handling
+- Strips URL fragments (`#section`) to prevent duplicate content
+- Removes trailing slashes for consistency
+- Normalizes query parameters
+
+### 🚫 Duplicate Detection
+- **Visited URL Tracking**: Maintains a set of all visited URLs
+- **Unlimited Crawling**: No page limits per domain or total pages
+- **Per-Page Duplicate Filtering**: Removes duplicate links within the same page
+
+### 🛡️ Smart Restrictions
+- **No Depth Limits**: Crawl as deep as the specified max_depth allows
+- **Content Type Filtering**: Only scrapes HTML content
+- **File Type Filtering**: Skips non-content files (PDFs, images, etc.)
+- **Consecutive Empty Level Detection**: Stops if 3 consecutive levels have no new content
+
+### 📊 Enhanced Tracking
+- **Domain Page Counts**: Tracks pages scraped per domain (for statistics)
+- **URL Check Counts**: Shows total URLs checked vs. pages scraped
+- **Detailed Statistics**: Comprehensive reporting on scraping efficiency
+- **Unlimited Processing**: No artificial limits on crawling scope
+
+## Installation
+
+1. **Clone or download the project files**
+
+2. **Install dependencies**:
+   ```bash
+   pip install -r requirements.txt
+   ```
+   - This will install all required packages, including `PyQt5`, `PyQtWebEngine` (for visual preview), and `openai` (for AI features).
+
+3. **Run the application**:
+   ```bash
+   python web_scraper_app.py
+   ```
+
+## Usage
+
+### 1. Scraping Configuration
+- Enter a starting URL (with or without http/https)
+- Set maximum crawl depth (1-100)
+- Click "Start Scraping" to begin
+
+### 2. Data View & Filtering
+- View scraped data in an interactive table
+- Filter by search terms or specific domains
+- Double-click any row to preview content
+- Export data to JSON format
+
+### 3. Analysis & Statistics
+- View comprehensive scraping statistics
+- See domain breakdown and word counts
+- Preview content in both text and visual formats
+- Analyze load times and link counts
+- Monitor duplicate detection efficiency
+
+### 4. AI Analysis (New!)
+- Switch to the **AI Analysis** tab
+- Type your question or use quick action buttons (e.g., "Analyze the website structure", "Suggest SEO improvements")
+- The AI will analyze your scraped data and provide actionable insights
+- Requires an internet connection and the `openai` package
+
+## Visual Preview Feature
+
+The application includes a visual HTML preview feature that renders scraped web pages in a browser-like view:
+
+- **Requirements**: PyQtWebEngine (automatically installed with requirements.txt)
+- **Functionality**: Displays HTML content with proper styling and formatting
+- **Fallback**: If PyQtWebEngine is not available, shows a text-only preview
+- **Error Handling**: Graceful error messages for invalid HTML content
+
+## Technical Details
+
+- **Backend**: Pure Python with urllib and html.parser (no compilation required)
+- **Frontend**: PyQt5 with custom modern styling
+- **Threading**: Multi-threaded scraping for better performance
+- **Data Storage**: Website objects with full metadata
+- **URL Handling**: Intelligent normalization and domain filtering
+- **Loop Prevention**: Multi-layered duplicate detection system
+- **AI Integration**: Uses OpenAI API (via openrouter) for chat-based analysis
+
+## File Structure
+
+```
+Testing/
+├── web_scraper_app.py      # Main application (with AI and GUI)
+├── module.py               # Core scraping logic
+├── test.py                 # Basic functionality tests
+├── requirements.txt        # Dependencies
+└── README.md               # This file
+```
+
+## Troubleshooting
+
+### Visual Preview Not Working
+1. Ensure PyQtWebEngine is installed: `pip install PyQtWebEngine`
+2. Check console output for import errors
+
+### AI Analysis Not Working
+1. Ensure the `openai` package is installed: `pip install openai`
+2. Check your internet connection (AI requires online access)
+3. If not installed, the AI tab will show a placeholder response
+
+### Scraping Issues
+1. Verify internet connection
+2. Check URL format (add https:// if needed)
+3. Try with a lower depth setting
+4. Check console for error messages
+
+### Loop Prevention
+1. The scraper automatically prevents infinite loops
+2. Check the analysis tab for detailed statistics
+3. Monitor "Total URLs Checked" vs "Total Pages" for efficiency
+4. Use lower depth settings for sites with many internal links
+
+### Performance
+- Use lower depth settings for faster scraping
+- Filter data to focus on specific domains
+- Close other applications to free up resources
+- Monitor domain page counts to avoid hitting limits
+
+## License
+
+This project is open source and available under the MIT License.