diff --git a/community-contributions/WebScraperApp/README.md b/community-contributions/WebScraperApp/README.md new file mode 100644 index 0000000..6dfed7b --- /dev/null +++ b/community-contributions/WebScraperApp/README.md @@ -0,0 +1,159 @@ +# Web Scraper & Data Analyzer + +A modern Python application with a sleek PyQt5 GUI for web scraping, data analysis, visualization, and AI-powered website insights. Features a clean, minimalistic design with real-time progress tracking, comprehensive data filtering, and an integrated AI chat assistant for advanced analysis. + +## Features + +- **Modern UI**: Clean, minimalistic design with dark theme and smooth animations +- **Web Scraping**: Multi-threaded scraping with configurable depth (max 100 levels) +- **Data Visualization**: Interactive table with sorting and filtering capabilities +- **Content Preview**: Dual preview system with both text and visual HTML rendering +- **Data Analysis**: Comprehensive statistics and domain breakdown +- **AI-Powered Analysis**: Chat-based assistant for website insights, SEO suggestions, and content analysis +- **Export Functionality**: JSON export with full metadata +- **URL Normalization**: Handles www/non-www domains intelligently +- **Real-time Progress**: Live progress updates during scraping operations +- **Loop Prevention**: Advanced duplicate detection to prevent infinite loops +- **Smart Limits**: Configurable limits to prevent runaway scraping + +## AI Analysis Tab + +The application features an advanced **AI Analysis** tab: + +- **Conversational Chat UI**: Ask questions about your scraped websites in a modern chat interface (like ChatGPT) +- **Quick Actions**: One-click questions for structure, SEO, content themes, and performance +- **Markdown Responses**: AI replies are formatted for clarity and readability +- **Context Awareness**: AI uses your scraped data for tailored insights +- **Requirements**: Internet connection and the `openai` Python package (see Installation) +- **Fallback**: If `openai` is not installed, a placeholder response is shown + +## Loop Prevention & Duplicate Detection + +The scraper includes robust protection against infinite loops and circular references: + +### 🔄 URL Normalization +- Removes `www.` prefixes for consistent domain handling +- Strips URL fragments (`#section`) to prevent duplicate content +- Removes trailing slashes for consistency +- Normalizes query parameters + +### 🚫 Duplicate Detection +- **Visited URL Tracking**: Maintains a set of all visited URLs +- **Unlimited Crawling**: No page limits per domain or total pages +- **Per-Page Duplicate Filtering**: Removes duplicate links within the same page + +### 🛡️ Smart Restrictions +- **No Depth Limits**: Crawl as deep as the specified max_depth allows +- **Content Type Filtering**: Only scrapes HTML content +- **File Type Filtering**: Skips non-content files (PDFs, images, etc.) +- **Consecutive Empty Level Detection**: Stops if 3 consecutive levels have no new content + +### 📊 Enhanced Tracking +- **Domain Page Counts**: Tracks pages scraped per domain (for statistics) +- **URL Check Counts**: Shows total URLs checked vs. pages scraped +- **Detailed Statistics**: Comprehensive reporting on scraping efficiency +- **Unlimited Processing**: No artificial limits on crawling scope + +## Installation + +1. **Clone or download the project files** + +2. **Install dependencies**: + ```bash + pip install -r requirements.txt + ``` + - This will install all required packages, including `PyQt5`, `PyQtWebEngine` (for visual preview), and `openai` (for AI features). + +3. **Run the application**: + ```bash + python web_scraper_app.py + ``` + +## Usage + +### 1. Scraping Configuration +- Enter a starting URL (with or without http/https) +- Set maximum crawl depth (1-100) +- Click "Start Scraping" to begin + +### 2. Data View & Filtering +- View scraped data in an interactive table +- Filter by search terms or specific domains +- Double-click any row to preview content +- Export data to JSON format + +### 3. Analysis & Statistics +- View comprehensive scraping statistics +- See domain breakdown and word counts +- Preview content in both text and visual formats +- Analyze load times and link counts +- Monitor duplicate detection efficiency + +### 4. AI Analysis (New!) +- Switch to the **AI Analysis** tab +- Type your question or use quick action buttons (e.g., "Analyze the website structure", "Suggest SEO improvements") +- The AI will analyze your scraped data and provide actionable insights +- Requires an internet connection and the `openai` package + +## Visual Preview Feature + +The application includes a visual HTML preview feature that renders scraped web pages in a browser-like view: + +- **Requirements**: PyQtWebEngine (automatically installed with requirements.txt) +- **Functionality**: Displays HTML content with proper styling and formatting +- **Fallback**: If PyQtWebEngine is not available, shows a text-only preview +- **Error Handling**: Graceful error messages for invalid HTML content + +## Technical Details + +- **Backend**: Pure Python with urllib and html.parser (no compilation required) +- **Frontend**: PyQt5 with custom modern styling +- **Threading**: Multi-threaded scraping for better performance +- **Data Storage**: Website objects with full metadata +- **URL Handling**: Intelligent normalization and domain filtering +- **Loop Prevention**: Multi-layered duplicate detection system +- **AI Integration**: Uses OpenAI API (via openrouter) for chat-based analysis + +## File Structure + +``` +Testing/ +├── web_scraper_app.py # Main application (with AI and GUI) +├── module.py # Core scraping logic +├── test.py # Basic functionality tests +├── requirements.txt # Dependencies +└── README.md # This file +``` + +## Troubleshooting + +### Visual Preview Not Working +1. Ensure PyQtWebEngine is installed: `pip install PyQtWebEngine` +2. Check console output for import errors + +### AI Analysis Not Working +1. Ensure the `openai` package is installed: `pip install openai` +2. Check your internet connection (AI requires online access) +3. If not installed, the AI tab will show a placeholder response + +### Scraping Issues +1. Verify internet connection +2. Check URL format (add https:// if needed) +3. Try with a lower depth setting +4. Check console for error messages + +### Loop Prevention +1. The scraper automatically prevents infinite loops +2. Check the analysis tab for detailed statistics +3. Monitor "Total URLs Checked" vs "Total Pages" for efficiency +4. Use lower depth settings for sites with many internal links + +### Performance +- Use lower depth settings for faster scraping +- Filter data to focus on specific domains +- Close other applications to free up resources +- Monitor domain page counts to avoid hitting limits + +## License + +This project is open source and available under the MIT License. \ No newline at end of file diff --git a/community-contributions/WebScraperApp/module.py b/community-contributions/WebScraperApp/module.py new file mode 100644 index 0000000..20dff0f --- /dev/null +++ b/community-contributions/WebScraperApp/module.py @@ -0,0 +1,473 @@ +import urllib.request +import urllib.parse +import urllib.error +import html.parser +import re +from datetime import datetime +import time +import ssl +from urllib.parse import urljoin, urlparse +from concurrent.futures import ThreadPoolExecutor, as_completed +import threading +from functools import partial + +class HTMLParser(html.parser.HTMLParser): + """Custom HTML parser to extract title, links, and text content""" + + def __init__(self): + super().__init__() + self.title = "" + self.links = [] + self.text_content = [] + self.in_title = False + self.in_body = False + self.current_tag = "" + + def handle_starttag(self, tag, attrs): + self.current_tag = tag.lower() + + if tag.lower() == 'title': + self.in_title = True + elif tag.lower() == 'body': + self.in_body = True + elif tag.lower() == 'a': + # Extract href attribute + for attr, value in attrs: + if attr.lower() == 'href' and value: + self.links.append(value) + + def handle_endtag(self, tag): + if tag.lower() == 'title': + self.in_title = False + elif tag.lower() == 'body': + self.in_body = False + + def handle_data(self, data): + if self.in_title: + self.title += data + elif self.in_body and self.current_tag in ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'div', 'span', 'li']: + # Clean the text data + cleaned_data = re.sub(r'\s+', ' ', data.strip()) + if cleaned_data: + self.text_content.append(cleaned_data) + + def get_text(self): + """Return all extracted text content as a single string""" + return ' '.join(self.text_content) + + def get_clean_text(self, max_length=500): + """Return cleaned text content with length limit""" + text = self.get_text() + # Remove extra whitespace and limit length + text = re.sub(r'\s+', ' ', text.strip()) + if len(text) > max_length: + text = text[:max_length] + "..." + return text + +class Website: + """Class to store website data""" + + def __init__(self, title, url, content, depth, links=None, load_time=None): + self.title = title or "No Title" + self.url = url + self.content = content + self.depth = depth + self.links = links or [] + self.load_time = load_time + self.timestamp = datetime.now() + + def get_word_count(self): + """Get word count from content""" + if not self.content: + return 0 + # Extract text content and count words + text_content = re.sub(r'<[^>]+>', '', self.content) + words = text_content.split() + return len(words) + + def get_domain(self): + """Extract domain from URL""" + try: + parsed = urlparse(self.url) + return parsed.netloc + except: + return "" + + def get_normalized_domain(self): + """Get domain without www prefix for consistent filtering""" + domain = self.get_domain() + if domain.startswith('www.'): + return domain[4:] + return domain + + def search_content(self, query): + """Search for query in content""" + if not self.content or not query: + return False + return query.lower() in self.content.lower() + + def get_text_preview(self, max_length=200): + """Get a text preview of the content""" + if not self.content: + return "No content available" + + # Extract text content + text_content = re.sub(r'<[^>]+>', '', self.content) + text_content = re.sub(r'\s+', ' ', text_content.strip()) + + if len(text_content) > max_length: + return text_content[:max_length] + "..." + return text_content + +class WebScraper: + """Web scraper with multithreading support and robust duplicate detection""" + + def __init__(self): + self.websites = [] + self.visited_urls = set() + self.visited_domains = set() # Track visited domains + self.start_domain = None # Store the starting domain + self.lock = threading.Lock() + self.max_workers = 10 # Number of concurrent threads + # Removed all page limits - unlimited crawling + self.domain_page_counts = {} # Track page count per domain (for statistics only) + self._stop_requested = False # Flag to stop scraping + + def normalize_url(self, url): + """Normalize URL to handle www prefixes and remove fragments""" + if not url: + return url + + # Remove fragments (#) to prevent duplicate content + if '#' in url: + url = url.split('#')[0] + + # Remove trailing slashes for consistency + url = url.rstrip('/') + + # Remove www prefix for consistent domain handling + if url.startswith('https://www.'): + return url.replace('https://www.', 'https://', 1) + elif url.startswith('http://www.'): + return url.replace('http://www.', 'http://', 1) + return url + + def get_domain_from_url(self, url): + """Extract and normalize domain from URL""" + try: + parsed = urlparse(url) + domain = parsed.netloc + if domain.startswith('www.'): + return domain[4:] + return domain + except: + return "" + + def should_skip_url(self, url, current_depth): + """Check if URL should be skipped based on various criteria""" + normalized_url = self.normalize_url(url) + + # Skip if already visited + if normalized_url in self.visited_urls: + return True, "Already visited" + + # Skip if not a valid HTTP/HTTPS URL + if not normalized_url.startswith(('http://', 'https://')): + return True, "Not HTTP/HTTPS URL" + + # Get domain + domain = self.get_domain_from_url(normalized_url) + if not domain: + return True, "Invalid domain" + + # Removed all domain page limits - unlimited crawling + # Removed external domain depth limits - crawl as deep as needed + + return False, "OK" + + def scrape_url(self, url, depth): + """Scrape a single URL with error handling and rate limiting""" + try: + # Check if stop was requested + if self._stop_requested: + return None + + # Check if URL should be skipped + should_skip, reason = self.should_skip_url(url, depth) + if should_skip: + print(f"Skipping {url}: {reason}") + return None + + # Normalize URL + normalized_url = self.normalize_url(url) + + # Mark as visited and update domain count (for statistics only) + with self.lock: + self.visited_urls.add(normalized_url) + domain = self.get_domain_from_url(normalized_url) + if domain: + self.domain_page_counts[domain] = self.domain_page_counts.get(domain, 0) + 1 + + # Add small delay to prevent overwhelming servers + time.sleep(0.1) + + start_time = time.time() + + # Create request with headers + req = urllib.request.Request( + normalized_url, + headers={ + 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', + 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', + 'Accept-Language': 'en-US,en;q=0.5', + 'Accept-Encoding': 'gzip, deflate', + 'Connection': 'keep-alive', + 'Upgrade-Insecure-Requests': '1', + } + ) + + # Fetch the page with timeout + with urllib.request.urlopen(req, timeout=15) as response: + # Check content type + content_type = response.headers.get('content-type', '').lower() + if 'text/html' not in content_type and 'application/xhtml' not in content_type: + print(f"Skipping {url}: Not HTML content ({content_type})") + return None + + html_content = response.read().decode('utf-8', errors='ignore') + + load_time = time.time() - start_time + + # Skip if content is too small (likely error page) + if len(html_content) < 100: + print(f"Skipping {url}: Content too small ({len(html_content)} chars)") + return None + + # Parse HTML + parser = HTMLParser() + parser.feed(html_content) + + # Extract links and normalize them with duplicate detection + links = [] + base_url = normalized_url + seen_links = set() # Track links within this page to avoid duplicates + + for link in parser.links: + try: + absolute_url = urljoin(base_url, link) + normalized_link = self.normalize_url(absolute_url) + + # Skip if already seen in this page or should be skipped + if normalized_link in seen_links: + continue + seen_links.add(normalized_link) + + should_skip, reason = self.should_skip_url(normalized_link, depth + 1) + if should_skip: + continue + + # Only include http/https links and filter out common non-content URLs + if (normalized_link.startswith(('http://', 'https://')) and + not any(skip in normalized_link.lower() for skip in [ + 'mailto:', 'tel:', 'javascript:', 'data:', 'file:', + '.pdf', '.doc', '.docx', '.xls', '.xlsx', '.zip', '.rar', + '.jpg', '.jpeg', '.png', '.gif', '.bmp', '.svg', '.ico', + '.css', '.js', '.xml', '.json', '.txt', '.log' + ])): + links.append(normalized_link) + except: + continue + + # Create Website object + website = Website( + title=parser.title, + url=normalized_url, + content=html_content, + depth=depth, + links=links, + load_time=load_time + ) + + return website + + except urllib.error.HTTPError as e: + print(f"HTTP Error scraping {url}: {e.code} - {e.reason}") + return None + except urllib.error.URLError as e: + print(f"URL Error scraping {url}: {e.reason}") + return None + except Exception as e: + print(f"Error scraping {url}: {str(e)}") + return None + + def crawl_website(self, start_url, max_depth=3, progress_callback=None): + """Crawl website with multithreading support and no page limits""" + if not start_url.startswith(('http://', 'https://')): + start_url = 'https://' + start_url + + # Initialize tracking + self.websites = [] + self.visited_urls = set() + self.visited_domains = set() + self.domain_page_counts = {} + self.start_domain = self.get_domain_from_url(start_url) + self._stop_requested = False # Reset stop flag + + print(f"Starting crawl from: {start_url}") + print(f"Starting domain: {self.start_domain}") + print(f"Max depth: {max_depth}") + print(f"Unlimited crawling - no page limits") + + # Start with the initial URL + urls_to_scrape = [(start_url, 0)] + max_depth_reached = 0 + consecutive_empty_levels = 0 + max_consecutive_empty = 3 # Stop if 3 consecutive levels have no new URLs + total_pages_scraped = 0 + # Removed all page limits - unlimited crawling + + with ThreadPoolExecutor(max_workers=self.max_workers) as executor: + for current_depth in range(max_depth + 1): + # Check if stop was requested + if self._stop_requested: + print("Scraping stopped by user request") + break + + if not urls_to_scrape: + print(f"Stopping at depth {current_depth}: No more URLs to scrape") + break + + # Check if we've reached too many consecutive empty levels + if consecutive_empty_levels >= max_consecutive_empty: + print(f"Stopping at depth {current_depth}: {max_consecutive_empty} consecutive empty levels") + break + + # Removed absolute page limit check - unlimited pages + + print(f"Scraping depth {current_depth} with {len(urls_to_scrape)} URLs") + + # Submit all URLs at current depth for concurrent scraping + future_to_url = { + executor.submit(self.scrape_url, url, depth): url + for url, depth in urls_to_scrape + } + + # Collect results and prepare next level + urls_to_scrape = [] + level_results = 0 + + for future in as_completed(future_to_url): + # Check if stop was requested + if self._stop_requested: + print("Stopping processing of current level") + break + + website = future.result() + if website: + with self.lock: + self.websites.append(website) + level_results += 1 + total_pages_scraped += 1 + + # Emit progress if callback provided + if progress_callback: + progress_callback(website) + + # Add links for next depth level (no limits) + if current_depth < max_depth: + for link in website.links: + # Removed URL limit per level - process all URLs + + should_skip, reason = self.should_skip_url(link, current_depth + 1) + if not should_skip: + urls_to_scrape.append((link, current_depth + 1)) + + # Check if stop was requested after processing level + if self._stop_requested: + break + + # Update depth tracking + if level_results > 0: + max_depth_reached = current_depth + consecutive_empty_levels = 0 + else: + consecutive_empty_levels += 1 + + # Only stop if we've reached the actual max depth + if current_depth >= max_depth: + print(f"Reached maximum depth: {max_depth}") + break + + # Print progress summary + print(f"Depth {current_depth} completed: {level_results} pages, Total: {len(self.websites)}") + if self.domain_page_counts: + print(f"Domain breakdown: {dict(self.domain_page_counts)}") + + print(f"Crawling completed. Max depth reached: {max_depth_reached}, Total pages: {len(self.websites)}") + print(f"Visited URLs: {len(self.visited_urls)}") + print(f"Domain breakdown: {dict(self.domain_page_counts)}") + return self.websites + + def reset(self): + """Reset the scraper state for a new crawl""" + self.websites = [] + self.visited_urls = set() + self.visited_domains = set() + self.domain_page_counts = {} + self.start_domain = None + self._stop_requested = False # Reset stop flag + + def get_statistics(self): + """Get scraping statistics with enhanced tracking information""" + if not self.websites: + return { + 'total_pages': 0, + 'total_links': 0, + 'total_words': 0, + 'avg_load_time': 0, + 'max_depth_reached': 0, + 'domains': {}, + 'visited_urls_count': 0, + 'domain_page_counts': {}, + 'start_domain': self.start_domain + } + + total_pages = len(self.websites) + total_links = sum(len(w.links) for w in self.websites) + total_words = sum(w.get_word_count() for w in self.websites) + + load_times = [w.load_time for w in self.websites if w.load_time] + avg_load_time = sum(load_times) / len(load_times) if load_times else 0 + + max_depth_reached = max(w.depth for w in self.websites) + + # Count domains + domains = {} + for website in self.websites: + domain = website.get_normalized_domain() + domains[domain] = domains.get(domain, 0) + 1 + + return { + 'total_pages': total_pages, + 'total_links': total_links, + 'total_words': total_words, + 'avg_load_time': avg_load_time, + 'max_depth_reached': max_depth_reached, + 'domains': domains, + 'visited_urls_count': len(self.visited_urls), + 'domain_page_counts': dict(self.domain_page_counts), + 'start_domain': self.start_domain + } + + def filter_by_domain(self, domain): + """Filter websites by domain""" + normalized_domain = self.normalize_url(domain) + return [w for w in self.websites if w.get_normalized_domain() == normalized_domain] + + def search_websites(self, query): + """Search websites by query""" + return [w for w in self.websites if w.search_content(query)] + + def stop_scraping(self): + """Request graceful stop of the scraping process""" + self._stop_requested = True \ No newline at end of file diff --git a/community-contributions/WebScraperApp/requirements.txt b/community-contributions/WebScraperApp/requirements.txt new file mode 100644 index 0000000..a9f1b2a --- /dev/null +++ b/community-contributions/WebScraperApp/requirements.txt @@ -0,0 +1,5 @@ +PyQt5>=5.15.0 +PyQtWebEngine>=5.15.0 +urllib3==2.0.7 +openai>=1.0.0 +python-dotenv>=1.0.0 \ No newline at end of file diff --git a/community-contributions/WebScraperApp/test.py b/community-contributions/WebScraperApp/test.py new file mode 100644 index 0000000..e86a29c --- /dev/null +++ b/community-contributions/WebScraperApp/test.py @@ -0,0 +1,161 @@ +#!/usr/bin/env python3 +""" +Simple test script to verify the web scraping functionality +""" + +import module + +def test_basic_scraping(): + """Test basic scraping functionality""" + print("Testing basic web scraping...") + + # Create a scraper instance + scraper = module.WebScraper() + + # Test with a simple website (httpbin.org is a safe test site) + test_url = "https://httpbin.org/html" + + print(f"Scraping {test_url} with depth 1...") + + try: + # Scrape with depth 1 to keep it fast + websites = scraper.crawl_website(test_url, max_depth=1) + + print(f"Successfully scraped {len(websites)} websites") + + if websites: + # Show first website details + first_site = websites[0] + print(f"\nFirst website:") + print(f" Title: {first_site.title}") + print(f" URL: {first_site.url}") + print(f" Depth: {first_site.depth}") + print(f" Links found: {len(first_site.links)}") + print(f" Word count: {first_site.get_word_count()}") + + # Show statistics + stats = scraper.get_statistics() + print(f"\nStatistics:") + print(f" Total pages: {stats['total_pages']}") + print(f" Total links: {stats['total_links']}") + print(f" Total words: {stats['total_words']}") + print(f" Average load time: {stats['avg_load_time']:.2f}s") + + return True + else: + print("No websites were scraped") + return False + + except Exception as e: + print(f"Error during scraping: {e}") + return False + +def test_website_class(): + """Test the Website class functionality""" + print("\nTesting Website class...") + + # Create a test website + website = module.Website( + title="Test Website", + url="https://example.com", + content="
This is a test paragraph.
", + depth=0, + links=["https://example.com/page1", "https://example.com/page2"] + ) + + # Test methods + print(f"Website title: {website.title}") + print(f"Website URL: {website.url}") + print(f"Word count: {website.get_word_count()}") + print(f"Domain: {website.get_domain()}") + print(f"Normalized domain: {website.get_normalized_domain()}") + print(f"Search for 'test': {website.search_content('test')}") + print(f"Search for 'nonexistent': {website.search_content('nonexistent')}") + + return True + +def test_html_parser(): + """Test the HTML parser functionality""" + print("\nTesting HTML Parser...") + + parser = module.HTMLParser() + test_html = """ + +This is a link to example.com
+Here's another relative link
+ + + """ + + parser.feed(test_html) + print(f"Title extracted: {parser.title}") + print(f"Links found: {parser.links}") + print(f"Text content length: {len(parser.get_text())}") + + return True + +def test_url_normalization(): + """Test URL normalization to handle www. prefixes""" + print("\nTesting URL Normalization...") + + scraper = module.WebScraper() + + # Test URLs with and without www. + test_urls = [ + "https://www.example.com/page", + "https://example.com/page", + "http://www.test.com/path?param=value#fragment", + "http://test.com/path?param=value#fragment" + ] + + print("URL Normalization Results:") + for url in test_urls: + normalized = scraper.normalize_url(url) + print(f" Original: {url}") + print(f" Normalized: {normalized}") + print() + + # Test domain filtering + print("Domain Filtering Test:") + test_websites = [ + module.Website("Site 1", "https://www.example.com", "content", 0), + module.Website("Site 2", "https://example.com", "content", 0), + module.Website("Site 3", "https://www.test.com", "content", 0) + ] + + scraper.websites = test_websites + + # Test filtering by domain with and without www. + domains_to_test = ["example.com", "www.example.com", "test.com", "www.test.com"] + + for domain in domains_to_test: + filtered = scraper.filter_by_domain(domain) + print(f" Filter '{domain}': {len(filtered)} results") + for site in filtered: + print(f" - {site.title} ({site.url})") + + return True + +if __name__ == "__main__": + print("Web Scraper Test Suite") + print("=" * 50) + + # Test HTML parser + test_html_parser() + + # Test Website class + test_website_class() + + # Test URL normalization + test_url_normalization() + + # Test basic scraping (uncomment to test actual scraping) + # Note: This requires internet connection + # test_basic_scraping() + + print("\nTest completed!") + print("\nTo run the full application:") + print("python web_scraper_app.py") \ No newline at end of file diff --git a/community-contributions/WebScraperApp/web_scraper_app.py b/community-contributions/WebScraperApp/web_scraper_app.py new file mode 100644 index 0000000..ccd5ce2 --- /dev/null +++ b/community-contributions/WebScraperApp/web_scraper_app.py @@ -0,0 +1,1678 @@ +import sys +import json +from urllib.parse import urlparse +from PyQt5.QtWidgets import (QApplication, QMainWindow, QWidget, QVBoxLayout, + QHBoxLayout, QLabel, QLineEdit, QSpinBox, QPushButton, + QTextEdit, QTableWidget, QTableWidgetItem, QTabWidget, + QProgressBar, QComboBox, QMessageBox, QSplitter, + QGroupBox, QGridLayout, QHeaderView, QFrame, QScrollArea, + QSystemTrayIcon, QStyle, QAction, QMenu, QTreeWidget, QTreeWidgetItem, + QListWidget, QListWidgetItem, QSizePolicy, QAbstractItemView) +from PyQt5.QtCore import QThread, pyqtSignal, Qt, QTimer, QUrl +from PyQt5.QtGui import QFont, QIcon, QPalette, QColor, QPixmap +try: + from PyQt5.QtWebEngineWidgets import QWebEngineView + WEB_ENGINE_AVAILABLE = True + print("PyQtWebEngine successfully imported - Visual preview enabled") +except ImportError as e: + WEB_ENGINE_AVAILABLE = False + print(f"PyQtWebEngine not available: {e}") + print("Visual preview will be disabled. Install with: pip install PyQtWebEngine") +import module +import re +import webbrowser +import os +try: + from openai import OpenAI + OPENAI_AVAILABLE = True +except ImportError: + OPENAI_AVAILABLE = False +from datetime import datetime +from dotenv import load_dotenv +import markdown + +# Load environment variables from .env file +load_dotenv() + +class ScrapingThread(QThread): + """Thread for running web scraping operations""" + progress_updated = pyqtSignal(str) + scraping_complete = pyqtSignal(list) + error_occurred = pyqtSignal(str) + + def __init__(self, url, max_depth): + super().__init__() + self.url = url + self.max_depth = max_depth + self.scraper = module.WebScraper() + self._stop_requested = False + + def stop(self): + """Request graceful stop of the scraping process""" + self._stop_requested = True + if hasattr(self.scraper, 'stop_scraping'): + self.scraper.stop_scraping() + + def run(self): + try: + self.progress_updated.emit("Starting web scraping...") + + # Reset scraper state for new crawl + self.scraper.reset() + + def progress_callback(website): + if self._stop_requested: + return # Stop processing if requested + if website: + self.progress_updated.emit(f"Scraped: {website.title} (depth {website.depth})") + + # Start scraping with progress callback + websites = self.scraper.crawl_website(self.url, self.max_depth, progress_callback) + + # Check if stop was requested + if self._stop_requested: + self.progress_updated.emit("Scraping stopped by user.") + return + + # Emit final progress + self.progress_updated.emit(f"Scraping complete! Found {len(websites)} websites.") + self.scraping_complete.emit(websites) + + except Exception as e: + if not self._stop_requested: # Only emit error if not stopped by user + self.error_occurred.emit(str(e)) + +class ModernButton(QPushButton): + """Custom modern button with hover effects""" + def __init__(self, text, primary=False): + super().__init__(text) + self.primary = primary + self.setMinimumHeight(40) + self.setFont(QFont("Segoe UI", 10, QFont.Weight.Medium)) + self.setCursor(Qt.CursorShape.PointingHandCursor) + self.update_style() + + def update_style(self): + if self.primary: + self.setStyleSheet(""" + QPushButton { + background: #3b82f6; + border: none; + color: white; + padding: 12px 24px; + border-radius: 6px; + font-weight: 600; + } + QPushButton:hover { + background: #2563eb; + } + QPushButton:pressed { + background: #1d4ed8; + } + QPushButton:disabled { + background: #9ca3af; + color: #f3f4f6; + } + """) + else: + self.setStyleSheet(""" + QPushButton { + background: white; + border: 1px solid #d1d5db; + color: #374151; + padding: 10px 20px; + border-radius: 6px; + font-weight: 500; + } + QPushButton:hover { + border-color: #3b82f6; + color: #3b82f6; + background: #f8fafc; + } + QPushButton:pressed { + background: #f1f5f9; + } + QPushButton:disabled { + background: #f9fafb; + border-color: #e5e7eb; + color: #9ca3af; + } + """) + +class ModernLineEdit(QLineEdit): + """Custom modern input field""" + def __init__(self, placeholder=""): + super().__init__() + self.setPlaceholderText(placeholder) + self.setMinimumHeight(40) + self.setFont(QFont("Segoe UI", 10)) + self.setStyleSheet(""" + QLineEdit { + border: 1px solid #d1d5db; + border-radius: 6px; + padding: 8px 12px; + background: white; + color: #374151; + font-size: 14px; + } + QLineEdit:focus { + border-color: #3b82f6; + outline: none; + } + QLineEdit::placeholder { + color: #9ca3af; + } + """) + +class ModernSpinBox(QSpinBox): + """Custom modern spin box""" + def __init__(self): + super().__init__() + self.setMinimumHeight(40) + self.setFont(QFont("Segoe UI", 10)) + self.setStyleSheet(""" + QSpinBox { + border: 1px solid #d1d5db; + border-radius: 6px; + padding: 8px 12px; + background: white; + color: #374151; + font-size: 14px; + } + QSpinBox:focus { + border-color: #3b82f6; + } + QSpinBox::up-button, QSpinBox::down-button { + border: none; + background: #f9fafb; + border-radius: 3px; + margin: 2px; + } + QSpinBox::up-button:hover, QSpinBox::down-button:hover { + background: #f3f4f6; + } + """) + +class ChatBubbleWidget(QWidget): + def __init__(self, message, timestamp, role): + super().__init__() + layout = QVBoxLayout(self) + layout.setContentsMargins(0, 0, 0, 0) + layout.setSpacing(2) + # Bubble + if role == "ai": + html = markdown.markdown(message) + bubble = QLabel(html) + bubble.setTextFormat(Qt.TextFormat.RichText) + else: + bubble = QLabel(message) + bubble.setTextFormat(Qt.TextFormat.PlainText) + bubble.setWordWrap(True) + bubble.setTextInteractionFlags(Qt.TextInteractionFlag.TextSelectableByMouse) + bubble.setFont(QFont("Segoe UI", 11)) + bubble.setSizePolicy(QSizePolicy.Preferred, QSizePolicy.Maximum) + bubble.setMinimumWidth(800) + bubble.setMaximumWidth(1200) + bubble.adjustSize() + # Timestamp + ts = QLabel(("🤖 " if role == "ai" else "") + timestamp) + ts.setFont(QFont("Segoe UI", 8)) + ts.setStyleSheet("color: #9ca3af;") + if role == "user": + bubble.setStyleSheet("background: #2563eb; color: white; border-radius: 16px; padding: 10px 16px; margin-left: 40px;") + layout.setAlignment(Qt.AlignmentFlag.AlignRight) + ts.setAlignment(Qt.AlignmentFlag.AlignRight) + else: + bubble.setStyleSheet("background: #f3f4f6; color: #1e293b; border-radius: 16px; padding: 10px 16px; margin-right: 40px;") + layout.setAlignment(Qt.AlignmentFlag.AlignLeft) + ts.setAlignment(Qt.AlignmentFlag.AlignLeft) + layout.addWidget(bubble) + layout.addWidget(ts) + +class WebScraperApp(QMainWindow): + def __init__(self): + super().__init__() + self.websites = [] + self.scraper = module.WebScraper() + self.init_ui() + + def init_ui(self): + self.setWindowTitle("Web Scraper & Data Analyzer") + self.setGeometry(100, 100, 1400, 900) + self.setMinimumSize(1200, 800) # Set minimum size to prevent geometry issues + + # Set clean, minimal styling + self.setStyleSheet(""" + QMainWindow { + background: #1e293b; + } + QTabWidget::pane { + border: none; + background: white; + border-radius: 8px; + margin: 8px 8px 8px 8px; + padding-top: 8px; + } + QTabBar::tab { + background: #475569; + color: #e2e8f0; + padding: 12px 20px; + margin-right: 4px; + border-top-left-radius: 8px; + border-top-right-radius: 8px; + font-weight: 600; + font-size: 14px; + min-width: 120px; + margin-bottom: 8px; + } + QTabBar::tab:selected { + background: white; + color: #1e293b; + border-bottom: none; + margin-bottom: 8px; + } + QTabBar::tab:hover:!selected { + background: #64748b; + color: #f1f5f9; + } + QTabBar::tab:first { + margin-left: 8px; + } + QTabBar::tab:last { + margin-right: 8px; + } + QGroupBox { + font-weight: 600; + font-size: 14px; + border: 2px solid #e2e8f0; + border-radius: 8px; + margin-top: 16px; + padding-top: 16px; + background: #f8fafc; + } + QGroupBox::title { + subcontrol-origin: margin; + left: 16px; + + color: #1e293b; + background: #f8fafc; + } + QTableWidget { + border: 2px solid #e2e8f0; + border-radius: 8px; + background: white; + gridline-color: #f1f5f9; + alternate-background-color: #f8fafc; + selection-background-color: #dbeafe; + selection-color: #1e293b; + } + QTableWidget::item { + padding: 8px 4px; + border: none; + min-height: 20px; + } + QTableWidget::item:selected { + background: #dbeafe; + color: #1e293b; + } + QHeaderView::section { + background: #e2e8f0; + padding: 12px 8px; + border: none; + border-right: 1px solid #cbd5e1; + border-bottom: 1px solid #cbd5e1; + font-weight: 600; + color: #1e293b; + } + QHeaderView::section:vertical { + background: #f8fafc; + padding: 8px 4px; + border: none; + border-bottom: 1px solid #e2e8f0; + font-weight: 500; + color: #64748b; + min-width: 40px; + } + QProgressBar { + border: 2px solid #e2e8f0; + border-radius: 6px; + text-align: center; + background: #f1f5f9; + } + QProgressBar::chunk { + background: #3b82f6; + border-radius: 5px; + } + QTextEdit { + border: 2px solid #e2e8f0; + border-radius: 6px; + padding: 12px; + background: white; + color: #1e293b; + font-family: 'Segoe UI', sans-serif; + } + QComboBox { + border: 2px solid #d1d5db; + border-radius: 6px; + padding: 8px 12px; + background: white; + color: #1e293b; + font-size: 14px; + min-height: 40px; + } + QComboBox:focus { + border-color: #3b82f6; + } + QComboBox::drop-down { + border: none; + width: 30px; + } + QComboBox::down-arrow { + image: none; + border-left: 5px solid transparent; + border-right: 5px solid transparent; + border-top: 5px solid #6b7280; + margin-right: 10px; + } + QLabel { + color: #1e293b; + font-weight: 500; + font-size: 14px; + } + """) + + # System tray icon for notifications + + self.tray_icon = QSystemTrayIcon(self) + self.tray_icon.setIcon(self.style().standardIcon(QStyle.StandardPixmap.SP_ComputerIcon)) + self.tray_icon.setVisible(True) + + # Create central widget and main layout + central_widget = QWidget() + self.setCentralWidget(central_widget) + main_layout = QVBoxLayout(central_widget) + main_layout.setContentsMargins(16, 16, 16, 16) + main_layout.setSpacing(12) + + # Create header + header = self.create_header() + main_layout.addWidget(header) + + # Add proper spacing after header + spacer = QWidget() + spacer.setFixedHeight(12) + main_layout.addWidget(spacer) + + # Create tab widget with proper margins + self.tab_widget = QTabWidget() + self.tab_widget.setStyleSheet(""" + QTabWidget { + margin-top: 0px; + background: transparent; + } + QTabWidget::pane { + border: none; + background: white; + border-radius: 8px; + margin: 4px 8px 8px 8px; + padding-top: 4px; + } + QTabBar { + background: transparent; + spacing: 0px; + } + QTabBar::tab { + background: #475569; + color: #e2e8f0; + padding: 12px 20px; + margin-right: 4px; + border-top-left-radius: 8px; + border-top-right-radius: 8px; + font-weight: 600; + font-size: 14px; + min-width: 120px; + margin-bottom: 4px; + } + QTabBar::tab:selected { + background: white; + color: #1e293b; + border-bottom: none; + margin-bottom: 4px; + } + QTabBar::tab:hover:!selected { + background: #64748b; + color: #f1f5f9; + } + QTabBar::tab:first { + margin-left: 8px; + } + QTabBar::tab:last { + margin-right: 8px; + } + """) + main_layout.addWidget(self.tab_widget) + + # Create tabs + self.create_scraping_tab() + self.create_data_tab() + self.create_analysis_tab() + self.create_sitemap_tab() + self.create_ai_tab() + + def create_header(self): + """Create a clean header with help button only (no theme toggle)""" + header_widget = QWidget() + header_widget.setStyleSheet(""" + QWidget { + background: #0f172a; + border-radius: 12px; + margin: 4px 4px 8px 4px; + } + """) + header_layout = QHBoxLayout(header_widget) + header_layout.setContentsMargins(24, 20, 24, 20) + header_layout.setSpacing(16) + + # Title + title_label = QLabel("Web Scraper & Data Analyzer") + title_label.setStyleSheet(""" + QLabel { + color: #f8fafc; + font-size: 28px; + font-weight: 800; + font-family: 'Segoe UI', sans-serif; + } + """) + + # Subtitle + subtitle_label = QLabel("Modern web scraping with intelligent data analysis") + subtitle_label.setStyleSheet(""" + QLabel { + color: #cbd5e1; + font-size: 16px; + font-weight: 500; + font-family: 'Segoe UI', sans-serif; + } + """) + + # Help button + help_button = ModernButton("Help") + help_button.clicked.connect(self.show_help) + + # Right side info + info_widget = QWidget() + info_layout = QVBoxLayout(info_widget) + info_layout.setAlignment(Qt.AlignmentFlag.AlignRight) + info_layout.setSpacing(4) + + version_label = QLabel("v2.0") + version_label.setStyleSheet(""" + QLabel { + color: #94a3b8; + font-size: 14px; + font-weight: 600; + background: #1e293b; + padding: 6px 12px; + border-radius: 6px; + border: 1px solid #334155; + } + """) + + info_layout.addWidget(version_label) + + header_layout.addWidget(title_label) + header_layout.addStretch() + header_layout.addWidget(subtitle_label) + header_layout.addStretch() + header_layout.addWidget(help_button) + header_layout.addWidget(info_widget) + + return header_widget + + def create_scraping_tab(self): + """Create the web scraping configuration tab""" + scraping_widget = QWidget() + main_layout = QVBoxLayout(scraping_widget) + main_layout.setContentsMargins(16, 16, 16, 16) + main_layout.setSpacing(16) + + # Create scroll area + scroll_area = QScrollArea() + scroll_area.setWidgetResizable(True) + scroll_area.setStyleSheet("QScrollArea { border: none; }") + scroll_area.setHorizontalScrollBarPolicy(Qt.ScrollBarPolicy.ScrollBarAsNeeded) + scroll_area.setVerticalScrollBarPolicy(Qt.ScrollBarPolicy.ScrollBarAsNeeded) + + # Create content widget for scrolling + content_widget = QWidget() + layout = QVBoxLayout(content_widget) + layout.setSpacing(16) + layout.setContentsMargins(0, 0, 0, 0) + + # Input group + input_group = QGroupBox("Scraping Configuration") + input_layout = QGridLayout(input_group) + input_layout.setSpacing(12) + + # URL input + input_layout.addWidget(QLabel("Website URL:"), 0, 0) + self.url_input = ModernLineEdit("https://example.com") + input_layout.addWidget(self.url_input, 0, 1) + + # Depth input + input_layout.addWidget(QLabel("Max Depth (1-100):"), 1, 0) + self.depth_input = ModernSpinBox() + self.depth_input.setRange(1, 100) + self.depth_input.setValue(3) + input_layout.addWidget(self.depth_input, 1, 1) + + # Control buttons + button_layout = QHBoxLayout() + button_layout.setSpacing(8) + + self.start_button = ModernButton("Start Scraping", primary=True) + self.start_button.clicked.connect(self.start_scraping) + button_layout.addWidget(self.start_button) + + self.stop_button = ModernButton("Stop") + self.stop_button.clicked.connect(self.stop_scraping) + self.stop_button.setEnabled(False) + button_layout.addWidget(self.stop_button) + + input_layout.addLayout(button_layout, 2, 0, 1, 2) + layout.addWidget(input_group) + + # Progress group + progress_group = QGroupBox("Progress") + progress_layout = QVBoxLayout(progress_group) + progress_layout.setSpacing(8) + + self.progress_bar = QProgressBar() + self.progress_bar.setVisible(False) + self.progress_bar.setMinimumHeight(20) + progress_layout.addWidget(self.progress_bar) + + self.status_label = QLabel("Ready to start scraping...") + self.status_label.setStyleSheet(""" + QLabel { + color: #374151; + font-size: 14px; + padding: 8px; + background: #f8fafc; + border-radius: 6px; + border-left: 3px solid #3b82f6; + } + """) + self.status_label.setWordWrap(True) # Enable word wrapping + progress_layout.addWidget(self.status_label) + + layout.addWidget(progress_group) + + # Results preview + results_group = QGroupBox("Scraping Results") + results_layout = QVBoxLayout(results_group) + + self.results_text = QTextEdit() + self.results_text.setReadOnly(True) + self.results_text.setMinimumHeight(80) # Reduced minimum height for more compact output + results_layout.addWidget(self.results_text) + + layout.addWidget(results_group) + + # Set the content widget in the scroll area + scroll_area.setWidget(content_widget) + main_layout.addWidget(scroll_area) + + self.tab_widget.addTab(scraping_widget, "Web Scraping") + + def create_data_tab(self): + """Create the data viewing and filtering tab""" + data_widget = QWidget() + layout = QVBoxLayout(data_widget) + layout.setSpacing(16) + + # Search and filter controls + controls_group = QGroupBox("Search & Filter") + controls_layout = QHBoxLayout(controls_group) + controls_layout.setSpacing(12) + + controls_layout.addWidget(QLabel("Search:")) + self.search_input = ModernLineEdit("Enter search term...") + self.search_input.textChanged.connect(self.filter_data) + controls_layout.addWidget(self.search_input) + + controls_layout.addWidget(QLabel("Domain:")) + self.domain_filter = QComboBox() + self.domain_filter.currentTextChanged.connect(self.filter_data) + controls_layout.addWidget(self.domain_filter) + + self.export_button = ModernButton("Export Data") + self.export_button.clicked.connect(self.export_data) + controls_layout.addWidget(self.export_button) + + # Sitemap button + self.sitemap_button = ModernButton("Generate Sitemap.xml") + self.sitemap_button.clicked.connect(self.generate_sitemap) + controls_layout.addWidget(self.sitemap_button) + + layout.addWidget(controls_group) + + # Data table + self.data_table = QTableWidget() + self.data_table.setColumnCount(6) + self.data_table.setHorizontalHeaderLabels([ + "Title", "URL", "Depth", "Links", "Words", "Load Time" + ]) + + # Set table properties to fill available width + header = self.data_table.horizontalHeader() + header.setStretchLastSection(False) # Don't stretch the last section + + # Set resize modes to make table fill width properly + header.setSectionResizeMode(0, QHeaderView.Stretch) # Title - stretch to fill + header.setSectionResizeMode(1, QHeaderView.Stretch) # URL - stretch to fill + header.setSectionResizeMode(2, QHeaderView.Fixed) # Depth - fixed + header.setSectionResizeMode(3, QHeaderView.Fixed) # Links - fixed + header.setSectionResizeMode(4, QHeaderView.Fixed) # Words - fixed + header.setSectionResizeMode(5, QHeaderView.Fixed) # Load Time - fixed + + # Set fixed column widths for non-stretching columns + self.data_table.setColumnWidth(2, 80) # Depth + self.data_table.setColumnWidth(3, 80) # Links + self.data_table.setColumnWidth(4, 80) # Words + self.data_table.setColumnWidth(5, 100) # Load Time + + # Set row height to prevent index cutoff + self.data_table.verticalHeader().setDefaultSectionSize(40) # Increased row height + self.data_table.verticalHeader().setMinimumSectionSize(35) # Minimum row height + + # Enable word wrapping for title and URL columns + self.data_table.setWordWrap(True) + + # Connect double-click signal + self.data_table.cellDoubleClicked.connect(self.show_content_preview) + + layout.addWidget(self.data_table) + + self.tab_widget.addTab(data_widget, "Data View") + + def create_analysis_tab(self): + """Create the data analysis tab""" + analysis_widget = QWidget() + layout = QVBoxLayout(analysis_widget) + layout.setSpacing(16) + + # Create scroll area for better layout + scroll_area = QScrollArea() + scroll_area.setWidgetResizable(True) + scroll_area.setStyleSheet("QScrollArea { border: none; }") + + content_widget = QWidget() + content_layout = QVBoxLayout(content_widget) + content_layout.setSpacing(16) + + # Statistics group + stats_group = QGroupBox("Statistics") + stats_layout = QGridLayout(stats_group) + stats_layout.setSpacing(12) + + self.stats_labels = {} + stats_fields = [ + ("Total Pages", "Total Pages"), + ("Total Links", "Total Links"), + ("Total Words", "Total Words"), + ("Average Load Time", "Average Load Time"), + ("Max Depth Reached", "Max Depth Reached") + ] + + for i, (label_text, field) in enumerate(stats_fields): + stats_layout.addWidget(QLabel(f"{label_text}:"), i, 0) + label = QLabel("0") + label.setStyleSheet(""" + QLabel { + font-weight: 700; + color: #3b82f6; + font-size: 16px; + padding: 8px 12px; + background: #eff6ff; + border-radius: 6px; + border-left: 3px solid #3b82f6; + } + """) + self.stats_labels[field] = label + stats_layout.addWidget(label, i, 1) + + content_layout.addWidget(stats_group) + + # Domain breakdown + domain_group = QGroupBox("Domain Breakdown") + domain_layout = QVBoxLayout(domain_group) + + self.domain_text = QTextEdit() + self.domain_text.setReadOnly(True) + self.domain_text.setMaximumHeight(150) + domain_layout.addWidget(self.domain_text) + + content_layout.addWidget(domain_group) + + # Content preview + content_preview_group = QGroupBox("Content Preview") + content_preview_layout = QVBoxLayout(content_preview_group) + + # Create splitter for text and visual preview + preview_splitter = QSplitter(Qt.Orientation.Horizontal) + + # Text preview + text_preview_widget = QWidget() + text_preview_layout = QVBoxLayout(text_preview_widget) + text_preview_layout.setContentsMargins(0, 0, 0, 0) + + text_label = QLabel("Text Content:") + text_label.setStyleSheet("font-weight: 600; margin-bottom: 8px;") + text_preview_layout.addWidget(text_label) + + self.content_text = QTextEdit() + self.content_text.setReadOnly(True) + self.content_text.setMaximumHeight(400) + self.content_text.setFont(QFont("Segoe UI", 12)) + self.content_text.setStyleSheet(""" + QTextEdit { + font-size: 12px; + line-height: 1.4; + padding: 16px; + } + """) + text_preview_layout.addWidget(self.content_text) + + # Visual HTML preview + visual_preview_widget = QWidget() + visual_preview_layout = QVBoxLayout(visual_preview_widget) + visual_preview_layout.setContentsMargins(0, 0, 0, 0) + + visual_label = QLabel("Visual Preview:") + visual_label.setStyleSheet("font-weight: 600; margin-bottom: 8px;") + visual_preview_layout.addWidget(visual_label) + + if WEB_ENGINE_AVAILABLE: + self.web_view = QWebEngineView() + self.web_view.setMinimumHeight(400) + self.web_view.setMaximumHeight(400) + visual_preview_layout.addWidget(self.web_view) + else: + self.web_view = QLabel("Visual preview not available\nInstall PyQtWebEngine for HTML rendering") + self.web_view.setStyleSheet("color: #6b7280; padding: 20px; text-align: center;") + self.web_view.setMinimumHeight(400) + self.web_view.setMaximumHeight(400) + visual_preview_layout.addWidget(self.web_view) + + # Add widgets to splitter + preview_splitter.addWidget(text_preview_widget) + preview_splitter.addWidget(visual_preview_widget) + preview_splitter.setSizes([400, 600]) # Set initial split ratio + + content_preview_layout.addWidget(preview_splitter) + + content_layout.addWidget(content_preview_group) + + scroll_area.setWidget(content_widget) + layout.addWidget(scroll_area) + + self.tab_widget.addTab(analysis_widget, "Analysis") + + def create_sitemap_tab(self): + """Create the visual sitemap tab with a tree widget and export button""" + sitemap_widget = QWidget() + layout = QVBoxLayout(sitemap_widget) + layout.setSpacing(16) + + # Export button + self.export_sitemap_button = ModernButton("Export Sitemap (JSON)") + self.export_sitemap_button.clicked.connect(self.export_sitemap_json) + layout.addWidget(self.export_sitemap_button) + + self.sitemap_tree = QTreeWidget() + self.sitemap_tree.setHeaderLabels(["Page Title", "URL"]) + self.sitemap_tree.setColumnWidth(0, 350) + self.sitemap_tree.setColumnWidth(1, 600) + self.sitemap_tree.itemDoubleClicked.connect(self.open_url_in_browser) + layout.addWidget(self.sitemap_tree) + + self.tab_widget.addTab(sitemap_widget, "Sitemap") + + def create_ai_tab(self): + """Create a simplified, modern AI Analysis tab with a chat interface and compact quick actions, using more curves to match the app style.""" + ai_widget = QWidget() + layout = QVBoxLayout(ai_widget) + layout.setSpacing(8) + layout.setContentsMargins(16, 16, 16, 16) + + hint_label = QLabel("💡 Ask questions about your scraped websites below.") + hint_label.setStyleSheet(""" + QLabel { + color: #64748b; + font-size: 13px; + padding: 4px 0 8px 0; + } + """) + layout.addWidget(hint_label) + + # --- Chat area --- + self.ai_chat_history = QListWidget() + self.ai_chat_history.setStyleSheet(""" + QListWidget { + background: #f8fafc; + border: 1.5px solid #e2e8f0; + border-radius: 22px; + font-size: 15px; + color: #1e293b; + padding: 12px; + font-family: 'Segoe UI', sans-serif; + } + """) + self.ai_chat_history.setSpacing(6) + self.ai_chat_history.setMinimumHeight(300) + self.ai_chat_history.setResizeMode(QListWidget.Adjust) + self.ai_chat_history.setVerticalScrollMode(QAbstractItemView.ScrollPerPixel) + layout.addWidget(self.ai_chat_history, stretch=1) + self.chat_messages = [] # Store (role, message, timestamp) tuples + self.render_chat_history() + + # --- Quick action buttons --- + quick_actions_widget = QWidget() + quick_actions_layout = QHBoxLayout(quick_actions_widget) + quick_actions_layout.setSpacing(8) + quick_actions_layout.setContentsMargins(0, 0, 0, 0) + quick_questions = [ + "Analyze the website structure", + "Find key content themes", + "Suggest SEO improvements", + "Compare page performance" + ] + for question in quick_questions: + quick_btn = QPushButton(question) + quick_btn.setFont(QFont("Segoe UI", 10)) + quick_btn.setCursor(Qt.CursorShape.PointingHandCursor) + quick_btn.clicked.connect(lambda _, q=question: self.quick_question(q)) + quick_btn.setStyleSheet(""" + QPushButton { + background: #e0e7ef; + border: none; + color: #374151; + padding: 8px 22px; + border-radius: 22px; + font-weight: 500; + font-size: 13px; + box-shadow: 0 2px 8px rgba(59, 130, 246, 0.04); + } + QPushButton:hover { + background: #3b82f6; + color: white; + } + QPushButton:pressed { + background: #2563eb; + color: white; + } + """) + quick_actions_layout.addWidget(quick_btn) + layout.addWidget(quick_actions_widget) + + # --- Input area --- + input_container = QWidget() + input_layout = QHBoxLayout(input_container) + input_layout.setContentsMargins(0, 0, 0, 0) + input_layout.setSpacing(8) + self.ai_input = QLineEdit() + self.ai_input.setPlaceholderText("Type your question and press Enter...") + self.ai_input.setMinimumHeight(44) + self.ai_input.setFont(QFont("Segoe UI", 12)) + self.ai_input.returnPressed.connect(self.send_ai_message) + self.ai_input.setStyleSheet(""" + QLineEdit { + border: 1.5px solid #e2e8f0; + border-radius: 22px; + padding: 10px 20px; + background: white; + color: #1e293b; + font-size: 14px; + } + QLineEdit:focus { + border-color: #3b82f6; + outline: none; + } + QLineEdit::placeholder { + color: #9ca3af; + } + """) + self.ai_send_button = QPushButton("Send") + self.ai_send_button.setMinimumHeight(44) + self.ai_send_button.setMinimumWidth(80) + self.ai_send_button.setFont(QFont("Segoe UI", 12, QFont.Weight.Medium)) + self.ai_send_button.setCursor(Qt.CursorShape.PointingHandCursor) + self.ai_send_button.clicked.connect(self.send_ai_message) + self.ai_send_button.setStyleSheet(""" + QPushButton { + background: #3b82f6; + border: none; + color: white; + padding: 10px 28px; + border-radius: 22px; + font-weight: 600; + font-size: 15px; + box-shadow: 0 2px 8px rgba(59, 130, 246, 0.08); + } + QPushButton:hover { + background: #2563eb; + } + QPushButton:pressed { + background: #1d4ed8; + } + QPushButton:disabled { + background: #9ca3af; + color: #f3f4f6; + } + """) + input_layout.addWidget(self.ai_input, stretch=1) + input_layout.addWidget(self.ai_send_button) + layout.addWidget(input_container) + + self.tab_widget.addTab(ai_widget, "AI Analysis") + ai_tab_index = self.tab_widget.count() - 1 + self.set_ai_tab_gradient(ai_tab_index) + + def render_chat_history(self): + self.ai_chat_history.clear() + for role, msg, timestamp in self.chat_messages: + item = QListWidgetItem() + bubble = ChatBubbleWidget(msg, timestamp, role) + bubble.adjustSize() + item.setSizeHint(bubble.sizeHint()) + self.ai_chat_history.addItem(item) + self.ai_chat_history.setItemWidget(item, bubble) + self.ai_chat_history.scrollToBottom() + + def send_ai_message(self): + user_msg = self.ai_input.text().strip() + if not user_msg: + return + timestamp = datetime.now().strftime("%H:%M") + self.chat_messages.append(("user", user_msg, timestamp)) + self.render_chat_history() + self.ai_input.clear() + # Show thinking indicator as AI message + self.chat_messages.append(("ai", "🤔 Analyzing your question...", timestamp)) + self.render_chat_history() + ai_context = self.get_ai_context(user_msg) + QTimer.singleShot(100, lambda: self._do_ai_response_openrouter(user_msg, ai_context)) + + def _do_ai_response_openrouter(self, user_msg, ai_context): + if OPENAI_AVAILABLE: + try: + client = OpenAI( + base_url="https://openrouter.ai/api/v1", + api_key=os.environ.get("OPENROUTER_API_KEY"), + ) + system_prompt = """You are an expert website analyst and AI assistant specializing in web scraping analysis. Your role is to:\n\n1. **Analyze website content** - Provide insights about the scraped websites\n2. **Identify patterns** - Find common themes, structures, and content types\n3. **Offer recommendations** - Suggest improvements for SEO, content, or structure\n4. **Answer questions** - Respond to specific queries about the websites\n5. **Provide actionable insights** - Give practical advice based on the data\n\n**Response Guidelines:**\n- Be professional yet conversational\n- Use clear, structured responses with bullet points when appropriate\n- Reference specific websites by title when relevant\n- Provide specific examples from the content\n- Suggest actionable next steps when possible\n- Use markdown formatting for better readability\n\n**Context:** You have access to scraped website data including titles, URLs, content previews, and metadata.""" + user_prompt = f"""# Website Analysis Request\n\n## User Question\n{user_msg}\n\n## Available Website Data\n{ai_context}\n\n## Instructions\nPlease provide a comprehensive analysis based on the user's question. Use the website data above to support your response. If the question is about specific aspects (SEO, content, structure, etc.), focus your analysis accordingly.\n\n**Format your response with:**\n- Clear headings and structure\n- Specific examples from the websites\n- Actionable insights and recommendations\n- Professional, helpful tone""" + completion = client.chat.completions.create( + extra_headers={ + "HTTP-Referer": "http://localhost:8000", + "X-Title": "Web Scraper & Data Analyzer - AI Analysis", + }, + extra_body={}, + model="deepseek/deepseek-r1-0528-qwen3-8b:free", + messages=[ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_prompt} + ], + temperature=0.7, + max_tokens=2000 + ) + try: + answer = completion.choices[0].message.content + if answer is not None: + answer = answer.strip() + else: + answer = "❌ **AI Analysis Error**\n\nNo response content received from the AI model." + except (AttributeError, IndexError, KeyError): + answer = "❌ **AI Analysis Error**\n\nUnexpected response format from the AI model." + if hasattr(self, "ai_stats_label"): + self.ai_stats_label.setText(f"Analyzed {len(self.websites)} websites") + except Exception as e: + answer = f"❌ **AI Analysis Error**\n\nI encountered an error while analyzing your request: `{str(e)}`\n\nPlease try again or check your internet connection." + else: + if ai_context == "No data available. Please scrape some websites first.": + answer = "📊 **No Data Available**\n\nPlease scrape some websites first to enable AI analysis." + else: + answer = f"🤖 **AI Analysis Preview**\n\nI have analyzed {len(self.websites)} websites. Your question: '{user_msg}'\n\n*(This is a placeholder response. Install the 'openai' package for real AI analysis.)*" + # Remove the last AI thinking message + if self.chat_messages and self.chat_messages[-1][1].startswith("🤔"): + self.chat_messages.pop() + timestamp = datetime.now().strftime("%H:%M") + self.chat_messages.append(("ai", answer, timestamp)) + self.render_chat_history() + + def open_url_in_browser(self, item, column): + url = item.data(1, Qt.ItemDataRole.DisplayRole) + if url: + webbrowser.open(url) + + def get_icon(self, is_root=False): + + if is_root: + return self.style().standardIcon(QStyle.StandardPixmap.SP_DesktopIcon) + else: + return self.style().standardIcon(QStyle.StandardPixmap.SP_DirIcon) + """Build and display the sitemap tree from crawled data, with icons and tooltips""" + self.sitemap_tree.clear() + if not self.websites: + return + url_to_website = {w.url: w for w in self.websites} + children_map = {w.url: [] for w in self.websites} + for w in self.websites: + for link in w.links: + if link in url_to_website: + children_map[w.url].append(link) + root_url = self.websites[0].url + def add_items(parent_item, url, visited, depth): + if url in visited: + return + visited.add(url) + website = url_to_website[url] + item = QTreeWidgetItem([website.title, website.url]) + item.setIcon(0, self.get_icon(is_root=False)) + tooltip = f"Title: {website.title}For more info, see the README or contact support.
" + ) + QMessageBox.information(self, "Help / Info", help_text) + + def scraping_finished(self, websites): + """Handle scraping completion""" + self.websites = websites + self.scraper.websites = websites + + # Update UI + self.start_button.setEnabled(True) + self.stop_button.setEnabled(False) + self.progress_bar.setVisible(False) + self.status_label.setText(f"Scraping complete! Found {len(websites)} websites.") + self.status_label.setStyleSheet(""" + QLabel { + color: #166534; + font-size: 14px; + padding: 8px; + background: #f0fdf4; + border-radius: 6px; + border-left: 3px solid #22c55e; + } + """) + + # Update data view + self.update_data_table() + self.update_analysis() + self.update_sitemap_tree() + + # Switch to data tab + self.tab_widget.setCurrentIndex(1) + + # Show desktop notification + self.tray_icon.showMessage( + "Web Scraper", + f"Scraping complete! Found {len(websites)} websites.", + QSystemTrayIcon.MessageIcon(1), # 1 = Information + 5000 + ) + + def scraping_error(self, error_message): + """Handle scraping errors""" + QMessageBox.critical(self, "Error", f"Scraping failed: {error_message}") + self.start_button.setEnabled(True) + self.stop_button.setEnabled(False) + self.progress_bar.setVisible(False) + self.status_label.setText("Scraping failed.") + self.status_label.setStyleSheet(""" + QLabel { + color: #991b1b; + font-size: 14px; + padding: 8px; + background: #fef2f2; + border-radius: 6px; + border-left: 3px solid #ef4444; + } + """) + + # Show desktop notification + self.tray_icon.showMessage( + "Web Scraper", + f"Scraping failed: {error_message}", + QSystemTrayIcon.MessageIcon(3), + 5000 + ) + + def update_data_table(self): + """Update the data table with scraped websites""" + self.data_table.setRowCount(len(self.websites)) + for row, website in enumerate(self.websites): + self.data_table.setRowHeight(row, 40) + title_item = QTableWidgetItem(website.title) + title_item.setTextAlignment(Qt.AlignmentFlag.AlignTop | Qt.AlignmentFlag.AlignLeft) + url_item = QTableWidgetItem(website.url) + url_item.setTextAlignment(Qt.AlignmentFlag.AlignTop | Qt.AlignmentFlag.AlignLeft) + depth_item = QTableWidgetItem(str(website.depth)) + depth_item.setTextAlignment(Qt.AlignmentFlag.AlignCenter) + links_item = QTableWidgetItem(str(len(website.links))) + links_item.setTextAlignment(Qt.AlignmentFlag.AlignCenter) + words_item = QTableWidgetItem(str(website.get_word_count())) + words_item.setTextAlignment(Qt.AlignmentFlag.AlignCenter) + load_time = f"{website.load_time:.2f}s" if website.load_time else "N/A" + load_time_item = QTableWidgetItem(load_time) + load_time_item.setTextAlignment(Qt.AlignmentFlag.AlignCenter) + self.data_table.setItem(row, 0, title_item) + self.data_table.setItem(row, 1, url_item) + self.data_table.setItem(row, 2, depth_item) + self.data_table.setItem(row, 3, links_item) + self.data_table.setItem(row, 4, words_item) + self.data_table.setItem(row, 5, load_time_item) + # Update domain filter + domains = list(set(w.get_normalized_domain() for w in self.websites)) + self.domain_filter.clear() + self.domain_filter.addItem("All Domains") + self.domain_filter.addItems(domains) + # Update content preview with first website + if self.websites: + first_website = self.websites[0] + content_preview = first_website.get_text_preview(800) + self.content_text.setText(content_preview) + + # Also update visual preview for first website + if WEB_ENGINE_AVAILABLE and hasattr(self, 'web_view'): + try: + html_content = first_website.content + if html_content and html_content.strip(): + full_html = f""" + + + + + +This page doesn't have HTML content to display in the visual preview.
+ + + """) + except Exception as e: + self.web_view.setHtml(f""" + + +Failed to load the visual preview:
+{str(e)}
+This might be due to:
+This page doesn't have HTML content to display in the visual preview.
+Check the text preview tab for the extracted content.
+ + + """) + except Exception as e: + # Show error message in the web view + error_html = f""" + + +Failed to load the visual preview:
+{str(e)}
+This might be due to:
+