Merge pull request #506 from website-deployer/main

Adding my project to repo
2025-07-10 08:08:36 -04:00
parent d0b83a52d2 865bf2dd2c
commit 76d1c888ff
5 changed files with 2476 additions and 0 deletions
--- a/community-contributions/WebScraperApp/README.md
+++ b/community-contributions/WebScraperApp/README.md
@@ -0,0 +1,159 @@
 # Web Scraper & Data Analyzer
 A modern Python application with a sleek PyQt5 GUI for web scraping, data analysis, visualization, and AI-powered website insights. Features a clean, minimalistic design with real-time progress tracking, comprehensive data filtering, and an integrated AI chat assistant for advanced analysis.
 ## Features
 - **Modern UI**: Clean, minimalistic design with dark theme and smooth animations
 - **Web Scraping**: Multi-threaded scraping with configurable depth (max 100 levels)
 - **Data Visualization**: Interactive table with sorting and filtering capabilities
 - **Content Preview**: Dual preview system with both text and visual HTML rendering
 - **Data Analysis**: Comprehensive statistics and domain breakdown
 - **AI-Powered Analysis**: Chat-based assistant for website insights, SEO suggestions, and content analysis
 - **Export Functionality**: JSON export with full metadata
 - **URL Normalization**: Handles www/non-www domains intelligently
 - **Real-time Progress**: Live progress updates during scraping operations
 - **Loop Prevention**: Advanced duplicate detection to prevent infinite loops
 - **Smart Limits**: Configurable limits to prevent runaway scraping
 ## AI Analysis Tab
 The application features an advanced **AI Analysis** tab:
 - **Conversational Chat UI**: Ask questions about your scraped websites in a modern chat interface (like ChatGPT)
 - **Quick Actions**: One-click questions for structure, SEO, content themes, and performance
 - **Markdown Responses**: AI replies are formatted for clarity and readability
 - **Context Awareness**: AI uses your scraped data for tailored insights
 - **Requirements**: Internet connection and the `openai` Python package (see Installation)
 - **Fallback**: If `openai` is not installed, a placeholder response is shown
 ## Loop Prevention & Duplicate Detection
 The scraper includes robust protection against infinite loops and circular references:
 ### 🔄 URL Normalization
 - Removes `www.` prefixes for consistent domain handling
 - Strips URL fragments (`#section`) to prevent duplicate content
 - Removes trailing slashes for consistency
 - Normalizes query parameters
 ### 🚫 Duplicate Detection
 - **Visited URL Tracking**: Maintains a set of all visited URLs
 - **Unlimited Crawling**: No page limits per domain or total pages
 - **Per-Page Duplicate Filtering**: Removes duplicate links within the same page
 ### 🛡️ Smart Restrictions
 - **No Depth Limits**: Crawl as deep as the specified max_depth allows
 - **Content Type Filtering**: Only scrapes HTML content
 - **File Type Filtering**: Skips non-content files (PDFs, images, etc.)
 - **Consecutive Empty Level Detection**: Stops if 3 consecutive levels have no new content
 ### 📊 Enhanced Tracking
 - **Domain Page Counts**: Tracks pages scraped per domain (for statistics)
 - **URL Check Counts**: Shows total URLs checked vs. pages scraped
 - **Detailed Statistics**: Comprehensive reporting on scraping efficiency
 - **Unlimited Processing**: No artificial limits on crawling scope
 ## Installation
 1. **Clone or download the project files**
 2. **Install dependencies**:
   ```bash
   pip install -r requirements.txt
   ```
   - This will install all required packages, including `PyQt5`, `PyQtWebEngine` (for visual preview), and `openai` (for AI features).
 3. **Run the application**:
   ```bash
   python web_scraper_app.py
   ```
 ## Usage
 ### 1. Scraping Configuration
 - Enter a starting URL (with or without http/https)
 - Set maximum crawl depth (1-100)
 - Click "Start Scraping" to begin
 ### 2. Data View & Filtering
 - View scraped data in an interactive table
 - Filter by search terms or specific domains
 - Double-click any row to preview content
 - Export data to JSON format
 ### 3. Analysis & Statistics
 - View comprehensive scraping statistics
 - See domain breakdown and word counts
 - Preview content in both text and visual formats
 - Analyze load times and link counts
 - Monitor duplicate detection efficiency
 ### 4. AI Analysis (New!)
 - Switch to the **AI Analysis** tab
 - Type your question or use quick action buttons (e.g., "Analyze the website structure", "Suggest SEO improvements")
 - The AI will analyze your scraped data and provide actionable insights
 - Requires an internet connection and the `openai` package
 ## Visual Preview Feature
 The application includes a visual HTML preview feature that renders scraped web pages in a browser-like view:
 - **Requirements**: PyQtWebEngine (automatically installed with requirements.txt)
 - **Functionality**: Displays HTML content with proper styling and formatting
 - **Fallback**: If PyQtWebEngine is not available, shows a text-only preview
 - **Error Handling**: Graceful error messages for invalid HTML content
 ## Technical Details
 - **Backend**: Pure Python with urllib and html.parser (no compilation required)
 - **Frontend**: PyQt5 with custom modern styling
 - **Threading**: Multi-threaded scraping for better performance
 - **Data Storage**: Website objects with full metadata
 - **URL Handling**: Intelligent normalization and domain filtering
 - **Loop Prevention**: Multi-layered duplicate detection system
 - **AI Integration**: Uses OpenAI API (via openrouter) for chat-based analysis
 ## File Structure
 ```
 Testing/
 ├── web_scraper_app.py      # Main application (with AI and GUI)
 ├── module.py               # Core scraping logic
 ├── test.py                 # Basic functionality tests
 ├── requirements.txt        # Dependencies
 └── README.md               # This file
 ```
 ## Troubleshooting
 ### Visual Preview Not Working
 1. Ensure PyQtWebEngine is installed: `pip install PyQtWebEngine`
 2. Check console output for import errors
 ### AI Analysis Not Working
 1. Ensure the `openai` package is installed: `pip install openai`
 2. Check your internet connection (AI requires online access)
 3. If not installed, the AI tab will show a placeholder response
 ### Scraping Issues
 1. Verify internet connection
 2. Check URL format (add https:// if needed)
 3. Try with a lower depth setting
 4. Check console for error messages
 ### Loop Prevention
 1. The scraper automatically prevents infinite loops
 2. Check the analysis tab for detailed statistics
 3. Monitor "Total URLs Checked" vs "Total Pages" for efficiency
 4. Use lower depth settings for sites with many internal links
 ### Performance
 - Use lower depth settings for faster scraping
 - Filter data to focus on specific domains
 - Close other applications to free up resources
 - Monitor domain page counts to avoid hitting limits
 ## License
 This project is open source and available under the MIT License. 
--- a/community-contributions/WebScraperApp/module.py
+++ b/community-contributions/WebScraperApp/module.py
@@ -0,0 +1,473 @@
 import urllib.request
 import urllib.parse
 import urllib.error
 import html.parser
 import re
 from datetime import datetime
 import time
 import ssl
 from urllib.parse import urljoin, urlparse
 from concurrent.futures import ThreadPoolExecutor, as_completed
 import threading
 from functools import partial
 class HTMLParser(html.parser.HTMLParser):
    """Custom HTML parser to extract title, links, and text content"""
    def __init__(self):
        super().__init__()
        self.title = ""
        self.links = []
        self.text_content = []
        self.in_title = False
        self.in_body = False
        self.current_tag = ""
    def handle_starttag(self, tag, attrs):
        self.current_tag = tag.lower()
        if tag.lower() == 'title':
            self.in_title = True
        elif tag.lower() == 'body':
            self.in_body = True
        elif tag.lower() == 'a':
            # Extract href attribute
            for attr, value in attrs:
                if attr.lower() == 'href' and value:
                    self.links.append(value)
    def handle_endtag(self, tag):
        if tag.lower() == 'title':
            self.in_title = False
        elif tag.lower() == 'body':
            self.in_body = False
    def handle_data(self, data):
        if self.in_title:
            self.title += data
        elif self.in_body and self.current_tag in ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'div', 'span', 'li']:
            # Clean the text data
            cleaned_data = re.sub(r'\s+', ' ', data.strip())
            if cleaned_data:
                self.text_content.append(cleaned_data)
    def get_text(self):
        """Return all extracted text content as a single string"""
        return ' '.join(self.text_content)
    def get_clean_text(self, max_length=500):
        """Return cleaned text content with length limit"""
        text = self.get_text()
        # Remove extra whitespace and limit length
        text = re.sub(r'\s+', ' ', text.strip())
        if len(text) > max_length:
            text = text[:max_length] + "..."
        return text
 class Website:
    """Class to store website data"""
    def __init__(self, title, url, content, depth, links=None, load_time=None):
        self.title = title or "No Title"
        self.url = url
        self.content = content
        self.depth = depth
        self.links = links or []
        self.load_time = load_time
        self.timestamp = datetime.now()
    def get_word_count(self):
        """Get word count from content"""
        if not self.content:
            return 0
        # Extract text content and count words
        text_content = re.sub(r'<[^>]+>', '', self.content)
        words = text_content.split()
        return len(words)
    def get_domain(self):
        """Extract domain from URL"""
        try:
            parsed = urlparse(self.url)
            return parsed.netloc
        except:
            return ""
    def get_normalized_domain(self):
        """Get domain without www prefix for consistent filtering"""
        domain = self.get_domain()
        if domain.startswith('www.'):
            return domain[4:]
        return domain
    def search_content(self, query):
        """Search for query in content"""
        if not self.content or not query:
            return False
        return query.lower() in self.content.lower()
    def get_text_preview(self, max_length=200):
        """Get a text preview of the content"""
        if not self.content:
            return "No content available"
        # Extract text content
        text_content = re.sub(r'<[^>]+>', '', self.content)
        text_content = re.sub(r'\s+', ' ', text_content.strip())
        if len(text_content) > max_length:
            return text_content[:max_length] + "..."
        return text_content
 class WebScraper:
    """Web scraper with multithreading support and robust duplicate detection"""
    def __init__(self):
        self.websites = []
        self.visited_urls = set()
        self.visited_domains = set()  # Track visited domains
        self.start_domain = None      # Store the starting domain
        self.lock = threading.Lock()
        self.max_workers = 10  # Number of concurrent threads
        # Removed all page limits - unlimited crawling
        self.domain_page_counts = {}  # Track page count per domain (for statistics only)
        self._stop_requested = False  # Flag to stop scraping
    def normalize_url(self, url):
        """Normalize URL to handle www prefixes and remove fragments"""
        if not url:
            return url
        # Remove fragments (#) to prevent duplicate content
        if '#' in url:
            url = url.split('#')[0]
        # Remove trailing slashes for consistency
        url = url.rstrip('/')
        # Remove www prefix for consistent domain handling
        if url.startswith('https://www.'):
            return url.replace('https://www.', 'https://', 1)
        elif url.startswith('http://www.'):
            return url.replace('http://www.', 'http://', 1)
        return url
    def get_domain_from_url(self, url):
        """Extract and normalize domain from URL"""
        try:
            parsed = urlparse(url)
            domain = parsed.netloc
            if domain.startswith('www.'):
                return domain[4:]
            return domain
        except:
            return ""
    def should_skip_url(self, url, current_depth):
        """Check if URL should be skipped based on various criteria"""
        normalized_url = self.normalize_url(url)
        # Skip if already visited
        if normalized_url in self.visited_urls:
            return True, "Already visited"
        # Skip if not a valid HTTP/HTTPS URL
        if not normalized_url.startswith(('http://', 'https://')):
            return True, "Not HTTP/HTTPS URL"
        # Get domain
        domain = self.get_domain_from_url(normalized_url)
        if not domain:
            return True, "Invalid domain"
        # Removed all domain page limits - unlimited crawling
        # Removed external domain depth limits - crawl as deep as needed
        return False, "OK"
    def scrape_url(self, url, depth):
        """Scrape a single URL with error handling and rate limiting"""
        try:
            # Check if stop was requested
            if self._stop_requested:
                return None
            # Check if URL should be skipped
            should_skip, reason = self.should_skip_url(url, depth)
            if should_skip:
                print(f"Skipping {url}: {reason}")
                return None
            # Normalize URL
            normalized_url = self.normalize_url(url)
            # Mark as visited and update domain count (for statistics only)
            with self.lock:
                self.visited_urls.add(normalized_url)
                domain = self.get_domain_from_url(normalized_url)
                if domain:
                    self.domain_page_counts[domain] = self.domain_page_counts.get(domain, 0) + 1
            # Add small delay to prevent overwhelming servers
            time.sleep(0.1)
            start_time = time.time()
            # Create request with headers
            req = urllib.request.Request(
                normalized_url,
                headers={
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
                    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                    'Accept-Language': 'en-US,en;q=0.5',
                    'Accept-Encoding': 'gzip, deflate',
                    'Connection': 'keep-alive',
                    'Upgrade-Insecure-Requests': '1',
                }
            )
            # Fetch the page with timeout
            with urllib.request.urlopen(req, timeout=15) as response:
                # Check content type
                content_type = response.headers.get('content-type', '').lower()
                if 'text/html' not in content_type and 'application/xhtml' not in content_type:
                    print(f"Skipping {url}: Not HTML content ({content_type})")
                    return None
                html_content = response.read().decode('utf-8', errors='ignore')
            load_time = time.time() - start_time
            # Skip if content is too small (likely error page)
            if len(html_content) < 100:
                print(f"Skipping {url}: Content too small ({len(html_content)} chars)")
                return None
            # Parse HTML
            parser = HTMLParser()
            parser.feed(html_content)
            # Extract links and normalize them with duplicate detection
            links = []
            base_url = normalized_url
            seen_links = set()  # Track links within this page to avoid duplicates
            for link in parser.links:
                try:
                    absolute_url = urljoin(base_url, link)
                    normalized_link = self.normalize_url(absolute_url)
                    # Skip if already seen in this page or should be skipped
                    if normalized_link in seen_links:
                        continue
                    seen_links.add(normalized_link)
                    should_skip, reason = self.should_skip_url(normalized_link, depth + 1)
                    if should_skip:
                        continue
                    # Only include http/https links and filter out common non-content URLs
                    if (normalized_link.startswith(('http://', 'https://')) and 
                        not any(skip in normalized_link.lower() for skip in [
                            'mailto:', 'tel:', 'javascript:', 'data:', 'file:',
                            '.pdf', '.doc', '.docx', '.xls', '.xlsx', '.zip', '.rar',
                            '.jpg', '.jpeg', '.png', '.gif', '.bmp', '.svg', '.ico',
                            '.css', '.js', '.xml', '.json', '.txt', '.log'
                        ])):
                        links.append(normalized_link)
                except:
                    continue
            # Create Website object
            website = Website(
                title=parser.title,
                url=normalized_url,
                content=html_content,
                depth=depth,
                links=links,
                load_time=load_time
            )
            return website
        except urllib.error.HTTPError as e:
            print(f"HTTP Error scraping {url}: {e.code} - {e.reason}")
            return None
        except urllib.error.URLError as e:
            print(f"URL Error scraping {url}: {e.reason}")
            return None
        except Exception as e:
            print(f"Error scraping {url}: {str(e)}")
            return None
    def crawl_website(self, start_url, max_depth=3, progress_callback=None):
        """Crawl website with multithreading support and no page limits"""
        if not start_url.startswith(('http://', 'https://')):
            start_url = 'https://' + start_url
        # Initialize tracking
        self.websites = []
        self.visited_urls = set()
        self.visited_domains = set()
        self.domain_page_counts = {}
        self.start_domain = self.get_domain_from_url(start_url)
        self._stop_requested = False  # Reset stop flag
        print(f"Starting crawl from: {start_url}")
        print(f"Starting domain: {self.start_domain}")
        print(f"Max depth: {max_depth}")
        print(f"Unlimited crawling - no page limits")
        # Start with the initial URL
        urls_to_scrape = [(start_url, 0)]
        max_depth_reached = 0
        consecutive_empty_levels = 0
        max_consecutive_empty = 3  # Stop if 3 consecutive levels have no new URLs
        total_pages_scraped = 0
        # Removed all page limits - unlimited crawling
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            for current_depth in range(max_depth + 1):
                # Check if stop was requested
                if self._stop_requested:
                    print("Scraping stopped by user request")
                    break
                if not urls_to_scrape:
                    print(f"Stopping at depth {current_depth}: No more URLs to scrape")
                    break
                # Check if we've reached too many consecutive empty levels
                if consecutive_empty_levels >= max_consecutive_empty:
                    print(f"Stopping at depth {current_depth}: {max_consecutive_empty} consecutive empty levels")
                    break
                # Removed absolute page limit check - unlimited pages
                print(f"Scraping depth {current_depth} with {len(urls_to_scrape)} URLs")
                # Submit all URLs at current depth for concurrent scraping
                future_to_url = {
                    executor.submit(self.scrape_url, url, depth): url 
                    for url, depth in urls_to_scrape
                }
                # Collect results and prepare next level
                urls_to_scrape = []
                level_results = 0
                for future in as_completed(future_to_url):
                    # Check if stop was requested
                    if self._stop_requested:
                        print("Stopping processing of current level")
                        break
                    website = future.result()
                    if website:
                        with self.lock:
                            self.websites.append(website)
                        level_results += 1
                        total_pages_scraped += 1
                        # Emit progress if callback provided
                        if progress_callback:
                            progress_callback(website)
                        # Add links for next depth level (no limits)
                        if current_depth < max_depth:
                            for link in website.links:
                                # Removed URL limit per level - process all URLs
                                should_skip, reason = self.should_skip_url(link, current_depth + 1)
                                if not should_skip:
                                    urls_to_scrape.append((link, current_depth + 1))
                # Check if stop was requested after processing level
                if self._stop_requested:
                    break
                # Update depth tracking
                if level_results > 0:
                    max_depth_reached = current_depth
                    consecutive_empty_levels = 0
                else:
                    consecutive_empty_levels += 1
                # Only stop if we've reached the actual max depth
                if current_depth >= max_depth:
                    print(f"Reached maximum depth: {max_depth}")
                    break
                # Print progress summary
                print(f"Depth {current_depth} completed: {level_results} pages, Total: {len(self.websites)}")
                if self.domain_page_counts:
                    print(f"Domain breakdown: {dict(self.domain_page_counts)}")
        print(f"Crawling completed. Max depth reached: {max_depth_reached}, Total pages: {len(self.websites)}")
        print(f"Visited URLs: {len(self.visited_urls)}")
        print(f"Domain breakdown: {dict(self.domain_page_counts)}")
        return self.websites
    def reset(self):
        """Reset the scraper state for a new crawl"""
        self.websites = []
        self.visited_urls = set()
        self.visited_domains = set()
        self.domain_page_counts = {}
        self.start_domain = None
        self._stop_requested = False  # Reset stop flag
    def get_statistics(self):
        """Get scraping statistics with enhanced tracking information"""
        if not self.websites:
            return {
                'total_pages': 0,
                'total_links': 0,
                'total_words': 0,
                'avg_load_time': 0,
                'max_depth_reached': 0,
                'domains': {},
                'visited_urls_count': 0,
                'domain_page_counts': {},
                'start_domain': self.start_domain
            }
        total_pages = len(self.websites)
        total_links = sum(len(w.links) for w in self.websites)
        total_words = sum(w.get_word_count() for w in self.websites)
        load_times = [w.load_time for w in self.websites if w.load_time]
        avg_load_time = sum(load_times) / len(load_times) if load_times else 0
        max_depth_reached = max(w.depth for w in self.websites)
        # Count domains
        domains = {}
        for website in self.websites:
            domain = website.get_normalized_domain()
            domains[domain] = domains.get(domain, 0) + 1
        return {
            'total_pages': total_pages,
            'total_links': total_links,
            'total_words': total_words,
            'avg_load_time': avg_load_time,
            'max_depth_reached': max_depth_reached,
            'domains': domains,
            'visited_urls_count': len(self.visited_urls),
            'domain_page_counts': dict(self.domain_page_counts),
            'start_domain': self.start_domain
        }
    def filter_by_domain(self, domain):
        """Filter websites by domain"""
        normalized_domain = self.normalize_url(domain)
        return [w for w in self.websites if w.get_normalized_domain() == normalized_domain]
    def search_websites(self, query):
        """Search websites by query"""
        return [w for w in self.websites if w.search_content(query)]
    def stop_scraping(self):
        """Request graceful stop of the scraping process"""
        self._stop_requested = True
--- a/community-contributions/WebScraperApp/requirements.txt
+++ b/community-contributions/WebScraperApp/requirements.txt
@@ -0,0 +1,5 @@
 PyQt5>=5.15.0
 PyQtWebEngine>=5.15.0
 urllib3==2.0.7
 openai>=1.0.0
 python-dotenv>=1.0.0
--- a/community-contributions/WebScraperApp/test.py
+++ b/community-contributions/WebScraperApp/test.py
@@ -0,0 +1,161 @@
 #!/usr/bin/env python3
 """
 Simple test script to verify the web scraping functionality
 """
 import module
 def test_basic_scraping():
    """Test basic scraping functionality"""
    print("Testing basic web scraping...")
    # Create a scraper instance
    scraper = module.WebScraper()
    # Test with a simple website (httpbin.org is a safe test site)
    test_url = "https://httpbin.org/html"
    print(f"Scraping {test_url} with depth 1...")
    try:
        # Scrape with depth 1 to keep it fast
        websites = scraper.crawl_website(test_url, max_depth=1)
        print(f"Successfully scraped {len(websites)} websites")
        if websites:
            # Show first website details
            first_site = websites[0]
            print(f"\nFirst website:")
            print(f"  Title: {first_site.title}")
            print(f"  URL: {first_site.url}")
            print(f"  Depth: {first_site.depth}")
            print(f"  Links found: {len(first_site.links)}")
            print(f"  Word count: {first_site.get_word_count()}")
            # Show statistics
            stats = scraper.get_statistics()
            print(f"\nStatistics:")
            print(f"  Total pages: {stats['total_pages']}")
            print(f"  Total links: {stats['total_links']}")
            print(f"  Total words: {stats['total_words']}")
            print(f"  Average load time: {stats['avg_load_time']:.2f}s")
            return True
        else:
            print("No websites were scraped")
            return False
    except Exception as e:
        print(f"Error during scraping: {e}")
        return False
 def test_website_class():
    """Test the Website class functionality"""
    print("\nTesting Website class...")
    # Create a test website
    website = module.Website(
        title="Test Website",
        url="https://example.com",
        content="<html><body><h1>Test Content</h1><p>This is a test paragraph.</p></body></html>",
        depth=0,
        links=["https://example.com/page1", "https://example.com/page2"]
    )
    # Test methods
    print(f"Website title: {website.title}")
    print(f"Website URL: {website.url}")
    print(f"Word count: {website.get_word_count()}")
    print(f"Domain: {website.get_domain()}")
    print(f"Normalized domain: {website.get_normalized_domain()}")
    print(f"Search for 'test': {website.search_content('test')}")
    print(f"Search for 'nonexistent': {website.search_content('nonexistent')}")
    return True
 def test_html_parser():
    """Test the HTML parser functionality"""
    print("\nTesting HTML Parser...")
    parser = module.HTMLParser()
    test_html = """
    <html>
        <head><title>Test Page</title></head>
        <body>
            <h1>Welcome</h1>
            <p>This is a <a href="https://example.com">link</a> to example.com</p>
            <p>Here's another <a href="/relative-link">relative link</a></p>
        </body>
    </html>
    """
    parser.feed(test_html)
    print(f"Title extracted: {parser.title}")
    print(f"Links found: {parser.links}")
    print(f"Text content length: {len(parser.get_text())}")
    return True
 def test_url_normalization():
    """Test URL normalization to handle www. prefixes"""
    print("\nTesting URL Normalization...")
    scraper = module.WebScraper()
    # Test URLs with and without www.
    test_urls = [
        "https://www.example.com/page",
        "https://example.com/page",
        "http://www.test.com/path?param=value#fragment",
        "http://test.com/path?param=value#fragment"
    ]
    print("URL Normalization Results:")
    for url in test_urls:
        normalized = scraper.normalize_url(url)
        print(f"  Original: {url}")
        print(f"  Normalized: {normalized}")
        print()
    # Test domain filtering
    print("Domain Filtering Test:")
    test_websites = [
        module.Website("Site 1", "https://www.example.com", "content", 0),
        module.Website("Site 2", "https://example.com", "content", 0),
        module.Website("Site 3", "https://www.test.com", "content", 0)
    ]
    scraper.websites = test_websites
    # Test filtering by domain with and without www.
    domains_to_test = ["example.com", "www.example.com", "test.com", "www.test.com"]
    for domain in domains_to_test:
        filtered = scraper.filter_by_domain(domain)
        print(f"  Filter '{domain}': {len(filtered)} results")
        for site in filtered:
            print(f"    - {site.title} ({site.url})")
    return True
 if __name__ == "__main__":
    print("Web Scraper Test Suite")
    print("=" * 50)
    # Test HTML parser
    test_html_parser()
    # Test Website class
    test_website_class()
    # Test URL normalization
    test_url_normalization()
    # Test basic scraping (uncomment to test actual scraping)
    # Note: This requires internet connection
    # test_basic_scraping()
    print("\nTest completed!")
    print("\nTo run the full application:")
    print("python web_scraper_app.py") 
--- a/community-contributions/WebScraperApp/web_scraper_app.py
+++ b/community-contributions/WebScraperApp/web_scraper_app.py