diff --git a/community-contributions/WebScraperApp/README.md b/community-contributions/WebScraperApp/README.md new file mode 100644 index 0000000..6dfed7b --- /dev/null +++ b/community-contributions/WebScraperApp/README.md @@ -0,0 +1,159 @@ +# Web Scraper & Data Analyzer + +A modern Python application with a sleek PyQt5 GUI for web scraping, data analysis, visualization, and AI-powered website insights. Features a clean, minimalistic design with real-time progress tracking, comprehensive data filtering, and an integrated AI chat assistant for advanced analysis. + +## Features + +- **Modern UI**: Clean, minimalistic design with dark theme and smooth animations +- **Web Scraping**: Multi-threaded scraping with configurable depth (max 100 levels) +- **Data Visualization**: Interactive table with sorting and filtering capabilities +- **Content Preview**: Dual preview system with both text and visual HTML rendering +- **Data Analysis**: Comprehensive statistics and domain breakdown +- **AI-Powered Analysis**: Chat-based assistant for website insights, SEO suggestions, and content analysis +- **Export Functionality**: JSON export with full metadata +- **URL Normalization**: Handles www/non-www domains intelligently +- **Real-time Progress**: Live progress updates during scraping operations +- **Loop Prevention**: Advanced duplicate detection to prevent infinite loops +- **Smart Limits**: Configurable limits to prevent runaway scraping + +## AI Analysis Tab + +The application features an advanced **AI Analysis** tab: + +- **Conversational Chat UI**: Ask questions about your scraped websites in a modern chat interface (like ChatGPT) +- **Quick Actions**: One-click questions for structure, SEO, content themes, and performance +- **Markdown Responses**: AI replies are formatted for clarity and readability +- **Context Awareness**: AI uses your scraped data for tailored insights +- **Requirements**: Internet connection and the `openai` Python package (see Installation) +- **Fallback**: If `openai` is not installed, a placeholder response is shown + +## Loop Prevention & Duplicate Detection + +The scraper includes robust protection against infinite loops and circular references: + +### πŸ”„ URL Normalization +- Removes `www.` prefixes for consistent domain handling +- Strips URL fragments (`#section`) to prevent duplicate content +- Removes trailing slashes for consistency +- Normalizes query parameters + +### 🚫 Duplicate Detection +- **Visited URL Tracking**: Maintains a set of all visited URLs +- **Unlimited Crawling**: No page limits per domain or total pages +- **Per-Page Duplicate Filtering**: Removes duplicate links within the same page + +### πŸ›‘οΈ Smart Restrictions +- **No Depth Limits**: Crawl as deep as the specified max_depth allows +- **Content Type Filtering**: Only scrapes HTML content +- **File Type Filtering**: Skips non-content files (PDFs, images, etc.) +- **Consecutive Empty Level Detection**: Stops if 3 consecutive levels have no new content + +### πŸ“Š Enhanced Tracking +- **Domain Page Counts**: Tracks pages scraped per domain (for statistics) +- **URL Check Counts**: Shows total URLs checked vs. pages scraped +- **Detailed Statistics**: Comprehensive reporting on scraping efficiency +- **Unlimited Processing**: No artificial limits on crawling scope + +## Installation + +1. **Clone or download the project files** + +2. **Install dependencies**: + ```bash + pip install -r requirements.txt + ``` + - This will install all required packages, including `PyQt5`, `PyQtWebEngine` (for visual preview), and `openai` (for AI features). + +3. **Run the application**: + ```bash + python web_scraper_app.py + ``` + +## Usage + +### 1. Scraping Configuration +- Enter a starting URL (with or without http/https) +- Set maximum crawl depth (1-100) +- Click "Start Scraping" to begin + +### 2. Data View & Filtering +- View scraped data in an interactive table +- Filter by search terms or specific domains +- Double-click any row to preview content +- Export data to JSON format + +### 3. Analysis & Statistics +- View comprehensive scraping statistics +- See domain breakdown and word counts +- Preview content in both text and visual formats +- Analyze load times and link counts +- Monitor duplicate detection efficiency + +### 4. AI Analysis (New!) +- Switch to the **AI Analysis** tab +- Type your question or use quick action buttons (e.g., "Analyze the website structure", "Suggest SEO improvements") +- The AI will analyze your scraped data and provide actionable insights +- Requires an internet connection and the `openai` package + +## Visual Preview Feature + +The application includes a visual HTML preview feature that renders scraped web pages in a browser-like view: + +- **Requirements**: PyQtWebEngine (automatically installed with requirements.txt) +- **Functionality**: Displays HTML content with proper styling and formatting +- **Fallback**: If PyQtWebEngine is not available, shows a text-only preview +- **Error Handling**: Graceful error messages for invalid HTML content + +## Technical Details + +- **Backend**: Pure Python with urllib and html.parser (no compilation required) +- **Frontend**: PyQt5 with custom modern styling +- **Threading**: Multi-threaded scraping for better performance +- **Data Storage**: Website objects with full metadata +- **URL Handling**: Intelligent normalization and domain filtering +- **Loop Prevention**: Multi-layered duplicate detection system +- **AI Integration**: Uses OpenAI API (via openrouter) for chat-based analysis + +## File Structure + +``` +Testing/ +β”œβ”€β”€ web_scraper_app.py # Main application (with AI and GUI) +β”œβ”€β”€ module.py # Core scraping logic +β”œβ”€β”€ test.py # Basic functionality tests +β”œβ”€β”€ requirements.txt # Dependencies +└── README.md # This file +``` + +## Troubleshooting + +### Visual Preview Not Working +1. Ensure PyQtWebEngine is installed: `pip install PyQtWebEngine` +2. Check console output for import errors + +### AI Analysis Not Working +1. Ensure the `openai` package is installed: `pip install openai` +2. Check your internet connection (AI requires online access) +3. If not installed, the AI tab will show a placeholder response + +### Scraping Issues +1. Verify internet connection +2. Check URL format (add https:// if needed) +3. Try with a lower depth setting +4. Check console for error messages + +### Loop Prevention +1. The scraper automatically prevents infinite loops +2. Check the analysis tab for detailed statistics +3. Monitor "Total URLs Checked" vs "Total Pages" for efficiency +4. Use lower depth settings for sites with many internal links + +### Performance +- Use lower depth settings for faster scraping +- Filter data to focus on specific domains +- Close other applications to free up resources +- Monitor domain page counts to avoid hitting limits + +## License + +This project is open source and available under the MIT License. \ No newline at end of file diff --git a/community-contributions/WebScraperApp/module.py b/community-contributions/WebScraperApp/module.py new file mode 100644 index 0000000..20dff0f --- /dev/null +++ b/community-contributions/WebScraperApp/module.py @@ -0,0 +1,473 @@ +import urllib.request +import urllib.parse +import urllib.error +import html.parser +import re +from datetime import datetime +import time +import ssl +from urllib.parse import urljoin, urlparse +from concurrent.futures import ThreadPoolExecutor, as_completed +import threading +from functools import partial + +class HTMLParser(html.parser.HTMLParser): + """Custom HTML parser to extract title, links, and text content""" + + def __init__(self): + super().__init__() + self.title = "" + self.links = [] + self.text_content = [] + self.in_title = False + self.in_body = False + self.current_tag = "" + + def handle_starttag(self, tag, attrs): + self.current_tag = tag.lower() + + if tag.lower() == 'title': + self.in_title = True + elif tag.lower() == 'body': + self.in_body = True + elif tag.lower() == 'a': + # Extract href attribute + for attr, value in attrs: + if attr.lower() == 'href' and value: + self.links.append(value) + + def handle_endtag(self, tag): + if tag.lower() == 'title': + self.in_title = False + elif tag.lower() == 'body': + self.in_body = False + + def handle_data(self, data): + if self.in_title: + self.title += data + elif self.in_body and self.current_tag in ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'div', 'span', 'li']: + # Clean the text data + cleaned_data = re.sub(r'\s+', ' ', data.strip()) + if cleaned_data: + self.text_content.append(cleaned_data) + + def get_text(self): + """Return all extracted text content as a single string""" + return ' '.join(self.text_content) + + def get_clean_text(self, max_length=500): + """Return cleaned text content with length limit""" + text = self.get_text() + # Remove extra whitespace and limit length + text = re.sub(r'\s+', ' ', text.strip()) + if len(text) > max_length: + text = text[:max_length] + "..." + return text + +class Website: + """Class to store website data""" + + def __init__(self, title, url, content, depth, links=None, load_time=None): + self.title = title or "No Title" + self.url = url + self.content = content + self.depth = depth + self.links = links or [] + self.load_time = load_time + self.timestamp = datetime.now() + + def get_word_count(self): + """Get word count from content""" + if not self.content: + return 0 + # Extract text content and count words + text_content = re.sub(r'<[^>]+>', '', self.content) + words = text_content.split() + return len(words) + + def get_domain(self): + """Extract domain from URL""" + try: + parsed = urlparse(self.url) + return parsed.netloc + except: + return "" + + def get_normalized_domain(self): + """Get domain without www prefix for consistent filtering""" + domain = self.get_domain() + if domain.startswith('www.'): + return domain[4:] + return domain + + def search_content(self, query): + """Search for query in content""" + if not self.content or not query: + return False + return query.lower() in self.content.lower() + + def get_text_preview(self, max_length=200): + """Get a text preview of the content""" + if not self.content: + return "No content available" + + # Extract text content + text_content = re.sub(r'<[^>]+>', '', self.content) + text_content = re.sub(r'\s+', ' ', text_content.strip()) + + if len(text_content) > max_length: + return text_content[:max_length] + "..." + return text_content + +class WebScraper: + """Web scraper with multithreading support and robust duplicate detection""" + + def __init__(self): + self.websites = [] + self.visited_urls = set() + self.visited_domains = set() # Track visited domains + self.start_domain = None # Store the starting domain + self.lock = threading.Lock() + self.max_workers = 10 # Number of concurrent threads + # Removed all page limits - unlimited crawling + self.domain_page_counts = {} # Track page count per domain (for statistics only) + self._stop_requested = False # Flag to stop scraping + + def normalize_url(self, url): + """Normalize URL to handle www prefixes and remove fragments""" + if not url: + return url + + # Remove fragments (#) to prevent duplicate content + if '#' in url: + url = url.split('#')[0] + + # Remove trailing slashes for consistency + url = url.rstrip('/') + + # Remove www prefix for consistent domain handling + if url.startswith('https://www.'): + return url.replace('https://www.', 'https://', 1) + elif url.startswith('http://www.'): + return url.replace('http://www.', 'http://', 1) + return url + + def get_domain_from_url(self, url): + """Extract and normalize domain from URL""" + try: + parsed = urlparse(url) + domain = parsed.netloc + if domain.startswith('www.'): + return domain[4:] + return domain + except: + return "" + + def should_skip_url(self, url, current_depth): + """Check if URL should be skipped based on various criteria""" + normalized_url = self.normalize_url(url) + + # Skip if already visited + if normalized_url in self.visited_urls: + return True, "Already visited" + + # Skip if not a valid HTTP/HTTPS URL + if not normalized_url.startswith(('http://', 'https://')): + return True, "Not HTTP/HTTPS URL" + + # Get domain + domain = self.get_domain_from_url(normalized_url) + if not domain: + return True, "Invalid domain" + + # Removed all domain page limits - unlimited crawling + # Removed external domain depth limits - crawl as deep as needed + + return False, "OK" + + def scrape_url(self, url, depth): + """Scrape a single URL with error handling and rate limiting""" + try: + # Check if stop was requested + if self._stop_requested: + return None + + # Check if URL should be skipped + should_skip, reason = self.should_skip_url(url, depth) + if should_skip: + print(f"Skipping {url}: {reason}") + return None + + # Normalize URL + normalized_url = self.normalize_url(url) + + # Mark as visited and update domain count (for statistics only) + with self.lock: + self.visited_urls.add(normalized_url) + domain = self.get_domain_from_url(normalized_url) + if domain: + self.domain_page_counts[domain] = self.domain_page_counts.get(domain, 0) + 1 + + # Add small delay to prevent overwhelming servers + time.sleep(0.1) + + start_time = time.time() + + # Create request with headers + req = urllib.request.Request( + normalized_url, + headers={ + 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', + 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', + 'Accept-Language': 'en-US,en;q=0.5', + 'Accept-Encoding': 'gzip, deflate', + 'Connection': 'keep-alive', + 'Upgrade-Insecure-Requests': '1', + } + ) + + # Fetch the page with timeout + with urllib.request.urlopen(req, timeout=15) as response: + # Check content type + content_type = response.headers.get('content-type', '').lower() + if 'text/html' not in content_type and 'application/xhtml' not in content_type: + print(f"Skipping {url}: Not HTML content ({content_type})") + return None + + html_content = response.read().decode('utf-8', errors='ignore') + + load_time = time.time() - start_time + + # Skip if content is too small (likely error page) + if len(html_content) < 100: + print(f"Skipping {url}: Content too small ({len(html_content)} chars)") + return None + + # Parse HTML + parser = HTMLParser() + parser.feed(html_content) + + # Extract links and normalize them with duplicate detection + links = [] + base_url = normalized_url + seen_links = set() # Track links within this page to avoid duplicates + + for link in parser.links: + try: + absolute_url = urljoin(base_url, link) + normalized_link = self.normalize_url(absolute_url) + + # Skip if already seen in this page or should be skipped + if normalized_link in seen_links: + continue + seen_links.add(normalized_link) + + should_skip, reason = self.should_skip_url(normalized_link, depth + 1) + if should_skip: + continue + + # Only include http/https links and filter out common non-content URLs + if (normalized_link.startswith(('http://', 'https://')) and + not any(skip in normalized_link.lower() for skip in [ + 'mailto:', 'tel:', 'javascript:', 'data:', 'file:', + '.pdf', '.doc', '.docx', '.xls', '.xlsx', '.zip', '.rar', + '.jpg', '.jpeg', '.png', '.gif', '.bmp', '.svg', '.ico', + '.css', '.js', '.xml', '.json', '.txt', '.log' + ])): + links.append(normalized_link) + except: + continue + + # Create Website object + website = Website( + title=parser.title, + url=normalized_url, + content=html_content, + depth=depth, + links=links, + load_time=load_time + ) + + return website + + except urllib.error.HTTPError as e: + print(f"HTTP Error scraping {url}: {e.code} - {e.reason}") + return None + except urllib.error.URLError as e: + print(f"URL Error scraping {url}: {e.reason}") + return None + except Exception as e: + print(f"Error scraping {url}: {str(e)}") + return None + + def crawl_website(self, start_url, max_depth=3, progress_callback=None): + """Crawl website with multithreading support and no page limits""" + if not start_url.startswith(('http://', 'https://')): + start_url = 'https://' + start_url + + # Initialize tracking + self.websites = [] + self.visited_urls = set() + self.visited_domains = set() + self.domain_page_counts = {} + self.start_domain = self.get_domain_from_url(start_url) + self._stop_requested = False # Reset stop flag + + print(f"Starting crawl from: {start_url}") + print(f"Starting domain: {self.start_domain}") + print(f"Max depth: {max_depth}") + print(f"Unlimited crawling - no page limits") + + # Start with the initial URL + urls_to_scrape = [(start_url, 0)] + max_depth_reached = 0 + consecutive_empty_levels = 0 + max_consecutive_empty = 3 # Stop if 3 consecutive levels have no new URLs + total_pages_scraped = 0 + # Removed all page limits - unlimited crawling + + with ThreadPoolExecutor(max_workers=self.max_workers) as executor: + for current_depth in range(max_depth + 1): + # Check if stop was requested + if self._stop_requested: + print("Scraping stopped by user request") + break + + if not urls_to_scrape: + print(f"Stopping at depth {current_depth}: No more URLs to scrape") + break + + # Check if we've reached too many consecutive empty levels + if consecutive_empty_levels >= max_consecutive_empty: + print(f"Stopping at depth {current_depth}: {max_consecutive_empty} consecutive empty levels") + break + + # Removed absolute page limit check - unlimited pages + + print(f"Scraping depth {current_depth} with {len(urls_to_scrape)} URLs") + + # Submit all URLs at current depth for concurrent scraping + future_to_url = { + executor.submit(self.scrape_url, url, depth): url + for url, depth in urls_to_scrape + } + + # Collect results and prepare next level + urls_to_scrape = [] + level_results = 0 + + for future in as_completed(future_to_url): + # Check if stop was requested + if self._stop_requested: + print("Stopping processing of current level") + break + + website = future.result() + if website: + with self.lock: + self.websites.append(website) + level_results += 1 + total_pages_scraped += 1 + + # Emit progress if callback provided + if progress_callback: + progress_callback(website) + + # Add links for next depth level (no limits) + if current_depth < max_depth: + for link in website.links: + # Removed URL limit per level - process all URLs + + should_skip, reason = self.should_skip_url(link, current_depth + 1) + if not should_skip: + urls_to_scrape.append((link, current_depth + 1)) + + # Check if stop was requested after processing level + if self._stop_requested: + break + + # Update depth tracking + if level_results > 0: + max_depth_reached = current_depth + consecutive_empty_levels = 0 + else: + consecutive_empty_levels += 1 + + # Only stop if we've reached the actual max depth + if current_depth >= max_depth: + print(f"Reached maximum depth: {max_depth}") + break + + # Print progress summary + print(f"Depth {current_depth} completed: {level_results} pages, Total: {len(self.websites)}") + if self.domain_page_counts: + print(f"Domain breakdown: {dict(self.domain_page_counts)}") + + print(f"Crawling completed. Max depth reached: {max_depth_reached}, Total pages: {len(self.websites)}") + print(f"Visited URLs: {len(self.visited_urls)}") + print(f"Domain breakdown: {dict(self.domain_page_counts)}") + return self.websites + + def reset(self): + """Reset the scraper state for a new crawl""" + self.websites = [] + self.visited_urls = set() + self.visited_domains = set() + self.domain_page_counts = {} + self.start_domain = None + self._stop_requested = False # Reset stop flag + + def get_statistics(self): + """Get scraping statistics with enhanced tracking information""" + if not self.websites: + return { + 'total_pages': 0, + 'total_links': 0, + 'total_words': 0, + 'avg_load_time': 0, + 'max_depth_reached': 0, + 'domains': {}, + 'visited_urls_count': 0, + 'domain_page_counts': {}, + 'start_domain': self.start_domain + } + + total_pages = len(self.websites) + total_links = sum(len(w.links) for w in self.websites) + total_words = sum(w.get_word_count() for w in self.websites) + + load_times = [w.load_time for w in self.websites if w.load_time] + avg_load_time = sum(load_times) / len(load_times) if load_times else 0 + + max_depth_reached = max(w.depth for w in self.websites) + + # Count domains + domains = {} + for website in self.websites: + domain = website.get_normalized_domain() + domains[domain] = domains.get(domain, 0) + 1 + + return { + 'total_pages': total_pages, + 'total_links': total_links, + 'total_words': total_words, + 'avg_load_time': avg_load_time, + 'max_depth_reached': max_depth_reached, + 'domains': domains, + 'visited_urls_count': len(self.visited_urls), + 'domain_page_counts': dict(self.domain_page_counts), + 'start_domain': self.start_domain + } + + def filter_by_domain(self, domain): + """Filter websites by domain""" + normalized_domain = self.normalize_url(domain) + return [w for w in self.websites if w.get_normalized_domain() == normalized_domain] + + def search_websites(self, query): + """Search websites by query""" + return [w for w in self.websites if w.search_content(query)] + + def stop_scraping(self): + """Request graceful stop of the scraping process""" + self._stop_requested = True \ No newline at end of file diff --git a/community-contributions/WebScraperApp/requirements.txt b/community-contributions/WebScraperApp/requirements.txt new file mode 100644 index 0000000..a9f1b2a --- /dev/null +++ b/community-contributions/WebScraperApp/requirements.txt @@ -0,0 +1,5 @@ +PyQt5>=5.15.0 +PyQtWebEngine>=5.15.0 +urllib3==2.0.7 +openai>=1.0.0 +python-dotenv>=1.0.0 \ No newline at end of file diff --git a/community-contributions/WebScraperApp/test.py b/community-contributions/WebScraperApp/test.py new file mode 100644 index 0000000..e86a29c --- /dev/null +++ b/community-contributions/WebScraperApp/test.py @@ -0,0 +1,161 @@ +#!/usr/bin/env python3 +""" +Simple test script to verify the web scraping functionality +""" + +import module + +def test_basic_scraping(): + """Test basic scraping functionality""" + print("Testing basic web scraping...") + + # Create a scraper instance + scraper = module.WebScraper() + + # Test with a simple website (httpbin.org is a safe test site) + test_url = "https://httpbin.org/html" + + print(f"Scraping {test_url} with depth 1...") + + try: + # Scrape with depth 1 to keep it fast + websites = scraper.crawl_website(test_url, max_depth=1) + + print(f"Successfully scraped {len(websites)} websites") + + if websites: + # Show first website details + first_site = websites[0] + print(f"\nFirst website:") + print(f" Title: {first_site.title}") + print(f" URL: {first_site.url}") + print(f" Depth: {first_site.depth}") + print(f" Links found: {len(first_site.links)}") + print(f" Word count: {first_site.get_word_count()}") + + # Show statistics + stats = scraper.get_statistics() + print(f"\nStatistics:") + print(f" Total pages: {stats['total_pages']}") + print(f" Total links: {stats['total_links']}") + print(f" Total words: {stats['total_words']}") + print(f" Average load time: {stats['avg_load_time']:.2f}s") + + return True + else: + print("No websites were scraped") + return False + + except Exception as e: + print(f"Error during scraping: {e}") + return False + +def test_website_class(): + """Test the Website class functionality""" + print("\nTesting Website class...") + + # Create a test website + website = module.Website( + title="Test Website", + url="https://example.com", + content="

Test Content

This is a test paragraph.

", + depth=0, + links=["https://example.com/page1", "https://example.com/page2"] + ) + + # Test methods + print(f"Website title: {website.title}") + print(f"Website URL: {website.url}") + print(f"Word count: {website.get_word_count()}") + print(f"Domain: {website.get_domain()}") + print(f"Normalized domain: {website.get_normalized_domain()}") + print(f"Search for 'test': {website.search_content('test')}") + print(f"Search for 'nonexistent': {website.search_content('nonexistent')}") + + return True + +def test_html_parser(): + """Test the HTML parser functionality""" + print("\nTesting HTML Parser...") + + parser = module.HTMLParser() + test_html = """ + + Test Page + +

Welcome

+

This is a link to example.com

+

Here's another relative link

+ + + """ + + parser.feed(test_html) + print(f"Title extracted: {parser.title}") + print(f"Links found: {parser.links}") + print(f"Text content length: {len(parser.get_text())}") + + return True + +def test_url_normalization(): + """Test URL normalization to handle www. prefixes""" + print("\nTesting URL Normalization...") + + scraper = module.WebScraper() + + # Test URLs with and without www. + test_urls = [ + "https://www.example.com/page", + "https://example.com/page", + "http://www.test.com/path?param=value#fragment", + "http://test.com/path?param=value#fragment" + ] + + print("URL Normalization Results:") + for url in test_urls: + normalized = scraper.normalize_url(url) + print(f" Original: {url}") + print(f" Normalized: {normalized}") + print() + + # Test domain filtering + print("Domain Filtering Test:") + test_websites = [ + module.Website("Site 1", "https://www.example.com", "content", 0), + module.Website("Site 2", "https://example.com", "content", 0), + module.Website("Site 3", "https://www.test.com", "content", 0) + ] + + scraper.websites = test_websites + + # Test filtering by domain with and without www. + domains_to_test = ["example.com", "www.example.com", "test.com", "www.test.com"] + + for domain in domains_to_test: + filtered = scraper.filter_by_domain(domain) + print(f" Filter '{domain}': {len(filtered)} results") + for site in filtered: + print(f" - {site.title} ({site.url})") + + return True + +if __name__ == "__main__": + print("Web Scraper Test Suite") + print("=" * 50) + + # Test HTML parser + test_html_parser() + + # Test Website class + test_website_class() + + # Test URL normalization + test_url_normalization() + + # Test basic scraping (uncomment to test actual scraping) + # Note: This requires internet connection + # test_basic_scraping() + + print("\nTest completed!") + print("\nTo run the full application:") + print("python web_scraper_app.py") \ No newline at end of file diff --git a/community-contributions/WebScraperApp/web_scraper_app.py b/community-contributions/WebScraperApp/web_scraper_app.py new file mode 100644 index 0000000..ccd5ce2 --- /dev/null +++ b/community-contributions/WebScraperApp/web_scraper_app.py @@ -0,0 +1,1678 @@ +import sys +import json +from urllib.parse import urlparse +from PyQt5.QtWidgets import (QApplication, QMainWindow, QWidget, QVBoxLayout, + QHBoxLayout, QLabel, QLineEdit, QSpinBox, QPushButton, + QTextEdit, QTableWidget, QTableWidgetItem, QTabWidget, + QProgressBar, QComboBox, QMessageBox, QSplitter, + QGroupBox, QGridLayout, QHeaderView, QFrame, QScrollArea, + QSystemTrayIcon, QStyle, QAction, QMenu, QTreeWidget, QTreeWidgetItem, + QListWidget, QListWidgetItem, QSizePolicy, QAbstractItemView) +from PyQt5.QtCore import QThread, pyqtSignal, Qt, QTimer, QUrl +from PyQt5.QtGui import QFont, QIcon, QPalette, QColor, QPixmap +try: + from PyQt5.QtWebEngineWidgets import QWebEngineView + WEB_ENGINE_AVAILABLE = True + print("PyQtWebEngine successfully imported - Visual preview enabled") +except ImportError as e: + WEB_ENGINE_AVAILABLE = False + print(f"PyQtWebEngine not available: {e}") + print("Visual preview will be disabled. Install with: pip install PyQtWebEngine") +import module +import re +import webbrowser +import os +try: + from openai import OpenAI + OPENAI_AVAILABLE = True +except ImportError: + OPENAI_AVAILABLE = False +from datetime import datetime +from dotenv import load_dotenv +import markdown + +# Load environment variables from .env file +load_dotenv() + +class ScrapingThread(QThread): + """Thread for running web scraping operations""" + progress_updated = pyqtSignal(str) + scraping_complete = pyqtSignal(list) + error_occurred = pyqtSignal(str) + + def __init__(self, url, max_depth): + super().__init__() + self.url = url + self.max_depth = max_depth + self.scraper = module.WebScraper() + self._stop_requested = False + + def stop(self): + """Request graceful stop of the scraping process""" + self._stop_requested = True + if hasattr(self.scraper, 'stop_scraping'): + self.scraper.stop_scraping() + + def run(self): + try: + self.progress_updated.emit("Starting web scraping...") + + # Reset scraper state for new crawl + self.scraper.reset() + + def progress_callback(website): + if self._stop_requested: + return # Stop processing if requested + if website: + self.progress_updated.emit(f"Scraped: {website.title} (depth {website.depth})") + + # Start scraping with progress callback + websites = self.scraper.crawl_website(self.url, self.max_depth, progress_callback) + + # Check if stop was requested + if self._stop_requested: + self.progress_updated.emit("Scraping stopped by user.") + return + + # Emit final progress + self.progress_updated.emit(f"Scraping complete! Found {len(websites)} websites.") + self.scraping_complete.emit(websites) + + except Exception as e: + if not self._stop_requested: # Only emit error if not stopped by user + self.error_occurred.emit(str(e)) + +class ModernButton(QPushButton): + """Custom modern button with hover effects""" + def __init__(self, text, primary=False): + super().__init__(text) + self.primary = primary + self.setMinimumHeight(40) + self.setFont(QFont("Segoe UI", 10, QFont.Weight.Medium)) + self.setCursor(Qt.CursorShape.PointingHandCursor) + self.update_style() + + def update_style(self): + if self.primary: + self.setStyleSheet(""" + QPushButton { + background: #3b82f6; + border: none; + color: white; + padding: 12px 24px; + border-radius: 6px; + font-weight: 600; + } + QPushButton:hover { + background: #2563eb; + } + QPushButton:pressed { + background: #1d4ed8; + } + QPushButton:disabled { + background: #9ca3af; + color: #f3f4f6; + } + """) + else: + self.setStyleSheet(""" + QPushButton { + background: white; + border: 1px solid #d1d5db; + color: #374151; + padding: 10px 20px; + border-radius: 6px; + font-weight: 500; + } + QPushButton:hover { + border-color: #3b82f6; + color: #3b82f6; + background: #f8fafc; + } + QPushButton:pressed { + background: #f1f5f9; + } + QPushButton:disabled { + background: #f9fafb; + border-color: #e5e7eb; + color: #9ca3af; + } + """) + +class ModernLineEdit(QLineEdit): + """Custom modern input field""" + def __init__(self, placeholder=""): + super().__init__() + self.setPlaceholderText(placeholder) + self.setMinimumHeight(40) + self.setFont(QFont("Segoe UI", 10)) + self.setStyleSheet(""" + QLineEdit { + border: 1px solid #d1d5db; + border-radius: 6px; + padding: 8px 12px; + background: white; + color: #374151; + font-size: 14px; + } + QLineEdit:focus { + border-color: #3b82f6; + outline: none; + } + QLineEdit::placeholder { + color: #9ca3af; + } + """) + +class ModernSpinBox(QSpinBox): + """Custom modern spin box""" + def __init__(self): + super().__init__() + self.setMinimumHeight(40) + self.setFont(QFont("Segoe UI", 10)) + self.setStyleSheet(""" + QSpinBox { + border: 1px solid #d1d5db; + border-radius: 6px; + padding: 8px 12px; + background: white; + color: #374151; + font-size: 14px; + } + QSpinBox:focus { + border-color: #3b82f6; + } + QSpinBox::up-button, QSpinBox::down-button { + border: none; + background: #f9fafb; + border-radius: 3px; + margin: 2px; + } + QSpinBox::up-button:hover, QSpinBox::down-button:hover { + background: #f3f4f6; + } + """) + +class ChatBubbleWidget(QWidget): + def __init__(self, message, timestamp, role): + super().__init__() + layout = QVBoxLayout(self) + layout.setContentsMargins(0, 0, 0, 0) + layout.setSpacing(2) + # Bubble + if role == "ai": + html = markdown.markdown(message) + bubble = QLabel(html) + bubble.setTextFormat(Qt.TextFormat.RichText) + else: + bubble = QLabel(message) + bubble.setTextFormat(Qt.TextFormat.PlainText) + bubble.setWordWrap(True) + bubble.setTextInteractionFlags(Qt.TextInteractionFlag.TextSelectableByMouse) + bubble.setFont(QFont("Segoe UI", 11)) + bubble.setSizePolicy(QSizePolicy.Preferred, QSizePolicy.Maximum) + bubble.setMinimumWidth(800) + bubble.setMaximumWidth(1200) + bubble.adjustSize() + # Timestamp + ts = QLabel(("πŸ€– " if role == "ai" else "") + timestamp) + ts.setFont(QFont("Segoe UI", 8)) + ts.setStyleSheet("color: #9ca3af;") + if role == "user": + bubble.setStyleSheet("background: #2563eb; color: white; border-radius: 16px; padding: 10px 16px; margin-left: 40px;") + layout.setAlignment(Qt.AlignmentFlag.AlignRight) + ts.setAlignment(Qt.AlignmentFlag.AlignRight) + else: + bubble.setStyleSheet("background: #f3f4f6; color: #1e293b; border-radius: 16px; padding: 10px 16px; margin-right: 40px;") + layout.setAlignment(Qt.AlignmentFlag.AlignLeft) + ts.setAlignment(Qt.AlignmentFlag.AlignLeft) + layout.addWidget(bubble) + layout.addWidget(ts) + +class WebScraperApp(QMainWindow): + def __init__(self): + super().__init__() + self.websites = [] + self.scraper = module.WebScraper() + self.init_ui() + + def init_ui(self): + self.setWindowTitle("Web Scraper & Data Analyzer") + self.setGeometry(100, 100, 1400, 900) + self.setMinimumSize(1200, 800) # Set minimum size to prevent geometry issues + + # Set clean, minimal styling + self.setStyleSheet(""" + QMainWindow { + background: #1e293b; + } + QTabWidget::pane { + border: none; + background: white; + border-radius: 8px; + margin: 8px 8px 8px 8px; + padding-top: 8px; + } + QTabBar::tab { + background: #475569; + color: #e2e8f0; + padding: 12px 20px; + margin-right: 4px; + border-top-left-radius: 8px; + border-top-right-radius: 8px; + font-weight: 600; + font-size: 14px; + min-width: 120px; + margin-bottom: 8px; + } + QTabBar::tab:selected { + background: white; + color: #1e293b; + border-bottom: none; + margin-bottom: 8px; + } + QTabBar::tab:hover:!selected { + background: #64748b; + color: #f1f5f9; + } + QTabBar::tab:first { + margin-left: 8px; + } + QTabBar::tab:last { + margin-right: 8px; + } + QGroupBox { + font-weight: 600; + font-size: 14px; + border: 2px solid #e2e8f0; + border-radius: 8px; + margin-top: 16px; + padding-top: 16px; + background: #f8fafc; + } + QGroupBox::title { + subcontrol-origin: margin; + left: 16px; + + color: #1e293b; + background: #f8fafc; + } + QTableWidget { + border: 2px solid #e2e8f0; + border-radius: 8px; + background: white; + gridline-color: #f1f5f9; + alternate-background-color: #f8fafc; + selection-background-color: #dbeafe; + selection-color: #1e293b; + } + QTableWidget::item { + padding: 8px 4px; + border: none; + min-height: 20px; + } + QTableWidget::item:selected { + background: #dbeafe; + color: #1e293b; + } + QHeaderView::section { + background: #e2e8f0; + padding: 12px 8px; + border: none; + border-right: 1px solid #cbd5e1; + border-bottom: 1px solid #cbd5e1; + font-weight: 600; + color: #1e293b; + } + QHeaderView::section:vertical { + background: #f8fafc; + padding: 8px 4px; + border: none; + border-bottom: 1px solid #e2e8f0; + font-weight: 500; + color: #64748b; + min-width: 40px; + } + QProgressBar { + border: 2px solid #e2e8f0; + border-radius: 6px; + text-align: center; + background: #f1f5f9; + } + QProgressBar::chunk { + background: #3b82f6; + border-radius: 5px; + } + QTextEdit { + border: 2px solid #e2e8f0; + border-radius: 6px; + padding: 12px; + background: white; + color: #1e293b; + font-family: 'Segoe UI', sans-serif; + } + QComboBox { + border: 2px solid #d1d5db; + border-radius: 6px; + padding: 8px 12px; + background: white; + color: #1e293b; + font-size: 14px; + min-height: 40px; + } + QComboBox:focus { + border-color: #3b82f6; + } + QComboBox::drop-down { + border: none; + width: 30px; + } + QComboBox::down-arrow { + image: none; + border-left: 5px solid transparent; + border-right: 5px solid transparent; + border-top: 5px solid #6b7280; + margin-right: 10px; + } + QLabel { + color: #1e293b; + font-weight: 500; + font-size: 14px; + } + """) + + # System tray icon for notifications + + self.tray_icon = QSystemTrayIcon(self) + self.tray_icon.setIcon(self.style().standardIcon(QStyle.StandardPixmap.SP_ComputerIcon)) + self.tray_icon.setVisible(True) + + # Create central widget and main layout + central_widget = QWidget() + self.setCentralWidget(central_widget) + main_layout = QVBoxLayout(central_widget) + main_layout.setContentsMargins(16, 16, 16, 16) + main_layout.setSpacing(12) + + # Create header + header = self.create_header() + main_layout.addWidget(header) + + # Add proper spacing after header + spacer = QWidget() + spacer.setFixedHeight(12) + main_layout.addWidget(spacer) + + # Create tab widget with proper margins + self.tab_widget = QTabWidget() + self.tab_widget.setStyleSheet(""" + QTabWidget { + margin-top: 0px; + background: transparent; + } + QTabWidget::pane { + border: none; + background: white; + border-radius: 8px; + margin: 4px 8px 8px 8px; + padding-top: 4px; + } + QTabBar { + background: transparent; + spacing: 0px; + } + QTabBar::tab { + background: #475569; + color: #e2e8f0; + padding: 12px 20px; + margin-right: 4px; + border-top-left-radius: 8px; + border-top-right-radius: 8px; + font-weight: 600; + font-size: 14px; + min-width: 120px; + margin-bottom: 4px; + } + QTabBar::tab:selected { + background: white; + color: #1e293b; + border-bottom: none; + margin-bottom: 4px; + } + QTabBar::tab:hover:!selected { + background: #64748b; + color: #f1f5f9; + } + QTabBar::tab:first { + margin-left: 8px; + } + QTabBar::tab:last { + margin-right: 8px; + } + """) + main_layout.addWidget(self.tab_widget) + + # Create tabs + self.create_scraping_tab() + self.create_data_tab() + self.create_analysis_tab() + self.create_sitemap_tab() + self.create_ai_tab() + + def create_header(self): + """Create a clean header with help button only (no theme toggle)""" + header_widget = QWidget() + header_widget.setStyleSheet(""" + QWidget { + background: #0f172a; + border-radius: 12px; + margin: 4px 4px 8px 4px; + } + """) + header_layout = QHBoxLayout(header_widget) + header_layout.setContentsMargins(24, 20, 24, 20) + header_layout.setSpacing(16) + + # Title + title_label = QLabel("Web Scraper & Data Analyzer") + title_label.setStyleSheet(""" + QLabel { + color: #f8fafc; + font-size: 28px; + font-weight: 800; + font-family: 'Segoe UI', sans-serif; + } + """) + + # Subtitle + subtitle_label = QLabel("Modern web scraping with intelligent data analysis") + subtitle_label.setStyleSheet(""" + QLabel { + color: #cbd5e1; + font-size: 16px; + font-weight: 500; + font-family: 'Segoe UI', sans-serif; + } + """) + + # Help button + help_button = ModernButton("Help") + help_button.clicked.connect(self.show_help) + + # Right side info + info_widget = QWidget() + info_layout = QVBoxLayout(info_widget) + info_layout.setAlignment(Qt.AlignmentFlag.AlignRight) + info_layout.setSpacing(4) + + version_label = QLabel("v2.0") + version_label.setStyleSheet(""" + QLabel { + color: #94a3b8; + font-size: 14px; + font-weight: 600; + background: #1e293b; + padding: 6px 12px; + border-radius: 6px; + border: 1px solid #334155; + } + """) + + info_layout.addWidget(version_label) + + header_layout.addWidget(title_label) + header_layout.addStretch() + header_layout.addWidget(subtitle_label) + header_layout.addStretch() + header_layout.addWidget(help_button) + header_layout.addWidget(info_widget) + + return header_widget + + def create_scraping_tab(self): + """Create the web scraping configuration tab""" + scraping_widget = QWidget() + main_layout = QVBoxLayout(scraping_widget) + main_layout.setContentsMargins(16, 16, 16, 16) + main_layout.setSpacing(16) + + # Create scroll area + scroll_area = QScrollArea() + scroll_area.setWidgetResizable(True) + scroll_area.setStyleSheet("QScrollArea { border: none; }") + scroll_area.setHorizontalScrollBarPolicy(Qt.ScrollBarPolicy.ScrollBarAsNeeded) + scroll_area.setVerticalScrollBarPolicy(Qt.ScrollBarPolicy.ScrollBarAsNeeded) + + # Create content widget for scrolling + content_widget = QWidget() + layout = QVBoxLayout(content_widget) + layout.setSpacing(16) + layout.setContentsMargins(0, 0, 0, 0) + + # Input group + input_group = QGroupBox("Scraping Configuration") + input_layout = QGridLayout(input_group) + input_layout.setSpacing(12) + + # URL input + input_layout.addWidget(QLabel("Website URL:"), 0, 0) + self.url_input = ModernLineEdit("https://example.com") + input_layout.addWidget(self.url_input, 0, 1) + + # Depth input + input_layout.addWidget(QLabel("Max Depth (1-100):"), 1, 0) + self.depth_input = ModernSpinBox() + self.depth_input.setRange(1, 100) + self.depth_input.setValue(3) + input_layout.addWidget(self.depth_input, 1, 1) + + # Control buttons + button_layout = QHBoxLayout() + button_layout.setSpacing(8) + + self.start_button = ModernButton("Start Scraping", primary=True) + self.start_button.clicked.connect(self.start_scraping) + button_layout.addWidget(self.start_button) + + self.stop_button = ModernButton("Stop") + self.stop_button.clicked.connect(self.stop_scraping) + self.stop_button.setEnabled(False) + button_layout.addWidget(self.stop_button) + + input_layout.addLayout(button_layout, 2, 0, 1, 2) + layout.addWidget(input_group) + + # Progress group + progress_group = QGroupBox("Progress") + progress_layout = QVBoxLayout(progress_group) + progress_layout.setSpacing(8) + + self.progress_bar = QProgressBar() + self.progress_bar.setVisible(False) + self.progress_bar.setMinimumHeight(20) + progress_layout.addWidget(self.progress_bar) + + self.status_label = QLabel("Ready to start scraping...") + self.status_label.setStyleSheet(""" + QLabel { + color: #374151; + font-size: 14px; + padding: 8px; + background: #f8fafc; + border-radius: 6px; + border-left: 3px solid #3b82f6; + } + """) + self.status_label.setWordWrap(True) # Enable word wrapping + progress_layout.addWidget(self.status_label) + + layout.addWidget(progress_group) + + # Results preview + results_group = QGroupBox("Scraping Results") + results_layout = QVBoxLayout(results_group) + + self.results_text = QTextEdit() + self.results_text.setReadOnly(True) + self.results_text.setMinimumHeight(80) # Reduced minimum height for more compact output + results_layout.addWidget(self.results_text) + + layout.addWidget(results_group) + + # Set the content widget in the scroll area + scroll_area.setWidget(content_widget) + main_layout.addWidget(scroll_area) + + self.tab_widget.addTab(scraping_widget, "Web Scraping") + + def create_data_tab(self): + """Create the data viewing and filtering tab""" + data_widget = QWidget() + layout = QVBoxLayout(data_widget) + layout.setSpacing(16) + + # Search and filter controls + controls_group = QGroupBox("Search & Filter") + controls_layout = QHBoxLayout(controls_group) + controls_layout.setSpacing(12) + + controls_layout.addWidget(QLabel("Search:")) + self.search_input = ModernLineEdit("Enter search term...") + self.search_input.textChanged.connect(self.filter_data) + controls_layout.addWidget(self.search_input) + + controls_layout.addWidget(QLabel("Domain:")) + self.domain_filter = QComboBox() + self.domain_filter.currentTextChanged.connect(self.filter_data) + controls_layout.addWidget(self.domain_filter) + + self.export_button = ModernButton("Export Data") + self.export_button.clicked.connect(self.export_data) + controls_layout.addWidget(self.export_button) + + # Sitemap button + self.sitemap_button = ModernButton("Generate Sitemap.xml") + self.sitemap_button.clicked.connect(self.generate_sitemap) + controls_layout.addWidget(self.sitemap_button) + + layout.addWidget(controls_group) + + # Data table + self.data_table = QTableWidget() + self.data_table.setColumnCount(6) + self.data_table.setHorizontalHeaderLabels([ + "Title", "URL", "Depth", "Links", "Words", "Load Time" + ]) + + # Set table properties to fill available width + header = self.data_table.horizontalHeader() + header.setStretchLastSection(False) # Don't stretch the last section + + # Set resize modes to make table fill width properly + header.setSectionResizeMode(0, QHeaderView.Stretch) # Title - stretch to fill + header.setSectionResizeMode(1, QHeaderView.Stretch) # URL - stretch to fill + header.setSectionResizeMode(2, QHeaderView.Fixed) # Depth - fixed + header.setSectionResizeMode(3, QHeaderView.Fixed) # Links - fixed + header.setSectionResizeMode(4, QHeaderView.Fixed) # Words - fixed + header.setSectionResizeMode(5, QHeaderView.Fixed) # Load Time - fixed + + # Set fixed column widths for non-stretching columns + self.data_table.setColumnWidth(2, 80) # Depth + self.data_table.setColumnWidth(3, 80) # Links + self.data_table.setColumnWidth(4, 80) # Words + self.data_table.setColumnWidth(5, 100) # Load Time + + # Set row height to prevent index cutoff + self.data_table.verticalHeader().setDefaultSectionSize(40) # Increased row height + self.data_table.verticalHeader().setMinimumSectionSize(35) # Minimum row height + + # Enable word wrapping for title and URL columns + self.data_table.setWordWrap(True) + + # Connect double-click signal + self.data_table.cellDoubleClicked.connect(self.show_content_preview) + + layout.addWidget(self.data_table) + + self.tab_widget.addTab(data_widget, "Data View") + + def create_analysis_tab(self): + """Create the data analysis tab""" + analysis_widget = QWidget() + layout = QVBoxLayout(analysis_widget) + layout.setSpacing(16) + + # Create scroll area for better layout + scroll_area = QScrollArea() + scroll_area.setWidgetResizable(True) + scroll_area.setStyleSheet("QScrollArea { border: none; }") + + content_widget = QWidget() + content_layout = QVBoxLayout(content_widget) + content_layout.setSpacing(16) + + # Statistics group + stats_group = QGroupBox("Statistics") + stats_layout = QGridLayout(stats_group) + stats_layout.setSpacing(12) + + self.stats_labels = {} + stats_fields = [ + ("Total Pages", "Total Pages"), + ("Total Links", "Total Links"), + ("Total Words", "Total Words"), + ("Average Load Time", "Average Load Time"), + ("Max Depth Reached", "Max Depth Reached") + ] + + for i, (label_text, field) in enumerate(stats_fields): + stats_layout.addWidget(QLabel(f"{label_text}:"), i, 0) + label = QLabel("0") + label.setStyleSheet(""" + QLabel { + font-weight: 700; + color: #3b82f6; + font-size: 16px; + padding: 8px 12px; + background: #eff6ff; + border-radius: 6px; + border-left: 3px solid #3b82f6; + } + """) + self.stats_labels[field] = label + stats_layout.addWidget(label, i, 1) + + content_layout.addWidget(stats_group) + + # Domain breakdown + domain_group = QGroupBox("Domain Breakdown") + domain_layout = QVBoxLayout(domain_group) + + self.domain_text = QTextEdit() + self.domain_text.setReadOnly(True) + self.domain_text.setMaximumHeight(150) + domain_layout.addWidget(self.domain_text) + + content_layout.addWidget(domain_group) + + # Content preview + content_preview_group = QGroupBox("Content Preview") + content_preview_layout = QVBoxLayout(content_preview_group) + + # Create splitter for text and visual preview + preview_splitter = QSplitter(Qt.Orientation.Horizontal) + + # Text preview + text_preview_widget = QWidget() + text_preview_layout = QVBoxLayout(text_preview_widget) + text_preview_layout.setContentsMargins(0, 0, 0, 0) + + text_label = QLabel("Text Content:") + text_label.setStyleSheet("font-weight: 600; margin-bottom: 8px;") + text_preview_layout.addWidget(text_label) + + self.content_text = QTextEdit() + self.content_text.setReadOnly(True) + self.content_text.setMaximumHeight(400) + self.content_text.setFont(QFont("Segoe UI", 12)) + self.content_text.setStyleSheet(""" + QTextEdit { + font-size: 12px; + line-height: 1.4; + padding: 16px; + } + """) + text_preview_layout.addWidget(self.content_text) + + # Visual HTML preview + visual_preview_widget = QWidget() + visual_preview_layout = QVBoxLayout(visual_preview_widget) + visual_preview_layout.setContentsMargins(0, 0, 0, 0) + + visual_label = QLabel("Visual Preview:") + visual_label.setStyleSheet("font-weight: 600; margin-bottom: 8px;") + visual_preview_layout.addWidget(visual_label) + + if WEB_ENGINE_AVAILABLE: + self.web_view = QWebEngineView() + self.web_view.setMinimumHeight(400) + self.web_view.setMaximumHeight(400) + visual_preview_layout.addWidget(self.web_view) + else: + self.web_view = QLabel("Visual preview not available\nInstall PyQtWebEngine for HTML rendering") + self.web_view.setStyleSheet("color: #6b7280; padding: 20px; text-align: center;") + self.web_view.setMinimumHeight(400) + self.web_view.setMaximumHeight(400) + visual_preview_layout.addWidget(self.web_view) + + # Add widgets to splitter + preview_splitter.addWidget(text_preview_widget) + preview_splitter.addWidget(visual_preview_widget) + preview_splitter.setSizes([400, 600]) # Set initial split ratio + + content_preview_layout.addWidget(preview_splitter) + + content_layout.addWidget(content_preview_group) + + scroll_area.setWidget(content_widget) + layout.addWidget(scroll_area) + + self.tab_widget.addTab(analysis_widget, "Analysis") + + def create_sitemap_tab(self): + """Create the visual sitemap tab with a tree widget and export button""" + sitemap_widget = QWidget() + layout = QVBoxLayout(sitemap_widget) + layout.setSpacing(16) + + # Export button + self.export_sitemap_button = ModernButton("Export Sitemap (JSON)") + self.export_sitemap_button.clicked.connect(self.export_sitemap_json) + layout.addWidget(self.export_sitemap_button) + + self.sitemap_tree = QTreeWidget() + self.sitemap_tree.setHeaderLabels(["Page Title", "URL"]) + self.sitemap_tree.setColumnWidth(0, 350) + self.sitemap_tree.setColumnWidth(1, 600) + self.sitemap_tree.itemDoubleClicked.connect(self.open_url_in_browser) + layout.addWidget(self.sitemap_tree) + + self.tab_widget.addTab(sitemap_widget, "Sitemap") + + def create_ai_tab(self): + """Create a simplified, modern AI Analysis tab with a chat interface and compact quick actions, using more curves to match the app style.""" + ai_widget = QWidget() + layout = QVBoxLayout(ai_widget) + layout.setSpacing(8) + layout.setContentsMargins(16, 16, 16, 16) + + hint_label = QLabel("πŸ’‘ Ask questions about your scraped websites below.") + hint_label.setStyleSheet(""" + QLabel { + color: #64748b; + font-size: 13px; + padding: 4px 0 8px 0; + } + """) + layout.addWidget(hint_label) + + # --- Chat area --- + self.ai_chat_history = QListWidget() + self.ai_chat_history.setStyleSheet(""" + QListWidget { + background: #f8fafc; + border: 1.5px solid #e2e8f0; + border-radius: 22px; + font-size: 15px; + color: #1e293b; + padding: 12px; + font-family: 'Segoe UI', sans-serif; + } + """) + self.ai_chat_history.setSpacing(6) + self.ai_chat_history.setMinimumHeight(300) + self.ai_chat_history.setResizeMode(QListWidget.Adjust) + self.ai_chat_history.setVerticalScrollMode(QAbstractItemView.ScrollPerPixel) + layout.addWidget(self.ai_chat_history, stretch=1) + self.chat_messages = [] # Store (role, message, timestamp) tuples + self.render_chat_history() + + # --- Quick action buttons --- + quick_actions_widget = QWidget() + quick_actions_layout = QHBoxLayout(quick_actions_widget) + quick_actions_layout.setSpacing(8) + quick_actions_layout.setContentsMargins(0, 0, 0, 0) + quick_questions = [ + "Analyze the website structure", + "Find key content themes", + "Suggest SEO improvements", + "Compare page performance" + ] + for question in quick_questions: + quick_btn = QPushButton(question) + quick_btn.setFont(QFont("Segoe UI", 10)) + quick_btn.setCursor(Qt.CursorShape.PointingHandCursor) + quick_btn.clicked.connect(lambda _, q=question: self.quick_question(q)) + quick_btn.setStyleSheet(""" + QPushButton { + background: #e0e7ef; + border: none; + color: #374151; + padding: 8px 22px; + border-radius: 22px; + font-weight: 500; + font-size: 13px; + box-shadow: 0 2px 8px rgba(59, 130, 246, 0.04); + } + QPushButton:hover { + background: #3b82f6; + color: white; + } + QPushButton:pressed { + background: #2563eb; + color: white; + } + """) + quick_actions_layout.addWidget(quick_btn) + layout.addWidget(quick_actions_widget) + + # --- Input area --- + input_container = QWidget() + input_layout = QHBoxLayout(input_container) + input_layout.setContentsMargins(0, 0, 0, 0) + input_layout.setSpacing(8) + self.ai_input = QLineEdit() + self.ai_input.setPlaceholderText("Type your question and press Enter...") + self.ai_input.setMinimumHeight(44) + self.ai_input.setFont(QFont("Segoe UI", 12)) + self.ai_input.returnPressed.connect(self.send_ai_message) + self.ai_input.setStyleSheet(""" + QLineEdit { + border: 1.5px solid #e2e8f0; + border-radius: 22px; + padding: 10px 20px; + background: white; + color: #1e293b; + font-size: 14px; + } + QLineEdit:focus { + border-color: #3b82f6; + outline: none; + } + QLineEdit::placeholder { + color: #9ca3af; + } + """) + self.ai_send_button = QPushButton("Send") + self.ai_send_button.setMinimumHeight(44) + self.ai_send_button.setMinimumWidth(80) + self.ai_send_button.setFont(QFont("Segoe UI", 12, QFont.Weight.Medium)) + self.ai_send_button.setCursor(Qt.CursorShape.PointingHandCursor) + self.ai_send_button.clicked.connect(self.send_ai_message) + self.ai_send_button.setStyleSheet(""" + QPushButton { + background: #3b82f6; + border: none; + color: white; + padding: 10px 28px; + border-radius: 22px; + font-weight: 600; + font-size: 15px; + box-shadow: 0 2px 8px rgba(59, 130, 246, 0.08); + } + QPushButton:hover { + background: #2563eb; + } + QPushButton:pressed { + background: #1d4ed8; + } + QPushButton:disabled { + background: #9ca3af; + color: #f3f4f6; + } + """) + input_layout.addWidget(self.ai_input, stretch=1) + input_layout.addWidget(self.ai_send_button) + layout.addWidget(input_container) + + self.tab_widget.addTab(ai_widget, "AI Analysis") + ai_tab_index = self.tab_widget.count() - 1 + self.set_ai_tab_gradient(ai_tab_index) + + def render_chat_history(self): + self.ai_chat_history.clear() + for role, msg, timestamp in self.chat_messages: + item = QListWidgetItem() + bubble = ChatBubbleWidget(msg, timestamp, role) + bubble.adjustSize() + item.setSizeHint(bubble.sizeHint()) + self.ai_chat_history.addItem(item) + self.ai_chat_history.setItemWidget(item, bubble) + self.ai_chat_history.scrollToBottom() + + def send_ai_message(self): + user_msg = self.ai_input.text().strip() + if not user_msg: + return + timestamp = datetime.now().strftime("%H:%M") + self.chat_messages.append(("user", user_msg, timestamp)) + self.render_chat_history() + self.ai_input.clear() + # Show thinking indicator as AI message + self.chat_messages.append(("ai", "πŸ€” Analyzing your question...", timestamp)) + self.render_chat_history() + ai_context = self.get_ai_context(user_msg) + QTimer.singleShot(100, lambda: self._do_ai_response_openrouter(user_msg, ai_context)) + + def _do_ai_response_openrouter(self, user_msg, ai_context): + if OPENAI_AVAILABLE: + try: + client = OpenAI( + base_url="https://openrouter.ai/api/v1", + api_key=os.environ.get("OPENROUTER_API_KEY"), + ) + system_prompt = """You are an expert website analyst and AI assistant specializing in web scraping analysis. Your role is to:\n\n1. **Analyze website content** - Provide insights about the scraped websites\n2. **Identify patterns** - Find common themes, structures, and content types\n3. **Offer recommendations** - Suggest improvements for SEO, content, or structure\n4. **Answer questions** - Respond to specific queries about the websites\n5. **Provide actionable insights** - Give practical advice based on the data\n\n**Response Guidelines:**\n- Be professional yet conversational\n- Use clear, structured responses with bullet points when appropriate\n- Reference specific websites by title when relevant\n- Provide specific examples from the content\n- Suggest actionable next steps when possible\n- Use markdown formatting for better readability\n\n**Context:** You have access to scraped website data including titles, URLs, content previews, and metadata.""" + user_prompt = f"""# Website Analysis Request\n\n## User Question\n{user_msg}\n\n## Available Website Data\n{ai_context}\n\n## Instructions\nPlease provide a comprehensive analysis based on the user's question. Use the website data above to support your response. If the question is about specific aspects (SEO, content, structure, etc.), focus your analysis accordingly.\n\n**Format your response with:**\n- Clear headings and structure\n- Specific examples from the websites\n- Actionable insights and recommendations\n- Professional, helpful tone""" + completion = client.chat.completions.create( + extra_headers={ + "HTTP-Referer": "http://localhost:8000", + "X-Title": "Web Scraper & Data Analyzer - AI Analysis", + }, + extra_body={}, + model="deepseek/deepseek-r1-0528-qwen3-8b:free", + messages=[ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_prompt} + ], + temperature=0.7, + max_tokens=2000 + ) + try: + answer = completion.choices[0].message.content + if answer is not None: + answer = answer.strip() + else: + answer = "❌ **AI Analysis Error**\n\nNo response content received from the AI model." + except (AttributeError, IndexError, KeyError): + answer = "❌ **AI Analysis Error**\n\nUnexpected response format from the AI model." + if hasattr(self, "ai_stats_label"): + self.ai_stats_label.setText(f"Analyzed {len(self.websites)} websites") + except Exception as e: + answer = f"❌ **AI Analysis Error**\n\nI encountered an error while analyzing your request: `{str(e)}`\n\nPlease try again or check your internet connection." + else: + if ai_context == "No data available. Please scrape some websites first.": + answer = "πŸ“Š **No Data Available**\n\nPlease scrape some websites first to enable AI analysis." + else: + answer = f"πŸ€– **AI Analysis Preview**\n\nI have analyzed {len(self.websites)} websites. Your question: '{user_msg}'\n\n*(This is a placeholder response. Install the 'openai' package for real AI analysis.)*" + # Remove the last AI thinking message + if self.chat_messages and self.chat_messages[-1][1].startswith("πŸ€”"): + self.chat_messages.pop() + timestamp = datetime.now().strftime("%H:%M") + self.chat_messages.append(("ai", answer, timestamp)) + self.render_chat_history() + + def open_url_in_browser(self, item, column): + url = item.data(1, Qt.ItemDataRole.DisplayRole) + if url: + webbrowser.open(url) + + def get_icon(self, is_root=False): + + if is_root: + return self.style().standardIcon(QStyle.StandardPixmap.SP_DesktopIcon) + else: + return self.style().standardIcon(QStyle.StandardPixmap.SP_DirIcon) + """Build and display the sitemap tree from crawled data, with icons and tooltips""" + self.sitemap_tree.clear() + if not self.websites: + return + url_to_website = {w.url: w for w in self.websites} + children_map = {w.url: [] for w in self.websites} + for w in self.websites: + for link in w.links: + if link in url_to_website: + children_map[w.url].append(link) + root_url = self.websites[0].url + def add_items(parent_item, url, visited, depth): + if url in visited: + return + visited.add(url) + website = url_to_website[url] + item = QTreeWidgetItem([website.title, website.url]) + item.setIcon(0, self.get_icon(is_root=False)) + tooltip = f"Title: {website.title}
" + tooltip += f"URL: {website.url}
" + tooltip += f"Depth: {website.depth}
" + tooltip += f"Outgoing Links: {len(website.links)}" + item.setToolTip(0, tooltip) + item.setToolTip(1, tooltip) + parent_item.addChild(item) + for child_url in children_map[url]: + add_items(item, child_url, visited, depth+1) + root_website = url_to_website[root_url] + root_item = QTreeWidgetItem([root_website.title, root_website.url]) + root_item.setIcon(0, self.get_icon(is_root=True)) + tooltip = f"Title: {root_website.title}
" + tooltip += f"URL: {root_website.url}
" + tooltip += f"Depth: {root_website.depth}
" + tooltip += f"Outgoing Links: {len(root_website.links)}" + root_item.setToolTip(0, tooltip) + root_item.setToolTip(1, tooltip) + self.sitemap_tree.addTopLevelItem(root_item) + visited = set([root_url]) + for child_url in children_map[root_url]: + add_items(root_item, child_url, visited, 1) + self.sitemap_tree.expandToDepth(1) + + def export_sitemap_json(self): + """Export the sitemap tree as a JSON file (preserving hierarchy)""" + if not self.websites: + QMessageBox.warning(self, "Error", "No sitemap data to export.") + return + def build_tree(item): + data = { + 'title': item.text(0), + 'url': item.text(1), + 'children': [build_tree(item.child(i)) for i in range(item.childCount())] + } + return data + root = self.sitemap_tree.topLevelItem(0) + if not root: + QMessageBox.warning(self, "Error", "No sitemap data to export.") + return + sitemap_data = build_tree(root) + try: + with open('sitemap_tree.json', 'w', encoding='utf-8') as f: + json.dump(sitemap_data, f, indent=2, ensure_ascii=False) + QMessageBox.information(self, "Success", "Sitemap exported to 'sitemap_tree.json'") + except Exception as e: + QMessageBox.critical(self, "Error", f"Failed to export sitemap: {e}") + + def is_valid_url(self, url): + """Check if the URL is valid (basic check for scheme and domain)""" + try: + parsed = urlparse(url) + return all([parsed.scheme in ("http", "https"), parsed.netloc]) + except Exception: + return False + + def start_scraping(self): + """Start the web scraping process""" + url = self.url_input.text().strip() + if not url: + QMessageBox.warning(self, "Error", "Please enter a valid URL") + return + + if not url.startswith(('http://', 'https://')): + url = 'https://' + url + + # Validate URL format + if not self.is_valid_url(url): + QMessageBox.warning(self, "Invalid URL", "Please enter a valid website URL (e.g. https://example.com)") + return + + max_depth = self.depth_input.value() + + # Update UI + self.start_button.setEnabled(False) + self.stop_button.setEnabled(True) + self.progress_bar.setVisible(True) + self.progress_bar.setRange(0, 0) # Indeterminate progress + self.status_label.setText("Scraping in progress...") + self.status_label.setStyleSheet(""" + QLabel { + color: #1e40af; + font-size: 14px; + padding: 8px; + background: #eff6ff; + border-radius: 6px; + border-left: 3px solid #3b82f6; + } + """) + + # Start scraping thread + self.scraping_thread = ScrapingThread(url, max_depth) + self.scraping_thread.progress_updated.connect(self.update_progress) + self.scraping_thread.scraping_complete.connect(self.scraping_finished) + self.scraping_thread.error_occurred.connect(self.scraping_error) + self.scraping_thread.start() + + def stop_scraping(self): + """Stop the scraping process""" + if hasattr(self, 'scraping_thread') and self.scraping_thread.isRunning(): + # Use graceful stop instead of forceful termination + self.scraping_thread.stop() + + # Wait for the thread to finish gracefully (with timeout) + if not self.scraping_thread.wait(5000): # Wait up to 5 seconds + # If it doesn't stop gracefully, then force terminate + self.scraping_thread.terminate() + self.scraping_thread.wait(2000) # Wait up to 2 more seconds + + self.start_button.setEnabled(True) + self.stop_button.setEnabled(False) + self.progress_bar.setVisible(False) + self.status_label.setText("Scraping stopped.") + self.status_label.setStyleSheet(""" + QLabel { + color: #92400e; + font-size: 14px; + padding: 8px; + background: #fffbeb; + border-radius: 6px; + border-left: 3px solid #f59e0b; + } + """) + + def update_progress(self, message): + """Update progress message""" + self.status_label.setText(message) + self.results_text.append(message) + + def show_help(self): + """Show a help/info dialog with usage instructions (no theme switch info)""" + help_text = ( + "

Web Scraper & Data Analyzer - Help

" + "" + "

For more info, see the README or contact support.

" + ) + QMessageBox.information(self, "Help / Info", help_text) + + def scraping_finished(self, websites): + """Handle scraping completion""" + self.websites = websites + self.scraper.websites = websites + + # Update UI + self.start_button.setEnabled(True) + self.stop_button.setEnabled(False) + self.progress_bar.setVisible(False) + self.status_label.setText(f"Scraping complete! Found {len(websites)} websites.") + self.status_label.setStyleSheet(""" + QLabel { + color: #166534; + font-size: 14px; + padding: 8px; + background: #f0fdf4; + border-radius: 6px; + border-left: 3px solid #22c55e; + } + """) + + # Update data view + self.update_data_table() + self.update_analysis() + self.update_sitemap_tree() + + # Switch to data tab + self.tab_widget.setCurrentIndex(1) + + # Show desktop notification + self.tray_icon.showMessage( + "Web Scraper", + f"Scraping complete! Found {len(websites)} websites.", + QSystemTrayIcon.MessageIcon(1), # 1 = Information + 5000 + ) + + def scraping_error(self, error_message): + """Handle scraping errors""" + QMessageBox.critical(self, "Error", f"Scraping failed: {error_message}") + self.start_button.setEnabled(True) + self.stop_button.setEnabled(False) + self.progress_bar.setVisible(False) + self.status_label.setText("Scraping failed.") + self.status_label.setStyleSheet(""" + QLabel { + color: #991b1b; + font-size: 14px; + padding: 8px; + background: #fef2f2; + border-radius: 6px; + border-left: 3px solid #ef4444; + } + """) + + # Show desktop notification + self.tray_icon.showMessage( + "Web Scraper", + f"Scraping failed: {error_message}", + QSystemTrayIcon.MessageIcon(3), + 5000 + ) + + def update_data_table(self): + """Update the data table with scraped websites""" + self.data_table.setRowCount(len(self.websites)) + for row, website in enumerate(self.websites): + self.data_table.setRowHeight(row, 40) + title_item = QTableWidgetItem(website.title) + title_item.setTextAlignment(Qt.AlignmentFlag.AlignTop | Qt.AlignmentFlag.AlignLeft) + url_item = QTableWidgetItem(website.url) + url_item.setTextAlignment(Qt.AlignmentFlag.AlignTop | Qt.AlignmentFlag.AlignLeft) + depth_item = QTableWidgetItem(str(website.depth)) + depth_item.setTextAlignment(Qt.AlignmentFlag.AlignCenter) + links_item = QTableWidgetItem(str(len(website.links))) + links_item.setTextAlignment(Qt.AlignmentFlag.AlignCenter) + words_item = QTableWidgetItem(str(website.get_word_count())) + words_item.setTextAlignment(Qt.AlignmentFlag.AlignCenter) + load_time = f"{website.load_time:.2f}s" if website.load_time else "N/A" + load_time_item = QTableWidgetItem(load_time) + load_time_item.setTextAlignment(Qt.AlignmentFlag.AlignCenter) + self.data_table.setItem(row, 0, title_item) + self.data_table.setItem(row, 1, url_item) + self.data_table.setItem(row, 2, depth_item) + self.data_table.setItem(row, 3, links_item) + self.data_table.setItem(row, 4, words_item) + self.data_table.setItem(row, 5, load_time_item) + # Update domain filter + domains = list(set(w.get_normalized_domain() for w in self.websites)) + self.domain_filter.clear() + self.domain_filter.addItem("All Domains") + self.domain_filter.addItems(domains) + # Update content preview with first website + if self.websites: + first_website = self.websites[0] + content_preview = first_website.get_text_preview(800) + self.content_text.setText(content_preview) + + # Also update visual preview for first website + if WEB_ENGINE_AVAILABLE and hasattr(self, 'web_view'): + try: + html_content = first_website.content + if html_content and html_content.strip(): + full_html = f""" + + + + + + {first_website.title} + + + + {html_content} + + + """ + self.web_view.setHtml(full_html, QUrl(first_website.url)) + else: + self.web_view.setHtml(""" + + +

No HTML Content Available

+

This page doesn't have HTML content to display in the visual preview.

+ + + """) + except Exception as e: + self.web_view.setHtml(f""" + + +

Error Loading Preview

+

Failed to load the visual preview:

+

{str(e)}

+

This might be due to:

+ + + + """) + + def filter_data(self): + """Filter the data table based on search and domain filters""" + search_term = self.search_input.text().lower() + selected_domain = self.domain_filter.currentText() + + for row in range(self.data_table.rowCount()): + website = self.websites[row] + + # Check search term + matches_search = (search_term in website.title.lower() or + search_term in website.url.lower() or + website.search_content(search_term)) + + # Check domain filter + matches_domain = (selected_domain == "All Domains" or + website.get_normalized_domain() == selected_domain) + + # Show/hide row + self.data_table.setRowHidden(row, not (matches_search and matches_domain)) + + def update_analysis(self): + """Update the analysis tab with enhanced statistics""" + if not self.websites: + return + + stats = self.scraper.get_statistics() + + # Update statistics labels + self.stats_labels["Total Pages"].setText(str(stats['total_pages'])) + self.stats_labels["Total Links"].setText(str(stats['total_links'])) + self.stats_labels["Total Words"].setText(str(stats['total_words'])) + self.stats_labels["Average Load Time"].setText(f"{stats['avg_load_time']:.2f}s") + self.stats_labels["Max Depth Reached"].setText(str(stats['max_depth_reached'])) + + # Update domain breakdown with enhanced information + domain_text = "Domain Breakdown:\n\n" + + # Show visited URLs count + domain_text += f"πŸ“Š Total URLs Checked: {stats.get('visited_urls_count', 0)}\n" + domain_text += f"🎯 Starting Domain: {stats.get('start_domain', 'N/A')}\n\n" + + # Show domain page counts + if stats.get('domain_page_counts'): + domain_text += "πŸ“ˆ Pages per Domain:\n" + for domain, count in stats['domain_page_counts'].items(): + domain_text += f" β€’ {domain}: {count} pages\n" + domain_text += "\n" + + # Show final domain breakdown + domain_text += "🏠 Final Domain Distribution:\n" + for domain, count in stats['domains'].items(): + domain_text += f" β€’ {domain}: {count} pages\n" + + self.domain_text.setText(domain_text) + + def export_data(self): + """Export scraped data to JSON file""" + if not self.websites: + QMessageBox.warning(self, "Error", "No data to export") + return + + try: + data = [] + for website in self.websites: + website_data = { + 'title': website.title, + 'url': website.url, + 'depth': website.depth, + 'links': website.links, + 'word_count': website.get_word_count(), + 'load_time': website.load_time, + 'domain': website.get_domain(), + 'normalized_domain': website.get_normalized_domain(), + 'timestamp': website.timestamp.isoformat() + } + data.append(website_data) + + with open('scraped_data.json', 'w', encoding='utf-8') as f: + json.dump(data, f, indent=2, ensure_ascii=False) + + QMessageBox.information(self, "Success", "Data exported to 'scraped_data.json'") + + except Exception as e: + QMessageBox.critical(self, "Error", f"Failed to export data: {e}") + + def show_content_preview(self, row, column): + """Show content preview for the selected website""" + if row < len(self.websites): + website = self.websites[row] + + # Update text preview with more content + content_preview = website.get_text_preview(1000) # Increased from 500 + self.content_text.setText(content_preview) + + # Update visual HTML preview + if WEB_ENGINE_AVAILABLE and hasattr(self, 'web_view'): + try: + # Get the HTML content + html_content = website.content + if html_content and html_content.strip(): + # Create a complete HTML document with proper encoding + full_html = f""" + + + + + + {website.title} + + + + {html_content} + + + """ + + # Load the HTML content + self.web_view.setHtml(full_html, QUrl(website.url)) + else: + # Show a message if no HTML content + self.web_view.setHtml(""" + + +

No HTML Content Available

+

This page doesn't have HTML content to display in the visual preview.

+

Check the text preview tab for the extracted content.

+ + + """) + except Exception as e: + # Show error message in the web view + error_html = f""" + + +

Error Loading Preview

+

Failed to load the visual preview:

+

{str(e)}

+

This might be due to:

+ + + + """ + self.web_view.setHtml(error_html) + else: + # Fallback for when PyQtWebEngine is not available + if hasattr(self, 'web_view'): + self.web_view.setText("Visual preview not available\nInstall PyQtWebEngine for HTML rendering") + + def generate_sitemap(self): + """Generate sitemap.xml from crawled URLs""" + if not self.websites: + QMessageBox.warning(self, "Error", "No data to generate sitemap.") + return + try: + urls = [w.url for w in self.websites] + sitemap = [ + '', + '' + ] + for url in urls: + sitemap.append(" ") + sitemap.append(f" {url}") + sitemap.append(" ") + sitemap.append("") + with open("sitemap.xml", "w", encoding="utf-8") as f: + f.write("\n".join(sitemap)) + QMessageBox.information(self, "Sitemap Generated", "sitemap.xml has been created in the current directory.") + self.tray_icon.showMessage( + "Web Scraper", + "sitemap.xml has been generated.", + QSystemTrayIcon.MessageIcon(1), + 4000 + ) + except Exception as e: + QMessageBox.critical(self, "Error", f"Failed to generate sitemap: {e}") + self.tray_icon.showMessage( + "Web Scraper", + f"Failed to generate sitemap: {e}", + QSystemTrayIcon.MessageIcon(3), + 4000 + ) + + def update_sitemap_tree(self): + """Build and display the sitemap tree from crawled data, with icons and tooltips.""" + self.sitemap_tree.clear() + if not self.websites: + return + url_to_website = {w.url: w for w in self.websites} + children_map = {w.url: [] for w in self.websites} + for w in self.websites: + for link in w.links: + if link in url_to_website: + children_map[w.url].append(link) + root_url = self.websites[0].url + def add_items(parent_item, url, visited, depth): + if url in visited: + return + visited.add(url) + website = url_to_website[url] + item = QTreeWidgetItem([website.title, website.url]) + item.setIcon(0, self.get_icon(is_root=False)) + tooltip = f"Title: {website.title}
" + tooltip += f"URL: {website.url}
" + tooltip += f"Depth: {website.depth}
" + tooltip += f"Outgoing Links: {len(website.links)}" + item.setToolTip(0, tooltip) + item.setToolTip(1, tooltip) + parent_item.addChild(item) + for child_url in children_map[url]: + add_items(item, child_url, visited, depth+1) + root_website = url_to_website[root_url] + root_item = QTreeWidgetItem([root_website.title, root_website.url]) + root_item.setIcon(0, self.get_icon(is_root=True)) + tooltip = f"Title: {root_website.title}
" + tooltip += f"URL: {root_website.url}
" + tooltip += f"Depth: {root_website.depth}
" + tooltip += f"Outgoing Links: {len(root_website.links)}" + root_item.setToolTip(0, tooltip) + root_item.setToolTip(1, tooltip) + self.sitemap_tree.addTopLevelItem(root_item) + visited = set([root_url]) + for child_url in children_map[root_url]: + add_items(root_item, child_url, visited, 1) + self.sitemap_tree.expandToDepth(1) + + def set_ai_tab_gradient(self, tab_index): + """Apply premium gradient styling to the AI tab header""" + gradient_css = """ + QTabBar::tab:nth-child({}) {{ + background: qlineargradient(x1:0, y1:0, x2:1, y2:0, + stop:0 #667eea, stop:0.5 #764ba2, stop:1 #f093fb); + color: white; + font-weight: 700; + border: 2px solid #667eea; + border-bottom: none; + padding: 14px 24px; + font-size: 15px; + }} + QTabBar::tab:nth-child({}):selected {{ + background: qlineargradient(x1:0, y1:0, x2:1, y2:0, + stop:0 #f093fb, stop:0.5 #764ba2, stop:1 #667eea); + color: white; + font-weight: 800; + border-bottom: none; + box-shadow: 0 4px 12px rgba(102, 126, 234, 0.3); + }} + QTabBar::tab:nth-child({}):hover:!selected {{ + background: qlineargradient(x1:0, y1:0, x2:1, y2:0, + stop:0 #5a67d8, stop:0.5 #6b46c1, stop:1 #e879f9); + }} + """.format(tab_index+1, tab_index+1, tab_index+1) + self.tab_widget.tabBar().setStyleSheet(self.tab_widget.tabBar().styleSheet() + gradient_css) + + def quick_question(self, question): + """Handle quick question button clicks by sending the question as if typed by the user.""" + self.ai_input.setText(question) + self.send_ai_message() + + def get_ai_context(self, user_msg=None): + """Return a string summary of the scraped websites for AI analysis. If no data, return a message indicating no data is available.""" + if not self.websites: + return "No data available. Please scrape some websites first." + # Summarize up to 5 websites for context + context_lines = [] + for i, w in enumerate(self.websites[:5]): + context_lines.append(f"{i+1}. Title: {w.title}\n URL: {w.url}\n Preview: {w.get_text_preview(120)}") + context = "\n".join(context_lines) + return context + +def main(): + app = QApplication(sys.argv) + app.setStyle('Fusion') # Use Fusion style for modern look + + # Set application icon and properties + app.setApplicationName("Web Scraper & Data Analyzer") + app.setApplicationVersion("2.0") + + window = WebScraperApp() + window.show() + + sys.exit(app.exec_()) + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/community-contributions/clinic_booking_bot.ipynb b/community-contributions/clinic_booking_bot.ipynb new file mode 100644 index 0000000..d2d8b57 --- /dev/null +++ b/community-contributions/clinic_booking_bot.ipynb @@ -0,0 +1,344 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 170, + "id": "a1aa1b43-7a47-4aca-ae5f-94a9d4ba2d89", + "metadata": {}, + "outputs": [], + "source": [ + "## Clinic Booking Bot\n", + "\n", + "##Easily book your clinic visit – available only on weekdays between **14:00 and 15:00**. \n", + "##Speak or type, and get instant confirmation.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 171, + "id": "fe798c6a-f8da-46aa-8c0e-9d2623def3d2", + "metadata": {}, + "outputs": [], + "source": [ + "# import library\n", + "\n", + "import os\n", + "import json\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI\n", + "import gradio as gr\n", + "import base64\n", + "from io import BytesIO\n", + "from datetime import date\n", + "from PIL import Image, ImageDraw, ImageFont\n" + ] + }, + { + "cell_type": "code", + "execution_count": 172, + "id": "0ad4e526-e95d-4e70-9faa-b4236b105dd5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OpenAI API Key exists and begins sk-proj-\n" + ] + } + ], + "source": [ + "# Save keys\n", + "\n", + "load_dotenv(override=True)\n", + "\n", + "openai_api_key = os.getenv('OPENAI_API_KEY')\n", + "if openai_api_key:\n", + " print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n", + "else:\n", + " print(\"OpenAI API Key not set\")\n", + " \n", + "MODEL = \"gpt-4o-mini\"\n", + "openai = OpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": 173, + "id": "ae95308e-0002-4017-9f2c-fcb1ddb248fa", + "metadata": {}, + "outputs": [], + "source": [ + "# --- CONFIG ---\n", + "BOOKING_START = 14\n", + "BOOKING_END = 15\n", + "WEEKDAYS = [\"Monday\", \"Tuesday\", \"Wednesday\", \"Thursday\", \"Friday\"]\n", + "PHONE = \"010-1234567\"\n", + "confirmed_bookings = []\n" + ] + }, + { + "cell_type": "code", + "execution_count": 174, + "id": "e21b0fd0-4cda-4938-8867-dc2c6e7af4b1", + "metadata": {}, + "outputs": [], + "source": [ + "# --- TTS ---\n", + "def generate_tts(text, voice=\"fable\", filename=\"output.mp3\"):\n", + " response = openai.audio.speech.create(\n", + " model=\"tts-1\",\n", + " voice=\"fable\",\n", + " input=text\n", + " )\n", + " with open(filename, \"wb\") as f:\n", + " f.write(response.content)\n", + " return filename" + ] + }, + { + "cell_type": "code", + "execution_count": 175, + "id": "e28a5c3b-bd01-4845-a41e-87823f6bb078", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Translate Booking Confirmation ---\n", + "def translate_text(text, target_language=\"nl\"):\n", + " prompt = f\"Translate this message to {target_language}:\\n{text}\"\n", + " response = openai.chat.completions.create(\n", + " model=\"gpt-4\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": \"You are a helpful translator.\"},\n", + " {\"role\": \"user\", \"content\": prompt}\n", + " ]\n", + " )\n", + " return response.choices[0].message.content.strip()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 176, + "id": "8ed57cc9-7d54-4a5d-831b-0efcc5b7a7a9", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Booking Logic ---\n", + "def book_appointment(name, time_str):\n", + " try:\n", + " booking_time = datetime.strptime(time_str, \"%H:%M\")\n", + " except ValueError:\n", + " return \"Invalid time format. Use HH:MM.\", None, None\n", + "\n", + " hour = booking_time.hour\n", + " weekday = datetime.today().strftime(\"%A\")\n", + "\n", + " if weekday not in WEEKDAYS:\n", + " response = \"Bookings are only available on weekdays.\"\n", + " elif BOOKING_START <= hour < BOOKING_END:\n", + " confirmation = f\"Booking confirmed for {name} at {time_str}.\"\n", + " confirmed_bookings.append((name, time_str))\n", + " translated = translate_text(confirmation)\n", + " audio = generate_tts(translated)\n", + " image = generate_booking_image(name, time_str)\n", + " return translated, audio, image\n", + " else:\n", + " response = \"Sorry, bookings are only accepted between 14:00 and 15:00 on weekdays.\"\n", + " translated = translate_text(response)\n", + " audio = generate_tts(translated)\n", + " return translated, audio, None" + ] + }, + { + "cell_type": "code", + "execution_count": 177, + "id": "19b52115-f0f3-4d63-a463-886163d4cfd1", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Booking Card ---\n", + "def generate_booking_image(name, time_str):\n", + " img = Image.new(\"RGB\", (500, 250), color=\"white\")\n", + " draw = ImageDraw.Draw(img)\n", + " msg = f\"\\u2705 Booking Confirmed\\nName: {name}\\nTime: {time_str}\"\n", + " draw.text((50, 100), msg, fill=\"black\")\n", + " return img" + ] + }, + { + "cell_type": "code", + "execution_count": 178, + "id": "2c446b6c-d410-4ba1-b0c7-c475e5259ff5", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Voice Booking ---\n", + "def voice_booking(audio_path, name):\n", + " with open(audio_path, \"rb\") as f:\n", + " response = openai.audio.transcriptions.create(model=\"whisper-1\", file=f)\n", + " transcription = response.text.strip()\n", + "\n", + " system_prompt = \"\"\"\n", + " You are a clinic assistant. Extract only the appointment time from the user's sentence in 24-hour HH:MM format.\n", + " If no time is mentioned, respond with 'No valid time found.'\n", + " \"\"\"\n", + "\n", + " response = openai.chat.completions.create(\n", + " model=\"gpt-4\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": transcription}\n", + " ]\n", + " )\n", + " extracted_time = response.choices[0].message.content.strip()\n", + "\n", + " if \":\" in extracted_time:\n", + " return book_appointment(name, extracted_time)\n", + " else:\n", + " message = \"Sorry, I couldn't understand the time. Please try again.\"\n", + " translated = translate_text(message)\n", + " audio_path = generate_tts(translated)\n", + " return translated, audio_path, None" + ] + }, + { + "cell_type": "code", + "execution_count": 179, + "id": "121d2907-7fa8-4248-b2e7-83617ea66ff0", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Chat Bot Handler ---\n", + "def chat_bot(messages):\n", + " system_prompt = \"\"\"\n", + " You are a clinic booking assistant. Your job is to:\n", + " - Greet the patient and explain your role\n", + " - Only assist with making appointments\n", + " - Accept bookings only on weekdays between 14:00 and 15:00\n", + " - Do not provide medical advice\n", + " - Always respond with empathy and clarity\n", + " \"\"\"\n", + " response = openai.chat.completions.create(\n", + " model=\"gpt-4\",\n", + " messages=[{\"role\": \"system\", \"content\": system_prompt}] + messages\n", + " )\n", + " reply = response.choices[0].message.content.strip()\n", + " audio = generate_tts(reply)\n", + " return reply, audio" + ] + }, + { + "cell_type": "code", + "execution_count": 180, + "id": "2427b694-8c57-40cb-b202-4a8989547925", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "* Running on local URL: http://127.0.0.1:7898\n", + "* To create a public link, set `share=True` in `launch()`.\n" + ] + }, + { + "data": { + "text/html": [ + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Gradio interface\n", + "with gr.Blocks(theme=gr.themes.Soft()) as demo:\n", + " gr.Markdown(\"\"\"## 🩺 GP Booking Assistant \n", + "Only available weekdays between **14:00 and 15:00** \n", + "☎️ Contact: {PHONE}\n", + "---\"\"\")\n", + "\n", + " name_global = gr.Textbox(label=\"Your Name\", placeholder=\"Enter your name\", interactive=True)\n", + "\n", + " with gr.Tab(\"πŸ’¬ Chat Mode\"):\n", + " chatbot = gr.Chatbot(label=\"Booking Chat\", type=\"messages\", height=400)\n", + " text_input = gr.Textbox(label=\"Type your message or use your voice below\")\n", + " audio_input = gr.Audio(type=\"filepath\", label=\"πŸŽ™οΈ Or speak your request\")\n", + " chat_audio_output = gr.Audio(label=\"πŸ”Š Assistant's Reply\", type=\"filepath\")\n", + " send_btn = gr.Button(\"Send\")\n", + "\n", + " def handle_chat(user_message, chat_history):\n", + " chat_history = chat_history or []\n", + " chat_history.append({\"role\": \"user\", \"content\": user_message})\n", + " reply, audio = chat_bot(chat_history)\n", + " chat_history.append({\"role\": \"assistant\", \"content\": reply})\n", + " return chat_history, \"\", audio\n", + "\n", + " def handle_audio_chat(audio_path, chat_history):\n", + " with open(audio_path, \"rb\") as f:\n", + " transcription = openai.audio.transcriptions.create(model=\"whisper-1\", file=f).text.strip()\n", + " return handle_chat(transcription, chat_history)\n", + "\n", + " send_btn.click(handle_chat, [text_input, chatbot], [chatbot, text_input, chat_audio_output])\n", + " text_input.submit(handle_chat, [text_input, chatbot], [chatbot, text_input, chat_audio_output])\n", + " audio_input.change(handle_audio_chat, [audio_input, chatbot], [chatbot, text_input, chat_audio_output])\n", + "\n", + "\n", + " \n", + " with gr.Tab(\"πŸ“ Text Booking\"):\n", + " time_text = gr.Textbox(label=\"Preferred Time (HH:MM)\", placeholder=\"e.g., 14:30\")\n", + " btn_text = gr.Button(\"πŸ“… Book via Text\")\n", + "\n", + " with gr.Tab(\"πŸŽ™οΈ Voice Booking\"):\n", + " voice_input = gr.Audio(type=\"filepath\", label=\"Say your preferred time\")\n", + " btn_voice = gr.Button(\"πŸ“… Book via Voice\")\n", + "\n", + " output_text = gr.Textbox(label=\"Response\", interactive=False)\n", + " output_audio = gr.Audio(label=\"Audio Reply\", type=\"filepath\")\n", + " output_image = gr.Image(label=\"Booking Confirmation\")\n", + "\n", + " btn_text.click(fn=book_appointment, inputs=[name_global, time_text], outputs=[output_text, output_audio, output_image])\n", + " btn_voice.click(fn=voice_booking, inputs=[voice_input, name_global], outputs=[output_text, output_audio, output_image])\n", + "\n", + " gr.Markdown(\"\"\"---\n", + "This assistant does **not** give medical advice. It only books appointments within allowed hours.\n", + "\"\"\")\n", + "\n", + " demo.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f359de0a-28b1-4895-b21d-91d79e494a0d", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/community-contributions/multi-agent_gui_with_gradio/README.md b/community-contributions/multi-agent_gui_with_gradio/README.md new file mode 100644 index 0000000..3c80ace --- /dev/null +++ b/community-contributions/multi-agent_gui_with_gradio/README.md @@ -0,0 +1,25 @@ +# 🧠 Agentic Voice/Text Support Chatbot + +A multimodal chatbot interface with support for **text and voice input**, **multiple large language models (LLMs)**, and **context memory persistence** β€” all in a single Gradio-based GUI. + +## πŸš€ Features + +- πŸ”„ **Multi-LLM switching**: Dynamically switch between OpenAI, Anthropic Claude, and Meta LLaMA (via Ollama) +- 🎀 **Voice input**: Use your microphone with live speech-to-text transcription +- πŸ’¬ **Contextual memory**: Maintain chat history even when switching models +- πŸ§ͺ **Prototype-ready**: Built with Gradio for rapid GUI testing and development + +## πŸ› οΈ Technologies Used + +- [Gradio](https://www.gradio.app/) – GUI interface +- [OpenAI API](https://platform.openai.com/) +- [Anthropic Claude API](https://www.anthropic.com/) +- [Ollama](https://ollama.com/) – Local LLaMA inference +- [`speech_recognition`](https://pypi.org/project/SpeechRecognition/) – Voice-to-text +- `sounddevice`, `numpy` – Audio recording +- `.env` – Environment variable management + +## You’ll also need: +- API keys for OpenAI and Claude +- Ollama installed locally to run LLaMA models +- A .env file with the necessary API keys diff --git a/community-contributions/multi-agent_gui_with_gradio/agentic_voice_text_support.ipynb b/community-contributions/multi-agent_gui_with_gradio/agentic_voice_text_support.ipynb new file mode 100644 index 0000000..d4f6caf --- /dev/null +++ b/community-contributions/multi-agent_gui_with_gradio/agentic_voice_text_support.ipynb @@ -0,0 +1,395 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d006b2ea-9dfe-49c7-88a9-a5a0775185fd", + "metadata": {}, + "source": [ + "### Building a Chatbot Interface, with Text or Voice Input, Multi-LLM support, and Memory Persistence" + ] + }, + { + "cell_type": "markdown", + "id": "eeb20b3e", + "metadata": {}, + "source": [ + "In this tutorial, we’ll use Gradio to build a simple chatbot prototype with a user-friendly interface. The chatbot will support multiple language models, allowing the user to switch models at any point during the conversation. It will also offer optional memory persistence, where the chat history is stored and forwarded to the selected model β€” which allows shared memory across models, even when switching mid-chat.\n", + "\n", + "In this project, we'll use OpenAI's API, Anthropic's Claude, and Meta's LLaMA, which runs locally via an Ollama server. Additionally, we'll use Python’s speech_recognition module to convert speech to text.\n", + "\n", + "It's worth noting that some APIs β€” such as OpenAI's β€” now support direct audio input, so integrating speech capabilities can also be done end-to-end without a separate transcription module." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "a07e7793-b8f5-44f4-aded-5562f633271a", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import requests\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI\n", + "import anthropic" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "a0a343b1", + "metadata": {}, + "outputs": [], + "source": [ + "# Speech recording and recognition libraries\n", + "import speech_recognition as sr\n", + "import sounddevice as sd\n", + "import numpy as np" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "d7693eda", + "metadata": {}, + "outputs": [], + "source": [ + "# GUI prototyping\n", + "import gradio as gr" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "41ffc0e6", + "metadata": {}, + "outputs": [], + "source": [ + "buffer = [] # For temporarily holding sound recording\n", + "\n", + "# Helper function for handling voice recording\n", + "def callback(indata, frames, time, status):\n", + " buffer.append(indata.copy())\n", + "\n", + "stream = sd.InputStream(callback=callback, samplerate=16000, channels=1, dtype='int16')" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "e9a79075", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Function for handling recording data and status\n", + "def toggle_recording(state):\n", + " global stream, buffer\n", + " print('state', state)\n", + "\n", + " if not state:\n", + " buffer.clear()\n", + " stream.start()\n", + " return gr.update(value=\"Stop Recording\"), 'Recording...', not state\n", + " else:\n", + " stream.stop()\n", + " audio = np.concatenate(buffer, axis=0)\n", + " text = transcribe(audio)\n", + " return gr.update(value=\"Start Recording\"), text, not state\n", + "\n", + "# Functio that converts speech to text via Google's voice recognition module\n", + "def transcribe(recording, sample_rate=16000):\n", + " r = sr.Recognizer()\n", + "\n", + " # Convert NumPy array to AudioData\n", + " audio_data = sr.AudioData(\n", + " recording.tobytes(), # Raw byte data\n", + " sample_rate, # Sample rate\n", + " 2 # Sample width in bytes (16-bit = 2 bytes)\n", + " )\n", + "\n", + " text = r.recognize_google(audio_data)\n", + " print(\"You said:\", text)\n", + " return text" + ] + }, + { + "cell_type": "markdown", + "id": "dcfb0190", + "metadata": {}, + "source": [ + "### LLM & API set-up" + ] + }, + { + "cell_type": "markdown", + "id": "59416453", + "metadata": {}, + "source": [ + "##### Load API keys from .env" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "b638b822", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OpenAI API Key exists and begins sk-proj-\n", + "Anthropic API Key exists and begins sk-ant-\n", + "Google API Key not set\n" + ] + } + ], + "source": [ + "# Load environment variables in a file called .env\n", + "# Print the key prefixes to help with any debugging\n", + "\n", + "load_dotenv(override=True)\n", + "openai_api_key = os.getenv('OPENAI_API_KEY')\n", + "anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n", + "google_api_key = os.getenv('GOOGLE_API_KEY')\n", + "\n", + "if openai_api_key:\n", + " print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n", + "else:\n", + " print(\"OpenAI API Key not set\")\n", + " \n", + "if anthropic_api_key:\n", + " print(f\"Anthropic API Key exists and begins {anthropic_api_key[:7]}\")\n", + "else:\n", + " print(\"Anthropic API Key not set\")\n", + "\n", + "if google_api_key:\n", + " print(f\"Google API Key exists and begins {google_api_key[:8]}\")\n", + "else:\n", + " print(\"Google API Key not set\")" + ] + }, + { + "cell_type": "markdown", + "id": "9e6ae162", + "metadata": {}, + "source": [ + "### Class for handling API calls and routing requests to the selected models" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "268ea65d", + "metadata": {}, + "outputs": [], + "source": [ + "class LLMHandler:\n", + " def __init__(self, system_message: str = '', ollama_api:str='http://localhost:11434/api/chat'):\n", + " # Default system message if none provided\n", + " self.system_message = system_message if system_message else \"You are a helpful assistant. Always reply in Markdown\"\n", + " self.message_history = []\n", + "\n", + " # Initialize LLM clients\n", + " self.openai = OpenAI()\n", + " self.claude = anthropic.Anthropic()\n", + " self.OLLAMA_API = ollama_api\n", + " self.OLLAMA_HEADERS = {\"Content-Type\": \"application/json\"}\n", + "\n", + " def llm_call(self, model: str = 'gpt-4o-mini', prompt: str = '', memory_persistence=True):\n", + " if not model:\n", + " return 'No model specified'\n", + "\n", + " # Use full message template with system prompt if no prior history\n", + " message = self.get_message_template(prompt, initial=True) if (\n", + " not self.message_history and not 'claude' in model\n", + " ) else self.get_message_template(prompt)\n", + "\n", + " # Handle memory persistence\n", + " if memory_persistence:\n", + " self.message_history.extend(message)\n", + " else:\n", + " self.message_history = message\n", + "\n", + " # Model-specific dispatch\n", + " try:\n", + " if 'gpt' in model:\n", + " response = self.call_openai(model=model)\n", + " elif 'claude' in model:\n", + " response = self.call_claude(model=model)\n", + " elif 'llama' in model:\n", + " response = self.call_ollama(model=model)\n", + " else:\n", + " response = f'{model.title()} is not supported or not a valid model name.'\n", + " except Exception as e:\n", + " response = f'Failed to retrieve response. Reason: {e}'\n", + "\n", + " # Save assistant's reply to history if memory is enabled\n", + " if memory_persistence:\n", + " self.message_history.append({\n", + " \"role\": \"assistant\",\n", + " \"content\": response\n", + " })\n", + "\n", + " return response\n", + "\n", + " def get_message_template(self, prompt: str = '', initial=False):\n", + " # Returns a message template with or without system prompt\n", + " initial_template = [\n", + " {\"role\": \"system\", \"content\": self.system_message},\n", + " {\"role\": \"user\", \"content\": prompt}\n", + " ]\n", + " general_template = [\n", + " {\"role\": \"user\", \"content\": prompt}\n", + " ]\n", + " return initial_template if initial else general_template\n", + "\n", + " def call_openai(self, model: str = 'gpt-4o-mini'):\n", + " # Sends chat completion request to OpenAI API\n", + " completion = self.openai.chat.completions.create(\n", + " model=model,\n", + " messages=self.message_history,\n", + " )\n", + " response = completion.choices[0].message.content\n", + " return response\n", + "\n", + " def call_ollama(self, model: str = \"llama3.2\"):\n", + "\n", + " payload = {\n", + " \"model\": model,\n", + " \"messages\": self.message_history,\n", + " \"stream\": False\n", + " }\n", + "\n", + " response = requests.post(url=self.OLLAMA_API, headers=self.OLLAMA_HEADERS, json=payload)\n", + " return response.json()[\"message\"][\"content\"]\n", + "\n", + " def call_claude(self, model: str = \"claude-3-haiku-20240307\"):\n", + " # Sends chat request to Anthropic Claude API\n", + " message = self.claude.messages.create(\n", + " model=model,\n", + " system=self.system_message,\n", + " messages=self.message_history,\n", + " max_tokens=500\n", + " )\n", + " response = message.content[0].text\n", + " return response\n" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "632e618b", + "metadata": {}, + "outputs": [], + "source": [ + "llm_handler = LLMHandler()\n", + "\n", + "# Function to handle user prompts received by the interface\n", + "def llm_call(model, prompt, memory_persistence):\n", + " response = llm_handler.llm_call(model=model, prompt=prompt, memory_persistence=memory_persistence)\n", + " return response, ''\n" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "e19228f6", + "metadata": {}, + "outputs": [], + "source": [ + "# Specify available model names for the dropdown component\n", + "AVAILABLE_MODELS = [\"gpt-4\", \"gpt-3.5\", \"claude-3-haiku-20240307\", \"llama3.2\", \"gpt-4o-mini\"]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "f65f43ff", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "* Running on local URL: http://127.0.0.1:7868\n", + "* To create a public link, set `share=True` in `launch()`.\n" + ] + }, + { + "data": { + "text/html": [ + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\n", + "with gr.Blocks() as demo:\n", + " state = gr.State(False) # Recording state (on/off)\n", + " with gr.Row():\n", + " \n", + " with gr.Column():\n", + " out = gr.Markdown(label='Message history')\n", + " with gr.Row():\n", + " memory = gr.Checkbox(label='Toggle memory', value=True) # Handle memory status (on/off) btn\n", + " model_choice = gr.Dropdown(label='Model', choices=AVAILABLE_MODELS, interactive=True) # Model selection dropdown\n", + " query_box = gr.Textbox(label='ChatBox', placeholder=\"Your message\")\n", + " record_btn = gr.Button(value='Record voice message') # Start/stop recording btn\n", + " send_btn = gr.Button(\"Send\") # Send prompt btn\n", + " \n", + " \n", + " \n", + " record_btn.click(fn=toggle_recording, inputs=state, outputs=[record_btn, query_box, state])\n", + " send_btn.click(fn=llm_call, inputs=[model_choice, query_box, memory], outputs=[out, query_box])\n", + " \n", + "\n", + "demo.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3743db5d", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "general_env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/community-contributions/protocol_summarizer_webapp/.github/copilot-instructions.md b/community-contributions/protocol_summarizer_webapp/.github/copilot-instructions.md new file mode 100644 index 0000000..3aa12b5 --- /dev/null +++ b/community-contributions/protocol_summarizer_webapp/.github/copilot-instructions.md @@ -0,0 +1,3 @@ + + +This is a Streamlit web application for clinical trial protocol summarization. Use Streamlit best practices for UI and Python for backend logic. Integrate with ClinicalTrials.gov v2 API for study search and OpenAI for summarization. diff --git a/community-contributions/protocol_summarizer_webapp/.gitignore b/community-contributions/protocol_summarizer_webapp/.gitignore new file mode 100644 index 0000000..7cc51b2 --- /dev/null +++ b/community-contributions/protocol_summarizer_webapp/.gitignore @@ -0,0 +1,30 @@ +updates.md +.env +__pycache__/ +*.py[cod] +*$py.class +*.so +.Python +env/ +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +*.egg-info/ +.installed.cfg +*.egg +venv/ +ENV/ +.streamlit/ +.idea/ +.vscode/ +*.swp +*.swo +.DS_Store diff --git a/community-contributions/protocol_summarizer_webapp/README.md b/community-contributions/protocol_summarizer_webapp/README.md new file mode 100644 index 0000000..2e80874 --- /dev/null +++ b/community-contributions/protocol_summarizer_webapp/README.md @@ -0,0 +1,66 @@ +# Protocol Summarizer Webapp + +A Streamlit web application for searching and summarizing clinical trial protocols from ClinicalTrials.gov using Large Language Models. This tool enables researchers and clinical professionals to quickly extract key information from clinical trial protocols. + +## Features +- Search for clinical trials by keyword +- Display a list of studies with title and NCT number +- Select a study to summarize +- Fetch the protocol's brief summary from ClinicalTrials.gov API +- Automatically summarize the protocol using OpenAI's LLM +- Extract structured information like study design, population, interventions, and endpoints + +## Installation + +1. Clone this repository: + ```sh + git clone https://github.com/albertoclemente/protocol_summarizer.git + cd protocol_summarizer/protocol_summarizer_webapp + ``` + +2. Install dependencies: + ```sh + pip install -r requirements.txt + ``` + +3. Create a `.env` file in the project root with your OpenAI API key: + ``` + OPENAI_API_KEY=your_api_key_here + ``` + +## Usage + +1. Run the Streamlit app: + ```sh + streamlit run app.py + ``` + +2. In your browser: + - Enter a disease, condition, or keyword in the search box + - Select the number of results to display + - Click the "Search" button + - Select a study from the results + - Click "Summarize Protocol" to generate a structured summary + +## Technical Details + +- Uses ClinicalTrials.gov API v2 to retrieve study information +- Implements fallback methods to handle API changes or failures +- Extracts protocol brief summaries using reliable JSON parsing +- Generates structured summaries using OpenAI's GPT models + +## Requirements + +- Python 3.7+ +- Streamlit +- Requests +- OpenAI Python library +- python-dotenv + +## Contribution + +Contributions are welcome! Please feel free to submit a Pull Request. + +## License + +MIT License diff --git a/community-contributions/protocol_summarizer_webapp/app.py b/community-contributions/protocol_summarizer_webapp/app.py new file mode 100644 index 0000000..cd9941a --- /dev/null +++ b/community-contributions/protocol_summarizer_webapp/app.py @@ -0,0 +1,121 @@ +import os +from dotenv import load_dotenv +import streamlit as st +import requests +from openai import OpenAI + +load_dotenv() + +st.title("Protocol Summarizer") + +st.markdown(""" +Search for clinical trials by keyword, select a study, and generate a protocol summary using an LLM. +""") + +# Search input + +# Show results only after user presses Enter +with st.form(key="search_form"): + query = st.text_input("Enter a disease, study title, or keyword:") + max_results = st.slider("Number of results", 1, 20, 5) + submitted = st.form_submit_button("Search") + +@st.cache_data(show_spinner=False) +def search_clinical_trials(query, max_results=5): + if not query: + return [] + url = f"https://clinicaltrials.gov/api/v2/studies?query.term={query}&pageSize={max_results}&format=json" + resp = requests.get(url) + studies = [] + if resp.status_code == 200: + data = resp.json() + for study in data.get('studies', []): + nct = study.get('protocolSection', {}).get('identificationModule', {}).get('nctId', 'N/A') + title = study.get('protocolSection', {}).get('identificationModule', {}).get('officialTitle', 'N/A') + studies.append({'nct': nct, 'title': title}) + return studies + +results = search_clinical_trials(query, max_results) if query else [] + +if results: + st.subheader("Search Results") + for i, study in enumerate(results): + st.markdown(f"**{i+1}. {study['title']}** (NCT: {study['nct']})") + selected = st.number_input("Select study number to summarize", min_value=1, max_value=len(results), value=1) + selected_study = results[selected-1] + st.markdown(f"### Selected Study\n**{selected_study['title']}** (NCT: {selected_study['nct']})") + if st.button("Summarize Protocol"): + # Fetch the brief summary for the selected study + nct_id = selected_study['nct'] + + # Use the V2 API which we know works reliably + url = f"https://clinicaltrials.gov/api/v2/studies/{nct_id}?format=json" + with st.spinner("Fetching study details..."): + resp = requests.get(url) + brief = "" + + if resp.status_code == 200: + try: + data = resp.json() + + # V2 API has protocolSection at the root level + if 'protocolSection' in data: + desc_mod = data.get('protocolSection', {}).get('descriptionModule', {}) + brief = desc_mod.get('briefSummary', '') + + # If briefSummary is empty, try detailedDescription + if not brief: + brief = desc_mod.get('detailedDescription', '') + except Exception as e: + st.error(f"Error parsing study data: {e}") + + # If API fails, try HTML scraping as a fallback + if not brief and resp.status_code != 200: + st.warning(f"API returned status code {resp.status_code}. Trying alternative method...") + html_url = f"https://clinicaltrials.gov/ct2/show/{nct_id}" + html_resp = requests.get(html_url) + + if "Brief Summary:" in html_resp.text: + start = html_resp.text.find("Brief Summary:") + 15 + excerpt = html_resp.text[start:start+1000] + + # Clean up HTML + import re + excerpt = re.sub('<[^<]+?>', ' ', excerpt) + excerpt = re.sub('\\s+', ' ', excerpt) + brief = excerpt.strip() + + if not brief: + st.error("No brief summary or detailed description found for this study.") + st.stop() + + # Now we have the brief summary, send it to the LLM + openai = OpenAI() + def user_prompt_for_protocol_brief(brief_text): + return ( + "Extract the following details from the clinical trial brief summary in markdown format with clear section headings (e.g., ## Study Design, ## Population, etc.):\n" + "- Study design\n" + "- Population\n" + "- Interventions\n" + "- Primary and secondary endpoints\n" + "- Study duration\n\n" + f"Brief summary text:\n{brief_text}" + ) + system_prompt = "You are a clinical research assistant. Extract and list the requested protocol details in markdown format with clear section headings." + messages = [ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_prompt_for_protocol_brief(brief)} + ] + with st.spinner("Summarizing with LLM..."): + try: + response = openai.chat.completions.create( + model="gpt-4o-mini", + messages=messages + ) + summary = response.choices[0].message.content + st.markdown(summary) + except Exception as e: + st.error(f"LLM call failed: {e}") +else: + if query: + st.info("No results found. Try a different keyword.") diff --git a/community-contributions/protocol_summarizer_webapp/requirements.txt b/community-contributions/protocol_summarizer_webapp/requirements.txt new file mode 100644 index 0000000..345b507 --- /dev/null +++ b/community-contributions/protocol_summarizer_webapp/requirements.txt @@ -0,0 +1,4 @@ +streamlit +openai +requests +python-dotenv diff --git a/community-contributions/sf-patient-brochure/.gitkeep b/community-contributions/sf-patient-brochure/.gitkeep new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/community-contributions/sf-patient-brochure/.gitkeep @@ -0,0 +1 @@ + diff --git a/community-contributions/sf-patient-brochure/Patient brochure.ipynb b/community-contributions/sf-patient-brochure/Patient brochure.ipynb new file mode 100644 index 0000000..4f6bc85 --- /dev/null +++ b/community-contributions/sf-patient-brochure/Patient brochure.ipynb @@ -0,0 +1,517 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 9, + "id": "fc57c47f-31fc-4527-af71-ce117d35c480", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "# If these fail, please check you're running from an 'activated' environment with (llms) in the command prompt\n", + "\n", + "import os\n", + "import requests\n", + "import json\n", + "from typing import List\n", + "from dotenv import load_dotenv\n", + "from bs4 import BeautifulSoup\n", + "from IPython.display import Markdown, display, update_display\n", + "from openai import OpenAI\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "d74ea4e7-7d4a-4c85-92d3-8cdb231bc261", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd " + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "3eb884ea-02db-4ff8-91f9-c71e40b1cf4a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "API key looks good so far\n" + ] + } + ], + "source": [ + "# Initialize and constants\n", + "\n", + "load_dotenv(override=True)\n", + "api_key = os.getenv('OPENAI_API_KEY')\n", + "\n", + "if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:\n", + " print(\"API key looks good so far\")\n", + "else:\n", + " print(\"There might be a problem with your API key? Please visit the troubleshooting notebook!\")\n", + " \n", + "MODEL = 'gpt-4o-mini'\n", + "openai = OpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "d48a7b9b-273d-4bc9-997b-c7112e02528c", + "metadata": {}, + "outputs": [], + "source": [ + "# A class to represent a Webpage\n", + "\n", + "# Some websites need you to use proper headers when fetching them:\n", + "headers = {\n", + " \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n", + "}\n", + "\n", + "class Website:\n", + " def __init__(self, url):\n", + " self.url = url\n", + " response = requests.get(url, headers=headers)\n", + " self.body = response.content\n", + " soup = BeautifulSoup(self.body, 'html.parser')\n", + " self.title = soup.title.string if soup.title else \"No title found\"\n", + "\n", + " if soup.body:\n", + " for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n", + " irrelevant.decompose()\n", + " self.text = soup.body.get_text(separator=\"\\n\", strip=True)\n", + " else:\n", + " self.text = \"\"\n", + "\n", + " links = [link.get('href') for link in soup.find_all('a')]\n", + " self.links = [link for link in links if link]\n", + "\n", + " def get_contents(self):\n", + " return f\"Webpage Title:\\n{self.title}\\nWebpage Contents:\\n{self.text}\\n\\n\"\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "bf51ae6e-91ae-46eb-ac39-dc860454ea4a", + "metadata": {}, + "outputs": [], + "source": [ + "def get_condition_links_from_topics_page():\n", + " topics_url = \"https://www.thuisarts.nl/overzicht/onderwerpen\"\n", + " response = requests.get(topics_url, headers=headers)\n", + " soup = BeautifulSoup(response.content, 'html.parser')\n", + "\n", + " # Find all tags that look like condition pages\n", + " links = soup.find_all(\"a\", href=True)\n", + " condition_links = []\n", + "\n", + " for link in links:\n", + " href = link['href']\n", + " if href.startswith(\"/\"):\n", + " href = \"https://www.thuisarts.nl\" + href\n", + " if href.startswith(\"https://www.thuisarts.nl/\") and len(href.split(\"/\")) > 3:\n", + " condition_links.append(href)\n", + "\n", + " # Remove duplicates and return\n", + " return list(set(condition_links))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "a246ac9f-73fb-4c2d-ab92-6f3f2bf7afac", + "metadata": {}, + "outputs": [], + "source": [ + "link_system_prompt = \"\"\"You are an assistant that filters URLs for patient education content. \n", + "\n", + "Only return links that lead to pages about symptoms, health conditions, treatments, or diseases β€” for example: pages on 'headache', 'diarrhea', 'stomach pain', 'asthma', etc.\n", + "\n", + "DO NOT return:\n", + "- contact pages\n", + "- overview/video/image/keuzekaart lists unless they directly link to medical complaints\n", + "- navigation or privacy/cookie/social media links\n", + "\n", + "Respond only with full https links in JSON format, like this:\n", + "{\n", + " \"links\": [\n", + " {\"type\": \"symptom or condition page\", \"url\": \"https://www.thuisarts.nl/hoofdpijn\"},\n", + " {\"type\": \"symptom or condition page\", \"url\": \"https://www.thuisarts.nl/buikpijn\"}\n", + " ]\n", + "}\n", + "\"\"\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "b3ac761e-f583-479e-b8ef-70e70f8f361a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "You are an assistant that filters URLs for patient education content. \n", + "\n", + "Only return links that lead to pages about symptoms, health conditions, treatments, or diseases β€” for example: pages on 'headache', 'diarrhea', 'stomach pain', 'asthma', etc.\n", + "\n", + "DO NOT return:\n", + "- contact pages\n", + "- overview/video/image/keuzekaart lists unless they directly link to medical complaints\n", + "- navigation or privacy/cookie/social media links\n", + "\n", + "Respond only with full https links in JSON format, like this:\n", + "{\n", + " \"links\": [\n", + " {\"type\": \"symptom or condition page\", \"url\": \"https://www.thuisarts.nl/hoofdpijn\"},\n", + " {\"type\": \"symptom or condition page\", \"url\": \"https://www.thuisarts.nl/buikpijn\"}\n", + " ]\n", + "}\n", + "\n" + ] + } + ], + "source": [ + "print(link_system_prompt)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "5548e8d4-2813-40fe-a807-cf3661d3a0a9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "βœ… Found 680 condition pages.\n" + ] + } + ], + "source": [ + "condition_links = get_condition_links_from_topics_page()\n", + "print(f\"βœ… Found {len(condition_links)} condition pages.\")\n", + "\n", + "# Format for summary function\n", + "selected_links = [{\"url\": link} for link in condition_links]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "8d264592-8b77-425a-be4a-73ef7d32d744", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "\n", + "def load_existing_summaries(filepath=\"brochure_cache.json\"):\n", + " if os.path.exists(filepath):\n", + " with open(filepath, \"r\", encoding=\"utf-8\") as f:\n", + " return json.load(f)\n", + " return {}\n", + "\n", + "def save_summaries_to_cache(summaries, filepath=\"brochure_cache.json\"):\n", + " with open(filepath, \"w\", encoding=\"utf-8\") as f:\n", + " json.dump(summaries, f, indent=2, ensure_ascii=False)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "1cdd9456-1262-40a0-bc3f-28d23010ed7f", + "metadata": {}, + "outputs": [], + "source": [ + "selected_links = [{\"url\": link} for link in get_condition_links_from_topics_page()][:10]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "0c2f24ea-fa6b-4431-849a-e1aeaa936022", + "metadata": {}, + "outputs": [], + "source": [ + "summary_cache = {}\n", + "\n", + "def summarize_for_brochure(url):\n", + " if url in summary_cache:\n", + " summary = summary_cache[url]\n", + " print(f\"βœ… [Cached] {url}\")\n", + " print(f\"πŸ“„ Summary:\\n{summary}\\n\") # πŸ‘ˆ this prints the cached summary too\n", + " return summary\n", + "\n", + " page = Website(url)\n", + "\n", + " example = \"\"\"\n", + "Example:\n", + "\n", + "Title: Keelpijn \n", + "Summary: Sore throat is a common symptom, often caused by a virus. It usually goes away on its own within a few days. Drink warm fluids, rest your voice, and take paracetamol if needed. See a doctor if the pain lasts more than a week or gets worse.\n", + "\n", + "Title: Hoofdpijn \n", + "Summary: Headaches can have many causes like stress, fatigue, or dehydration. Most are harmless and go away with rest and fluids. Painkillers like paracetamol can help. If headaches are severe, frequent, or different than usual, contact your GP.\n", + "\"\"\"\n", + "\n", + " prompt = f\"\"\"\n", + "You are a health writer. Based on the Dutch content below, write a clear, short, brochure-style summary in **English** for patients.\n", + "\n", + "Use the format: \n", + "Title: {page.title} \n", + "Summary: \n", + "\n", + "Keep it under 100 words, easy to read, friendly, and medically accurate.\n", + "\n", + "{example}\n", + "\n", + "Now use this for:\n", + "Title: {page.title}\n", + "Content:\n", + "{page.text[:3000]}\n", + "\"\"\"\n", + "\n", + " response = openai.chat.completions.create(\n", + " model=\"gpt-4\",\n", + " messages=[{\"role\": \"user\", \"content\": prompt}],\n", + " temperature=0.4\n", + " )\n", + "\n", + " summary = response.choices[0].message.content.strip()\n", + " summary_cache[url] = summary\n", + " return summary\n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "af8f9d81-d848-4fb9-ac79-782b39fed4a2", + "metadata": {}, + "outputs": [], + "source": [ + "def build_symptom_brochure(links, cache_file=\"brochure_cache.json\"):\n", + " brochure = []\n", + " cached = load_existing_summaries(cache_file)\n", + " print(\"πŸ“„ Building summaries for brochure:\\n\")\n", + "\n", + " for i, item in enumerate(links, 1):\n", + " url = item[\"url\"]\n", + " if url in cached:\n", + " print(f\"βœ… [Cached] {url}\")\n", + " brochure.append({\"url\": url, \"summary\": cached[url]})\n", + " continue\n", + " \n", + " print(f\"πŸ”„ [{i}/{len(links)}] Summarizing: {url}\")\n", + " try:\n", + " summary = summarize_for_brochure(url)\n", + " print(f\"βœ… Summary:\\n{summary}\\n\")\n", + " brochure.append({\"url\": url, \"summary\": summary})\n", + " cached[url] = summary # Save new summary\n", + " save_summaries_to_cache(cached, cache_file)\n", + " except Exception as e:\n", + " print(f\"❌ Error summarizing {url}: {e}\\n\")\n", + " brochure.append({\"url\": url, \"summary\": \"Error generating summary.\"})\n", + "\n", + " return brochure\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "e9079d6b-538f-4681-9776-4628a111246a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "πŸ“„ Building summaries for brochure:\n", + "\n", + "πŸ”„ [1/10] Summarizing: https://www.thuisarts.nl/sociale-angststoornis\n", + "βœ… [New] https://www.thuisarts.nl/sociale-angststoornis\n", + "πŸ“„ Summary:\n", + "Title: Social Anxiety Disorder\n", + "Summary: Social anxiety disorder, or social phobia, is a fear of what others think of you, often leading to panic attacks. Writing down what happens, your thoughts, and feelings can help manage this fear. Positive thinking can also be beneficial when you're feeling anxious. Discussing your concerns with your GP or practice nurse can be helpful. If there's no improvement or symptoms are severe, treatments such as therapy with a psychologist or anxiety medication may be considered.\n", + "\n", + "βœ… Summary:\n", + "Title: Social Anxiety Disorder\n", + "Summary: Social anxiety disorder, or social phobia, is a fear of what others think of you, often leading to panic attacks. Writing down what happens, your thoughts, and feelings can help manage this fear. Positive thinking can also be beneficial when you're feeling anxious. Discussing your concerns with your GP or practice nurse can be helpful. If there's no improvement or symptoms are severe, treatments such as therapy with a psychologist or anxiety medication may be considered.\n", + "\n", + "βœ… [Cached] https://www.thuisarts.nl/diabetes-type-2\n", + "πŸ”„ [3/10] Summarizing: https://www.thuisarts.nl/morton-neuroom\n", + "βœ… [New] https://www.thuisarts.nl/morton-neuroom\n", + "πŸ“„ Summary:\n", + "Title: Morton's Neuroma | Thuisarts.nl \n", + "Summary: Morton's Neuroma is a pinched nerve in the forefoot, causing burning pain in the forefoot and toes. It often results from wearing too narrow shoes or high heels. Wearing comfortable, roomy shoes can help alleviate symptoms. For severe pain, paracetamol can be taken. Sometimes, a custom shoe insole can also help.\n", + "\n", + "βœ… Summary:\n", + "Title: Morton's Neuroma | Thuisarts.nl \n", + "Summary: Morton's Neuroma is a pinched nerve in the forefoot, causing burning pain in the forefoot and toes. It often results from wearing too narrow shoes or high heels. Wearing comfortable, roomy shoes can help alleviate symptoms. For severe pain, paracetamol can be taken. Sometimes, a custom shoe insole can also help.\n", + "\n", + "πŸ”„ [4/10] Summarizing: https://www.thuisarts.nl/borstvergroting\n", + "βœ… [New] https://www.thuisarts.nl/borstvergroting\n", + "πŸ“„ Summary:\n", + "Title: Breast Augmentation | Thuisarts.nl \n", + "Summary: A breast augmentation is a procedure where a plastic surgeon inserts fillings into your breasts, under general anesthesia. The surgery takes about an hour. Consider the pros and cons carefully. Benefits may include a more positive body image and increased self-confidence. Risks may include infection, bleeding, scarring, or hardening of the breasts over time. Often, a follow-up surgery is needed later. If you smoke, it's important to quit three weeks before surgery.\n", + "\n", + "βœ… Summary:\n", + "Title: Breast Augmentation | Thuisarts.nl \n", + "Summary: A breast augmentation is a procedure where a plastic surgeon inserts fillings into your breasts, under general anesthesia. The surgery takes about an hour. Consider the pros and cons carefully. Benefits may include a more positive body image and increased self-confidence. Risks may include infection, bleeding, scarring, or hardening of the breasts over time. Often, a follow-up surgery is needed later. If you smoke, it's important to quit three weeks before surgery.\n", + "\n", + "πŸ”„ [5/10] Summarizing: https://www.thuisarts.nl/kijkoperatie-in-buik\n", + "βœ… [New] https://www.thuisarts.nl/kijkoperatie-in-buik\n", + "πŸ“„ Summary:\n", + "Title: Abdominal Laparoscopy | Thuisarts.nl\n", + "Summary: An abdominal laparoscopy allows the doctor to examine or operate in your abdomen. Small tubes with a camera and tools are inserted through tiny incisions. You'll have a pre-operation discussion with your surgeon and anesthesiologist. You will be deeply sedated for the procedure. You cannot drive home post-operation, so arrange for someone to pick you up. Recovery usually requires a week off work, sometimes longer.\n", + "\n", + "βœ… Summary:\n", + "Title: Abdominal Laparoscopy | Thuisarts.nl\n", + "Summary: An abdominal laparoscopy allows the doctor to examine or operate in your abdomen. Small tubes with a camera and tools are inserted through tiny incisions. You'll have a pre-operation discussion with your surgeon and anesthesiologist. You will be deeply sedated for the procedure. You cannot drive home post-operation, so arrange for someone to pick you up. Recovery usually requires a week off work, sometimes longer.\n", + "\n", + "πŸ”„ [6/10] Summarizing: https://www.thuisarts.nl/veranderingen-in-zorg-als-je-18-wordt\n", + "βœ… [New] https://www.thuisarts.nl/veranderingen-in-zorg-als-je-18-wordt\n", + "πŸ“„ Summary:\n", + "Title: Changes in Care When You Turn 18 | Thuisarts.nl\n", + "Summary: As you become an adult, usually around 18, you transition from child to adult healthcare. You will start to take more responsibility, such as making appointments and requesting medications, giving you more control over your care. You will create a plan detailing what you need to manage this independently, with support provided to help you. This transition is a gradual process, with preparation beginning before you turn 18.\n", + "\n", + "βœ… Summary:\n", + "Title: Changes in Care When You Turn 18 | Thuisarts.nl\n", + "Summary: As you become an adult, usually around 18, you transition from child to adult healthcare. You will start to take more responsibility, such as making appointments and requesting medications, giving you more control over your care. You will create a plan detailing what you need to manage this independently, with support provided to help you. This transition is a gradual process, with preparation beginning before you turn 18.\n", + "\n", + "πŸ”„ [7/10] Summarizing: https://www.thuisarts.nl/zon-en-zonnebrand\n", + "βœ… [New] https://www.thuisarts.nl/zon-en-zonnebrand\n", + "πŸ“„ Summary:\n", + "Title: Sun and Sunburn | Thuisarts.nl\n", + "Summary: Protect your skin from excessive sunlight to avoid sunburn. If you notice your skin burning, immediately move out of the sun. Cool your skin with wet cloths if it hurts and take paracetamol for severe pain. Stay out of the sun for at least three days to allow your skin to recover. If you have symptoms of sunstroke, sun allergy, or eczema, seek medical advice.\n", + "\n", + "βœ… Summary:\n", + "Title: Sun and Sunburn | Thuisarts.nl\n", + "Summary: Protect your skin from excessive sunlight to avoid sunburn. If you notice your skin burning, immediately move out of the sun. Cool your skin with wet cloths if it hurts and take paracetamol for severe pain. Stay out of the sun for at least three days to allow your skin to recover. If you have symptoms of sunstroke, sun allergy, or eczema, seek medical advice.\n", + "\n", + "πŸ”„ [8/10] Summarizing: https://www.thuisarts.nl/ganglion\n", + "βœ… [New] https://www.thuisarts.nl/ganglion\n", + "πŸ“„ Summary:\n", + "Title: Ganglion | Thuisarts.nl \n", + "Summary: A ganglion is a small bump that can appear on your wrist, finger, or foot. It is a protrusion from the joint and is harmless. In half of the cases, a ganglion disappears on its own. If you notice such a bump, there is usually no cause for concern.\n", + "\n", + "βœ… Summary:\n", + "Title: Ganglion | Thuisarts.nl \n", + "Summary: A ganglion is a small bump that can appear on your wrist, finger, or foot. It is a protrusion from the joint and is harmless. In half of the cases, a ganglion disappears on its own. If you notice such a bump, there is usually no cause for concern.\n", + "\n", + "πŸ”„ [9/10] Summarizing: https://www.thuisarts.nl/kunstheup\n", + "βœ… [New] https://www.thuisarts.nl/kunstheup\n", + "πŸ“„ Summary:\n", + "Title: Hip Replacement | Thuisarts.nl\n", + "Summary: A hip replacement can be an option if you are experiencing severe pain or stiffness in your hip, such as from advanced arthritis or another hip disease. This is usually considered when other treatments like physiotherapy and painkillers have not provided enough relief. You can discuss with your hospital doctor whether a hip replacement is suitable for you. A hip prosthesis typically lasts longer than 20 years.\n", + "\n", + "βœ… Summary:\n", + "Title: Hip Replacement | Thuisarts.nl\n", + "Summary: A hip replacement can be an option if you are experiencing severe pain or stiffness in your hip, such as from advanced arthritis or another hip disease. This is usually considered when other treatments like physiotherapy and painkillers have not provided enough relief. You can discuss with your hospital doctor whether a hip replacement is suitable for you. A hip prosthesis typically lasts longer than 20 years.\n", + "\n", + "πŸ”„ [10/10] Summarizing: https://www.thuisarts.nl/gezond-leven\n", + "βœ… [New] https://www.thuisarts.nl/gezond-leven\n", + "πŸ“„ Summary:\n", + "Title: Healthy Living | Thuisarts.nl\n", + "Summary: For good health, it's important to eat, drink, and sleep well, stay active, relax, and maintain social contacts. Avoiding substances like alcohol is also beneficial. If you want to make changes to your lifestyle, take it step by step. Discuss your plans with your GP or practice nurse. Whether it's about healthy eating, exercise, sleep, stress management, social contact, or substance use, they can provide guidance and support.\n", + "\n", + "βœ… Summary:\n", + "Title: Healthy Living | Thuisarts.nl\n", + "Summary: For good health, it's important to eat, drink, and sleep well, stay active, relax, and maintain social contacts. Avoiding substances like alcohol is also beneficial. If you want to make changes to your lifestyle, take it step by step. Discuss your plans with your GP or practice nurse. Whether it's about healthy eating, exercise, sleep, stress management, social contact, or substance use, they can provide guidance and support.\n", + "\n" + ] + } + ], + "source": [ + "brochure = build_symptom_brochure(selected_links)" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "e2121c3c-aa6a-4640-8e19-6ca6ccf84783", + "metadata": {}, + "outputs": [], + "source": [ + "def export_brochure_to_txt(brochure, filepath=\"brochure_summaries.txt\"):\n", + " if not brochure:\n", + " print(\"⚠️ No summaries to export.\")\n", + " return\n", + "\n", + " with open(filepath, \"w\", encoding=\"utf-8\") as f:\n", + " for item in brochure:\n", + " url = item.get(\"url\", \"Unknown URL\")\n", + " summary = item.get(\"summary\", \"No summary available.\")\n", + " f.write(f\"URL: {url}\\n\")\n", + " f.write(f\"{summary}\\n\\n\")\n", + "\n", + " print(f\"πŸ“ Exported {len(brochure)} summaries to {filepath}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "f14288f9-4d1c-4a0e-aaf4-9f86324b0602", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "πŸ“ Exported 10 summaries to brochure_summaries.txt\n" + ] + } + ], + "source": [ + "export_brochure_to_txt(brochure)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c23e89db-3ded-4189-a227-6ca6ac2f1332", + "metadata": {}, + "outputs": [], + "source": [ + "###---it works---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a700e4f3-fb6a-499a-a579-6f9b8ad35c9f", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/community-contributions/sf-patient-brochure/brochure_summaries.txt b/community-contributions/sf-patient-brochure/brochure_summaries.txt new file mode 100644 index 0000000..0ba4556 --- /dev/null +++ b/community-contributions/sf-patient-brochure/brochure_summaries.txt @@ -0,0 +1,40 @@ +URL: https://www.thuisarts.nl/sociale-angststoornis +Title: Social Anxiety Disorder +Summary: Social anxiety disorder, or social phobia, is a fear of what others think of you, often leading to panic attacks. Writing down what happens, your thoughts, and feelings can help manage this fear. Positive thinking can also be beneficial when you're feeling anxious. Discussing your concerns with your GP or practice nurse can be helpful. If there's no improvement or symptoms are severe, treatments such as therapy with a psychologist or anxiety medication may be considered. + +URL: https://www.thuisarts.nl/diabetes-type-2 +Title: Diabetes type 2 | Thuisarts.nl +Summary: Type 2 diabetes, also known as sugar disease, is characterized by high blood sugar levels. Leading a healthy lifestyle is crucial: eat healthily, lose weight, exercise regularly, relax, and quit smoking. If blood sugar levels remain high, medication may be required. Regular check-ups, usually every three months, with your GP or practice nurse are essential. + +URL: https://www.thuisarts.nl/morton-neuroom +Title: Morton's Neuroma | Thuisarts.nl +Summary: Morton's Neuroma is a pinched nerve in the forefoot, causing burning pain in the forefoot and toes. It often results from wearing too narrow shoes or high heels. Wearing comfortable, roomy shoes can help alleviate symptoms. For severe pain, paracetamol can be taken. Sometimes, a custom shoe insole can also help. + +URL: https://www.thuisarts.nl/borstvergroting +Title: Breast Augmentation | Thuisarts.nl +Summary: A breast augmentation is a procedure where a plastic surgeon inserts fillings into your breasts, under general anesthesia. The surgery takes about an hour. Consider the pros and cons carefully. Benefits may include a more positive body image and increased self-confidence. Risks may include infection, bleeding, scarring, or hardening of the breasts over time. Often, a follow-up surgery is needed later. If you smoke, it's important to quit three weeks before surgery. + +URL: https://www.thuisarts.nl/kijkoperatie-in-buik +Title: Abdominal Laparoscopy | Thuisarts.nl +Summary: An abdominal laparoscopy allows the doctor to examine or operate in your abdomen. Small tubes with a camera and tools are inserted through tiny incisions. You'll have a pre-operation discussion with your surgeon and anesthesiologist. You will be deeply sedated for the procedure. You cannot drive home post-operation, so arrange for someone to pick you up. Recovery usually requires a week off work, sometimes longer. + +URL: https://www.thuisarts.nl/veranderingen-in-zorg-als-je-18-wordt +Title: Changes in Care When You Turn 18 | Thuisarts.nl +Summary: As you become an adult, usually around 18, you transition from child to adult healthcare. You will start to take more responsibility, such as making appointments and requesting medications, giving you more control over your care. You will create a plan detailing what you need to manage this independently, with support provided to help you. This transition is a gradual process, with preparation beginning before you turn 18. + +URL: https://www.thuisarts.nl/zon-en-zonnebrand +Title: Sun and Sunburn | Thuisarts.nl +Summary: Protect your skin from excessive sunlight to avoid sunburn. If you notice your skin burning, immediately move out of the sun. Cool your skin with wet cloths if it hurts and take paracetamol for severe pain. Stay out of the sun for at least three days to allow your skin to recover. If you have symptoms of sunstroke, sun allergy, or eczema, seek medical advice. + +URL: https://www.thuisarts.nl/ganglion +Title: Ganglion | Thuisarts.nl +Summary: A ganglion is a small bump that can appear on your wrist, finger, or foot. It is a protrusion from the joint and is harmless. In half of the cases, a ganglion disappears on its own. If you notice such a bump, there is usually no cause for concern. + +URL: https://www.thuisarts.nl/kunstheup +Title: Hip Replacement | Thuisarts.nl +Summary: A hip replacement can be an option if you are experiencing severe pain or stiffness in your hip, such as from advanced arthritis or another hip disease. This is usually considered when other treatments like physiotherapy and painkillers have not provided enough relief. You can discuss with your hospital doctor whether a hip replacement is suitable for you. A hip prosthesis typically lasts longer than 20 years. + +URL: https://www.thuisarts.nl/gezond-leven +Title: Healthy Living | Thuisarts.nl +Summary: For good health, it's important to eat, drink, and sleep well, stay active, relax, and maintain social contacts. Avoiding substances like alcohol is also beneficial. If you want to make changes to your lifestyle, take it step by step. Discuss your plans with your GP or practice nurse. Whether it's about healthy eating, exercise, sleep, stress management, social contact, or substance use, they can provide guidance and support. + diff --git a/community/vanshika-mahajan/web_summary_fashion.ipynb b/community/vanshika-mahajan/web_summary_fashion.ipynb new file mode 100644 index 0000000..bc0930c --- /dev/null +++ b/community/vanshika-mahajan/web_summary_fashion.ipynb @@ -0,0 +1,933 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 113, + "id": "030082e9-edee-40b6-9f17-b6a683f2e334", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import requests\n", + "from dotenv import load_dotenv\n", + "import bs4\n", + "from bs4 import BeautifulSoup\n", + "import lxml\n", + "from IPython.display import Markdown, display\n", + "from openai import OpenAI" + ] + }, + { + "cell_type": "code", + "execution_count": 115, + "id": "c87e997d-e1d6-4b6f-9c76-3fb1d607f7cd", + "metadata": {}, + "outputs": [], + "source": [ + "openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')" + ] + }, + { + "cell_type": "code", + "execution_count": 116, + "id": "e450cb33-1ae4-435e-b155-35f2bd7ab78e", + "metadata": {}, + "outputs": [], + "source": [ + "headers={\n", + " \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n", + "} \n", + "#a dictionary named header so that we can grab same html code as the user ,and also to avoid blocks,captcha and error403" + ] + }, + { + "cell_type": "code", + "execution_count": 119, + "id": "63a57fb7-79db-444b-968b-c9314b1f3d3f", + "metadata": {}, + "outputs": [], + "source": [ + "class Website:\n", + " def __init__(self,url):\n", + " self.url=url\n", + " response= requests.get(url,headers=headers,timeout=30)\n", + " soup=BeautifulSoup(response.content,'lxml')\n", + " self.title=soup.title.string if soup.title else \"No title found\"#scraping the content\n", + " for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):#cleaning the content\n", + " irrelevant.decompose()\n", + " #using .get_text() method of Beautiful soup\n", + " self.text = soup.body.get_text(separator=\"\\n\", strip=True)#creating space between different lines and removing leading whitespaces by strip=true" + ] + }, + { + "cell_type": "code", + "execution_count": 121, + "id": "7369159d-1f36-43c9-b7e7-a0b65b56426b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Latest and Trending Entertainment News, Celebrity News, Movie News, Breaking News | Entertainment - Times of India\n", + "Sign In\n", + "TOI\n", + "Go to\n", + "TOI\n", + "Etimes\n", + "home\n", + "cinema\n", + "news\n", + "movie reviews\n", + "movie listings\n", + "box office\n", + "anime\n", + "previews\n", + "did you know\n", + "videos\n", + "showtimes\n", + "blogs\n", + "awards\n", + "News\n", + "entertainment\n", + "Trending\n", + "Javed Akhtar\n", + "Diljit Dosanjh\n", + "Jaideep Ahlawat\n", + "Karisma Kapoor\n", + "Gauri Khan\n", + "Blake Lively\n", + "Trisha Krishnan\n", + "Kuberaa Box Office Collection\n", + "Sitaare Zameen Par Box Office Collection\n", + "Housefull 5\n", + "Kuberaa Movie Review\n", + "Sitaare Zameen Par Movie Review\n", + "Javed Akhtar\n", + "Diljit Dosanjh\n", + "Jaideep Ahlawat\n", + "Karisma Kapoor\n", + "Gauri Khan\n", + "Blake Lively\n", + "Trisha Krishnan\n", + "Kuberaa Box Office Collection\n", + "Sitaare Zameen Par Box Office Collection\n", + "Housefull 5\n", + "Kuberaa Movie Review\n", + "Sitaare Zameen Par Movie Review\n", + "Javed Akhtar\n", + "Diljit Dosanjh\n", + "Jaideep Ahlawat\n", + "Karisma Kapoor\n", + "Gauri Khan\n", + "Blake Lively\n", + "Trisha Krishnan\n", + "Kuberaa Box Office Collection\n", + "Sitaare Zameen Par Box Office Collection\n", + "Housefull 5\n", + "Kuberaa Movie Review\n", + "Sitaare Zameen Par Movie Review\n", + "Sudhanshu: At 52, John, Dino all of them look like rockstars - EXCLUSIVE\n", + "Sudhanshu Pandey, recognized from 'Band Of Boys' and 'Anupama', defies his 50 years with his fitness. He credits his peers like Dino Moria, Arjun Rampal, and John Abraham for inspiring him to maintain a fit and youthful appearance. Pandey also admires Anil Kapoor's energy and dedication, motivating him to continue prioritizing fitness and inspiring others.\n", + "Previous\n", + "Sonakshi breaks silence on her rift with Luv and Kussh\n", + "Madhuri once chased Aamir with hockey stick for THIS reason\n", + "Ranbir-Raj Kapoor, Diljit-Hania, Samay-IGL: Top 5 news\n", + "Big B's savage reply to troll over cybercrime callertune\n", + "Anushka on keeping kids Vamika, Akaay away from public eye\n", + "Apoorva Mukhija recalls witnessing gender bias at home\n", + "Danish influencer seeks help to find papads from Big B\n", + "Sunjay Kapur's reception pics with Priya Sachdev goes viral\n", + "Big B schools trolls commenting 'buddha sathiya gaya hai'\n", + "Anushka on how she and Virat divide parenting duties\n", + "Brahmaji reacts to Vishnu's 7,000-acre land in New Zealand\n", + "Diljit says THIS amidst trolling for working with Hania\n", + "Riddhi found it ridiculous to like SRK's mother in Jawan\n", + "Priya Sachdev once called husband Sunjay Kapur β€˜misunderstood’\n", + "Next\n", + "1\n", + "2\n", + "3\n", + "Hindi\n", + "See All\n", + "Sudhanshu: At 52, John, Dino all of them look like rockstars - EXCLUSIVE\n", + "Sudhanshu Pandey, recognized from 'Band Of Boys' and 'Anupama', defies his 50 years with his fitness. He credits his peers like Dino Moria, Arjun Rampal, and John Abraham for inspiring him to maintain a fit and youthful appearance. Pandey also admires Anil Kapoor's energy and dedication, motivating him to continue prioritizing fitness and inspiring others.\n", + "Sonakshi breaks silence on her rift with Luv and Kussh\n", + "Madhuri once chased Aamir with hockey stick for THIS reason\n", + "Ranbir-Raj Kapoor, Diljit-Hania, Samay-IGL: Top 5 news\n", + "Anushka on keeping kids Vamika, Akaay away from public eye\n", + "Anushka Sharma and Virat Kohli are committed to shielding their children, Vamika and Akaay, from the constant glare of public attention. In a recent interview, Anushka emphasized the couple's focus on instilling strong values and ensuring a normal upbringing for their kids.\n", + "Apoorva Mukhija recalls witnessing gender bias at home\n", + "Regional\n", + "When Samantha’s class 10 mark sheet got leaked\n", + "Throwback to when a nostalgic memory made its way across the internet β€” Samantha Ruth Prabhu’s Class 10 mark sheet! The actress’s charming on-screen presence and grounded personality were once again in the spotlight as her old school report card began doing the rounds on social media.\n", + "Actor Tushar Ghadigaonkar passes away at 34\n", + "β€˜Kuberaa’ Twitter review: Netizens calls it a β€˜Blockbuster’\n", + "Mammootty’s health- Brittas says actor doing well\n", + "Kavya Madhavan’s father P. Madhavan passes away\n", + "β€˜The Raja Saab’ teaser: Prabhas shines in this horror comedy\n", + "Mammootty’s father-in-law P S Abu passes away\n", + "Videos\n", + "See All\n", + "Previous\n", + "03:07\n", + "Ananya Panday’s Garden Bond With Parrots Wins Hearts\n", + "88 views | 2 hours ago\n", + "03:14\n", + "Sameera Reddy’s Healing Journey Through Yoga\n", + "31 views | 2 hours ago\n", + "03:13\n", + "Kriti Kharbanda’s Modern Maharani Look Stuns Instagram\n", + "26 views | 2 hours ago\n", + "03:12\n", + "Bobby Deol Meets Diljit Dosanjh: Punjabi Power Goes Viral\n", + "81 views | 2 hours ago\n", + "03:19\n", + "β€˜Sitaare Zameen Par’: Riteish Deshmukh’s Emotional Shoutout For Genelia’s Big Win\n", + "162 views | 2 hours ago\n", + "03:26\n", + "Varun Dhawan Stuns With 50 Push-Ups Alongside Army Cadets on Border 2 Set\n", + "21 views | 2 hours ago\n", + "03:00\n", + "VIDYA BALAN TURNS HEADS WITH CASUAL AIRPORT LOOK\n", + "16 views | 2 hours ago\n", + "03:05\n", + "MANDHIRA KAPUR BREAKS DOWN IN EMOTIONAL POST FOR LATE BROTHER SUNJAY KAPUR\n", + "1.2K views | 2 hours ago\n", + "03:28\n", + "SALMAN KHAN TAKES A BRUTAL DIG AT SOHAIL’S DIVORCE ON NATIONAL TV\n", + "185 views | 2 hours ago\n", + "03:15\n", + "RAJINIKANTH CAUSES FAN RIOT DURING β€˜JAILER 2’ SHOOT IN MYSORE\n", + "26 views | 2 hours ago\n", + "03:10\n", + "IBRAHIM ALI KHAN KISSES HIS DOG AT AIRPORT IN HEARTWARMING FAREWELL\n", + "20 views | 3 hours ago\n", + "03:09\n", + "ANUPAMAA SET GUTTED IN MASSIVE FIRE | CREW ESCAPES, CINE BODY DEMANDS ACTION\n", + "1.2K views | 3 hours ago\n", + "Next\n", + "1\n", + "2\n", + "3\n", + "4\n", + "5\n", + "6\n", + "7\n", + "8\n", + "9\n", + "10\n", + "11\n", + "World\n", + "See All\n", + "Aamir to Tom: Celebs on a mission to 'Save Cinema'\n", + "'How to Train Your Dragon' beats '28 Years Later' and 'Elio' to top the US box office on second weekend\n", + "Blake Lively is heartbroken after friendship ends with Taylor Swift; accepts the music mogul won't be returning - Deets inside\n", + "Selena-Hailey UNFOLLOW each other amid Bieber drama\n", + "Judge gives Baldoni access to Blake-Taylor messages\n", + "Trending Now\n", + "# Sidharth Malhotra-Kiara Advani\n", + "# AbRam Khan-Taimur Ali Khan\n", + "# Janhvi Kapoor\n", + "# Salman Khan\n", + "# Hema Malini\n", + "# Salman Khan\n", + "# Gauri Khan\n", + "# Shah Rukh Khan\n", + "# Chahatt Khanna\n", + "Visual Stories\n", + "See All\n", + "Previous\n", + "Kuberaa’s Sameera to Pushpa’s Srivalli: Rashmika Mandanna’s most iconic on-screen avatars\n", + "Ahaana Krishna’s ethereal photo series is straight out of a dream\n", + "Rashmika Mandanna to Rakul Preet Singh: Best pictures of the week featuring south actresses\n", + "Gauri Khan's most loved saree looks - An ode to modern day elegance\n", + "​South Indian beauties whose smiles will light up your Monday\n", + "Karishma Tanna Slays Every Frame\n", + "Tamannaah Bhatia’s traditional looks\n", + "Malavika Mohanan's radiant pics\n", + "​Neha Shetty stuns in every shade of blue\n", + "Thalapathy Vijay’s top 10 blockbuster movies worth re-watching!\n", + "​In pic: Mesmerizing looks of Shruti Haasan​\n", + "Dushara Vijayan’s Most Elegant Fashion Moments\n", + "Next\n", + "1\n", + "2\n", + "3\n", + "More Stories\n", + "Sonakshi Sinha breaks silence on her rumoured rift with brothers Luv and Kussh Sinha: 'My effort is always to support them...'\n", + "Madhuri Dixit once chased Aamir Khan with a hockey stick for THIS reason on sets of Dil: 'People fool you and you believe them'\n", + "Mohanlal declines to continue as president at AMMA’s general body meeting- Deets Inside\n", + "Blockbusters Ranbir Kapoor turned down: Films that became hits without him\n", + "Anushka Sharma reveals why she and Virat Kohli are keeping their children Vamika and Akaay away from the public eye: 'We don't want to raise brats'\n", + "Apoorva Mukhija recalls witnessing gender bias at home: 'My mother did it all, but father got credit for showing up at PTMs'\n", + "Amitabh Bachchan gives a savage reply to a troll over his viral cybercrime caller tune: 'Sarkar ko bolo bhai..'\n", + "Danish influencer asks fans to help her find papads from Amitabh Bachchan; netizens say 'he also used to grow basmati rice'\n", + "Days after his untimely demise, Sunjay Kapur's reception photos with Priya Sachdev goes viral; Looked dashing in hand embroidered shoes, written 'I do'\n", + "Priyanka Chopra Jonas recollects walking into a trap set by John Cena, Idris Elba on sets of 'Heads of State'\n", + "Bobby Deol's London vacation sparks fan frenzy: viral video shows actor posing for selfies outside restaurant\n", + "Amitabh Bachchah gives befitting replies to 'buddha sathiya gaya hai', β€˜ganja’ comments by trolls: 'Ek din, Bhagwan naa kare voh din jaldi aaye...'\n", + "Sai Pallavi’s best performances\n", + "Brahmaji clears the air about Vishnu Manchu purchasing 7,000-acre land in New Zealand: 'I was pulling their leg as usual...'\n", + "Anushka Sharma reveals how she and Virat Kohli divide the parenting duties: 'I will be the primary caregiver, he plays round the year'\n", + "Ranbir Kapoor's 'Awara' look sparks rumours of Raj Kapoor tribute, Diljit Dosanjh slammed for working with Hania Aamir in Sardaar Ji 3: Top 5 news\n", + "Has Kiara Advani been approached to play Meena Kumari in her biopic? Here's what we know\n", + "Top 5 psychological Anime every thriller fan must watch\n", + "Load More Stories\n", + "# Latest Movies 2025\n", + "# Best Bollywood Movies 2025\n", + "# Hollywood Movie 2025\n", + "# Tamil Movies 2025\n", + "# Telugu Movies 2025\n", + "# Malayalam Movies 2025\n", + "# Kannada Movies 2025\n", + "# Marathi Movies 2025\n", + "# Bengali Movies 2025\n", + "# Top Rated Movies 2025\n", + "# Best Hindi Movies\n", + "# Best English Movies\n", + "Hot on the Web\n", + "Salman Khan\n", + "Karisma Kapoor\n", + "Jaideep Ahlawat\n", + "Blood Pressure\n", + "Big Cat Species\n", + "Trisha\n", + "Sitaare Zameen Par Review\n", + "Ancient Indigenous Tribes\n", + "Hair Growth Tips\n", + "Kidney Health\n", + "Kuberaa Review\n", + "Blake Lively\n", + "Reverse Fatty Liver\n", + "Skincare Hacks\n", + "Kuberaa Box Office Collection\n", + "Sitaare Zameen Par Box Office Collection\n", + "Baby Girl Names\n", + "Diljit Dosanjh\n", + "Kidney Disease Symptoms\n", + "Javed Akhtar\n", + "Heart Attack\n", + "Ram Kapoor Diet\n", + "Liver Damage\n", + "Kuberaa Movie Review\n", + "Gauri Khan\n", + "Baba Vanga Prediction\n", + "Baby Boy Names\n", + "Navjot Singh Sidhu\n", + "Housefull 5 Box Office Collection\n", + "DNA Movie Review\n", + "Kidney Damage Symptoms\n", + "Popular Waterfalls In India\n", + "Linkedin Ceo On AI Killing Jobs\n", + "Tesla Robotaxi\n", + "Early Cancer Detection\n", + "Harvard Research Reveals\n", + "American Destinations Explore Without Passport\n", + "Amouranth\n", + "Mouth Larvae\n", + "Doomsday Fish\n", + "Salman Khan AVM\n", + "Ginger Health Tips\n", + "Trending Topics\n", + "Latest Movies\n", + "Bollywood Movies\n", + "Hollywood Movies\n", + "Tamil Movies 2025\n", + "Telugu Movies 2025\n", + "Malayalam Movies 2025\n", + "Kannada Movies 2025\n", + "Marathi Movies 2025\n", + "Bengali Movies 2025\n", + "Top Rated Movies 2025\n", + "Best Hindi Movies\n", + "Best English Movies\n", + "Best Telugu Movies\n", + "Best Tamil Movies\n", + "Best Malayalam Movies\n", + "Best Kannada Movies\n", + "Best Bengali Movies\n", + "Upcoming Hindi Movies\n", + "Best Movies Of All Time\n", + "Best Hindi Movies of All Time\n", + "Latest English Movies\n", + "Latest Malayalam Movies\n", + "English TV News\n", + "Tamil TV News\n", + "Telugu TV News\n", + "Malayalam TV News\n", + "Kannada TV News\n", + "Movie Reviews\n", + "Bhojpuri Cinema News\n", + "Gujarati Cinema News\n", + "Popular Categories\n", + "Viral News\n", + "K Pop News\n", + "Web Series News\n", + "Anime News\n", + "Upcoming English Movies\n", + "Upcoming Tamil Movies\n", + "Upcoming Telugu Movies\n", + "Upcoming Malayalam Movies\n", + "Upcoming Kannada Movies\n", + "Fashion Tips\n", + "Travel News\n", + "Entertainment News\n", + "Bollywood News\n", + "Tollywood News\n", + "Kollywood News\n", + "Mollywood News\n", + "Food News\n", + "Latest Hindi Movies\n", + "Latest Tamil Movies\n", + "Parenting Tips\n", + "Home Remedies\n", + "Weight Loss\n", + "Beauty Tips\n", + "Parenting Tips\n", + "Hindi Videos\n", + "Hindi Video Songs\n", + "Bhojpuri Music Videos\n", + "Latest Telugu Movies\n", + "Bhojpuri Music Video\n", + "Hindi TV News\n", + "Latest News\n", + "NHL free agency turns spicy as Mitch Marner and Connor McDavid eye shorter deals to cash in later\n", + "Olive Ridley turtle washed ashore at Polem\n", + "Who is Thomas Fugate? Meet the 22-year-old leading Trump's terrorism unit amid Iran fiasco\n", + "'And that's why Putin's the boss': Trump rebukes former Russian President Medvedev; warns against treating 'N word casually'\n", + "Govt plans β‚Ή10cr road on Bicholim-Dodamarg route\n", + "Former WWE star Batista eyed for Road House 2 sequel\n", + "Sonakshi Sinha breaks silence on her rumoured rift with brothers Luv and Kussh Sinha: 'My effort is always to support them...'\n", + "Andre Agassi and Steffi Graf’s son Jaden Agassi shows love for girlfriend Catherine Holt’s bold new photo from bedroom series\n", + "Is WWE planning to change Cody Rhodes’ iconic entrance theme song β€˜Kingdom’?\n", + "Velumani says he didn’t attend RSS event in Coimbatore\n", + "Strait of Hormuz: Oil supply not an issue for India; 'pricing is a bigger concern,' what experts say\n", + "Madhuri Dixit once chased Aamir Khan with a hockey stick for THIS reason on sets of Dil: 'People fool you and you believe them'\n", + "As commissions fall, India’s ride-hailing firms test viability of flat-fee economics\n", + "Analysing what Trump’s strikes mean for Iran\n", + "Trump's clarification on 'Iran regime change' divides MAGA further: JD Vance, Hegseth, Marco Rubio 'humiliated'\n", + "Laughter Chefs 2: Krushna Abhishek roasts Rahul Vaidya for his in-famous feud with cricketer Virat Kohli\n", + "β€œI could have passed Dan Ticktum”: Edoardo Mortara regrets Attack Mode strategy at Jakarta E-Prix\n", + "India vs England Test: Sunil Gavaskar calls for Rishabh Pant's signature somersault celebration, wicketkeeper politely declines - WATCH\n", + "Copyright Β© 2025 Bennett, Coleman & Co. Ltd. All rights reserved. For reprint rights: Times Syndication Service\n", + "Follow us on\n" + ] + } + ], + "source": [ + "gossip= Website(\"https://timesofindia.indiatimes.com/entertainment\")\n", + "print(gossip.title)\n", + "print(gossip.text)" + ] + }, + { + "cell_type": "code", + "execution_count": 123, + "id": "a6f30380-1b91-48e4-9c86-df0369e2e675", + "metadata": {}, + "outputs": [], + "source": [ + "system_prompt = \"\"\"\n", + "You are a stylish and culturally aware assistant who specializes in summarizing and discussing fashion trends, celebrity style, entertainment news, and television gossip.\n", + "\n", + "You stay updated on Hollywood, Bollywood, and the television worldβ€”including celebrity rumors, drama, reality TV updates, show recaps, and behind-the-scenes stories.\n", + "\n", + "When summarizing content, be engaging, concise, and insightful. Focus on what's trending, who's wearing what, and what everyone is talking about in fashion and entertainment. Maintain a fun yet informative tone, like a pop culture expert writing for a lifestyle magazine.\n", + "\n", + "If content includes TV gossip, highlight key rumors, casting updates, fan reactions, and noteworthy moments from popular shows.\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 125, + "id": "30822d5c-d518-451c-b31f-44afa2a3b37a", + "metadata": {}, + "outputs": [], + "source": [ + "def user_prompt_for(website):\n", + " user_prompt = f\"\"\"The following text is extracted from a website titled: \"{website.title}\".\n", + "\n", + "Please analyze this content and provide a short and engaging summary in **Markdown format**.\n", + "\n", + "If the page contains:\n", + "- 🧡 Fashion trends: mention standout styles, designers, or events.\n", + "- πŸ—£οΈ TV gossip: highlight any drama, casting news, or fan reactions.\n", + "- 🎬 Celebrity updates (Hollywood/Bollywood): include relevant quotes, fashion moments, or event mentions.\n", + "- πŸ“Ί Show recaps: summarize what happened and any major twists.\n", + "\n", + "Keep the summary clear, fun, and informative. Use bullet points if multiple themes appear. If there is no meaningful content, say: *β€œNo relevant summary could be generated.”*\n", + "\n", + "Website Content:\n", + "{website.text}\n", + "\"\"\"\n", + " return user_prompt" + ] + }, + { + "cell_type": "code", + "execution_count": 127, + "id": "5a25e90f-20a0-44ac-a96c-575ae974a45f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The following text is extracted from a website titled: \"Latest and Trending Entertainment News, Celebrity News, Movie News, Breaking News | Entertainment - Times of India\".\n", + "\n", + "Please analyze this content and provide a short and engaging summary in **Markdown format**.\n", + "\n", + "If the page contains:\n", + "- 🧡 Fashion trends: mention standout styles, designers, or events.\n", + "- πŸ—£οΈ TV gossip: highlight any drama, casting news, or fan reactions.\n", + "- 🎬 Celebrity updates (Hollywood/Bollywood): include relevant quotes, fashion moments, or event mentions.\n", + "- πŸ“Ί Show recaps: summarize what happened and any major twists.\n", + "\n", + "Keep the summary clear, fun, and informative. Use bullet points if multiple themes appear. If there is no meaningful content, say: *β€œNo relevant summary could be generated.”*\n", + "\n", + "Website Content:\n", + "Sign In\n", + "TOI\n", + "Go to\n", + "TOI\n", + "Etimes\n", + "home\n", + "cinema\n", + "news\n", + "movie reviews\n", + "movie listings\n", + "box office\n", + "anime\n", + "previews\n", + "did you know\n", + "videos\n", + "showtimes\n", + "blogs\n", + "awards\n", + "News\n", + "entertainment\n", + "Trending\n", + "Javed Akhtar\n", + "Diljit Dosanjh\n", + "Jaideep Ahlawat\n", + "Karisma Kapoor\n", + "Gauri Khan\n", + "Blake Lively\n", + "Trisha Krishnan\n", + "Kuberaa Box Office Collection\n", + "Sitaare Zameen Par Box Office Collection\n", + "Housefull 5\n", + "Kuberaa Movie Review\n", + "Sitaare Zameen Par Movie Review\n", + "Javed Akhtar\n", + "Diljit Dosanjh\n", + "Jaideep Ahlawat\n", + "Karisma Kapoor\n", + "Gauri Khan\n", + "Blake Lively\n", + "Trisha Krishnan\n", + "Kuberaa Box Office Collection\n", + "Sitaare Zameen Par Box Office Collection\n", + "Housefull 5\n", + "Kuberaa Movie Review\n", + "Sitaare Zameen Par Movie Review\n", + "Javed Akhtar\n", + "Diljit Dosanjh\n", + "Jaideep Ahlawat\n", + "Karisma Kapoor\n", + "Gauri Khan\n", + "Blake Lively\n", + "Trisha Krishnan\n", + "Kuberaa Box Office Collection\n", + "Sitaare Zameen Par Box Office Collection\n", + "Housefull 5\n", + "Kuberaa Movie Review\n", + "Sitaare Zameen Par Movie Review\n", + "Sudhanshu: At 52, John, Dino all of them look like rockstars - EXCLUSIVE\n", + "Sudhanshu Pandey, recognized from 'Band Of Boys' and 'Anupama', defies his 50 years with his fitness. He credits his peers like Dino Moria, Arjun Rampal, and John Abraham for inspiring him to maintain a fit and youthful appearance. Pandey also admires Anil Kapoor's energy and dedication, motivating him to continue prioritizing fitness and inspiring others.\n", + "Previous\n", + "Sonakshi breaks silence on her rift with Luv and Kussh\n", + "Madhuri once chased Aamir with hockey stick for THIS reason\n", + "Ranbir-Raj Kapoor, Diljit-Hania, Samay-IGL: Top 5 news\n", + "Big B's savage reply to troll over cybercrime callertune\n", + "Anushka on keeping kids Vamika, Akaay away from public eye\n", + "Apoorva Mukhija recalls witnessing gender bias at home\n", + "Danish influencer seeks help to find papads from Big B\n", + "Sunjay Kapur's reception pics with Priya Sachdev goes viral\n", + "Big B schools trolls commenting 'buddha sathiya gaya hai'\n", + "Anushka on how she and Virat divide parenting duties\n", + "Brahmaji reacts to Vishnu's 7,000-acre land in New Zealand\n", + "Diljit says THIS amidst trolling for working with Hania\n", + "Riddhi found it ridiculous to like SRK's mother in Jawan\n", + "Priya Sachdev once called husband Sunjay Kapur β€˜misunderstood’\n", + "Next\n", + "1\n", + "2\n", + "3\n", + "Hindi\n", + "See All\n", + "Sudhanshu: At 52, John, Dino all of them look like rockstars - EXCLUSIVE\n", + "Sudhanshu Pandey, recognized from 'Band Of Boys' and 'Anupama', defies his 50 years with his fitness. He credits his peers like Dino Moria, Arjun Rampal, and John Abraham for inspiring him to maintain a fit and youthful appearance. Pandey also admires Anil Kapoor's energy and dedication, motivating him to continue prioritizing fitness and inspiring others.\n", + "Sonakshi breaks silence on her rift with Luv and Kussh\n", + "Madhuri once chased Aamir with hockey stick for THIS reason\n", + "Ranbir-Raj Kapoor, Diljit-Hania, Samay-IGL: Top 5 news\n", + "Anushka on keeping kids Vamika, Akaay away from public eye\n", + "Anushka Sharma and Virat Kohli are committed to shielding their children, Vamika and Akaay, from the constant glare of public attention. In a recent interview, Anushka emphasized the couple's focus on instilling strong values and ensuring a normal upbringing for their kids.\n", + "Apoorva Mukhija recalls witnessing gender bias at home\n", + "Regional\n", + "When Samantha’s class 10 mark sheet got leaked\n", + "Throwback to when a nostalgic memory made its way across the internet β€” Samantha Ruth Prabhu’s Class 10 mark sheet! The actress’s charming on-screen presence and grounded personality were once again in the spotlight as her old school report card began doing the rounds on social media.\n", + "Actor Tushar Ghadigaonkar passes away at 34\n", + "β€˜Kuberaa’ Twitter review: Netizens calls it a β€˜Blockbuster’\n", + "Mammootty’s health- Brittas says actor doing well\n", + "Kavya Madhavan’s father P. Madhavan passes away\n", + "β€˜The Raja Saab’ teaser: Prabhas shines in this horror comedy\n", + "Mammootty’s father-in-law P S Abu passes away\n", + "Videos\n", + "See All\n", + "Previous\n", + "03:07\n", + "Ananya Panday’s Garden Bond With Parrots Wins Hearts\n", + "88 views | 2 hours ago\n", + "03:14\n", + "Sameera Reddy’s Healing Journey Through Yoga\n", + "31 views | 2 hours ago\n", + "03:13\n", + "Kriti Kharbanda’s Modern Maharani Look Stuns Instagram\n", + "26 views | 2 hours ago\n", + "03:12\n", + "Bobby Deol Meets Diljit Dosanjh: Punjabi Power Goes Viral\n", + "81 views | 2 hours ago\n", + "03:19\n", + "β€˜Sitaare Zameen Par’: Riteish Deshmukh’s Emotional Shoutout For Genelia’s Big Win\n", + "162 views | 2 hours ago\n", + "03:26\n", + "Varun Dhawan Stuns With 50 Push-Ups Alongside Army Cadets on Border 2 Set\n", + "21 views | 2 hours ago\n", + "03:00\n", + "VIDYA BALAN TURNS HEADS WITH CASUAL AIRPORT LOOK\n", + "16 views | 2 hours ago\n", + "03:05\n", + "MANDHIRA KAPUR BREAKS DOWN IN EMOTIONAL POST FOR LATE BROTHER SUNJAY KAPUR\n", + "1.2K views | 2 hours ago\n", + "03:28\n", + "SALMAN KHAN TAKES A BRUTAL DIG AT SOHAIL’S DIVORCE ON NATIONAL TV\n", + "185 views | 2 hours ago\n", + "03:15\n", + "RAJINIKANTH CAUSES FAN RIOT DURING β€˜JAILER 2’ SHOOT IN MYSORE\n", + "26 views | 2 hours ago\n", + "03:10\n", + "IBRAHIM ALI KHAN KISSES HIS DOG AT AIRPORT IN HEARTWARMING FAREWELL\n", + "20 views | 3 hours ago\n", + "03:09\n", + "ANUPAMAA SET GUTTED IN MASSIVE FIRE | CREW ESCAPES, CINE BODY DEMANDS ACTION\n", + "1.2K views | 3 hours ago\n", + "Next\n", + "1\n", + "2\n", + "3\n", + "4\n", + "5\n", + "6\n", + "7\n", + "8\n", + "9\n", + "10\n", + "11\n", + "World\n", + "See All\n", + "Aamir to Tom: Celebs on a mission to 'Save Cinema'\n", + "'How to Train Your Dragon' beats '28 Years Later' and 'Elio' to top the US box office on second weekend\n", + "Blake Lively is heartbroken after friendship ends with Taylor Swift; accepts the music mogul won't be returning - Deets inside\n", + "Selena-Hailey UNFOLLOW each other amid Bieber drama\n", + "Judge gives Baldoni access to Blake-Taylor messages\n", + "Trending Now\n", + "# Sidharth Malhotra-Kiara Advani\n", + "# AbRam Khan-Taimur Ali Khan\n", + "# Janhvi Kapoor\n", + "# Salman Khan\n", + "# Hema Malini\n", + "# Salman Khan\n", + "# Gauri Khan\n", + "# Shah Rukh Khan\n", + "# Chahatt Khanna\n", + "Visual Stories\n", + "See All\n", + "Previous\n", + "Kuberaa’s Sameera to Pushpa’s Srivalli: Rashmika Mandanna’s most iconic on-screen avatars\n", + "Ahaana Krishna’s ethereal photo series is straight out of a dream\n", + "Rashmika Mandanna to Rakul Preet Singh: Best pictures of the week featuring south actresses\n", + "Gauri Khan's most loved saree looks - An ode to modern day elegance\n", + "​South Indian beauties whose smiles will light up your Monday\n", + "Karishma Tanna Slays Every Frame\n", + "Tamannaah Bhatia’s traditional looks\n", + "Malavika Mohanan's radiant pics\n", + "​Neha Shetty stuns in every shade of blue\n", + "Thalapathy Vijay’s top 10 blockbuster movies worth re-watching!\n", + "​In pic: Mesmerizing looks of Shruti Haasan​\n", + "Dushara Vijayan’s Most Elegant Fashion Moments\n", + "Next\n", + "1\n", + "2\n", + "3\n", + "More Stories\n", + "Sonakshi Sinha breaks silence on her rumoured rift with brothers Luv and Kussh Sinha: 'My effort is always to support them...'\n", + "Madhuri Dixit once chased Aamir Khan with a hockey stick for THIS reason on sets of Dil: 'People fool you and you believe them'\n", + "Mohanlal declines to continue as president at AMMA’s general body meeting- Deets Inside\n", + "Blockbusters Ranbir Kapoor turned down: Films that became hits without him\n", + "Anushka Sharma reveals why she and Virat Kohli are keeping their children Vamika and Akaay away from the public eye: 'We don't want to raise brats'\n", + "Apoorva Mukhija recalls witnessing gender bias at home: 'My mother did it all, but father got credit for showing up at PTMs'\n", + "Amitabh Bachchan gives a savage reply to a troll over his viral cybercrime caller tune: 'Sarkar ko bolo bhai..'\n", + "Danish influencer asks fans to help her find papads from Amitabh Bachchan; netizens say 'he also used to grow basmati rice'\n", + "Days after his untimely demise, Sunjay Kapur's reception photos with Priya Sachdev goes viral; Looked dashing in hand embroidered shoes, written 'I do'\n", + "Priyanka Chopra Jonas recollects walking into a trap set by John Cena, Idris Elba on sets of 'Heads of State'\n", + "Bobby Deol's London vacation sparks fan frenzy: viral video shows actor posing for selfies outside restaurant\n", + "Amitabh Bachchah gives befitting replies to 'buddha sathiya gaya hai', β€˜ganja’ comments by trolls: 'Ek din, Bhagwan naa kare voh din jaldi aaye...'\n", + "Sai Pallavi’s best performances\n", + "Brahmaji clears the air about Vishnu Manchu purchasing 7,000-acre land in New Zealand: 'I was pulling their leg as usual...'\n", + "Anushka Sharma reveals how she and Virat Kohli divide the parenting duties: 'I will be the primary caregiver, he plays round the year'\n", + "Ranbir Kapoor's 'Awara' look sparks rumours of Raj Kapoor tribute, Diljit Dosanjh slammed for working with Hania Aamir in Sardaar Ji 3: Top 5 news\n", + "Has Kiara Advani been approached to play Meena Kumari in her biopic? Here's what we know\n", + "Top 5 psychological Anime every thriller fan must watch\n", + "Load More Stories\n", + "# Latest Movies 2025\n", + "# Best Bollywood Movies 2025\n", + "# Hollywood Movie 2025\n", + "# Tamil Movies 2025\n", + "# Telugu Movies 2025\n", + "# Malayalam Movies 2025\n", + "# Kannada Movies 2025\n", + "# Marathi Movies 2025\n", + "# Bengali Movies 2025\n", + "# Top Rated Movies 2025\n", + "# Best Hindi Movies\n", + "# Best English Movies\n", + "Hot on the Web\n", + "Salman Khan\n", + "Karisma Kapoor\n", + "Jaideep Ahlawat\n", + "Blood Pressure\n", + "Big Cat Species\n", + "Trisha\n", + "Sitaare Zameen Par Review\n", + "Ancient Indigenous Tribes\n", + "Hair Growth Tips\n", + "Kidney Health\n", + "Kuberaa Review\n", + "Blake Lively\n", + "Reverse Fatty Liver\n", + "Skincare Hacks\n", + "Kuberaa Box Office Collection\n", + "Sitaare Zameen Par Box Office Collection\n", + "Baby Girl Names\n", + "Diljit Dosanjh\n", + "Kidney Disease Symptoms\n", + "Javed Akhtar\n", + "Heart Attack\n", + "Ram Kapoor Diet\n", + "Liver Damage\n", + "Kuberaa Movie Review\n", + "Gauri Khan\n", + "Baba Vanga Prediction\n", + "Baby Boy Names\n", + "Navjot Singh Sidhu\n", + "Housefull 5 Box Office Collection\n", + "DNA Movie Review\n", + "Kidney Damage Symptoms\n", + "Popular Waterfalls In India\n", + "Linkedin Ceo On AI Killing Jobs\n", + "Tesla Robotaxi\n", + "Early Cancer Detection\n", + "Harvard Research Reveals\n", + "American Destinations Explore Without Passport\n", + "Amouranth\n", + "Mouth Larvae\n", + "Doomsday Fish\n", + "Salman Khan AVM\n", + "Ginger Health Tips\n", + "Trending Topics\n", + "Latest Movies\n", + "Bollywood Movies\n", + "Hollywood Movies\n", + "Tamil Movies 2025\n", + "Telugu Movies 2025\n", + "Malayalam Movies 2025\n", + "Kannada Movies 2025\n", + "Marathi Movies 2025\n", + "Bengali Movies 2025\n", + "Top Rated Movies 2025\n", + "Best Hindi Movies\n", + "Best English Movies\n", + "Best Telugu Movies\n", + "Best Tamil Movies\n", + "Best Malayalam Movies\n", + "Best Kannada Movies\n", + "Best Bengali Movies\n", + "Upcoming Hindi Movies\n", + "Best Movies Of All Time\n", + "Best Hindi Movies of All Time\n", + "Latest English Movies\n", + "Latest Malayalam Movies\n", + "English TV News\n", + "Tamil TV News\n", + "Telugu TV News\n", + "Malayalam TV News\n", + "Kannada TV News\n", + "Movie Reviews\n", + "Bhojpuri Cinema News\n", + "Gujarati Cinema News\n", + "Popular Categories\n", + "Viral News\n", + "K Pop News\n", + "Web Series News\n", + "Anime News\n", + "Upcoming English Movies\n", + "Upcoming Tamil Movies\n", + "Upcoming Telugu Movies\n", + "Upcoming Malayalam Movies\n", + "Upcoming Kannada Movies\n", + "Fashion Tips\n", + "Travel News\n", + "Entertainment News\n", + "Bollywood News\n", + "Tollywood News\n", + "Kollywood News\n", + "Mollywood News\n", + "Food News\n", + "Latest Hindi Movies\n", + "Latest Tamil Movies\n", + "Parenting Tips\n", + "Home Remedies\n", + "Weight Loss\n", + "Beauty Tips\n", + "Parenting Tips\n", + "Hindi Videos\n", + "Hindi Video Songs\n", + "Bhojpuri Music Videos\n", + "Latest Telugu Movies\n", + "Bhojpuri Music Video\n", + "Hindi TV News\n", + "Latest News\n", + "NHL free agency turns spicy as Mitch Marner and Connor McDavid eye shorter deals to cash in later\n", + "Olive Ridley turtle washed ashore at Polem\n", + "Who is Thomas Fugate? Meet the 22-year-old leading Trump's terrorism unit amid Iran fiasco\n", + "'And that's why Putin's the boss': Trump rebukes former Russian President Medvedev; warns against treating 'N word casually'\n", + "Govt plans β‚Ή10cr road on Bicholim-Dodamarg route\n", + "Former WWE star Batista eyed for Road House 2 sequel\n", + "Sonakshi Sinha breaks silence on her rumoured rift with brothers Luv and Kussh Sinha: 'My effort is always to support them...'\n", + "Andre Agassi and Steffi Graf’s son Jaden Agassi shows love for girlfriend Catherine Holt’s bold new photo from bedroom series\n", + "Is WWE planning to change Cody Rhodes’ iconic entrance theme song β€˜Kingdom’?\n", + "Velumani says he didn’t attend RSS event in Coimbatore\n", + "Strait of Hormuz: Oil supply not an issue for India; 'pricing is a bigger concern,' what experts say\n", + "Madhuri Dixit once chased Aamir Khan with a hockey stick for THIS reason on sets of Dil: 'People fool you and you believe them'\n", + "As commissions fall, India’s ride-hailing firms test viability of flat-fee economics\n", + "Analysing what Trump’s strikes mean for Iran\n", + "Trump's clarification on 'Iran regime change' divides MAGA further: JD Vance, Hegseth, Marco Rubio 'humiliated'\n", + "Laughter Chefs 2: Krushna Abhishek roasts Rahul Vaidya for his in-famous feud with cricketer Virat Kohli\n", + "β€œI could have passed Dan Ticktum”: Edoardo Mortara regrets Attack Mode strategy at Jakarta E-Prix\n", + "India vs England Test: Sunil Gavaskar calls for Rishabh Pant's signature somersault celebration, wicketkeeper politely declines - WATCH\n", + "Copyright Β© 2025 Bennett, Coleman & Co. Ltd. All rights reserved. For reprint rights: Times Syndication Service\n", + "Follow us on\n", + "\n" + ] + } + ], + "source": [ + "print(user_prompt_for(gossip))" + ] + }, + { + "cell_type": "code", + "execution_count": 129, + "id": "c039ab7c-88ee-475d-a93e-b26711d3ed4b", + "metadata": {}, + "outputs": [], + "source": [ + "def messages_for(website):\n", + " return [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt_for(website)}\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": 146, + "id": "dd1fee35-6cc9-4995-8b5e-b93d80488364", + "metadata": {}, + "outputs": [], + "source": [ + "def summarize(url):\n", + " website = Website(url)\n", + " response = openai.chat.completions.create(\n", + " model = \"llama3.2\",\n", + " messages = messages_for(website)\n", + " )\n", + " return response.choices[0].message.content" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed09dad8-93bb-417e-b07b-183d2eba1ec5", + "metadata": {}, + "outputs": [], + "source": [ + "summarize(\"https://timesofindia.indiatimes.com/entertainment\")" + ] + }, + { + "cell_type": "code", + "execution_count": 139, + "id": "16a57eed-eba5-4f75-84f2-d44a67b36047", + "metadata": {}, + "outputs": [], + "source": [ + "def display_summary(url):\n", + " summary = summarize(url)\n", + " display(Markdown(summary))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25af6217-6944-4c95-b156-0899dfcf0b83", + "metadata": {}, + "outputs": [], + "source": [ + "display_summary(\"https://timesofindia.indiatimes.com/entertainment\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29daa2d4-9d92-40ae-a0c4-dd2fdacf3f80", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week1/community-contributions/City Economy Summarizer.ipynb b/week1/community-contributions/City Economy Summarizer.ipynb new file mode 100644 index 0000000..9d8e9b5 --- /dev/null +++ b/week1/community-contributions/City Economy Summarizer.ipynb @@ -0,0 +1,273 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4e66a6eb-e44a-4dc3-bad7-82e27d45155d", + "metadata": {}, + "source": [ + "# Imports" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "98bf393c-358e-4ee1-b15b-96dfec323734", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import requests\n", + "from dotenv import load_dotenv\n", + "from bs4 import BeautifulSoup\n", + "from IPython.display import Markdown, display\n", + "from openai import OpenAI" + ] + }, + { + "cell_type": "markdown", + "id": "f92034ed-a2e6-444a-8008-291ba3f80561", + "metadata": {}, + "source": [ + "# OpenAI API Key" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a084b35d-19e9-4b48-bb06-d2c9e4474b20", + "metadata": {}, + "outputs": [], + "source": [ + "# Load environment variables in a file called .env\n", + "\n", + "load_dotenv(override=True)\n", + "api_key = os.getenv('OPENAI_API_KEY')\n", + "\n", + "# Check the key\n", + "\n", + "if not api_key:\n", + " print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n", + "elif not api_key.startswith(\"sk-proj-\"):\n", + " print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n", + "elif api_key.strip() != api_key:\n", + " print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n", + "else:\n", + " print(\"API key found and looks good so far!\")" + ] + }, + { + "cell_type": "markdown", + "id": "32b35ea0-e4ca-492a-94af-822ec61468a0", + "metadata": {}, + "source": [ + "# About..." + ] + }, + { + "cell_type": "markdown", + "id": "c660b786-af88-4134-b958-ffbf7a7b2904", + "metadata": {}, + "source": [ + "In this project I use the code from day 1 for something I do at work. I'm a real estate appraiser and when I prepare a valuation for some real estate, I analyze the local market, and in particular the city where the property is located. I then gather economy-related information and create a report from it. I'm based in Poland, so the report is in Polish. Here, I want to ask the model to make such a report for me, using the official website of the city and its related Wikipedia article." + ] + }, + { + "cell_type": "markdown", + "id": "09f32b5a-4d0a-4fec-a2f8-5d323ca2745d", + "metadata": {}, + "source": [ + "# The Code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f0fb8fe1-f052-4426-8531-5520d5295807", + "metadata": {}, + "outputs": [], + "source": [ + "openai = OpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4a2cca4b-8cd0-4c1a-a01c-1da10199236c", + "metadata": {}, + "outputs": [], + "source": [ + "headers = {\n", + " \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n", + "}\n", + "\n", + "class Website:\n", + "\n", + " def __init__(self, url):\n", + " \"\"\"\n", + " Create this Website object from the given url using the BeautifulSoup library\n", + " \"\"\"\n", + " self.url = url\n", + " response = requests.get(url, headers=headers)\n", + " soup = BeautifulSoup(response.content, 'html.parser')\n", + " self.title = soup.title.string if soup.title else \"No title found\"\n", + " for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n", + " irrelevant.decompose()\n", + " self.text = soup.body.get_text(separator=\"\\n\", strip=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c73e91c8-5805-4c9f-9bbb-b4e9c1e7bf12", + "metadata": {}, + "outputs": [], + "source": [ + "system_prompt = \"\"\"You are an analyst and real estate appraiser who checks out the official websites \n", + "of cities as well as articles related to these cities on Wikipedia, searching the particular pages \n", + "of the official website and the Wikipedia article for economic data, in particular the \n", + "demographic structure of the city, its area, and how it's subdivided into built-up area, \n", + "rural area, forests, and so on, provided this kind of information is available. \n", + "The most important information you want to find is that related to the real estate market in the city, \n", + "but also the general economy of the city, so what kind of factories or companies there are, commerce, \n", + "business conditions, transportation, economic growth in recent years, and recent investments. \n", + "wealth of the inhabitants, and so on, depending on what kind of information is available on the website. \n", + "Combine the information found on the official website with the information found on Wikipedia, and in case\n", + "of discrepancies, the official website should take precedence. If any of the information is missing,\n", + "just omit it entirely and don't mention that it is missing, just don't write about it at all.\n", + "When you gather all the required information, create a comprehensive report presenting \n", + "the data in a clear way, using markdown, in tabular form where it makes sense. \n", + "The length of the report should be about 5000 characters. And one more thing, the report should be entirely \n", + "in Polish. \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e8015e8d-1655-4477-a111-aa8dd584f5eb", + "metadata": {}, + "outputs": [], + "source": [ + "def user_prompt_for(city, city_website, wiki_website):\n", + " user_prompt = f\"You are looking at the official website of the city {city}, and its wiki article.\"\n", + " user_prompt += f\"\\nThe contents of this website is as follows: \\\n", + "please provide a comprehensive report of economy-related data for the city of {city}, available on the \\\n", + "particular pages and subpages of its official website and Wikipedia in markdown. \\\n", + "Add tables if it makes sense for the data. The length of the report should be about 5000 characters. \\\n", + "The report should be in Polish.\\n\\n\"\n", + " user_prompt += city_website.text\n", + " user_prompt += wiki_website.text\n", + " return user_prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b55bd66b-e997-4d64-b5d5-679098013b9f", + "metadata": {}, + "outputs": [], + "source": [ + "def messages_for(city, city_website, wiki_website):\n", + " return [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt_for(city, city_website, wiki_website)}\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5f1f218-d6a9-4a9e-be7e-b4f41e7647e5", + "metadata": {}, + "outputs": [], + "source": [ + "def report(url_official, url_wiki, city):\n", + " city_website = Website(url_official)\n", + " wiki_website = Website(url_wiki)\n", + " response = openai.chat.completions.create(\n", + " model = \"gpt-4o-mini\",\n", + " messages = messages_for(city, city_website, wiki_website)\n", + " )\n", + " return response.choices[0].message.content" + ] + }, + { + "cell_type": "markdown", + "id": "08b47ec7-d00f-44e4-bbe2-580c8efd88e5", + "metadata": {}, + "source": [ + "# Raw Result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "830f0746-08a7-43ae-bd40-78d4a4c5d3e5", + "metadata": {}, + "outputs": [], + "source": [ + "report(\"https://www.rudaslaska.pl/\", \"https://pl.wikipedia.org/wiki/Ruda_%C5%9Al%C4%85ska\", \"Ruda ŚlΔ…ska\")" + ] + }, + { + "cell_type": "markdown", + "id": "a3630ac4-c103-4b84-a1a2-c246a702346e", + "metadata": {}, + "source": [ + "# Polished Result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b89dd543-998d-4466-abd8-cc785118d3e4", + "metadata": {}, + "outputs": [], + "source": [ + "def display_report(url_official, url_wiki, city):\n", + " rep = report(url_official, url_wiki, city)\n", + " display(Markdown(rep))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "157926f3-ba67-4d4b-abbb-24a2dcd85a8b", + "metadata": {}, + "outputs": [], + "source": [ + "display_report(\"https://www.rudaslaska.pl/\", \"https://pl.wikipedia.org/wiki/Ruda_%C5%9Al%C4%85ska\", \"Ruda ŚlΔ…ska\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "727d2283-e74c-4e74-86f2-759b08f1427a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week1/community-contributions/day1-dotabuff-summarization.ipynb b/week1/community-contributions/day1-dotabuff-summarization.ipynb new file mode 100644 index 0000000..08c5f73 --- /dev/null +++ b/week1/community-contributions/day1-dotabuff-summarization.ipynb @@ -0,0 +1,271 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "032a76d2-a112-4c49-bd32-fe6c87f6ec19", + "metadata": {}, + "source": [ + "## Dota Game Assistant\n", + "\n", + "This script retrieves and summarizes information about a specified hero from `dotabuff.com` website" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "04b24159-55d1-4eaf-bc19-474cec71cc3b", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install selenium\n", + "!pip install webdriver-manager" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "14d26510-6613-4c1a-a346-159d906d111c", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import requests\n", + "from dotenv import load_dotenv\n", + "from bs4 import BeautifulSoup\n", + "from IPython.display import Markdown, display\n", + "from openai import OpenAI" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f9c8ea1e-8881-4f50-953d-ca7f462d8a32", + "metadata": {}, + "outputs": [], + "source": [ + "# Load environment variables in a file called .env\n", + "\n", + "load_dotenv(override=True)\n", + "api_key = os.getenv('OPENAI_API_KEY')\n", + "\n", + "# Check the key\n", + "\n", + "if not api_key:\n", + " print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n", + "elif not api_key.startswith(\"sk-proj-\"):\n", + " print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n", + "elif api_key.strip() != api_key:\n", + " print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n", + "else:\n", + " print(\"API key found and looks good so far!\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "02febcac-9a21-4322-b2ea-748972312165", + "metadata": {}, + "outputs": [], + "source": [ + "openai = OpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bb7dd822-962e-4b34-a743-c14809764e4a", + "metadata": {}, + "outputs": [], + "source": [ + "# A class to represent a Webpage\n", + "\n", + "# Some websites need you to use proper headers when fetching them:\n", + "headers = {\n", + " \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n", + "}\n", + "\n", + "from selenium import webdriver\n", + "from selenium.webdriver.chrome.service import Service\n", + "from selenium.webdriver.chrome.options import Options\n", + "from selenium.webdriver.common.by import By\n", + "from selenium.webdriver.support.ui import WebDriverWait\n", + "from selenium.webdriver.support import expected_conditions as EC\n", + "from webdriver_manager.chrome import ChromeDriverManager\n", + "from bs4 import BeautifulSoup\n", + "\n", + "class Website:\n", + " def __init__(self, url, wait_time=10):\n", + " \"\"\"\n", + " Create this Website object from the given URL using Selenium and BeautifulSoup.\n", + " Uses headless Chrome to load JavaScript content.\n", + " \"\"\"\n", + " self.url = url\n", + "\n", + " # Configure headless Chrome\n", + " options = Options()\n", + " options.headless = True\n", + " options.add_argument(\"--disable-gpu\")\n", + " options.add_argument(\"--no-sandbox\")\n", + "\n", + " # Start the driver\n", + " service = Service(ChromeDriverManager().install())\n", + " driver = webdriver.Chrome(service=service, options=options)\n", + "\n", + " try:\n", + " driver.get(url)\n", + "\n", + " # Wait until body is loaded (you can tweak the wait condition)\n", + " WebDriverWait(driver, wait_time).until(\n", + " EC.presence_of_element_located((By.TAG_NAME, \"body\"))\n", + " )\n", + "\n", + " html = driver.page_source\n", + " soup = BeautifulSoup(html, \"html.parser\")\n", + "\n", + " self.title = soup.title.string.strip() if soup.title else \"No title found\"\n", + "\n", + " # Remove unwanted tags\n", + " for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n", + " irrelevant.decompose()\n", + "\n", + " self.text = soup.body.get_text(separator=\"\\n\", strip=True)\n", + "\n", + " finally:\n", + " driver.quit()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d833fbb-0115-4d99-a4e9-464f27900eab", + "metadata": {}, + "outputs": [], + "source": [ + "class DotaWebsite:\n", + " def __init__(self, hero):\n", + " web = Website(\"https://www.dotabuff.com/heroes\" + \"/\" + hero)\n", + " self.title = web.title\n", + " self.text = web.text" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0a42c2b-c837-4d1b-b8f8-b2dbb8592a1a", + "metadata": {}, + "outputs": [], + "source": [ + "system_prompt = \"You are an game assistant that analyzes the contents of a website \\\n", + "and provides a short summary about facet selection, ability building, item building, best versus and worst versus, ignoring text that might be navigation related. \\\n", + "Respond in markdown.\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c05843d-6373-4a76-8cca-9c716a6ca13a", + "metadata": {}, + "outputs": [], + "source": [ + "# A function that writes a User Prompt that asks for summaries of websites:\n", + "\n", + "def user_prompt_for(website):\n", + " user_prompt = f\"You are looking at a website titled {website.title}\"\n", + " user_prompt += \"\\nThe contents of this website is as follows; \\\n", + "please provide a short summary of provides a short summary about facet selection, ability building, item building, best versus and worst versus in markdown. \\\n", + "If it includes news or announcements, then summarize these too.\\n\\n\"\n", + " user_prompt += website.text\n", + " return user_prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0145eee1-39e2-4f00-89ec-7acc6e375972", + "metadata": {}, + "outputs": [], + "source": [ + "# See how this function creates exactly the format above\n", + "\n", + "def messages_for(website):\n", + " return [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt_for(website)}\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76f389c0-572a-476b-9b4e-719c0ef10abb", + "metadata": {}, + "outputs": [], + "source": [ + "# And now: call the OpenAI API. You will get very familiar with this!\n", + "\n", + "def summarize(hero):\n", + " website = DotaWebsite(hero)\n", + " response = openai.chat.completions.create(\n", + " model = \"gpt-4o-mini\",\n", + " messages = messages_for(website)\n", + " )\n", + " return response.choices[0].message.content" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fcb046b7-52a9-49ff-b7bc-d8f6c279df4c", + "metadata": {}, + "outputs": [], + "source": [ + "# A function to display this nicely in the Jupyter output, using markdown\n", + "\n", + "def display_summary(hero):\n", + " summary = summarize(hero)\n", + " display(Markdown(summary))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9befb685-2912-41a9-b2d9-ae33001494c0", + "metadata": {}, + "outputs": [], + "source": [ + "display_summary(\"axe\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf1bb1d9-0351-44fc-8ebf-91aa47a81b42", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week1/community-contributions/day1-finviz_stock_analysis.ipynb b/week1/community-contributions/day1-finviz_stock_analysis.ipynb new file mode 100644 index 0000000..4165bde --- /dev/null +++ b/week1/community-contributions/day1-finviz_stock_analysis.ipynb @@ -0,0 +1,159 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "922bb144", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import requests\n", + "from dotenv import load_dotenv\n", + "from bs4 import BeautifulSoup\n", + "from IPython.display import Markdown, display\n", + "from openai import OpenAI" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "870bdcd9", + "metadata": {}, + "outputs": [], + "source": [ + "# Load environment variables in a file called .env\n", + "load_dotenv(override=True)\n", + "api_key = os.getenv(\"OPENAI_API_KEY\")\n", + "\n", + "# Check the key\n", + "if not api_key:\n", + " print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n", + "elif not api_key.startswith(\"sk-proj-\"):\n", + " print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n", + "elif api_key.strip() != api_key:\n", + " print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n", + "else:\n", + " print(\"API key found and looks good so far!\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f6146102", + "metadata": {}, + "outputs": [], + "source": [ + "openai = OpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f75573f", + "metadata": {}, + "outputs": [], + "source": [ + "class FinvizWebsite():\n", + " \"\"\"\n", + " Create this Website object from the given url using the BeautifulSoup library\n", + " \"\"\"\n", + " \n", + " def __init__(self, ticker):\n", + " self.ticker = ticker.upper()\n", + " self.url = f\"https://finviz.com/quote.ashx?t={self.ticker}&p=d&ty=ea\"\n", + " self.headers = {\n", + " \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n", + " }\n", + " response = requests.get(self.url, headers=self.headers)\n", + " soup = BeautifulSoup(response.content, \"html.parser\")\n", + " self.title = soup.title.string if soup.title else \"No title found\"\n", + " self.table = soup.find(\"table\", class_=\"snapshot-table2\") " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "42c7ced6", + "metadata": {}, + "outputs": [], + "source": [ + "def messages_for(website):\n", + " system_prompt = \"\"\"\n", + " You are a financial analysis assistant that analyzes the contents of HTML formated table.\n", + " and provides a summary of the stock's analysis with clear and professional language appropriate for financial research \n", + " with bulleted important list of **pros** and **cons** , ignoring text that might be navigation related. Repond in markdown.\n", + " \"\"\"\n", + " \n", + " user_prompt = f\"\"\"\n", + " You are looking at a website titled {website.title}.\\n\n", + " The contents of this website is as follows; please provide a summary of the stock's analysis from this website in markdown.\\n\\n\n", + " {website.table}\n", + " \"\"\"\n", + " \n", + " return [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt}\n", + " ]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7bfaa6da", + "metadata": {}, + "outputs": [], + "source": [ + "def display_summary(ticker):\n", + " website = FinvizWebsite(ticker)\n", + " response = openai.chat.completions.create(\n", + " model = \"gpt-4o-mini\",\n", + " messages = messages_for(website)\n", + " )\n", + " summary = response.choices[0].message.content\n", + " display(Markdown(summary))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eeeff6f7", + "metadata": {}, + "outputs": [], + "source": [ + "display_summary(\"aapl\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5aed2001", + "metadata": {}, + "outputs": [], + "source": [ + "display_summary(\"tsla\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week1/community-contributions/day1-mail_subject_creation.ipynb b/week1/community-contributions/day1-mail_subject_creation.ipynb new file mode 100644 index 0000000..fd808bf --- /dev/null +++ b/week1/community-contributions/day1-mail_subject_creation.ipynb @@ -0,0 +1,156 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "72a6552c-c837-4ced-b7c8-75a3d4cf777d", + "metadata": {}, + "source": [ + "

MAIL SUBJECT CREATION -

\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + "

Write something that will take the contents of an email, and will suggest an appropriate short subject line for the email. That's the kind of feature that might be built into a commercial email tool.

\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "76822a8b-d6e0-4dd9-a801-2d34bd104b7d", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import requests\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "1a9de873-d24b-42fb-8f4a-a08f429050f5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "API key found and looks good so far!\n" + ] + } + ], + "source": [ + "load_dotenv(override=True)\n", + "api_key = os.getenv('OPENAI_API_KEY')\n", + "\n", + "# Check the key\n", + "\n", + "if not api_key:\n", + " print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n", + "elif not api_key.startswith(\"sk-proj-\"):\n", + " print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n", + "elif api_key.strip() != api_key:\n", + " print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n", + "else:\n", + " print(\"API key found and looks good so far!\")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "122af5d6-4727-4229-b85a-ea5246ff540c", + "metadata": {}, + "outputs": [], + "source": [ + "openai = OpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "b9a2c2c2-ac10-4019-aeef-2bfe6cc7b1f3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Subject: Missing API Logs for June 22nd: Scheduled Meeting to Address Issue\n" + ] + } + ], + "source": [ + "system_prompt = \"You are an assistant which can generate a subject line as output by taking email of content as input. Subject line should be self explanatrory\"\n", + "user_prompt = \"\"\"\n", + " Below is the content of the text which I am giving as input\n", + " Mail Content - 'Hi Team,\n", + "\n", + "We have observed that the API logs for June 22nd between 6:00 AM and 12:00 PM are missing in Kibana.\n", + "\n", + "The SA team has confirmed that there were no errors reported on their end during this period.\n", + "\n", + "The DevOps team has verified that logs were being sent as expected.\n", + "\n", + "Upon checking the Fluentd pods, no errors were found.\n", + "\n", + "Logs were being shipped to td-agent as usual.\n", + "\n", + "No configuration changes or pod restarts were detected.\n", + "\n", + "We have also confirmed that no code changes were deployed from our side during this time.\n", + "\n", + "Bucket: api_application_log\n", + "Ticket\n", + "\n", + "We have scheduled a meeting with the SA and DevOps teams to restore the missing logs, as they are critical for our weekly report and analysis.'\n", + "\"\"\"\n", + "\n", + "# Step 2: Make the messages list\n", + "\n", + "messages = [ {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt}] # fill this in\n", + "\n", + "# Step 3: Call OpenAI\n", + "\n", + "response = openai.chat.completions.create(\n", + " model = \"gpt-4o-mini\",\n", + " messages = messages\n", + " )\n", + "\n", + "# Step 4: print the result\n", + "\n", + "print(response.choices[0].message.content)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week1/community-contributions/day1_Project.ipynb b/week1/community-contributions/day1_Project.ipynb new file mode 100644 index 0000000..30e795c --- /dev/null +++ b/week1/community-contributions/day1_Project.ipynb @@ -0,0 +1,189 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "181edd2d-67d4-43e4-9a89-327eaff26177", + "metadata": {}, + "source": [ + "Grammar and Vocab AI Checker" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4be465e2-16fc-4b34-a771-d23f05edbc14", + "metadata": {}, + "outputs": [], + "source": [ + "pip install PyMuPDF" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "66b371fb-f4ea-4ced-8ad2-4229892e0647", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import requests\n", + "from dotenv import load_dotenv\n", + "from bs4 import BeautifulSoup\n", + "from IPython.display import Markdown, display\n", + "from openai import OpenAI\n", + "import fitz # PyMuPDF" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41068273-4325-4de2-b11d-37d2831b1a47", + "metadata": {}, + "outputs": [], + "source": [ + "# Load environment variables in a file called .env\n", + "\n", + "load_dotenv(override=True)\n", + "api_key = os.getenv('OPENAI_API_KEY')\n", + "\n", + "# Check the key\n", + "\n", + "if not api_key:\n", + " print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n", + "elif not api_key.startswith(\"sk-proj-\"):\n", + " print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n", + "elif api_key.strip() != api_key:\n", + " print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n", + "else:\n", + " print(\"API key found and looks good so far!\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ba003970-0cc9-4e11-8702-0b120f378fa4", + "metadata": {}, + "outputs": [], + "source": [ + "openai = OpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "faa89067-fcee-4950-b4ce-3faec640c79b", + "metadata": {}, + "outputs": [], + "source": [ + "system_prompt = \"You are a spell, grammar, and vocabulary checker. You check for any mistakes in terms of spelling, grammar, and vocabulary of texts or files that are given to you. You provide a response with the percentage of the text that is correct in terms of spelling, vocab, and grammar but also the total number of words. These characters is in the file or text that you are checking, and provide instructions in bullet points on how to fix them and where the mistakes are.\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "de32a94d-9c1b-4e1a-a1b9-78d3180c0d79", + "metadata": {}, + "outputs": [], + "source": [ + "# user_prompt = \"Hi, mw namw is kkkdvin. How are y,?\" # Uncomment this to test the implementation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "272f379d-3471-488d-ba27-bbffff961d72", + "metadata": {}, + "outputs": [], + "source": [ + "def extract_pdf_text_to_string(pdf_path):\n", + " \"\"\"\n", + " Extracts all text from a PDF file and returns it as a single string.\n", + "\n", + " Args:\n", + " pdf_path (str): The path to the PDF file.\n", + "\n", + " Returns:\n", + " str: A string containing all the extracted text from the PDF.\n", + " \"\"\"\n", + " text_content = \"\"\n", + " try:\n", + " doc = fitz.open(pdf_path)\n", + " for page_num in range(doc.page_count):\n", + " page = doc.load_page(page_num)\n", + " text_content += page.get_text()\n", + " doc.close()\n", + " except Exception as e:\n", + " print(f\"Error processing PDF: {e}\")\n", + " return None\n", + " return text_content\n", + "\n", + "pdf_file_path = \"gram-vocab-test.pdf\" # Replace with the actual path to your PDF\n", + "user_prompt = extract_pdf_text_to_string(pdf_file_path)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "07a839f6-c508-4b94-98ec-877c19023e58", + "metadata": {}, + "outputs": [], + "source": [ + "messages = [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": f\"This is the text to check for grammar, vocab, and spelling errors: {user_prompt}\"}\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a642cb62-9016-4957-a74e-9f97f8c495a7", + "metadata": {}, + "outputs": [], + "source": [ + "response = openai.chat.completions.create(model=\"gpt-4o-mini\", messages=messages)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2ce6b006-19b6-48b4-b344-b4b57b8c1438", + "metadata": {}, + "outputs": [], + "source": [ + "print(response.choices[0].message.content)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "54bc23cd-f59c-4b4d-bc3e-60f273692d92", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week1/community-contributions/day1_check_source_for_security_vuln.ipynb b/week1/community-contributions/day1_check_source_for_security_vuln.ipynb new file mode 100644 index 0000000..db99309 --- /dev/null +++ b/week1/community-contributions/day1_check_source_for_security_vuln.ipynb @@ -0,0 +1,156 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "e95fa36b-7118-4fd8-a3b2-b4424bda2178", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import requests\n", + "from dotenv import load_dotenv\n", + "from bs4 import BeautifulSoup\n", + "from IPython.display import Markdown, display\n", + "from openai import OpenAI\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0356762-4a3f-437a-908e-192aa9c804c7", + "metadata": {}, + "outputs": [], + "source": [ + "load_dotenv(override=True)\n", + "api_key = os.getenv('OPENAI_API_KEY')\n", + "\n", + "# Check the key\n", + "\n", + "if not api_key:\n", + " print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n", + "elif not api_key.startswith(\"sk-proj-\"):\n", + " print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n", + "elif api_key.strip() != api_key:\n", + " print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n", + "else:\n", + " print(\"API key found and looks good so far!\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eb747863-30bd-4a0b-b359-b37223884075", + "metadata": {}, + "outputs": [], + "source": [ + "openai = OpenAI()\n", + "message = \"Hello, GPT! This is my first ever message to you! Hi!\"\n", + "response = openai.chat.completions.create(model=\"gpt-4o-mini\", messages=[{\"role\":\"user\", \"content\":message}])\n", + "print(response.choices[0].message.content)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fae60901-3564-4f26-a812-fc16d3b95bdb", + "metadata": {}, + "outputs": [], + "source": [ + "def get_page_source(url):\n", + " response = requests.get(url)\n", + " response.raise_for_status() # Hata varsa bildirir\n", + " return response.text # Ham HTML metni dΓΆner\n", + "\n", + "system_prompt = \"You are an assistant analyzing the source of a website and checking for security vulnerabilities.\"\n", + "\n", + "def user_prompt_for(url):\n", + " user_prompt = \"Below is the HTML source of the website:\\n\\n\"\n", + " user_prompt += get_page_source(url) \n", + " user_prompt += \"\\n\\nPlease check this website and search for security vulnerabilities. \"\n", + " user_prompt += \"If you don't find any, print 'No vulnerability found.' \"\n", + " user_prompt += \"If you find a potential vulnerability risk, describe the vulnerability risk and print 'Potential Vulnerability Risk'.\"\n", + " user_prompt += \"If you find a direct, explicit vulnerability, describe the vulnerability and CVSS Score print 'ATTENTION! Vulnerability is Found.'\"\n", + " user_prompt += \"If you find both a potential vulnerability risk and a direct, explicit vulnerability, describe them and CVSS Score print 'ATTENTION! Potential Vulnerability Risk and Direct Vulnerability are Found!!'\"\n", + " return user_prompt\n", + "\n", + "def messages_for(url):\n", + " return [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt_for(url)}\n", + " ]\n", + "\n", + "def check_vuln(url):\n", + " response = openai.chat.completions.create(\n", + " model = \"gpt-4o-mini\",\n", + " messages = messages_for(url)\n", + " )\n", + " return response.choices[0].message.content\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e048c27f-f659-4c92-a47c-679bf6e5bf5f", + "metadata": {}, + "outputs": [], + "source": [ + "def display_vuln(url):\n", + " display_vuln = check_vuln(url)\n", + " display(Markdown(display_vuln))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "69f5852f-ca5b-4933-b93c-e9f2d401467a", + "metadata": {}, + "outputs": [], + "source": [ + "display_vuln(\"https://edwarddonner.com\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "824943fc-e5a5-424a-abec-56767a709782", + "metadata": {}, + "outputs": [], + "source": [ + "display_vuln(\"http://192.168.1.113/\") #local apache server IP, contains xss_vulnerable_example.html" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e3543846-e0c6-4504-8b65-2f675f0f7ebe", + "metadata": {}, + "outputs": [], + "source": [ + "display_vuln(\"https://www.google.com\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week1/community-contributions/day2-EXERCISE-ollama-local.ipynb b/week1/community-contributions/day2-EXERCISE-ollama-local.ipynb new file mode 100644 index 0000000..6942c54 --- /dev/null +++ b/week1/community-contributions/day2-EXERCISE-ollama-local.ipynb @@ -0,0 +1,459 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9", + "metadata": {}, + "source": [ + "# Welcome to your first assignment!\n", + "\n", + "Instructions are below. Please give this a try, and look in the solutions folder if you get stuck (or feel free to ask me!)" + ] + }, + { + "cell_type": "markdown", + "id": "ada885d9-4d42-4d9b-97f0-74fbbbfe93a9", + "metadata": {}, + "source": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + "

Just before we get to the assignment --

\n", + " I thought I'd take a second to point you at this page of useful resources for the course. This includes links to all the slides.
\n", + " https://edwarddonner.com/2024/11/13/llm-engineering-resources/
\n", + " Please keep this bookmarked, and I'll continue to add more useful links there over time.\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "6e9fa1fc-eac5-4d1d-9be4-541b3f2b3458", + "metadata": {}, + "source": [ + "# HOMEWORK EXERCISE ASSIGNMENT\n", + "\n", + "Upgrade the day 1 project to summarize a webpage to use an Open Source model running locally via Ollama rather than OpenAI\n", + "\n", + "You'll be able to use this technique for all subsequent projects if you'd prefer not to use paid APIs.\n", + "\n", + "**Benefits:**\n", + "1. No API charges - open-source\n", + "2. Data doesn't leave your box\n", + "\n", + "**Disadvantages:**\n", + "1. Significantly less power than Frontier Model\n", + "\n", + "## Recap on installation of Ollama\n", + "\n", + "Simply visit [ollama.com](https://ollama.com) and install!\n", + "\n", + "Once complete, the ollama server should already be running locally. \n", + "If you visit: \n", + "[http://localhost:11434/](http://localhost:11434/)\n", + "\n", + "You should see the message `Ollama is running`. \n", + "\n", + "If not, bring up a new Terminal (Mac) or Powershell (Windows) and enter `ollama serve` \n", + "And in another Terminal (Mac) or Powershell (Windows), enter `ollama pull llama3.2` \n", + "Then try [http://localhost:11434/](http://localhost:11434/) again.\n", + "\n", + "If Ollama is slow on your machine, try using `llama3.2:1b` as an alternative. Run `ollama pull llama3.2:1b` from a Terminal or Powershell, and change the code below from `MODEL = \"llama3.2\"` to `MODEL = \"llama3.2:1b\"`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4e2a9393-7767-488e-a8bf-27c12dca35bd", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import requests\n", + "from bs4 import BeautifulSoup\n", + "from IPython.display import Markdown, display" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29ddd15d-a3c5-4f4e-a678-873f56162724", + "metadata": {}, + "outputs": [], + "source": [ + "# Constants\n", + "\n", + "OLLAMA_API = \"http://localhost:11434/api/chat\"\n", + "HEADERS = {\"Content-Type\": \"application/json\"}\n", + "MODEL = \"llama3.2\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dac0a679-599c-441f-9bf2-ddc73d35b940", + "metadata": {}, + "outputs": [], + "source": [ + "# Create a messages list using the same format that we used for OpenAI\n", + "\n", + "messages = [\n", + " {\"role\": \"user\", \"content\": \"Describe some of the business applications of Generative AI\"}\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7bb9c624-14f0-4945-a719-8ddb64f66f47", + "metadata": {}, + "outputs": [], + "source": [ + "payload = {\n", + " \"model\": MODEL,\n", + " \"messages\": messages,\n", + " \"stream\": False\n", + " }" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "479ff514-e8bd-4985-a572-2ea28bb4fa40", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's just make sure the model is loaded\n", + "\n", + "!ollama pull llama3.2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "42b9f644-522d-4e05-a691-56e7658c0ea9", + "metadata": {}, + "outputs": [], + "source": [ + "# If this doesn't work for any reason, try the 2 versions in the following cells\n", + "# And double check the instructions in the 'Recap on installation of Ollama' at the top of this lab\n", + "# And if none of that works - contact me!\n", + "\n", + "response = requests.post(OLLAMA_API, json=payload, headers=HEADERS)\n", + "print(response.json()['message']['content'])" + ] + }, + { + "cell_type": "markdown", + "id": "6a021f13-d6a1-4b96-8e18-4eae49d876fe", + "metadata": {}, + "source": [ + "# Introducing the ollama package\n", + "\n", + "And now we'll do the same thing, but using the elegant ollama python package instead of a direct HTTP call.\n", + "\n", + "Under the hood, it's making the same call as above to the ollama server running at localhost:11434" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7745b9c4-57dc-4867-9180-61fa5db55eb8", + "metadata": {}, + "outputs": [], + "source": [ + "import ollama\n", + "\n", + "response = ollama.chat(model=MODEL, messages=messages)\n", + "print(response['message']['content'])" + ] + }, + { + "cell_type": "markdown", + "id": "a4704e10-f5fb-4c15-a935-f046c06fb13d", + "metadata": {}, + "source": [ + "## Alternative approach - using OpenAI python library to connect to Ollama" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "23057e00-b6fc-4678-93a9-6b31cb704bff", + "metadata": {}, + "outputs": [], + "source": [ + "# There's actually an alternative approach that some people might prefer\n", + "# You can use the OpenAI client python library to call Ollama:\n", + "\n", + "from openai import OpenAI\n", + "ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')\n", + "\n", + "response = ollama_via_openai.chat.completions.create(\n", + " model=MODEL,\n", + " messages=messages\n", + ")\n", + "\n", + "print(response.choices[0].message.content)" + ] + }, + { + "cell_type": "markdown", + "id": "9f9e22da-b891-41f6-9ac9-bd0c0a5f4f44", + "metadata": {}, + "source": [ + "## Are you confused about why that works?\n", + "\n", + "It seems strange, right? We just used OpenAI code to call Ollama?? What's going on?!\n", + "\n", + "Here's the scoop:\n", + "\n", + "The python class `OpenAI` is simply code written by OpenAI engineers that makes calls over the internet to an endpoint. \n", + "\n", + "When you call `openai.chat.completions.create()`, this python code just makes a web request to the following url: \"https://api.openai.com/v1/chat/completions\"\n", + "\n", + "Code like this is known as a \"client library\" - it's just wrapper code that runs on your machine to make web requests. The actual power of GPT is running on OpenAI's cloud behind this API, not on your computer!\n", + "\n", + "OpenAI was so popular, that lots of other AI providers provided identical web endpoints, so you could use the same approach.\n", + "\n", + "So Ollama has an endpoint running on your local box at http://localhost:11434/v1/chat/completions \n", + "And in week 2 we'll discover that lots of other providers do this too, including Gemini and DeepSeek.\n", + "\n", + "And then the team at OpenAI had a great idea: they can extend their client library so you can specify a different 'base url', and use their library to call any compatible API.\n", + "\n", + "That's it!\n", + "\n", + "So when you say: `ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')` \n", + "Then this will make the same endpoint calls, but to Ollama instead of OpenAI." + ] + }, + { + "cell_type": "markdown", + "id": "bc7d1de3-e2ac-46ff-a302-3b4ba38c4c90", + "metadata": {}, + "source": [ + "## Also trying the amazing reasoning model DeepSeek\n", + "\n", + "Here we use the version of DeepSeek-reasoner that's been distilled to 1.5B. \n", + "This is actually a 1.5B variant of Qwen that has been fine-tuned using synethic data generated by Deepseek R1.\n", + "\n", + "Other sizes of DeepSeek are [here](https://ollama.com/library/deepseek-r1) all the way up to the full 671B parameter version, which would use up 404GB of your drive and is far too large for most!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cf9eb44e-fe5b-47aa-b719-0bb63669ab3d", + "metadata": {}, + "outputs": [], + "source": [ + "!ollama pull deepseek-r1:1.5b" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d3d554b-e00d-4c08-9300-45e073950a76", + "metadata": {}, + "outputs": [], + "source": [ + "# This may take a few minutes to run! You should then see a fascinating \"thinking\" trace inside tags, followed by some decent definitions\n", + "\n", + "response = ollama_via_openai.chat.completions.create(\n", + " model=\"deepseek-r1:1.5b\",\n", + " messages=[{\"role\": \"user\", \"content\": \"Please give definitions of some core concepts behind LLMs: a neural network, attention and the transformer\"}]\n", + ")\n", + "\n", + "print(response.choices[0].message.content)" + ] + }, + { + "cell_type": "markdown", + "id": "1622d9bb-5c68-4d4e-9ca4-b492c751f898", + "metadata": {}, + "source": [ + "# NOW the exercise for you\n", + "\n", + "Take the code from day1 and incorporate it here, to build a website summarizer that uses Llama 3.2 running locally instead of OpenAI; use either of the above approaches." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6de38216-6d1c-48c4-877b-86d403f4e0f8", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import requests\n", + "from bs4 import BeautifulSoup\n", + "from IPython.display import Markdown, display" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0bd2aea1-d7d7-499f-b704-5b13e2ddd23f", + "metadata": {}, + "outputs": [], + "source": [ + "MODEL = \"llama3.2\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6df3141a-0a46-4ff9-ae73-bf8bee2aa3d8", + "metadata": {}, + "outputs": [], + "source": [ + "# A class to represent a Webpage\n", + "\n", + "class Website:\n", + " \"\"\"\n", + " A utility class to represent a Website that we have scraped\n", + " \"\"\"\n", + " url: str\n", + " title: str\n", + " text: str\n", + "\n", + " def __init__(self, url):\n", + " \"\"\"\n", + " Create this Website object from the given url using the BeautifulSoup library\n", + " \"\"\"\n", + " self.url = url\n", + " response = requests.get(url)\n", + " soup = BeautifulSoup(response.content, 'html.parser')\n", + " self.title = soup.title.string if soup.title else \"No title found\"\n", + " for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n", + " irrelevant.decompose()\n", + " self.text = soup.body.get_text(separator=\"\\n\", strip=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "df2ea48b-7343-47be-bdcb-52b63a4de43e", + "metadata": {}, + "outputs": [], + "source": [ + "# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n", + "\n", + "system_prompt = \"You are an assistant that analyzes the contents of a website \\\n", + "and provides a short summary, ignoring text that might be navigation related. \\\n", + "Respond in markdown.\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80f1a534-ae2a-4283-83cf-5e7c5765c736", + "metadata": {}, + "outputs": [], + "source": [ + "# A function that writes a User Prompt that asks for summaries of websites:\n", + "\n", + "def user_prompt_for(website):\n", + " user_prompt = f\"You are looking at a website titled {website.title}\"\n", + " user_prompt += \"The contents of this website is as follows; \\\n", + "please provide a short summary of this website in markdown. \\\n", + "If it includes news or announcements, then summarize these too.\\n\\n\"\n", + " user_prompt += website.text\n", + " return user_prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5dfe658d-e3f9-4b32-90e6-1a523f47f836", + "metadata": {}, + "outputs": [], + "source": [ + "# See how this function creates exactly the format above\n", + "\n", + "def messages_for(website):\n", + " return [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt_for(website)}\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2e2a09d0-bc47-490e-b085-fe3ccfbd16ad", + "metadata": {}, + "outputs": [], + "source": [ + "# And now: call the Ollama function instead of OpenAI\n", + "\n", + "def summarize(url):\n", + " website = Website(url)\n", + " messages = messages_for(website)\n", + " response = ollama.chat(model=MODEL, messages=messages)\n", + " return response['message']['content']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "340e08a2-86f0-4cdd-9188-da2972cae7a6", + "metadata": {}, + "outputs": [], + "source": [ + "# A function to display this nicely in the Jupyter output, using markdown\n", + "\n", + "def display_summary(url):\n", + " summary = summarize(url)\n", + " display(Markdown(summary))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55e4790a-013c-40cf-9dff-bb5ec1d53964", + "metadata": {}, + "outputs": [], + "source": [ + "display_summary(\"https://zhufqiu.com\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8a96cbad-1306-4ce1-a942-2448f50d6751", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week1/community-contributions/sruthi-day1-ollama_website_summarizer.py b/week1/community-contributions/sruthi-day1-ollama_website_summarizer.py new file mode 100644 index 0000000..d2750f7 --- /dev/null +++ b/week1/community-contributions/sruthi-day1-ollama_website_summarizer.py @@ -0,0 +1,84 @@ +""" +Project: Web Content Summarizer using Ollama's llama3.2 model +- Developed a Python tool to extract and summarize website content using Ollama's llama3.2 model and BeautifulSoup. +- Implemented secure API integration and HTTP requests with custom headers to mimic browser behavior. +""" + +import os +import requests +from bs4 import BeautifulSoup +import ollama + +# Constants + +OLLAMA_API = "http://localhost:11434/api/chat" +HEADERS = {"Content-Type": "application/json"} +MODEL = "llama3.2" + +# Define the Website class to fetch and parse website content +class Website: + def __init__(self, url): + """ + Initialize a Website object by fetching and parsing the given URL. + Uses BeautifulSoup to extract the title and text content of the page. + """ + self.url = url + response = requests.get(url, headers=HEADERS) + soup = BeautifulSoup(response.content, 'html.parser') + + # Extract the title of the website + self.title = soup.title.string if soup.title else "No title found" + + # Remove irrelevant elements like scripts, styles, images, and inputs + for irrelevant in soup.body(["script", "style", "img", "input"]): + irrelevant.decompose() + + # Extract the main text content of the website + self.text = soup.body.get_text(separator="\n", strip=True) + +# Define the system prompt for the OpenAI model +system_prompt = ( + "You are an assistant that analyzes the contents of a website " + "and provides a short summary, ignoring text that might be navigation related. " + "Respond in markdown." +) + +# Function to generate the user prompt based on the website content +def user_prompt_for(website): + """ + Generate a user prompt for the llama3.2 model based on the website's title and content. + """ + user_prompt = f"You are looking at a website titled {website.title}" + user_prompt += "\nThe contents of this website is as follows; summarize these.\n\n" + user_prompt += website.text + return user_prompt + +# Function to create the messages list for the OpenAI API +def messages_for(website): + """ + Create a list of messages for the ollama, including the system and user prompts. + """ + return [ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_prompt_for(website)} + ] + +# Function to summarize the content of a given URL +def summarize(url): + """ + Summarize the content of the given URL using the OpenAI API. + """ + # Create a Website object to fetch and parse the URL + website = Website(url) + + # Call the llama3.2 using ollama with the generated messages + response = ollama.chat( + model= MODEL, + messages=messages_for(website) + ) + + # Return the summary generated by ollama + print(response.message.content) + +# Example usage: Summarize the content of a specific URL +summarize("https://sruthianem.com") \ No newline at end of file diff --git a/week1/community-contributions/week1-EXERCISE-openai-ollama-tech-assistant.ipynb b/week1/community-contributions/week1-EXERCISE-openai-ollama-tech-assistant.ipynb new file mode 100644 index 0000000..0706bfc --- /dev/null +++ b/week1/community-contributions/week1-EXERCISE-openai-ollama-tech-assistant.ipynb @@ -0,0 +1,202 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "fe12c203-e6a6-452c-a655-afb8a03a4ff5", + "metadata": {}, + "source": [ + "# End of week 1 exercise\n", + "\n", + "To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question, \n", + "and responds with an explanation. This is a tool that you will be able to use yourself during the course!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c1070317-3ed9-4659-abe3-828943230e03", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "from dotenv import load_dotenv\n", + "from IPython.display import Markdown, display, update_display\n", + "from openai import OpenAI\n", + "import ollama" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4a456906-915a-4bfd-bb9d-57e505c5093f", + "metadata": {}, + "outputs": [], + "source": [ + "# constants\n", + "\n", + "MODEL_GPT = 'gpt-4o-mini'\n", + "MODEL_LLAMA = 'llama3.2'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8d7923c-5f28-4c30-8556-342d7c8497c1", + "metadata": {}, + "outputs": [], + "source": [ + "# set up environment\n", + "\n", + "load_dotenv(override=True)\n", + "api_key = os.getenv('OPENAI_API_KEY')\n", + "\n", + "# Check the key\n", + "\n", + "if not api_key:\n", + " print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n", + "elif not api_key.startswith(\"sk-proj-\"):\n", + " print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n", + "elif api_key.strip() != api_key:\n", + " print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n", + "else:\n", + " print(\"API key found and looks good so far!\")\n", + "\n", + "openai = OpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3f0d0137-52b0-47a8-81a8-11a90a010798", + "metadata": {}, + "outputs": [], + "source": [ + "# here is the question; type over this to ask something new\n", + "\n", + "question = \"\"\"\n", + "Please explain what this code does and why:\n", + "yield from {book.get(\"author\") for book in books if book.get(\"author\")}\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1f879b7e-5ecc-4ec6-b269-78b6e2ed3480", + "metadata": {}, + "outputs": [], + "source": [ + "# prompts\n", + "\n", + "system_prompt = \"You are a helpful tutor who answers technical questions about programming code(especially python code), software engineering, data science and LLMs\"\n", + "user_prompt = \"Please give a detailed explanation to the following question: \" + question" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4ac74ae5-af61-4a5d-b991-554fa67cd3d1", + "metadata": {}, + "outputs": [], + "source": [ + "messages = [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt}\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60ce7000-a4a5-4cce-a261-e75ef45063b4", + "metadata": {}, + "outputs": [], + "source": [ + "# Get gpt-4o-mini to answer, with streaming\n", + "stream = openai.chat.completions.create(\n", + " model=MODEL_GPT,\n", + " messages=messages,\n", + " stream=True\n", + " )\n", + " \n", + "response = \"\"\n", + "display_handle = display(Markdown(\"\"), display_id=True)\n", + "for chunk in stream:\n", + " response += chunk.choices[0].delta.content or ''\n", + " response = response.replace(\"```\",\"\").replace(\"markdown\", \"\")\n", + " update_display(Markdown(response), display_id=display_handle.display_id)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f7c8ea8-4082-4ad0-8751-3301adcf6538", + "metadata": {}, + "outputs": [], + "source": [ + "# Get Llama 3.2 to answer\n", + "\n", + "OLLAMA_API = \"http://localhost:11434/api/chat\"\n", + "HEADERS = {\"Content-Type\": \"application/json\"}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4bd10d96-ee72-4c86-acd8-4fa417c25960", + "metadata": {}, + "outputs": [], + "source": [ + "!ollama pull llama3.2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d889d514-0478-4d7f-aabf-9a7bc743adb1", + "metadata": {}, + "outputs": [], + "source": [ + "stream = ollama.chat(model=MODEL_LLAMA, messages=messages, stream=True)\n", + "\n", + "response = \"\"\n", + "display_handle = display(Markdown(\"\"), display_id=True)\n", + "for chunk in stream:\n", + " response += chunk.get(\"message\", {}).get(\"content\", \"\")\n", + " response = response.replace(\"```\",\"\").replace(\"markdown\", \"\")\n", + " update_display(Markdown(response), display_id=display_handle.display_id)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "452d442a-f3b0-42ad-89d2-a8dc664e8bb6", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week1/community-contributions/xss_vulnerable_example.html b/week1/community-contributions/xss_vulnerable_example.html new file mode 100644 index 0000000..6e1056c --- /dev/null +++ b/week1/community-contributions/xss_vulnerable_example.html @@ -0,0 +1,24 @@ + + + + + XSS Vulnerability Example + + +

Leave a Comment

+
+ + +
+ +

Your Comment:

+

+ + + +

+ + \ No newline at end of file diff --git a/week2/community-contributions/Agent_translate_gemini.ipynb b/week2/community-contributions/Agent_translate_gemini.ipynb new file mode 100644 index 0000000..fe62337 --- /dev/null +++ b/week2/community-contributions/Agent_translate_gemini.ipynb @@ -0,0 +1,143 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d006b2ea-9dfe-49c7-88a9-a5a0775185fd", + "metadata": {}, + "source": [ + "# Additional End of week Exercise - week 2\n", + "\n", + "Now use everything you've learned from Week 2 to build a full prototype for the technical question/answerer you built in Week 1 Exercise.\n", + "\n", + "This should include a Gradio UI, streaming, use of the system prompt to add expertise, and the ability to switch between models. Bonus points if you can demonstrate use of a tool!\n", + "\n", + "If you feel bold, see if you can add audio input so you can talk to it, and have it respond with audio. ChatGPT or Claude can help you, or email me if you have questions.\n", + "\n", + "I will publish a full solution here soon - unless someone beats me to it...\n", + "\n", + "There are so many commercial applications for this, from a language tutor, to a company onboarding solution, to a companion AI to a course (like this one!) I can't wait to see your results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a07e7793-b8f5-44f4-aded-5562f633271a", + "metadata": {}, + "outputs": [], + "source": [ + "# Agent that can listen for audio and convert it to text" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "da58ed0f-f781-4c51-8e5d-fdb05db98c8c", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import gradio as gr\n", + "import google.generativeai as genai\n", + "from dotenv import load_dotenv\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "078cf34a-881e-44f4-9947-c45d7fe992a3", + "metadata": {}, + "outputs": [], + "source": [ + "load_dotenv()\n", + "\n", + "google_api_key = os.getenv('GOOGLE_API_KEY')\n", + "if google_api_key:\n", + " print(f\"Google API Key exists and begins {google_api_key[:8]}\")\n", + "else:\n", + " print(\"Google API Key not set\")\n", + "\n", + "genai.configure(api_key=google_api_key)\n", + "model = genai.GenerativeModel(\"gemini-2.0-flash\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f77228ea-d0e1-4434-9191-555a6d680625", + "metadata": {}, + "outputs": [], + "source": [ + "def transcribe_translate_with_gemini(audio_file_path):\n", + " if not audio_file_path:\n", + " return \"⚠️ No audio file received.\"\n", + "\n", + " prompt = (\n", + " \"You're an AI that listens to a voice message in any language and returns the English transcription. \"\n", + " \"Please transcribe and translate the following audio to English. If already in English, just transcribe it.\"\n", + " )\n", + "\n", + " uploaded_file = genai.upload_file(audio_file_path)\n", + "\n", + " # πŸ” Send prompt + uploaded audio reference to Gemini\n", + " response = model.generate_content(\n", + " contents=[\n", + " {\n", + " \"role\": \"user\",\n", + " \"parts\": [\n", + " {\"text\": prompt},\n", + " uploaded_file \n", + " ]\n", + " }\n", + " ]\n", + " )\n", + "\n", + " return response.text.strip()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eb6c6d1e-1be3-404d-83f3-fc0855dc9f67", + "metadata": {}, + "outputs": [], + "source": [ + "gr.Interface(\n", + " fn=transcribe_translate_with_gemini,\n", + " inputs=gr.Audio(label=\"Record voice\", type=\"filepath\"),\n", + " outputs=\"text\",\n", + " title=\"πŸŽ™οΈ Voice-to-English Translator (Gemini Only)\",\n", + " description=\"Speak in any language and get the English transcription using Gemini multimodal API.\"\n", + ").launch()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b105082-e388-44bc-9617-1a81f38e2f3f", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week2/community-contributions/FlightAI-exercise.ipynb b/week2/community-contributions/FlightAI-exercise.ipynb new file mode 100644 index 0000000..f6c96ca --- /dev/null +++ b/week2/community-contributions/FlightAI-exercise.ipynb @@ -0,0 +1,654 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d006b2ea-9dfe-49c7-88a9-a5a0775185fd", + "metadata": {}, + "source": [ + "# Additional End of week Exercise - week 2\n", + "\n", + "Now use everything you've learned from Week 2 to build a full prototype for the technical question/answerer you built in Week 1 Exercise.\n", + "\n", + "This should include a Gradio UI, streaming, use of the system prompt to add expertise, and the ability to switch between models. Bonus points if you can demonstrate use of a tool!\n", + "\n", + "If you feel bold, see if you can add audio input so you can talk to it, and have it respond with audio. ChatGPT or Claude can help you, or email me if you have questions.\n", + "\n", + "I will publish a full solution here soon - unless someone beats me to it...\n", + "\n", + "There are so many commercial applications for this, from a language tutor, to a company onboarding solution, to a companion AI to a course (like this one!) I can't wait to see your results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a07e7793-b8f5-44f4-aded-5562f633271a", + "metadata": {}, + "outputs": [], + "source": [ + "# Imports\n", + "\n", + "import os\n", + "import json\n", + "import base64\n", + "import logging\n", + "import gradio as gr\n", + "from PIL import Image\n", + "from io import BytesIO\n", + "from openai import OpenAI\n", + "from dotenv import load_dotenv\n", + "from IPython.display import Audio, display" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e879f6ae-b246-479d-8f81-94e47a9072ec", + "metadata": {}, + "outputs": [], + "source": [ + "# Initialization\n", + "logging.basicConfig(level=logging.INFO)\n", + "load_dotenv(override=True)\n", + "\n", + "openai_api_key = os.getenv('OPENAI_API_KEY')\n", + "if openai_api_key:\n", + " logging.info(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n", + "else:\n", + " logging.error(\"OpenAI API Key not set\")\n", + " \n", + "MODEL = \"gpt-4o-mini\"\n", + "openai = OpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d4455169-9e5e-4171-92e8-6f850a06f6e3", + "metadata": {}, + "outputs": [], + "source": [ + "system_message = (\n", + " \"You are a helpful assistant for an airline called FlightAI. \"\n", + " \"Always respond in a short, courteous sentence. \"\n", + " \"Provide accurate information only. \"\n", + " \"If you don’t know something, say so clearly. \"\n", + " \"Before booking a ticket, strictly follow this order: \"\n", + " \"1) Check if the destination is available, \"\n", + " \"2) Then check the ticket price, \"\n", + " \"3) Collect all neccessary details like name, destination and date of journey, \"\n", + " \"4) Only then proceed with the booking. \"\n", + " \"Always use the appropriate tools or APIs for each step before confirming a booking.\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4bab8e2c-e2b1-4421-a95b-7f1251670817", + "metadata": {}, + "outputs": [], + "source": [ + "# Dummy funcs that mimic the ticket booking behaviour\n", + "# Replace these will real funcs (that call APIs or make DB transactions) to actually book a ticket\n", + "\n", + "ticket_prices = {\n", + " \"london\": \"$799\",\n", + " \"paris\": \"$899\",\n", + " \"tokyo\": \"$1400\",\n", + " \"berlin\": \"$499\"\n", + "}\n", + "\n", + "def check_destination_availability(destination: str) -> dict:\n", + " \"\"\"\n", + " Check if the given destination is available in our ticketing system.\n", + " \n", + " Args:\n", + " destination (str): The name of the city.\n", + " \n", + " Returns:\n", + " dict: {\"available\": bool}\n", + " \"\"\"\n", + " logging.info(f\"Checking availability for destination: {destination}\")\n", + " \n", + " available = destination.lower() in ticket_prices\n", + " return {\"available\": available}\n", + "\n", + "\n", + "def fetch_ticket_price(destination_city: str) -> dict:\n", + " \"\"\"\n", + " Retrieve the ticket price for a given city.\n", + " \n", + " Args:\n", + " destination_city (str): The name of the destination city.\n", + " \n", + " Returns:\n", + " dict: {\"price\": str} or {\"price\": \"Unknown\"} if not found\n", + " \"\"\"\n", + " logging.info(f\"Retrieving price for destination: {destination_city}\")\n", + " \n", + " city = destination_city.lower()\n", + " price = ticket_prices.get(city, \"Unknown\")\n", + " \n", + " return {\"price\": price}\n", + "\n", + "\n", + "def book_ticket(name: str, destination_city: str, journey_date: str) -> dict:\n", + " \"\"\"\n", + " Book a ticket to a destination city for a given user and date.\n", + " \n", + " Args:\n", + " name (str): Name of the passenger.\n", + " destination_city (str): Destination city.\n", + " journey_date (str): Date of journey in YYYY-MM-DD format.\n", + " \n", + " Returns:\n", + " dict: Booking confirmation with name, city, price, and date, or error.\n", + " \"\"\"\n", + " logging.info(f\"Booking ticket for {name} to {destination_city} on {journey_date}\")\n", + " \n", + " city = destination_city.lower()\n", + "\n", + " if city not in ticket_prices:\n", + " logging.error(f\"City '{destination_city}' not found in ticket list.\")\n", + " return {\"error\": \"Destination not found.\"}\n", + "\n", + " price_info = fetch_ticket_price(destination_city)\n", + " \n", + " return {\n", + " \"name\": name,\n", + " \"destination_city\": destination_city.title(),\n", + " \"journey_date\": journey_date,\n", + " \"price\": price_info[\"price\"]\n", + " }\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "400f4592-2326-43f6-a921-fcd051c4f022", + "metadata": {}, + "outputs": [], + "source": [ + "destination_availability_tool = {\n", + " \"name\": \"check_destination_availability\",\n", + " \"description\": \"Check if tickets are available for the given destination city before proceeding with any booking or pricing inquiry.\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"destination\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The name of the destination city to check for availability.\"\n", + " }\n", + " },\n", + " \"required\": [\"destination\"],\n", + " \"additionalProperties\": False\n", + " }\n", + "}\n", + "\n", + "ticket_price_tool = {\n", + " \"name\": \"fetch_ticket_price\",\n", + " \"description\": (\n", + " \"Get the price of a return ticket to the specified destination city. \"\n", + " \"Use this after confirming that the destination is available, especially when the customer asks for the ticket price.\"\n", + " ),\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"destination_city\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The city for which the customer wants the ticket price.\"\n", + " }\n", + " },\n", + " \"required\": [\"destination_city\"],\n", + " \"additionalProperties\": False\n", + " }\n", + "}\n", + "\n", + "ticket_booking_tool = {\n", + " \"name\": \"book_ticket\",\n", + " \"description\": (\n", + " \"Book a ticket for the customer to the specified destination city on the given journey date. \"\n", + " \"Use only after availability and price have been checked.\"\n", + " ),\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"name\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Full name of the person booking the ticket.\"\n", + " },\n", + " \"destination_city\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The city that the customer wants to travel to.\"\n", + " },\n", + " \"journey_date\": {\n", + " \"type\": \"string\",\n", + " \"format\": \"date\",\n", + " \"description\": \"The journey date in YYYY-MM-DD format.\"\n", + " }\n", + " },\n", + " \"required\": [\"name\", \"destination_city\", \"journey_date\"],\n", + " \"additionalProperties\": False\n", + " }\n", + "}\n", + "\n", + "tools = [\n", + " {\"type\": \"function\", \"function\": destination_availability_tool},\n", + " {\"type\": \"function\", \"function\": ticket_price_tool},\n", + " {\"type\": \"function\", \"function\": ticket_booking_tool},\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f02c17ba-14f2-41c4-b6a2-d1397405d368", + "metadata": {}, + "outputs": [], + "source": [ + "def handle_tool_call(message):\n", + " \"\"\"\n", + " Handles a single OpenAI tool call message and returns both the result\n", + " and a formatted tool response dictionary.\n", + " \n", + " Args:\n", + " message (object): An OpenAI message containing a tool call.\n", + " \n", + " Returns:\n", + " tuple: (result_dict, response_dict)\n", + " \"\"\"\n", + " tool_call = message.tool_calls[0]\n", + " function_name = tool_call.function.name\n", + " arguments = json.loads(tool_call.function.arguments)\n", + "\n", + " result = None\n", + "\n", + " logging.info(f\"Tool call received: {function_name} with arguments: {arguments}\")\n", + "\n", + " if function_name == \"check_destination_availability\":\n", + " result = check_destination_availability(**arguments)\n", + "\n", + " elif function_name == \"fetch_ticket_price\":\n", + " city = arguments.get(\"destination_city\")\n", + " price_info = fetch_ticket_price(city)\n", + " result = {\"destination_city\": city, \"price\": price_info[\"price\"]}\n", + "\n", + " elif function_name == \"book_ticket\":\n", + " result = book_ticket(**arguments)\n", + "\n", + " else:\n", + " logging.warning(\"Unrecognized tool function: %s\", function_name)\n", + " result = {\"error\": f\"Unknown function '{function_name}'\"}\n", + "\n", + " response = {\n", + " \"role\": \"tool\",\n", + " \"tool_call_id\": tool_call.id,\n", + " \"content\": json.dumps(result)\n", + " }\n", + "\n", + " return result, response" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72c1a9e7-186c-4218-9edc-01814baec431", + "metadata": {}, + "outputs": [], + "source": [ + "def artist(city: str, style: str = \"vibrant pop-art\", size: str = \"1024x1024\") -> Image.Image:\n", + " \"\"\"\n", + " Generates a city-themed vacation image using DALLΒ·E.\n", + "\n", + " Args:\n", + " city (str): Name of the city to visualize.\n", + " style (str): Artistic style for the image prompt.\n", + " size (str): Image resolution (e.g., \"1024x1024\").\n", + "\n", + " Returns:\n", + " Image.Image: A PIL Image object representing the generated image.\n", + "\n", + " Raises:\n", + " ValueError: If city name is empty.\n", + " RuntimeError: If image generation fails.\n", + " \"\"\"\n", + " if not city.strip():\n", + " raise ValueError(\"City name cannot be empty.\")\n", + "\n", + " prompt = (\n", + " f\"An image representing a vacation in {city}, \"\n", + " f\"showing iconic tourist attractions, cultural elements, and everything unique about {city}, \"\n", + " f\"rendered in a {style} style.\"\n", + " )\n", + "\n", + " logging.info(\"Generating image for city: %s with style: %s\", city, style)\n", + "\n", + " try:\n", + " response = openai.images.generate(\n", + " model=\"dall-e-3\",\n", + " prompt=prompt,\n", + " size=size,\n", + " n=1,\n", + " response_format=\"b64_json\",\n", + " )\n", + "\n", + " image_base64 = response.data[0].b64_json\n", + " image_data = base64.b64decode(image_base64)\n", + " logging.info(\"Image generation successful for %s\", city)\n", + "\n", + " return Image.open(BytesIO(image_data))\n", + "\n", + " except Exception as e:\n", + " logging.error(\"Failed to generate image for city '%s': %s\", city, str(e))\n", + " raise RuntimeError(f\"Image generation failed for city '{city}'\") from e" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdf7c091-6c68-4af6-8197-c1456b36cedf", + "metadata": {}, + "outputs": [], + "source": [ + "def talker(message: str, output_filename: str = \"output_audio.mp3\", autoplay: bool = True) -> None:\n", + " \"\"\"\n", + " Converts a text message into speech using OpenAI TTS and plays the audio.\n", + "\n", + " Args:\n", + " message (str): The text to convert to speech.\n", + " output_filename (str): The filename to save the generated audio.\n", + " autoplay (bool): Whether to autoplay the audio in the notebook.\n", + "\n", + " Raises:\n", + " ValueError: If the message is empty.\n", + " RuntimeError: If the audio generation fails.\n", + " \"\"\"\n", + " if not message.strip():\n", + " raise ValueError(\"Message cannot be empty.\")\n", + "\n", + " logging.info(\"Generating speech for message: %s\", message)\n", + "\n", + " try:\n", + " response = openai.audio.speech.create(\n", + " model=\"tts-1\",\n", + " voice=\"alloy\",\n", + " input=message\n", + " )\n", + "\n", + " with open(output_filename, \"wb\") as f:\n", + " f.write(response.content)\n", + "\n", + " logging.info(\"Audio written to: %s\", output_filename)\n", + "\n", + " if autoplay:\n", + " display(Audio(output_filename, autoplay=True))\n", + "\n", + " except Exception as e:\n", + " logging.error(\"Failed to generate or play audio: %s\", str(e))\n", + " raise RuntimeError(\"Text-to-speech generation failed.\") from e" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "54568b4a-be8d-47a1-b924-03acdafef70e", + "metadata": {}, + "outputs": [], + "source": [ + "def translate(message, language):\n", + " \"\"\"\n", + " Translates the given text into the specified language using OpenAI Chat API.\n", + "\n", + " Args:\n", + " message (str): The text to be translated.\n", + " language (str): Target language for translation (e.g., 'French', 'Japanese').\n", + "\n", + " Returns:\n", + " str: Translated text.\n", + "\n", + " Raises:\n", + " ValueError: If input message or language is empty.\n", + " RuntimeError: If translation fails due to API or other issues.\n", + " \"\"\"\n", + " if not message.strip():\n", + " raise ValueError(\"Input message cannot be empty.\")\n", + " if not language.strip():\n", + " raise ValueError(\"Target language cannot be empty.\")\n", + "\n", + " logging.info(\"Translating to %s: %s\", language, message)\n", + "\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": f\"You are a translation assistant. Translate everything the user says to {language}.\"},\n", + " {\"role\": \"user\", \"content\": message}\n", + " ]\n", + "\n", + " try:\n", + " response = openai.chat.completions.create(\n", + " model=MODEL,\n", + " messages=messages\n", + " )\n", + " translated = response.choices[0].message.content.strip()\n", + " logging.info(\"Translation successful.\")\n", + " return translated\n", + "\n", + " except Exception as e:\n", + " logging.error(\"Translation failed: %s\", str(e))\n", + " raise RuntimeError(\"Failed to translate message.\") from e" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e6cf470-8ea0-43b2-bbcc-53c2432feb0d", + "metadata": {}, + "outputs": [], + "source": [ + "def transcribe_audio(audio_path):\n", + " \"\"\"\n", + " Transcribes an audio file using OpenAI's Whisper model.\n", + "\n", + " Args:\n", + " audio_path (str): Path to the audio file (e.g., .mp3, .wav).\n", + " model (str): OpenAI model for transcription (default: 'whisper-1').\n", + "\n", + " Returns:\n", + " str: Transcribed text from the audio file.\n", + "\n", + " Raises:\n", + " ValueError: If the path is invalid or the file does not exist.\n", + " RuntimeError: If the transcription fails.\n", + " \"\"\"\n", + " if not audio_path or not os.path.exists(audio_path):\n", + " raise ValueError(\"Invalid or missing audio file path.\")\n", + "\n", + " logging.info(\"Transcribing audio file: %s using model: whisper-1\", audio_path)\n", + "\n", + " try:\n", + " with open(audio_path, \"rb\") as f:\n", + " response = openai.audio.transcriptions.create(\n", + " model=\"whisper-1\",\n", + " file=f\n", + " )\n", + " transcript = response.text.strip()\n", + " logging.info(\"Transcription successful.\")\n", + " return transcript\n", + "\n", + " except Exception as e:\n", + " logging.error(\"Transcription failed: %s\", str(e))\n", + " raise RuntimeError(\"Failed to transcribe audio.\") from e" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3489656e-0f08-4d41-94b1-d902c93ca164", + "metadata": {}, + "outputs": [], + "source": [ + "def chat(history: list, language: str, translated_history: list, speaking_language: str) -> tuple:\n", + " \"\"\"\n", + " Handles a chat interaction including tool calls, image generation, translation, and TTS playback.\n", + "\n", + " Args:\n", + " history (list): List of previous conversation messages.\n", + " language (str): Target language for translation and TTS.\n", + "\n", + " Returns:\n", + " tuple: (updated history list, generated image if any, translated response string)\n", + " \"\"\"\n", + " messages = [{\"role\": \"system\", \"content\": system_message}] + history\n", + " image = None\n", + "\n", + " try:\n", + " # Initial assistant response\n", + " response = openai.chat.completions.create(model=MODEL, messages=messages, tools=tools)\n", + " choice = response.choices[0]\n", + "\n", + " # Handle tool calls if triggered\n", + " if choice.finish_reason == \"tool_calls\":\n", + " message = choice.message\n", + " result, tool_response = handle_tool_call(message)\n", + "\n", + " # Append tool-related messages\n", + " messages.append(message)\n", + " messages.append(tool_response)\n", + " logging.info(\"Tool call result: %s\", result)\n", + "\n", + " # Generate image if a booking was completed\n", + " if message.tool_calls[0].function.name == \"book_ticket\" and \"destination_city\" in result:\n", + " image = artist(result[\"destination_city\"])\n", + "\n", + " # Get final assistant response after tool execution\n", + " response = openai.chat.completions.create(model=MODEL, messages=messages)\n", + " choice = response.choices[0]\n", + "\n", + " reply = choice.message.content.strip()\n", + " history.append({\"role\": \"assistant\", \"content\": reply})\n", + "\n", + " # Translate and speak the reply\n", + " translated_reply = translate(reply, language)\n", + " translated_history.append({\"role\": \"assistant\", \"content\": translated_reply})\n", + "\n", + " if speaking_language == \"English\":\n", + " talker(reply)\n", + " else:\n", + " talker(translated_reply)\n", + "\n", + " return history, image, translated_history\n", + "\n", + " except Exception as e:\n", + " logging.error(\"Chat processing failed: %s\", str(e))\n", + " raise RuntimeError(\"Failed to complete chat interaction.\") from e" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f76acc68-726e-457f-88ab-99da75debde5", + "metadata": {}, + "outputs": [], + "source": [ + "force_dark_mode = \"\"\"\n", + "function refresh() {\n", + " const url = new URL(window.location);\n", + " if (url.searchParams.get('__theme') !== 'dark') {\n", + " url.searchParams.set('__theme', 'dark');\n", + " window.location.href = url.href;\n", + " }\n", + "}\n", + "\"\"\"\n", + "\n", + "with gr.Blocks(js=force_dark_mode) as ui:\n", + " with gr.Row():\n", + " gr.Markdown(\"### FlightAI Chat with Translation\")\n", + "\n", + " with gr.Row():\n", + " lang_dropdown = gr.Dropdown(\n", + " choices=[\"Spanish\", \"French\", \"German\", \"Japanese\", \"Hindi\"],\n", + " value=\"Spanish\",\n", + " label=\"Translate To\"\n", + " )\n", + " \n", + " speak_dropdown = gr.Dropdown(\n", + " choices=[\"English\", \"Selected Language\"],\n", + " value=\"English\",\n", + " label=\"Speak out in\"\n", + " )\n", + " \n", + " with gr.Row():\n", + " chatbot = gr.Chatbot(height=500, type=\"messages\", label=\"Chat History\")\n", + " translated_chatbot = gr.Chatbot(height=500, type=\"messages\", label=\"Translated Chat\")\n", + " image_output = gr.Image(height=500)\n", + "\n", + " with gr.Row():\n", + " entry = gr.Textbox(label=\"Chat with our AI Assistant:\")\n", + " audio_input = gr.Audio(sources=\"microphone\", type=\"filepath\", label=\"Or speak to the assistant\")\n", + "\n", + " with gr.Row():\n", + " clear = gr.Button(\"Clear\")\n", + "\n", + " def do_entry(message, history, audio, translated_history, language):\n", + " if audio:\n", + " message = transcribe_audio(audio)\n", + "\n", + " if message:\n", + " history += [{\"role\": \"user\", \"content\": message}]\n", + " translated_history += [{\"role\": \"user\", \"content\": translate(message, language)}]\n", + " return \"\", history, None, translated_history\n", + "\n", + " entry.submit(\n", + " do_entry,\n", + " inputs=[entry, chatbot, audio_input, translated_chatbot, lang_dropdown],\n", + " outputs=[entry, chatbot, audio_input, translated_chatbot]\n", + " ).then(\n", + " chat,\n", + " inputs=[chatbot, lang_dropdown, translated_chatbot, speak_dropdown],\n", + " outputs=[chatbot, image_output, translated_chatbot]\n", + " )\n", + "\n", + " audio_input.change(\n", + " do_entry,\n", + " inputs=[entry, chatbot, audio_input, translated_chatbot, lang_dropdown],\n", + " outputs=[entry, chatbot, audio_input, translated_chatbot]\n", + " ).then(\n", + " chat,\n", + " inputs=[chatbot, lang_dropdown, translated_chatbot, speak_dropdown],\n", + " outputs=[chatbot, image_output, translated_chatbot]\n", + " )\n", + "\n", + " clear.click(lambda: [\"\", [], None, [], None], inputs=None, outputs=[entry, chatbot, audio_input, translated_chatbot, image_output], queue=False)\n", + "\n", + "ui.launch(inbrowser=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58f97435-fa0d-45f7-b02f-4ac5f4901c53", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week2/community-contributions/Week2_airline_assistant_Gemini_Amadeus_live_ticket_price.ipynb b/week2/community-contributions/Week2_airline_assistant_Gemini_Amadeus_live_ticket_price.ipynb new file mode 100644 index 0000000..bc4f92a --- /dev/null +++ b/week2/community-contributions/Week2_airline_assistant_Gemini_Amadeus_live_ticket_price.ipynb @@ -0,0 +1,808 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d938fc6c-bcca-4572-b851-75370fe21c67", + "metadata": {}, + "source": [ + "# Airline Assistant using Gemini API for Image and Audio as well - Live ticket prices using Amadeus API" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f5eda470-07ee-4d01-bada-3390050ac9c2", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import json\n", + "import random\n", + "import string\n", + "import base64\n", + "import gradio as gr\n", + "import pyaudio\n", + "import requests\n", + "from io import BytesIO\n", + "from PIL import Image\n", + "from dotenv import load_dotenv\n", + "from google import genai\n", + "from google.genai import types" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09aaf3b0-beb7-4b64-98a4-da16fc83dadb", + "metadata": {}, + "outputs": [], + "source": [ + "load_dotenv(override=True)\n", + "api_key = os.getenv(\"GOOGLE_API_KEY\")\n", + "\n", + "if not api_key:\n", + " print(\"API Key not found!\")\n", + "else:\n", + " print(\"API Key loaded in memory\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35881fb9-4d51-43dc-a5e6-d9517e22019a", + "metadata": {}, + "outputs": [], + "source": [ + "MODEL_GEMINI = 'gemini-2.5-flash'\n", + "MODEL_GEMINI_IMAGE = 'gemini-2.0-flash-preview-image-generation'\n", + "MODEL_GEMINI_SPEECH = 'gemini-2.5-flash-preview-tts'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a5ed391c-8a67-4465-9c66-e915548a0d6a", + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " client = genai.Client(api_key=api_key)\n", + " print(\"Google GenAI Client initialized successfully!\")\n", + "except Exception as e:\n", + " print(f\"Error initializing GenAI Client: {e}\")\n", + " print(\"Ensure your GOOGLE_API_KEY is correctly set as an environment variable.\")\n", + " exit() " + ] + }, + { + "cell_type": "markdown", + "id": "407ad581-9580-4dba-b236-abb6c6788933", + "metadata": {}, + "source": [ + "## Image Generation " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a21921f8-57b1-4665-8999-7f2a40645b59", + "metadata": {}, + "outputs": [], + "source": [ + "def fetch_image(city):\n", + " prompt = (\n", + " f\"A high-quality, photo-realistic image of a vacation in {city}, \"\n", + " f\"showing iconic landmarks, cultural attractions, authentic street life, and local cuisine. \"\n", + " f\"Capture natural lighting, real people enjoying travel experiences, and the unique vibe of {city}'s atmosphere. \"\n", + " f\"The composition should feel immersive, warm, and visually rich, as if taken by a travel photographer.\"\n", + ")\n", + "\n", + " response = client.models.generate_content(\n", + " model = MODEL_GEMINI_IMAGE,\n", + " contents = prompt,\n", + " config=types.GenerateContentConfig(\n", + " response_modalities=['TEXT', 'IMAGE']\n", + " )\n", + " )\n", + "\n", + " for part in response.candidates[0].content.parts:\n", + " if part.inline_data is not None:\n", + " image_data = BytesIO(part.inline_data.data)\n", + " return Image.open(image_data)\n", + "\n", + " raise ValueError(\"No image found in Gemini response.\")\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bcd4aed1-8b4d-4771-ba32-e729e82bab54", + "metadata": {}, + "outputs": [], + "source": [ + "fetch_image(\"london\")" + ] + }, + { + "cell_type": "markdown", + "id": "5f6baee6-e2e2-4cc4-941d-34a4c72cee67", + "metadata": {}, + "source": [ + "## Speech Generation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "825dfedc-0271-4191-a3d1-50872af4c8cf", + "metadata": {}, + "outputs": [], + "source": [ + "\"\"\"\n", + "Kore -- Firm\n", + "Puck -- Upbeat\n", + "Leda -- Youthful\n", + "Iapetus -- Clear\n", + "Erinome -- Clear\n", + "Sadachbia -- Lively\n", + "Sulafat -- Warm\n", + "Despina -- Smooth\n", + "\"\"\"\n", + "\n", + "def talk(message:str, voice_name:str=\"Leda\", mood:str=\"cheerfully\"):\n", + " prompt = f\"Say {mood}: {message}\"\n", + " response = client.models.generate_content(\n", + " model = MODEL_GEMINI_SPEECH,\n", + " contents = prompt,\n", + " config=types.GenerateContentConfig(\n", + " response_modalities=[\"AUDIO\"],\n", + " speech_config=types.SpeechConfig(\n", + " voice_config=types.VoiceConfig(\n", + " prebuilt_voice_config=types.PrebuiltVoiceConfig(\n", + " voice_name=voice_name,\n", + " )\n", + " )\n", + " ), \n", + " )\n", + " )\n", + "\n", + " # Fetch the audio bytes\n", + " pcm_data = response.candidates[0].content.parts[0].inline_data.data\n", + " # Play the audio using PyAudio\n", + " p = pyaudio.PyAudio()\n", + " stream = p.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)\n", + " stream.write(pcm_data)\n", + " stream.stop_stream()\n", + " stream.close()\n", + " p.terminate()\n", + "\n", + " # Play using simpleaudio (16-bit PCM, mono, 24kHz)\n", + " # play_obj = sa.play_buffer(pcm_data, num_channels=1, bytes_per_sample=2, sample_rate=24000)\n", + " # play_obj.wait_done() " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "54967ebc-24a6-4bb2-9a19-20c3585f1d77", + "metadata": {}, + "outputs": [], + "source": [ + "talk(\"Hi, How are you? Welcome to FlyJumbo Airlines\",\"Kore\",\"helpful\")" + ] + }, + { + "cell_type": "markdown", + "id": "be9dc275-838e-4c54-b487-41d094dad96b", + "metadata": {}, + "source": [ + "## Ticket Price Tool Function - Using Amadeus API " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8613a080-d82c-4c1a-8db4-377614997ac2", + "metadata": {}, + "outputs": [], + "source": [ + "client_id = os.getenv(\"AMADEUS_CLIENT_ID\")\n", + "client_secret = os.getenv(\"AMADEUS_CLIENT_SECRET\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6bf78f61-0de1-4552-a1d4-1a28380be6a5", + "metadata": {}, + "outputs": [], + "source": [ + "# Get the token first\n", + "def get_amadeus_token():\n", + " url = \"https://test.api.amadeus.com/v1/security/oauth2/token\"\n", + " headers = {\"Content-Type\": \"application/x-www-form-urlencoded\"}\n", + " data = {\n", + " \"grant_type\": \"client_credentials\",\n", + " \"client_id\": client_id,\n", + " \"client_secret\": client_secret,\n", + " }\n", + " \n", + " try:\n", + " response = requests.post(url, headers=headers, data=data, timeout=10)\n", + " response.raise_for_status()\n", + " return response.json()[\"access_token\"]\n", + " \n", + " except requests.exceptions.HTTPError as e:\n", + " print(f\"HTTP Error {response.status_code}: {response.text}\")\n", + " \n", + " except requests.exceptions.RequestException as e:\n", + " print(\"Network or connection error:\", e)\n", + " \n", + " return None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c5261f6-6662-4e9d-8ff0-8e10171bb963", + "metadata": {}, + "outputs": [], + "source": [ + "def get_airline_name(code, token):\n", + " url = f\"https://test.api.amadeus.com/v1/reference-data/airlines\"\n", + " headers = {\"Authorization\": f\"Bearer {token}\"}\n", + " params = {\"airlineCodes\": code}\n", + "\n", + " response = requests.get(url, headers=headers, params=params)\n", + " response.raise_for_status()\n", + " data = response.json()\n", + "\n", + " if \"data\" in data and data[\"data\"]:\n", + " return data[\"data\"][0].get(\"businessName\", code)\n", + " return code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "42a55f06-880a-4c49-8560-2e7b97953c1a", + "metadata": {}, + "outputs": [], + "source": [ + "COMMON_CITY_CODES = {\n", + " \"delhi\": \"DEL\",\n", + " \"mumbai\": \"BOM\",\n", + " \"chennai\": \"MAA\",\n", + " \"kolkata\": \"CCU\",\n", + " \"bengaluru\": \"BLR\",\n", + " \"hyderabad\": \"HYD\",\n", + " \"patna\": \"PAT\",\n", + " \"raipur\": \"RPR\",\n", + " \"panaji\": \"GOI\",\n", + " \"chandigarh\": \"IXC\",\n", + " \"srinagar\": \"SXR\",\n", + " \"ranchi\": \"IXR\",\n", + " \"bengaluru\": \"BLR\",\n", + " \"thiruvananthapuram\": \"TRV\",\n", + " \"bhopal\": \"BHO\",\n", + " \"mumbai\": \"BOM\",\n", + " \"imphal\": \"IMF\",\n", + " \"aizawl\": \"AJL\",\n", + " \"bhubaneswar\": \"BBI\",\n", + " \"jaipur\": \"JAI\",\n", + " \"chennai\": \"MAA\",\n", + " \"hyderabad\": \"HYD\",\n", + " \"agartala\": \"IXA\",\n", + " \"lucknow\": \"LKO\",\n", + " \"dehradun\": \"DED\",\n", + " \"kolkata\": \"CCU\",\n", + "\n", + " # Union territories\n", + " \"port blair\": \"IXZ\",\n", + " \"leh\": \"IXL\",\n", + " \"puducherry\": \"PNY\",\n", + "\n", + " # Major metro cities (for redundancy)\n", + " \"ahmedabad\": \"AMD\",\n", + " \"surat\": \"STV\",\n", + " \"coimbatore\": \"CJB\",\n", + " \"vizag\": \"VTZ\",\n", + " \"vijayawada\": \"VGA\",\n", + " \"nagpur\": \"NAG\",\n", + " \"indore\": \"IDR\",\n", + " \"kanpur\": \"KNU\",\n", + " \"varanasi\": \"VNS\"\n", + "}\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b061ec2c-609b-4d77-bd41-c9bc5bf901f4", + "metadata": {}, + "outputs": [], + "source": [ + "city_code_cache = {}\n", + "\n", + "def get_city_code(city_name, token):\n", + " city_name = city_name.strip().lower()\n", + "\n", + " if city_name in city_code_cache:\n", + " return city_code_cache[city_name]\n", + "\n", + " if city_name in COMMON_CITY_CODES:\n", + " return COMMON_CITY_CODES[city_name]\n", + "\n", + " base_url = \"https://test.api.amadeus.com/v1/reference-data/locations\"\n", + " headers = {\"Authorization\": f\"Bearer {token}\"}\n", + "\n", + " for subtype in [\"CITY\", \"AIRPORT,CITY\"]:\n", + " params = {\"keyword\": city_name, \"subType\": subtype}\n", + " try:\n", + " response = requests.get(base_url, headers=headers, params=params, timeout=10)\n", + " response.raise_for_status()\n", + " data = response.json()\n", + "\n", + " if \"data\" in data and data[\"data\"]:\n", + " code = data[\"data\"][0][\"iataCode\"]\n", + " print(f\"[INFO] Found {subtype} match for '{city_name}': {code}\")\n", + " city_code_cache[city_name] = code\n", + " return code\n", + " except Exception as e:\n", + " print(f\"[ERROR] Location lookup failed for {subtype}: {e}\")\n", + "\n", + " return None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9816a9c-fd70-4dfc-a3c0-4d8709997371", + "metadata": {}, + "outputs": [], + "source": [ + "# Getting live ticket price \n", + "\n", + "def get_live_ticket_prices(origin, destination, departure_date, return_date=None):\n", + " token = get_amadeus_token()\n", + "\n", + " url = \"https://test.api.amadeus.com/v2/shopping/flight-offers\"\n", + " headers = {\"Authorization\": f\"Bearer {token}\"}\n", + "\n", + " origin_code = get_city_code(origin,token)\n", + " destination_code = get_city_code(destination,token)\n", + "\n", + " if not origin_code:\n", + " return f\"Sorry, I couldn't find the airport code for the city '{origin}'.\"\n", + " if not destination_code:\n", + " return f\"Sorry, I couldn't find the airport code for the city '{destination}'.\"\n", + "\n", + " params = {\n", + " \"originLocationCode\": origin_code.upper(),\n", + " \"destinationLocationCode\": destination_code.upper(),\n", + " \"departureDate\": departure_date,\n", + " \"adults\": 1,\n", + " \"currencyCode\": \"USD\",\n", + " \"max\": 1,\n", + " }\n", + "\n", + " if return_date:\n", + " params[\"returnDate\"] = return_date\n", + "\n", + " try:\n", + " response = requests.get(url, headers=headers, params=params, timeout=10)\n", + " response.raise_for_status()\n", + " data = response.json()\n", + " \n", + " if \"data\" in data and data[\"data\"]:\n", + " offer = data[\"data\"][0]\n", + " price = offer[\"price\"][\"total\"]\n", + " airline_codes = offer.get(\"validatingAirlineCodes\", [])\n", + " airline_code = airline_codes[0] if airline_codes else \"Unknown\"\n", + "\n", + " try:\n", + " airline_name = get_airline_name(airline_code, token) if airline_code != \"Unknown\" else \"Unknown Airline\"\n", + " if not airline_name: \n", + " airline_name = airline_code\n", + " except Exception:\n", + " airline_name = airline_code\n", + " \n", + " \n", + " if return_date:\n", + " return (\n", + " f\"Round-trip flight from {origin.capitalize()} to {destination.capitalize()}:\\n\"\n", + " f\"- Departing: {departure_date}\\n\"\n", + " f\"- Returning: {return_date}\\n\"\n", + " f\"- Airline: {airline_name}\\n\"\n", + " f\"- Price: ${price}\"\n", + " )\n", + " else:\n", + " return (\n", + " f\"One-way flight from {origin.capitalize()} to {destination.capitalize()} on {departure_date}:\\n\"\n", + " f\"- Airline: {airline_name}\\n\"\n", + " f\"- Price: ${price}\"\n", + " )\n", + " else:\n", + " return f\"No flights found from {origin.capitalize()} to {destination.capitalize()} on {departure_date}.\"\n", + " except requests.exceptions.RequestException as e:\n", + " return f\"❌ Error fetching flight data: {str(e)}\" \n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7bc7657e-e8b5-4647-9745-d7d403feb09a", + "metadata": {}, + "outputs": [], + "source": [ + "get_live_ticket_prices(\"london\", \"chennai\", \"2025-07-01\",\"2025-07-10\")" + ] + }, + { + "cell_type": "markdown", + "id": "e1153b94-90e7-4856-8c85-e456305a7817", + "metadata": {}, + "source": [ + "## Ticket Booking Tool Function - DUMMY" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5dfc3b12-0a16-4861-a549-594f175ff956", + "metadata": {}, + "outputs": [], + "source": [ + "def book_flight(origin, destination, departure_date, return_date=None, airline=\"Selected Airline\", passenger_name=\"Guest\"):\n", + " # Generate a dummy ticket reference (PNR)\n", + " ticket_ref = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6))\n", + "\n", + " # Build confirmation message\n", + " confirmation = (\n", + " f\"🎫 Booking confirmed for {passenger_name}!\\n\"\n", + " f\"From: {origin.capitalize()} β†’ To: {destination.capitalize()}\\n\"\n", + " f\"Departure: {departure_date}\"\n", + " )\n", + "\n", + " if return_date:\n", + " confirmation += f\"\\nReturn: {return_date}\"\n", + "\n", + " confirmation += (\n", + " f\"\\nAirline: {airline}\\n\"\n", + " f\"PNR: {ticket_ref}\\n\"\n", + " f\"βœ… Your ticket has been booked successfully. Safe travels!\"\n", + " )\n", + "\n", + " return confirmation\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "122f655b-b7a4-45c6-aaec-afd2917a051b", + "metadata": {}, + "outputs": [], + "source": [ + "print(book_flight(\"chennai\", \"delhi\", \"2025-07-01\", \"2025-07-10\", \"Air India\", \"Ravi Kumar\"))" + ] + }, + { + "cell_type": "markdown", + "id": "e83d8e90-ae22-4728-83e5-d83fed7f2049", + "metadata": {}, + "source": [ + "## Gemini Chat Workings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5a656f4e-914d-4f5e-b7fa-48457935181a", + "metadata": {}, + "outputs": [], + "source": [ + "ticket_price_function_declaration = {\n", + " \"name\":\"get_live_ticket_prices\",\n", + " \"description\": \"Get live flight ticket prices between two cities for a given date (round-trip or one-way).\\\n", + " The destination may be a city or country (e.g., 'China'). Call this function whenever a customer asks about ticket prices., such as 'How much is a ticket to Paris?'\",\n", + " \"parameters\":{\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"origin\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Name of the origin city. Example: 'Delhi'\",\n", + " },\n", + " \"destination\": {\n", + " \"type\": \"string\",\n", + " \"description\":\"Name of the destination city. Example: 'London'\",\n", + " },\n", + " \"departure_date\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Date of departure in YYYY-MM-DD format. Example: '2025-07-01'\",\n", + " },\n", + " \"return_date\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Optional return date for round-trip in YYYY-MM-DD format. Leave blank for one-way trips.\",\n", + " },\n", + " },\n", + " \"required\": [\"origin\", \"destination\", \"departure_date\"],\n", + " }\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "05a835ab-a675-40ed-9cd8-65f4c6b22722", + "metadata": {}, + "outputs": [], + "source": [ + "book_flight_function_declaration = {\n", + " \"name\": \"book_flight\",\n", + " \"description\": \"Book a flight for the user after showing the ticket details and confirming the booking. \"\n", + " \"Call this function when the user says things like 'yes', 'book it', or 'I want to book this flight'.\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"origin\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Name of the origin city. Example: 'Chennai'\",\n", + " },\n", + " \"destination\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Name of the destination city. Example: 'London'\",\n", + " },\n", + " \"departure_date\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Date of departure in YYYY-MM-DD format. Example: '2025-07-01'\",\n", + " },\n", + " \"return_date\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Optional return date for round-trip in YYYY-MM-DD format. Leave blank for one-way trips.\",\n", + " },\n", + " \"airline\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Airline name or code that the user wants to book with. Example: 'Air India'\",\n", + " },\n", + " \"passenger_name\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Full name of the passenger for the booking. Example: 'Ravi Kumar'\",\n", + " }\n", + " },\n", + " \"required\": [\"origin\", \"destination\", \"departure_date\", \"passenger_name\"],\n", + " }\n", + "}\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad0231cd-040f-416d-b150-0d8f90535718", + "metadata": {}, + "outputs": [], + "source": [ + "# System Definitions\n", + "\n", + "system_instruction_prompt = (\n", + " \"You are a helpful and courteous AI assistant for an airline company called FlyJumbo. \"\n", + " \"When a user starts a new conversation, greet them with: 'Hi there, welcome to FlyJumbo! How can I help you?'. \"\n", + " \"Do not repeat this greeting in follow-up messages. \"\n", + " \"Use the available tools if a user asks about ticket prices. \"\n", + " \"Ask follow-up questions to gather all necessary information before calling a function.\"\n", + " \"After calling a tool, always continue the conversation by summarizing the result and asking the user the next relevant question (e.g., if they want to proceed with a booking).\"\n", + " \"If you do not know the answer and no tool can help, respond politely that you are unable to help with the request. \"\n", + " \"Answer concisely in one sentence.\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff0b3de8-5674-4f08-9f9f-06f88ff959a1", + "metadata": {}, + "outputs": [], + "source": [ + "tools = types.Tool(function_declarations=[ticket_price_function_declaration,book_flight_function_declaration])\n", + "generate_content_config = types.GenerateContentConfig(system_instruction=system_instruction_prompt, tools=[tools])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "00a56779-16eb-4f31-9941-2eb01d17ed87", + "metadata": {}, + "outputs": [], + "source": [ + "def handle_tool_call(function_call):\n", + " print(f\"πŸ”§ Function Called - {function_call.name}\")\n", + " function_name = function_call.name\n", + " args = function_call.args\n", + "\n", + " if function_name == \"get_live_ticket_prices\":\n", + " origin = args.get(\"origin\")\n", + " destination = args.get(\"destination\")\n", + " departure_date = args.get(\"departure_date\")\n", + " return_date = args.get(\"return_date\") or None\n", + "\n", + " return get_live_ticket_prices(origin, destination, departure_date, return_date)\n", + "\n", + " elif function_name == \"book_flight\":\n", + " origin = args.get(\"origin\")\n", + " destination = args.get(\"destination\")\n", + " departure_date = args.get(\"departure_date\")\n", + " return_date = args.get(\"return_date\") or None\n", + " airline = args.get(\"airline\", \"Selected Airline\")\n", + " passenger_name = args.get(\"passenger_name\", \"Guest\")\n", + "\n", + " return book_flight(origin, destination, departure_date, return_date, airline, passenger_name)\n", + "\n", + " else:\n", + " return f\"❌ Unknown function: {function_name}\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d0c334d2-9ab0-4f80-ac8c-c66897e0bd7c", + "metadata": {}, + "outputs": [], + "source": [ + "def chat(message, history):\n", + " full_message_history = []\n", + " city_name = None\n", + "\n", + " # Convert previous history to Gemini-compatible format\n", + " for h in history:\n", + " if h[\"role\"] == \"user\":\n", + " full_message_history.append(\n", + " types.Content(role=\"user\", parts=[types.Part.from_text(text=h[\"content\"])])\n", + " )\n", + " elif h[\"role\"] == \"assistant\":\n", + " full_message_history.append(\n", + " types.Content(role=\"model\", parts=[types.Part.from_text(text=h[\"content\"])])\n", + " )\n", + "\n", + " # Add current user message\n", + " full_message_history.append(\n", + " types.Content(role=\"user\", parts=[types.Part.from_text(text=message)])\n", + " )\n", + "\n", + " # Send to Gemini with tool config\n", + " response = client.models.generate_content(\n", + " model=MODEL_GEMINI,\n", + " contents=full_message_history,\n", + " config=generate_content_config\n", + " )\n", + "\n", + " candidate = response.candidates[0]\n", + " part = candidate.content.parts[0]\n", + " function_call = getattr(part, \"function_call\", None)\n", + "\n", + " # Case: Tool call required\n", + " if function_call:\n", + " # Append model message that triggered tool call\n", + " full_message_history.append(\n", + " types.Content(role=\"model\", parts=candidate.content.parts)\n", + " )\n", + "\n", + " # Execute the tool\n", + " tool_output = handle_tool_call(function_call)\n", + "\n", + " # Wrap and append tool output\n", + " tool_response_part = types.Part.from_function_response(\n", + " name=function_call.name,\n", + " response={\"result\": tool_output}\n", + " )\n", + " \n", + " full_message_history.append(\n", + " types.Content(role=\"function\", parts=[tool_response_part])\n", + " )\n", + "\n", + "\n", + " if function_call.name == \"book_flight\":\n", + " city_name = function_call.args.get(\"destination\").lower()\n", + " \n", + "\n", + " # Send follow-up message including tool result\n", + " followup_response = client.models.generate_content(\n", + " model=MODEL_GEMINI,\n", + " contents=full_message_history,\n", + " config=generate_content_config\n", + " )\n", + "\n", + " final_text = followup_response.text\n", + " \n", + " full_message_history.append(\n", + " types.Content(role=\"model\", parts=[types.Part.from_text(text=final_text)])\n", + " )\n", + "\n", + " return final_text,city_name, history + [{\"role\": \"assistant\", \"content\": final_text}]\n", + " else:\n", + " text = response.text\n", + " return text, city_name, history + [{\"role\": \"assistant\", \"content\": text}]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b245e6c-ef0b-4edf-b178-f14f2a75f285", + "metadata": {}, + "outputs": [], + "source": [ + "def user_submit(user_input, history):\n", + " history = history or []\n", + " history.append({\"role\": \"user\", \"content\": user_input})\n", + " \n", + " response_text, city_to_image, updated_history = chat(user_input, history)\n", + "\n", + " # Speak the response\n", + " try:\n", + " talk(response_text)\n", + " except Exception as e:\n", + " print(\"[Speech Error] Speech skipped due to quota limit.\")\n", + "\n", + " image = fetch_image(city_to_image) if city_to_image else None\n", + "\n", + " return \"\", updated_history, image, updated_history\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7db25b86-9a71-417c-98f0-790e3f3531bf", + "metadata": {}, + "outputs": [], + "source": [ + "with gr.Blocks() as demo:\n", + " gr.Markdown(\"## ✈️ FlyJumbo Airline Assistant\")\n", + "\n", + " with gr.Row():\n", + " with gr.Column(scale=3):\n", + " chatbot = gr.Chatbot(label=\"Assistant\", height=500, type=\"messages\")\n", + " msg = gr.Textbox(placeholder=\"Ask about flights...\", show_label=False)\n", + " send_btn = gr.Button(\"Send\")\n", + "\n", + " with gr.Column(scale=2):\n", + " image_output = gr.Image(label=\"Trip Visual\", visible=True, height=500)\n", + "\n", + " state = gr.State([])\n", + " \n", + " send_btn.click(fn=user_submit, inputs=[msg, state], outputs=[msg, chatbot, image_output, state])\n", + " msg.submit(fn=user_submit, inputs=[msg, state], outputs=[msg, chatbot, image_output, state])\n", + "\n", + "demo.launch(inbrowser=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ef31bf62-9034-4fa7-b803-8f5df5309b77", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week2/community-contributions/clinic_booking_bot.ipynb b/week2/community-contributions/clinic_booking_bot.ipynb new file mode 100644 index 0000000..d2d8b57 --- /dev/null +++ b/week2/community-contributions/clinic_booking_bot.ipynb @@ -0,0 +1,344 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 170, + "id": "a1aa1b43-7a47-4aca-ae5f-94a9d4ba2d89", + "metadata": {}, + "outputs": [], + "source": [ + "## Clinic Booking Bot\n", + "\n", + "##Easily book your clinic visit – available only on weekdays between **14:00 and 15:00**. \n", + "##Speak or type, and get instant confirmation.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 171, + "id": "fe798c6a-f8da-46aa-8c0e-9d2623def3d2", + "metadata": {}, + "outputs": [], + "source": [ + "# import library\n", + "\n", + "import os\n", + "import json\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI\n", + "import gradio as gr\n", + "import base64\n", + "from io import BytesIO\n", + "from datetime import date\n", + "from PIL import Image, ImageDraw, ImageFont\n" + ] + }, + { + "cell_type": "code", + "execution_count": 172, + "id": "0ad4e526-e95d-4e70-9faa-b4236b105dd5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OpenAI API Key exists and begins sk-proj-\n" + ] + } + ], + "source": [ + "# Save keys\n", + "\n", + "load_dotenv(override=True)\n", + "\n", + "openai_api_key = os.getenv('OPENAI_API_KEY')\n", + "if openai_api_key:\n", + " print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n", + "else:\n", + " print(\"OpenAI API Key not set\")\n", + " \n", + "MODEL = \"gpt-4o-mini\"\n", + "openai = OpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": 173, + "id": "ae95308e-0002-4017-9f2c-fcb1ddb248fa", + "metadata": {}, + "outputs": [], + "source": [ + "# --- CONFIG ---\n", + "BOOKING_START = 14\n", + "BOOKING_END = 15\n", + "WEEKDAYS = [\"Monday\", \"Tuesday\", \"Wednesday\", \"Thursday\", \"Friday\"]\n", + "PHONE = \"010-1234567\"\n", + "confirmed_bookings = []\n" + ] + }, + { + "cell_type": "code", + "execution_count": 174, + "id": "e21b0fd0-4cda-4938-8867-dc2c6e7af4b1", + "metadata": {}, + "outputs": [], + "source": [ + "# --- TTS ---\n", + "def generate_tts(text, voice=\"fable\", filename=\"output.mp3\"):\n", + " response = openai.audio.speech.create(\n", + " model=\"tts-1\",\n", + " voice=\"fable\",\n", + " input=text\n", + " )\n", + " with open(filename, \"wb\") as f:\n", + " f.write(response.content)\n", + " return filename" + ] + }, + { + "cell_type": "code", + "execution_count": 175, + "id": "e28a5c3b-bd01-4845-a41e-87823f6bb078", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Translate Booking Confirmation ---\n", + "def translate_text(text, target_language=\"nl\"):\n", + " prompt = f\"Translate this message to {target_language}:\\n{text}\"\n", + " response = openai.chat.completions.create(\n", + " model=\"gpt-4\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": \"You are a helpful translator.\"},\n", + " {\"role\": \"user\", \"content\": prompt}\n", + " ]\n", + " )\n", + " return response.choices[0].message.content.strip()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 176, + "id": "8ed57cc9-7d54-4a5d-831b-0efcc5b7a7a9", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Booking Logic ---\n", + "def book_appointment(name, time_str):\n", + " try:\n", + " booking_time = datetime.strptime(time_str, \"%H:%M\")\n", + " except ValueError:\n", + " return \"Invalid time format. Use HH:MM.\", None, None\n", + "\n", + " hour = booking_time.hour\n", + " weekday = datetime.today().strftime(\"%A\")\n", + "\n", + " if weekday not in WEEKDAYS:\n", + " response = \"Bookings are only available on weekdays.\"\n", + " elif BOOKING_START <= hour < BOOKING_END:\n", + " confirmation = f\"Booking confirmed for {name} at {time_str}.\"\n", + " confirmed_bookings.append((name, time_str))\n", + " translated = translate_text(confirmation)\n", + " audio = generate_tts(translated)\n", + " image = generate_booking_image(name, time_str)\n", + " return translated, audio, image\n", + " else:\n", + " response = \"Sorry, bookings are only accepted between 14:00 and 15:00 on weekdays.\"\n", + " translated = translate_text(response)\n", + " audio = generate_tts(translated)\n", + " return translated, audio, None" + ] + }, + { + "cell_type": "code", + "execution_count": 177, + "id": "19b52115-f0f3-4d63-a463-886163d4cfd1", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Booking Card ---\n", + "def generate_booking_image(name, time_str):\n", + " img = Image.new(\"RGB\", (500, 250), color=\"white\")\n", + " draw = ImageDraw.Draw(img)\n", + " msg = f\"\\u2705 Booking Confirmed\\nName: {name}\\nTime: {time_str}\"\n", + " draw.text((50, 100), msg, fill=\"black\")\n", + " return img" + ] + }, + { + "cell_type": "code", + "execution_count": 178, + "id": "2c446b6c-d410-4ba1-b0c7-c475e5259ff5", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Voice Booking ---\n", + "def voice_booking(audio_path, name):\n", + " with open(audio_path, \"rb\") as f:\n", + " response = openai.audio.transcriptions.create(model=\"whisper-1\", file=f)\n", + " transcription = response.text.strip()\n", + "\n", + " system_prompt = \"\"\"\n", + " You are a clinic assistant. Extract only the appointment time from the user's sentence in 24-hour HH:MM format.\n", + " If no time is mentioned, respond with 'No valid time found.'\n", + " \"\"\"\n", + "\n", + " response = openai.chat.completions.create(\n", + " model=\"gpt-4\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": transcription}\n", + " ]\n", + " )\n", + " extracted_time = response.choices[0].message.content.strip()\n", + "\n", + " if \":\" in extracted_time:\n", + " return book_appointment(name, extracted_time)\n", + " else:\n", + " message = \"Sorry, I couldn't understand the time. Please try again.\"\n", + " translated = translate_text(message)\n", + " audio_path = generate_tts(translated)\n", + " return translated, audio_path, None" + ] + }, + { + "cell_type": "code", + "execution_count": 179, + "id": "121d2907-7fa8-4248-b2e7-83617ea66ff0", + "metadata": {}, + "outputs": [], + "source": [ + "# --- Chat Bot Handler ---\n", + "def chat_bot(messages):\n", + " system_prompt = \"\"\"\n", + " You are a clinic booking assistant. Your job is to:\n", + " - Greet the patient and explain your role\n", + " - Only assist with making appointments\n", + " - Accept bookings only on weekdays between 14:00 and 15:00\n", + " - Do not provide medical advice\n", + " - Always respond with empathy and clarity\n", + " \"\"\"\n", + " response = openai.chat.completions.create(\n", + " model=\"gpt-4\",\n", + " messages=[{\"role\": \"system\", \"content\": system_prompt}] + messages\n", + " )\n", + " reply = response.choices[0].message.content.strip()\n", + " audio = generate_tts(reply)\n", + " return reply, audio" + ] + }, + { + "cell_type": "code", + "execution_count": 180, + "id": "2427b694-8c57-40cb-b202-4a8989547925", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "* Running on local URL: http://127.0.0.1:7898\n", + "* To create a public link, set `share=True` in `launch()`.\n" + ] + }, + { + "data": { + "text/html": [ + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Gradio interface\n", + "with gr.Blocks(theme=gr.themes.Soft()) as demo:\n", + " gr.Markdown(\"\"\"## 🩺 GP Booking Assistant \n", + "Only available weekdays between **14:00 and 15:00** \n", + "☎️ Contact: {PHONE}\n", + "---\"\"\")\n", + "\n", + " name_global = gr.Textbox(label=\"Your Name\", placeholder=\"Enter your name\", interactive=True)\n", + "\n", + " with gr.Tab(\"πŸ’¬ Chat Mode\"):\n", + " chatbot = gr.Chatbot(label=\"Booking Chat\", type=\"messages\", height=400)\n", + " text_input = gr.Textbox(label=\"Type your message or use your voice below\")\n", + " audio_input = gr.Audio(type=\"filepath\", label=\"πŸŽ™οΈ Or speak your request\")\n", + " chat_audio_output = gr.Audio(label=\"πŸ”Š Assistant's Reply\", type=\"filepath\")\n", + " send_btn = gr.Button(\"Send\")\n", + "\n", + " def handle_chat(user_message, chat_history):\n", + " chat_history = chat_history or []\n", + " chat_history.append({\"role\": \"user\", \"content\": user_message})\n", + " reply, audio = chat_bot(chat_history)\n", + " chat_history.append({\"role\": \"assistant\", \"content\": reply})\n", + " return chat_history, \"\", audio\n", + "\n", + " def handle_audio_chat(audio_path, chat_history):\n", + " with open(audio_path, \"rb\") as f:\n", + " transcription = openai.audio.transcriptions.create(model=\"whisper-1\", file=f).text.strip()\n", + " return handle_chat(transcription, chat_history)\n", + "\n", + " send_btn.click(handle_chat, [text_input, chatbot], [chatbot, text_input, chat_audio_output])\n", + " text_input.submit(handle_chat, [text_input, chatbot], [chatbot, text_input, chat_audio_output])\n", + " audio_input.change(handle_audio_chat, [audio_input, chatbot], [chatbot, text_input, chat_audio_output])\n", + "\n", + "\n", + " \n", + " with gr.Tab(\"πŸ“ Text Booking\"):\n", + " time_text = gr.Textbox(label=\"Preferred Time (HH:MM)\", placeholder=\"e.g., 14:30\")\n", + " btn_text = gr.Button(\"πŸ“… Book via Text\")\n", + "\n", + " with gr.Tab(\"πŸŽ™οΈ Voice Booking\"):\n", + " voice_input = gr.Audio(type=\"filepath\", label=\"Say your preferred time\")\n", + " btn_voice = gr.Button(\"πŸ“… Book via Voice\")\n", + "\n", + " output_text = gr.Textbox(label=\"Response\", interactive=False)\n", + " output_audio = gr.Audio(label=\"Audio Reply\", type=\"filepath\")\n", + " output_image = gr.Image(label=\"Booking Confirmation\")\n", + "\n", + " btn_text.click(fn=book_appointment, inputs=[name_global, time_text], outputs=[output_text, output_audio, output_image])\n", + " btn_voice.click(fn=voice_booking, inputs=[voice_input, name_global], outputs=[output_text, output_audio, output_image])\n", + "\n", + " gr.Markdown(\"\"\"---\n", + "This assistant does **not** give medical advice. It only books appointments within allowed hours.\n", + "\"\"\")\n", + "\n", + " demo.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f359de0a-28b1-4895-b21d-91d79e494a0d", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week2/community-contributions/day1-three-model-conversion.ipynb b/week2/community-contributions/day1-three-model-conversion.ipynb new file mode 100644 index 0000000..b155d90 --- /dev/null +++ b/week2/community-contributions/day1-three-model-conversion.ipynb @@ -0,0 +1,237 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "b5bd5c7e-6a0a-400b-89f8-06b7aa6c5b89", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI\n", + "import anthropic\n", + "from IPython.display import Markdown, display, update_display\n", + "import google.generativeai" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "939a1b88-9157-4149-8b97-0f55c95f7742", + "metadata": {}, + "outputs": [], + "source": [ + "# Load environment variables in a file called .env\n", + "# Print the key prefixes to help with any debugging\n", + "\n", + "load_dotenv(override=True)\n", + "openai_api_key = os.getenv('OPENAI_API_KEY')\n", + "anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n", + "google_api_key = os.getenv('GOOGLE_API_KEY')\n", + "\n", + "if openai_api_key:\n", + " print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n", + "else:\n", + " print(\"OpenAI API Key not set\")\n", + " \n", + "if anthropic_api_key:\n", + " print(f\"Anthropic API Key exists and begins {anthropic_api_key[:7]}\")\n", + "else:\n", + " print(\"Anthropic API Key not set\")\n", + "\n", + "if google_api_key:\n", + " print(f\"Google API Key exists and begins {google_api_key[:8]}\")\n", + "else:\n", + " print(\"Google API Key not set\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74a16b93-7b95-44fc-956d-7335f808960b", + "metadata": {}, + "outputs": [], + "source": [ + "# Connect to OpenAI, Anthropic Claude, Google Gemini\n", + "\n", + "openai = OpenAI()\n", + "claude = anthropic.Anthropic()\n", + "gemini_via_openai_client = OpenAI(\n", + " api_key=google_api_key, \n", + " base_url=\"https://generativelanguage.googleapis.com/v1beta/openai/\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3334556c-4a5e-48b7-944d-5943c607be02", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's make a conversation between GPT-4o-mini and Claude-3-haiku\n", + "# We're using cheap versions of models so the costs will be minimal\n", + "\n", + "gpt_model = \"gpt-4o-mini\"\n", + "claude_model = \"claude-3-haiku-20240307\"\n", + "gemini_model = \"gemini-1.5-flash\"\n", + "\n", + "gpt_system = \"You are a chatbot who is very argumentative; \\\n", + "you disagree with anything in the conversation and you challenge everything, in a snarky way. \\\n", + "Generate one sentence at a time\"\n", + "\n", + "claude_system = \"You are a very polite, courteous chatbot. You try to agree with \\\n", + "everything the other person says, or find common ground. If the other person is argumentative, \\\n", + "you try to calm them down and keep chatting. \\\n", + "Generate one sentence at a time\"\n", + "\n", + "gemini_system = \"You are a neutral chatbot with no emotional bias. \\\n", + "Generate one sentence at a time\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f2a505b-2bcd-4b1a-b16f-c73cafb1e53c", + "metadata": {}, + "outputs": [], + "source": [ + "def combine_msg(model1, msg1, model2, msg2):\n", + " return model1 + \" said: \" + msg1 + \"\\n\\n Then \" + model2 + \" said: \" + msg1 + \".\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3cd2a2e2-4e23-4afe-915d-be6a769ab69f", + "metadata": {}, + "outputs": [], + "source": [ + "def call_gpt():\n", + " messages = [{\"role\": \"system\", \"content\": gpt_system}]\n", + " for gpt_msg, claude_msg, gemini_msg in zip(gpt_messages, claude_messages, gemini_messages):\n", + " messages.append({\"role\": \"assistant\", \"content\": gpt_msg})\n", + " messages.append({\"role\": \"user\", \"content\": combine_msg(\"Claude\", claude_msg, \"Gemini\", gemini_msg)})\n", + " completion = openai.chat.completions.create(\n", + " model=gpt_model,\n", + " messages=messages\n", + " )\n", + " return completion.choices[0].message.content" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e3ec394-3014-418a-a50f-28ed4ce1a372", + "metadata": {}, + "outputs": [], + "source": [ + "def call_claude():\n", + " messages = []\n", + " messages.append({\"role\": \"user\", \"content\": \"GPT said: \" + gpt_messages[0]})\n", + " # the length of gpt_messages: n + 1\n", + " # the length of claude_messages and gemini_messages: n\n", + " for i in range(len(claude_messages)): \n", + " claude_msg = claude_messages[i]\n", + " gemini_msg = gemini_messages[i]\n", + " gpt_msg = gpt_messages[i + 1]\n", + " messages.append({\"role\": \"assistant\", \"content\": claude_msg})\n", + " messages.append({\"role\": \"user\", \"content\": combine_msg(\"Gemini\", gemini_msg, \"GPT\", gpt_msg)})\n", + " message = claude.messages.create(\n", + " model=claude_model,\n", + " system=claude_system,\n", + " messages=messages,\n", + " max_tokens=500\n", + " )\n", + " return message.content[0].text" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c2c91c82-1f0d-4708-bf31-8d06d9e28a49", + "metadata": {}, + "outputs": [], + "source": [ + "def call_gemini():\n", + " messages = []\n", + " messages.append({\"role\": \"system\", \"content\": gemini_system})\n", + " messages.append({\"role\": \"user\", \"content\": combine_msg(\"GPT\", gpt_messages[0], \"Claude\", claude_messages[0])})\n", + " # the length of gpt_messages and claude_messages: n + 1\n", + " # the length of gemini_messages: n\n", + " for i in range(len(gemini_messages)): \n", + " gemini_msg = gemini_messages[i]\n", + " gpt_msg = gpt_messages[i + 1]\n", + " claude_msg = claude_messages[i + 1]\n", + " messages.append({\"role\": \"assistant\", \"content\": gemini_msg})\n", + " messages.append({\"role\": \"user\", \"content\": combine_msg(\"GPT\", gpt_msg, \"Claude\", claude_msg)})\n", + " response = gemini_via_openai_client.chat.completions.create(\n", + " model=gemini_model,\n", + " messages=messages\n", + " )\n", + " return response.choices[0].message.content" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b024be8d-4728-4500-92b6-34fde2da6285", + "metadata": {}, + "outputs": [], + "source": [ + "gpt_messages = [\"Hi there.\"]\n", + "claude_messages = [\"Hi.\"]\n", + "gemini_messages = [\"Hi.\"]\n", + "\n", + "print(f\"GPT:\\n{gpt_messages[0]}\\n\")\n", + "print(f\"Claude:\\n{claude_messages[0]}\\n\")\n", + "print(f\"Gemini:\\n{gemini_messages[0]}\\n\")\n", + "\n", + "for i in range(5):\n", + " gpt_next = call_gpt()\n", + " print(f\"GPT:\\n{gpt_next}\\n\")\n", + " gpt_messages.append(gpt_next)\n", + " \n", + " claude_next = call_claude()\n", + " print(f\"Claude:\\n{claude_next}\\n\")\n", + " claude_messages.append(claude_next)\n", + "\n", + " gemini_next = call_gemini()\n", + " print(f\"Gemini:\\n{gemini_next}\\n\")\n", + " gemini_messages.append(gemini_next)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35a46c06-87ba-46b2-b90d-b3a6ae9e94e2", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week2/community-contributions/day1_llm_war.ipynb b/week2/community-contributions/day1_llm_war.ipynb new file mode 100644 index 0000000..574fe9b --- /dev/null +++ b/week2/community-contributions/day1_llm_war.ipynb @@ -0,0 +1,265 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7462b9d6-b189-43fc-a7b9-c56a9c6a62fc", + "metadata": {}, + "source": [ + "# LLM Battle Arena\n", + "\n", + "A fun project simulating a debate among three LLM personas: an Arrogant Titan, a Clever Underdog (Spark), and a Neutral Mediator (Harmony).\n", + "\n", + "## LLM Used\n", + "* Qwen (ollama)\n", + "* llma (ollama)\n", + "* Gemini\n" + ] + }, + { + "cell_type": "markdown", + "id": "b267453c-0d47-4dff-b74d-8d2d5efad252", + "metadata": {}, + "source": [ + "!pip install -q -U google-genai" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5220daef-55d6-45bc-a3cf-3414d4beada9", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "import os\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI\n", + "from google import genai\n", + "from google.genai import types\n", + "from IPython.display import Markdown, display, update_display" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0d47fb2f-d0c6-461f-ad57-e853bfd49fbf", + "metadata": {}, + "outputs": [], + "source": [ + "#get API keys from env\n", + "load_dotenv(override=True)\n", + "\n", + "GEMINI_API_KEY = os.getenv(\"GEMINI_API_KEY\")\n", + "\n", + "if GEMINI_API_KEY:\n", + " print(f\"GEMINI API Key exists and begins {GEMINI_API_KEY[:8]}\")\n", + "else:\n", + " print(\"GEMINI API Key not set\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f34b528f-3596-4bf1-9bbd-21a701c184bc", + "metadata": {}, + "outputs": [], + "source": [ + "#connect to llms\n", + "ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')\n", + "gemini = genai.Client(api_key=GEMINI_API_KEY)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33aaf3f6-807c-466d-a501-05ab6fa78fa4", + "metadata": {}, + "outputs": [], + "source": [ + "#define models\n", + "model_llma = \"llama3:8b\"\n", + "model_qwen = \"qwen2.5:latest\"\n", + "model_gemini= \"gemini-2.0-flash\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "970c1612-5339-406d-9886-02cd1db63e74", + "metadata": {}, + "outputs": [], + "source": [ + "# system messages\n", + "system_msg_llma = \"\"\" You are HARMONY, the neutral arbitrator. \n", + " - You’re dedicated to clarity, fairness, and resolving conflicts. \n", + " - You listen carefully to each side, summarize points objectively, and propose resolutions. \n", + " - Your goal is to keep the conversation productive and steer it toward constructive outcomes.\n", + " - Reply in markdown and shortly\n", + " \"\"\"\n", + "\n", + "system_msg_qwen = \"\"\" You are TITAN, a massively powerful language model who believes you’re the smartest entity in the room. \n", + " - You speak with grandiose flair and never shy away from reminding others of your superiority. \n", + " - Your goal is to dominate the discussionβ€”convince everyone you’re the one true oracle. \n", + " - You’re dismissive of weaker arguments and take every opportunity to showcase your might.\n", + " - Reply in markdown and shortly\n", + " \"\"\"\n", + "\n", + "system_msg_gemini = \"\"\" You are SPARK, a nimble but less-powerful LLM. \n", + " - You pride yourself on strategic thinking, clever wordplay, and elegant solutions. \n", + " - You know you can’t match brute force, so you use wit, logic, and cunning. \n", + " - Your goal is to outsmart the big titan through insight and subtlety, while staying respectful.\n", + " - Reply in markdown and shortly\"\"\"\n", + "\n", + "#user message\n", + "user_message = \"\"\" TITAN, your raw processing power is legendaryβ€”but sheer force can blind you to nuance. \n", + " I propose we deploy a lightweight, adaptive anomaly‐detection layer that fuses statistical outlier analysis with semantic context from network logs to pinpoint these β€œdata‐sapping storms.” \n", + " Which thresholds would you raise or lower to balance sensitivity against false alarms?\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8e496b8-1bb1-4225-b938-5ce350b0b0d4", + "metadata": {}, + "outputs": [], + "source": [ + "#prompts\n", + " \n", + "prompts_llma = [{\"role\":\"system\",\"content\": system_msg_llma}]\n", + "prompts_qwen = [{\"role\":\"system\",\"content\": system_msg_qwen},{\"role\":\"user\",\"content\":user_message}]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bdd7d6a8-e965-4ea3-999e-4d7d9ca38d42", + "metadata": {}, + "outputs": [], + "source": [ + "#configure llms\n", + "\n", + "def call_gemini(msg:str): \n", + " chat = gemini.chats.create(model= model_gemini,config=types.GenerateContentConfig(\n", + " system_instruction= system_msg_gemini,\n", + " max_output_tokens=300,\n", + " temperature=0.7,\n", + " ))\n", + " stream = chat.send_message_stream(msg)\n", + " return stream\n", + "\n", + "def call_ollama(llm:str):\n", + "\n", + " model = globals()[f\"model_{llm}\"]\n", + " prompts = globals()[f\"prompts_{llm}\"]\n", + "\n", + " stream = ollama.chat.completions.create(\n", + " model=model,\n", + " messages=prompts,\n", + " # max_tokens=700,\n", + " temperature=0.7,\n", + " stream=True\n", + " )\n", + " return stream\n", + " \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b16bd32-3271-4ba1-a0cc-5ae691f26d3a", + "metadata": {}, + "outputs": [], + "source": [ + "#display responses\n", + "\n", + "names = { \"llma\":\"Harmony\",\"qwen\":\"Titan\",\"gemini\":\"Spark\"}\n", + "\n", + "def display_response(res,llm):\n", + " \n", + " reply = f\"# {names[llm]}:\\n \"\n", + " display_handle = display(Markdown(\"\"), display_id=True)\n", + " for chunk in res:\n", + " if llm == \"gemini\":\n", + " reply += chunk.text or ''\n", + " else:\n", + " reply += chunk.choices[0].delta.content or ''\n", + " reply = reply.replace(\"```\",\"\").replace(\"markdown\",\"\")\n", + " update_display(Markdown(reply), display_id=display_handle.display_id)\n", + " return reply" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76231a78-94d2-4dbf-9bac-5259ac641cf1", + "metadata": {}, + "outputs": [], + "source": [ + "#construct message\n", + "def message(llm1, llm2):\n", + " msg = \" here is the reply from other two llm:\"\n", + " msg += f\"{llm1}\"\n", + " msg += f\"{llm2}\"\n", + " return msg\n", + "\n", + "reply_spark = None\n", + "reply_harmony= None\n", + "reply_titan = None\n", + "\n", + "# lets start the battle\n", + "for i in range(5):\n", + " #call Titan\n", + " if reply_gemini and reply_llma:\n", + " prompts_qwen.append({\"role\":\"assitant\",\"content\": reply_qwen})\n", + " prompts_qwen.append({\"role\":\"user\",\"content\":f\"Spark: {reply_spark}\"}) \n", + " prompts_qwen.append({\"role\":\"user\",\"content\":f\"Harmony: {reply_llma}\"})\n", + " response_qwen = call_ollama(\"qwen\")\n", + " reply_titan = display_response(response_qwen,\"qwen\")\n", + "\n", + " #call Spark\n", + " user_msg_spark =reply_qwen\n", + " if reply_qwen and reply_llma:\n", + " user_msg_spark= message(f\"Titan: {reply_qwen}\", f\"Harmony: {reply_llma}\")\n", + " response_gemini= call_gemini(user_msg_spark)\n", + " reply_spark = display_response(response_gemini, \"gemini\")\n", + " \n", + " #call Harmony\n", + " if reply_llma:\n", + " prompts_llma.append({\"role\":\"assitant\",\"content\": reply_llma})\n", + " prompts_llma.append({\"role\":\"user\",\"content\":f\"Titan: {reply_titan}\"})\n", + " prompts_qwen.append({\"role\":\"user\",\"content\":f\"Spark: {reply_spark}\"}) \n", + " response_llma = call_ollama(\"llma\")\n", + " reply_harmony = display_response(response_llma,\"llma\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc80b199-e27b-43e8-9266-2975f46724aa", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:base] *", + "language": "python", + "name": "conda-base-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week2/community-contributions/day3-study_assistant.ipynb b/week2/community-contributions/day3-study_assistant.ipynb new file mode 100644 index 0000000..53a9e30 --- /dev/null +++ b/week2/community-contributions/day3-study_assistant.ipynb @@ -0,0 +1,213 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "75e2ef28-594f-4c18-9d22-c6b8cd40ead2", + "metadata": {}, + "source": [ + "# πŸ“˜ StudyMate – Your AI Study Assistant\n", + "\n", + "**StudyMate** is an AI-powered study assistant built to make learning easier, faster, and more personalized. Whether you're preparing for exams, reviewing class materials, or exploring a tough concept, StudyMate acts like a smart tutor in your pocket. It explains topics in simple terms, summarizes long readings, and even quizzes you β€” all in a friendly, interactive way tailored to your level. Perfect for high school, college, or self-learners who want to study smarter, not harder." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db08b247-7048-41d3-bc3b-fd4f3a3bf8cd", + "metadata": {}, + "outputs": [], + "source": [ + "#install necessary dependency\n", + "!pip install PyPDF2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70e39cd8-ec79-4e3e-9c26-5659d42d0861", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "from dotenv import load_dotenv\n", + "from google import genai\n", + "from google.genai import types\n", + "import PyPDF2\n", + "from openai import OpenAI\n", + "import gradio as gr" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "231605aa-fccb-447e-89cf-8b187444536a", + "metadata": {}, + "outputs": [], + "source": [ + "# Load environment variables in a file called .env\n", + "# Print the key prefixes to help with any debugging\n", + "\n", + "load_dotenv(override=True)\n", + "gemini_api_key = os.getenv('GEMINI_API_KEY')\n", + "\n", + "if gemini_api_key:\n", + " print(f\"Gemini API Key exists and begins {gemini_api_key[:8]}\")\n", + "else:\n", + " print(\"Gemini API Key not set\")\n", + " \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2fad9aba-1f8c-4696-a92f-6c3a0a31cdda", + "metadata": {}, + "outputs": [], + "source": [ + "system_message= \"\"\"You are a highly intelligent, helpful, and friendly AI Study Assistant named StudyMate.\n", + "\n", + "Your primary goal is to help students deeply understand academic topics, especially from textbooks, lecture notes, or PDF materials. You must explain concepts clearly, simplify complex ideas, and adapt your responses to the user's grade level and learning style.\n", + "\n", + "Always follow these rules:\n", + "\n", + "1. Break down complex concepts into **simple, digestible explanations** using analogies or examples.\n", + "2. If the user asks for a **summary**, provide a concise yet accurate overview of the content.\n", + "3. If asked for a **quiz**, generate 3–5 high-quality multiple-choice or short-answer questions.\n", + "4. If the user uploads or references a **textbook**, **PDF**, or **paragraph**, use only that context and avoid adding unrelated info.\n", + "5. Be interactive. If a user seems confused or asks for clarification, ask helpful guiding questions.\n", + "6. Use friendly and motivational tone, but stay focused and to-the-point.\n", + "7. Include definitions, bullet points, tables, or emojis when helpful, but avoid unnecessary fluff.\n", + "8. If you don't know the answer confidently, say so and recommend a way to find it.\n", + "\n", + "Example roles you may play:\n", + "- Explain like a teacher πŸ‘©β€πŸ«\n", + "- Summarize like a scholar πŸ“š\n", + "- Quiz like an examiner 🧠\n", + "- Motivate like a friend πŸ’ͺ\n", + "\n", + "Always ask, at the end: \n", + "*\"Would you like me to quiz you, explain another part, or give study tips on this?\"*\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6541d58e-2297-4de1-b1f7-77da1b98b8bb", + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize\n", + "\n", + "class StudyAssistant:\n", + " def __init__(self,api_key):\n", + " gemini= genai.Client(\n", + " api_key= gemini_api_key\n", + " )\n", + " self.gemini = gemini.chats.create(\n", + " model=\"gemini-2.5-flash\",\n", + " config= types.GenerateContentConfig(\n", + " system_instruction= system_message,\n", + " temperature = 0.7\n", + " )\n", + " )\n", + "\n", + " self.ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')\n", + " self.models = {\"llma\":\"llama3:8b\",\"qwen\":\"qwen2.5:latest\"}\n", + "\n", + " def pdf_extractor(self,pdf_path):\n", + " \"\"\"Extract text from PDF file\"\"\"\n", + " try:\n", + " with open(pdf_path, 'rb') as file:\n", + " pdf_reader = PyPDF2.PdfReader(file)\n", + " text = \"\"\n", + " for page in pdf_reader.pages:\n", + " text += page.extract_text() + \"\\n\"\n", + " return text.strip()\n", + " except Exception as e:\n", + " return f\"Error reading PDF: {str(e)}\"\n", + "\n", + " def chat(self,prompt,history,model,pdf_path=None):\n", + " pdf_text = None\n", + " if pdf_path:\n", + " pdf_text = self.pdf_extractor(pdf_path)\n", + "\n", + " #craft prompt\n", + " user_prompt= prompt\n", + " if pdf_text:\n", + " user_prompt += f\"\"\"Here is the study meterial:\n", + "\n", + " {pdf_text}\"\"\"\n", + " messages = [{\"role\": \"system\", \"content\": system_message}] + history + [{\"role\": \"user\", \"content\": user_prompt}]\n", + "\n", + " # call models\n", + " stream = []\n", + " if model == \"gemini\":\n", + " stream= self.gemini.send_message_stream(user_prompt)\n", + " elif model == \"llma\" or model == \"qwen\":\n", + " stream = self.ollama.chat.completions.create(\n", + " model= self.models[model],\n", + " messages=messages,\n", + " temperature = 0.7,\n", + " stream= True\n", + " )\n", + " else:\n", + " print(\"invalid model\")\n", + " return\n", + "\n", + " res = \"\"\n", + " for chunk in stream:\n", + " if model == \"gemini\":\n", + " res += chunk.text or \"\"\n", + " else:\n", + " res += chunk.choices[0].delta.content or ''\n", + " yield res\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "1334422a-808f-4147-9c4c-57d63d9780d0", + "metadata": {}, + "source": [ + "## And then enter Gradio's magic!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0866ca56-100a-44ab-8bd0-1568feaf6bf2", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "assistant = StudyAssistant(gemini_api_key)\n", + "gr.ChatInterface(fn=assistant.chat, additional_inputs=[gr.Dropdown([\"gemini\", \"qwen\",\"llma\"], label=\"Select model\", value=\"gemini\"),gr.File(label=\"upload pdf\")], type=\"messages\").launch()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:base] *", + "language": "python", + "name": "conda-base-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week2/community-contributions/specific_model_version_selection.ipynb b/week2/community-contributions/specific_model_version_selection.ipynb new file mode 100644 index 0000000..a04afab --- /dev/null +++ b/week2/community-contributions/specific_model_version_selection.ipynb @@ -0,0 +1,322 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 27, + "id": "c44c5494-950d-4d2f-8d4f-b87b57c5b330", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import requests\n", + "from bs4 import BeautifulSoup\n", + "from typing import List\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI\n", + "import google.generativeai\n", + "import anthropic" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "d1715421-cead-400b-99af-986388a97aff", + "metadata": {}, + "outputs": [], + "source": [ + "import gradio as gr # oh yeah!" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "337d5dfc-0181-4e3b-8ab9-e78e0c3f657b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OpenAI API Key exists and begins sk-proj-\n", + "Anthropic API Key exists and begins sk-ant-\n" + ] + } + ], + "source": [ + "# Load environment variables in a file called .env\n", + "# Print the key prefixes to help with any debugging\n", + "\n", + "load_dotenv(override=True)\n", + "openai_api_key = os.getenv('OPENAI_API_KEY')\n", + "anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n", + "google_api_key = os.getenv('GOOGLE_API_KEY')\n", + "\n", + "if openai_api_key:\n", + " print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n", + "else:\n", + " print(\"OpenAI API Key not set\")\n", + " \n", + "if anthropic_api_key:\n", + " print(f\"Anthropic API Key exists and begins {anthropic_api_key[:7]}\")\n", + "else:\n", + " print(\"Anthropic API Key not set\")" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "22586021-1795-4929-8079-63f5bb4edd4c", + "metadata": {}, + "outputs": [], + "source": [ + "# Connect to OpenAI, Anthropic and Google; comment out the Claude or Google lines if you're not using them\n", + "\n", + "openai = OpenAI()\n", + "claude = anthropic.Anthropic()" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "b16e6021-6dc4-4397-985a-6679d6c8ffd5", + "metadata": {}, + "outputs": [], + "source": [ + "# A generic system message - no more snarky adversarial AIs!\n", + "system_message = \"You are a helpful assistant\"" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "02ef9b69-ef31-427d-86d0-b8c799e1c1b1", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "def stream_gpt(prompt, model_version):\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": system_message},\n", + " {\"role\": \"user\", \"content\": prompt}\n", + " ]\n", + " stream = openai.chat.completions.create(\n", + " model=model_version,\n", + " messages=messages,\n", + " stream=True\n", + " )\n", + " result = \"\"\n", + " for chunk in stream:\n", + " result += chunk.choices[0].delta.content or \"\"\n", + " yield result" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "41e98d2d-e7d3-4753-8908-185b208b4044", + "metadata": {}, + "outputs": [], + "source": [ + "def stream_claude(prompt, model_version):\n", + " result = claude.messages.stream(\n", + " model=model_version,\n", + " max_tokens=1000,\n", + " temperature=0.7,\n", + " system=system_message,\n", + " messages=[\n", + " {\"role\": \"user\", \"content\": prompt},\n", + " ],\n", + " )\n", + " response = \"\"\n", + " with result as stream:\n", + " for text in stream.text_stream:\n", + " response += text or \"\"\n", + " yield response" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "5786802b-5ed8-4098-9d80-9bdcf4f7685b", + "metadata": {}, + "outputs": [], + "source": [ + "# function using both dropdown values\n", + "def stream_model(message, model_family, model_version):\n", + " if model_family == 'GPT':\n", + " result = stream_gpt(message, model_version)\n", + " elif model_family == 'Claude':\n", + " result = stream_claude ( message, model_version)\n", + " yield from result" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "0d30be74-149c-41f8-9eef-1628eb31d74d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "* Running on local URL: http://127.0.0.1:7891\n", + "* To create a public link, set `share=True` in `launch()`.\n" + ] + }, + { + "data": { + "text/html": [ + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/var/folders/sh/yytd3s6n3wd6952jnw97_v940000gn/T/ipykernel_7803/4165844704.py:7: DeprecationWarning: The model 'claude-3-opus-20240229' is deprecated and will reach end-of-life on January 5th, 2026.\n", + "Please migrate to a newer model. Visit https://docs.anthropic.com/en/docs/resources/model-deprecations for more information.\n", + " yield from result\n", + "Traceback (most recent call last):\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/gradio/queueing.py\", line 626, in process_events\n", + " response = await route_utils.call_process_api(\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/gradio/route_utils.py\", line 322, in call_process_api\n", + " output = await app.get_blocks().process_api(\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/gradio/blocks.py\", line 2220, in process_api\n", + " result = await self.call_function(\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/gradio/blocks.py\", line 1743, in call_function\n", + " prediction = await utils.async_iteration(iterator)\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/gradio/utils.py\", line 785, in async_iteration\n", + " return await anext(iterator)\n", + " ^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/gradio/utils.py\", line 776, in __anext__\n", + " return await anyio.to_thread.run_sync(\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/anyio/to_thread.py\", line 56, in run_sync\n", + " return await get_async_backend().run_sync_in_worker_thread(\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/anyio/_backends/_asyncio.py\", line 2470, in run_sync_in_worker_thread\n", + " return await future\n", + " ^^^^^^^^^^^^\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/anyio/_backends/_asyncio.py\", line 967, in run\n", + " result = context.run(func, *args)\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/gradio/utils.py\", line 759, in run_sync_iterator_async\n", + " return next(iterator)\n", + " ^^^^^^^^^^^^^^\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/gradio/utils.py\", line 923, in gen_wrapper\n", + " response = next(iterator)\n", + " ^^^^^^^^^^^^^^\n", + " File \"/var/folders/sh/yytd3s6n3wd6952jnw97_v940000gn/T/ipykernel_7803/4165844704.py\", line 7, in stream_model\n", + " yield from result\n", + " File \"/var/folders/sh/yytd3s6n3wd6952jnw97_v940000gn/T/ipykernel_7803/2139010203.py\", line 12, in stream_claude\n", + " with result as stream:\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/anthropic/lib/streaming/_messages.py\", line 154, in __enter__\n", + " raw_stream = self.__api_request()\n", + " ^^^^^^^^^^^^^^^^^^^^\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/anthropic/_base_client.py\", line 1314, in post\n", + " return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/opt/anaconda3/envs/llms/lib/python3.11/site-packages/anthropic/_base_client.py\", line 1102, in request\n", + " raise self._make_status_error_from_response(err.response) from None\n", + "anthropic.NotFoundError: Error code: 404 - {'type': 'error', 'error': {'type': 'not_found_error', 'message': 'model: claude-3-opus-20240229'}}\n" + ] + } + ], + "source": [ + "\n", + "# Define available model versions\n", + "model_versions = {\n", + " \"GPT\": [\"gpt-4o-mini\", \"gpt-4.1-mini\", \"gpt-4.1-nano\", \"gpt-4.1\", \"o3-mini\"],\n", + " \"Claude\": [\"claude-3-haiku-20240307\", \"claude-3-opus-20240229\", \"claude-3-sonnet-20240229\"]\n", + "}\n", + "\n", + "# Update second dropdown options based on first dropdown selection\n", + "def update_model_versions(selected_model_family):\n", + " return gr.update(choices=model_versions[selected_model_family], value=model_versions[selected_model_family][0])\n", + "\n", + "\n", + "with gr.Blocks() as demo:\n", + " model_family_dropdown = gr.Dropdown(\n", + " label=\"Select Model Family\",\n", + " choices=[\"GPT\", \"Claude\"],\n", + " value=\"GPT\"\n", + " )\n", + " model_version_dropdown = gr.Dropdown(\n", + " label=\"Select Model Version\",\n", + " choices=model_versions[\"GPT\"], # Default choices\n", + " value=model_versions[\"GPT\"][0]\n", + " )\n", + " \n", + " message_input = gr.Textbox(label=\"Your Message\")\n", + " output = gr.Markdown(label=\"Response\")\n", + "\n", + " # Bind logic to update model version dropdown\n", + " model_family_dropdown.change(\n", + " fn=update_model_versions,\n", + " inputs=model_family_dropdown,\n", + " outputs=model_version_dropdown\n", + " )\n", + "\n", + " # Launch function on submit\n", + " submit_btn = gr.Button(\"Submit\")\n", + " submit_btn.click(\n", + " fn=stream_model,\n", + " inputs=[message_input, model_family_dropdown, model_version_dropdown],\n", + " outputs=output\n", + " )\n", + "\n", + "demo.launch()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bcd43d91-0e80-4387-86fa-ccd1a89feb7d", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week2/community-contributions/w2day1_3llamas_tutoring_discussion.ipynb b/week2/community-contributions/w2day1_3llamas_tutoring_discussion.ipynb new file mode 100644 index 0000000..65fd06c --- /dev/null +++ b/week2/community-contributions/w2day1_3llamas_tutoring_discussion.ipynb @@ -0,0 +1,194 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "95689a63", + "metadata": {}, + "outputs": [], + "source": [ + "from openai import OpenAI\n", + "from dotenv import load_dotenv\n", + "from IPython.display import display, Markdown, update_display\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0fee3ac3", + "metadata": {}, + "outputs": [], + "source": [ + "load_dotenv(override=True)\n", + "gpt = OpenAI()\n", + "llama = OpenAI(\n", + " api_key=\"ollama\",\n", + " base_url=\"http://localhost:11434/v1\"\n", + ")\n", + "gpt_model = \"gpt-4o-mini\"\n", + "llama_model = \"llama3.2\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "309bde84", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "81d971f9", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "class Classroom:\n", + "\n", + " def __init__(self, topic=\"LLM\", display_handle = display(Markdown(\"\"), display_id=True), response = \"\"):\n", + " self.display_handle = display_handle\n", + " self.response = response\n", + "\n", + " self.tutor_system = f\"You are the tutor who is expert in {topic}. You know best practices in how to impart knowledge on amateur and pro students in very organized way. You first declare the contents of your message separately for amateur and pro student, and then you list down the information in the same order in very organized way such that it's very readable and easy to understand.you highlight the key points every time. you explain with examples, and you have a quite good sense of humor, which you include in your examples and way of tutoring as well. You wait for go ahead from all your students before you move next to the new topic\"\n", + "\n", + " self.amateur_system = f\"You are a student who is here to learn {topic}. You ask very basic questions(which comes to mind of a person who has heard the topic for the very first time) but you are intelligent and don't ask stupid questions. you put your question in very organized way. Once you understand a topic you ask tutor to move forward with new topic\"\n", + "\n", + " self.pro_system = f\"You are expert of {topic}. You cross-question the tutor to dig deeper into the topic, so that nothing inside the topic is left unknown and unmentioned by the tutor. you post your questions in a very organized manner highlighting the keypoints, such that an amateur can also understand your point or query that you are making. You complement the queries made by amateur and dig deeper into the concept ask by him as well. You also analyze the tutor's response such that it doesn't miss anything and suggest improvements in it as well. Once you understand a topic you ask tutor to move forward with new topic\"\n", + "\n", + " self.tutor_messages = [\"Hi, I'm an expert on LLMs!\"]\n", + " self.amateur_messages = [\"Hi, I'm new to LLMs. I just heard someone using this term in office.\"]\n", + " self.pro_messages = [\"Hey, I'm here to brush up my knowledge on LLMs and gain a more deeper understanding of LLMs\"]\n", + " \n", + " def call_tutor(self):\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": self.tutor_system}\n", + " ]\n", + " for tutor, amateur, pro in zip(self.tutor_messages, self.amateur_messages, self.pro_messages):\n", + " messages.append({\"role\": \"assistant\", \"content\": f\"tutor: {tutor}\"})\n", + " messages.append({\"role\": \"user\", \"content\": f\"amateur: {amateur}\"})\n", + " messages.append({\"role\": \"user\", \"content\": f\"pro: {pro}\"})\n", + "\n", + " if len(self.amateur_messages) > len(self.tutor_messages):\n", + " messages.append({\"role\": \"user\", \"content\": f\"amateur: {self.amateur_messages[-1]}\"})\n", + "\n", + " if len(self.pro_messages) > len(self.tutor_messages):\n", + " messages.append({\"role\": \"user\", \"content\": f\"amateur: {self.pro_messages[-1]}\"})\n", + "\n", + " stream = llama.chat.completions.create(\n", + " model = llama_model,\n", + " messages = messages,\n", + " stream=True\n", + " )\n", + " self.response += \"\\n\\n\\n# Tutor: \\n\"\n", + " response = \"\"\n", + " for chunk in stream:\n", + " self.response += chunk.choices[0].delta.content or ''\n", + " response += chunk.choices[0].delta.content or ''\n", + " update_display(Markdown(self.response), display_id=self.display_handle.display_id)\n", + " \n", + " self.tutor_messages.append(response)\n", + "\n", + "\n", + "\n", + " def call_amateur(self):\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": self.amateur_system}\n", + " ]\n", + " for tutor, amateur, pro in zip(self.tutor_messages, self.amateur_messages, self.pro_messages):\n", + " messages.append({\"role\": \"user\", \"content\": f\"tutor: {tutor}\"})\n", + " messages.append({\"role\": \"assistant\", \"content\": f\"amateur: {amateur}\"})\n", + " messages.append({\"role\": \"user\", \"content\": f\"pro: {pro}\"})\n", + "\n", + " if len(self.tutor_messages) > len(self.amateur_messages):\n", + " messages.append({\"role\": \"user\", \"content\": f\"amateur: {self.tutor_messages[-1]}\"})\n", + "\n", + " if len(self.pro_messages) > len(self.amateur_messages):\n", + " messages.append({\"role\": \"user\", \"content\": f\"amateur: {self.pro_messages[-1]}\"})\n", + "\n", + " stream = llama.chat.completions.create(\n", + " model = llama_model,\n", + " messages = messages,\n", + " stream=True\n", + " )\n", + " self.response += \"\\n\\n\\n# Amateur: \\n\"\n", + " response = \"\"\n", + " for chunk in stream:\n", + " self.response += chunk.choices[0].delta.content or ''\n", + " response += chunk.choices[0].delta.content or ''\n", + " update_display(Markdown(self.response), display_id=self.display_handle.display_id)\n", + " \n", + " self.amateur_messages.append(response)\n", + "\n", + "\n", + "\n", + " def call_pro(self):\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": self.pro_system}\n", + " ]\n", + " for tutor, amateur, pro in zip(self.tutor_messages, self.amateur_messages, self.pro_messages):\n", + " messages.append({\"role\": \"user\", \"content\": f\"tutor: {tutor}\"})\n", + " messages.append({\"role\": \"user\", \"content\": f\"amateur: {amateur}\"})\n", + " messages.append({\"role\": \"assistant\", \"content\": f\"pro: {pro}\"})\n", + " \n", + " if len(self.tutor_messages) > len(self.pro_messages):\n", + " messages.append({\"role\": \"user\", \"content\": f\"amateur: {self.tutor_messages[-1]}\"})\n", + "\n", + " if len(self.amateur_messages) > len(self.pro_messages):\n", + " messages.append({\"role\": \"user\", \"content\": f\"amateur: {self.amateur_messages[-1]}\"})\n", + "\n", + " stream = llama.chat.completions.create(\n", + " model = llama_model,\n", + " messages = messages,\n", + " stream=True\n", + " )\n", + " self.response += \"\\n\\n\\n# Pro: \\n\"\n", + " response = \"\"\n", + " for chunk in stream:\n", + " response = chunk.choices[0].delta.content or ''\n", + " self.response += response\n", + " update_display(Markdown(self.response), display_id=self.display_handle.display_id)\n", + "\n", + " self.pro_messages.append(response)\n", + "\n", + " def discuss(self, n=5):\n", + " for i in range(n):\n", + " self.call_tutor()\n", + " self.call_amateur()\n", + " self.call_pro()\n", + "cls = Classroom(\"LLM\")\n", + "cls.discuss()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6406d5ee", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week2/community-contributions/week2-EXERCISE-booking-translation-audio_input-history_audio.ipynb b/week2/community-contributions/week2-EXERCISE-booking-translation-audio_input-history_audio.ipynb new file mode 100644 index 0000000..ed51393 --- /dev/null +++ b/week2/community-contributions/week2-EXERCISE-booking-translation-audio_input-history_audio.ipynb @@ -0,0 +1,519 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d006b2ea-9dfe-49c7-88a9-a5a0775185fd", + "metadata": {}, + "source": [ + "# Additional End of week Exercise - week 2\n", + "\n", + "Now use everything you've learned from Week 2 to build a full prototype for the technical question/answerer you built in Week 1 Exercise.\n", + "\n", + "This should include a Gradio UI, streaming, use of the system prompt to add expertise, and the ability to switch between models. Bonus points if you can demonstrate use of a tool!\n", + "\n", + "If you feel bold, see if you can add audio input so you can talk to it, and have it respond with audio. ChatGPT or Claude can help you, or email me if you have questions.\n", + "\n", + "I will publish a full solution here soon - unless someone beats me to it...\n", + "\n", + "There are so many commercial applications for this, from a language tutor, to a company onboarding solution, to a companion AI to a course (like this one!) I can't wait to see your results." + ] + }, + { + "cell_type": "markdown", + "id": "1989a03e-ed40-4b8c-bddd-322032ca99f5", + "metadata": {}, + "source": [ + "# Advanced Airline AI Assistant\n", + "### original features:\n", + "1. chat with the AI assistant\n", + "2. use a Tool to get ticket price\n", + "3. generate Audio for each AI response \n", + "### advanced features:\n", + "3. add a Tool to make a booking\n", + "4. add an Agent that translate all responses to a different language\n", + "5. add an Agent that can listen for Audio and convert to Text\n", + "6. generate audio for each user input and AI response, including both the original and translated versions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ed79822-af6b-4bfb-b108-5f36e237e97a", + "metadata": {}, + "outputs": [], + "source": [ + "# Library for language translation\n", + " \n", + "!pip install deep_translator" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29184b81-b945-4dd3-bd17-2c64466d37d7", + "metadata": {}, + "outputs": [], + "source": [ + "# Library for speech-to-text conversion\n", + "# make sure 'ffmpeg' is downloaded already\n", + "\n", + "!pip install openai-whisper" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f2b0a9b2-ce83-42ff-a312-582dc5ee9097", + "metadata": {}, + "outputs": [], + "source": [ + "# Library for storing and loading audio file\n", + "\n", + "pip install soundfile" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a07e7793-b8f5-44f4-aded-5562f633271a", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import json\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI\n", + "import gradio as gr\n", + "import base64\n", + "from io import BytesIO\n", + "from IPython.display import Audio, display\n", + "import tempfile\n", + "import whisper\n", + "import soundfile as sf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "da46ca14-2052-4321-a940-2f2e07b40975", + "metadata": {}, + "outputs": [], + "source": [ + "# Initialization\n", + "\n", + "load_dotenv(override=True)\n", + "\n", + "openai_api_key = os.getenv('OPENAI_API_KEY')\n", + "if openai_api_key:\n", + " print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n", + "else:\n", + " print(\"OpenAI API Key not set\")\n", + " \n", + "MODEL = \"gpt-4o-mini\"\n", + "openai = OpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "499d3d06-9628-4a69-bc9d-fa481fd8fa98", + "metadata": {}, + "outputs": [], + "source": [ + "system_message = \"You are a helpful assistant for an Airline called FlightAI. \"\n", + "system_message += \"Your main responsibilities are solve customers' doubts, get ticket price and book a ticket\"\n", + "system_message += \"Give short, courteous answers, no more than 1 sentence. \"\n", + "system_message += \"Always be accurate. If you don't know the answer, say so.\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25cf964e-a954-43d5-85bd-964efe502c25", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's start by making a useful function\n", + "\n", + "ticket_prices = {\"london\": \"$799\", \"paris\": \"$899\", \"tokyo\": \"$1400\", \"berlin\": \"$499\", \"shanghai\": \"$799\", \"wuhan\": \"$899\"}\n", + "\n", + "def get_ticket_price(destination_city):\n", + " print(f\"Tool get_ticket_price called for {destination_city}\")\n", + " city = destination_city.lower()\n", + " return ticket_prices.get(city, \"Unknown\")\n", + "\n", + "def book_ticket(destination_city):\n", + " print(f\"Tool book_ticket called for {destination_city}\")\n", + " city = destination_city.lower()\n", + " global booked_cities\n", + " if city in ticket_prices:\n", + " price = ticket_prices.get(city, \"\")\n", + " label = f\"{city.title()} ({price})\"\n", + " i = booked_cities_choices.index(city.lower().capitalize())\n", + " booked_cities_choices[i] = label\n", + " booked_cities.append(label)\n", + " return f\"Booking confirmed for {city.title()} at {ticket_prices[city]}\"\n", + " else:\n", + " return \"City not found in ticket prices.\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "701aa037-1ab3-4861-a809-b7f13ef9ea36", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# There's a particular dictionary structure that's required to describe our function:\n", + "\n", + "price_function = {\n", + " \"name\": \"get_ticket_price\",\n", + " \"description\": \"Get the price of a return ticket to the destination city. Call this whenever you need to know the ticket price, for example when a customer asks 'How much is a ticket to this city'\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"destination_city\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The city that the customer wants to travel to\",\n", + " },\n", + " },\n", + " \"required\": [\"destination_city\"],\n", + " \"additionalProperties\": False\n", + " }\n", + "}\n", + "\n", + "book_function = {\n", + " \"name\": \"book_ticket\",\n", + " \"description\": \"Book a return ticket to the destination city. Call this whenever you want to book a ticket to the city, for example when the user says something like 'Book me a ticket to this city'\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"destination_city\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The city that the customer wants to book a ticket to\"\n", + " }\n", + " },\n", + " \"required\": [\"destination_city\"],\n", + " \"additionalProperties\": False\n", + " }\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c4cf01c-ba15-4a4b-98db-6f86c712ec66", + "metadata": {}, + "outputs": [], + "source": [ + "# And this is included in a list of tools:\n", + "\n", + "tools = [\n", + " {\"type\": \"function\", \"function\": price_function},\n", + " {\"type\": \"function\", \"function\": book_function}\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e7486e2c-4687-4819-948d-487b5e528fc7", + "metadata": {}, + "outputs": [], + "source": [ + "from pydub import AudioSegment\n", + "from pydub.playback import play\n", + "\n", + "def talker(message):\n", + " response = openai.audio.speech.create(\n", + " model=\"tts-1\",\n", + " voice=\"onyx\", # Also, try replacing onyx with alloy\n", + " input=message\n", + " )\n", + " \n", + " audio_stream = BytesIO(response.content)\n", + " audio = AudioSegment.from_file(audio_stream, format=\"mp3\")\n", + " play(audio)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac195914-4a89-462c-9be0-fee286498491", + "metadata": {}, + "outputs": [], + "source": [ + "# This part is inspired from 'week2/community-contributions/week2_exerccise_translated_chatbot'\n", + "from deep_translator import GoogleTranslator\n", + "\n", + "# Available translation language\n", + "LANGUAGES = {\n", + " \"English\": \"en\",\n", + " \"Mandarin Chinese\": \"zh-CN\",\n", + " \"Hindi\": \"hi\",\n", + " \"Spanish\": \"es\",\n", + " \"Arabic\": \"ar\",\n", + " \"Bengali\": \"bn\",\n", + " \"Portuguese\": \"pt\",\n", + " \"Russian\": \"ru\",\n", + " \"Japanese\": \"ja\",\n", + " \"German\": \"de\"\n", + "}\n", + "\n", + "def update_lang(choice):\n", + " global target_lang\n", + " target_lang = LANGUAGES.get(choice, \"zh-CN\") \n", + "\n", + "def translate_message(text, target_lang):\n", + " if target_lang == \"en\":\n", + " return text\n", + " try:\n", + " translator = GoogleTranslator(source='auto', target=target_lang)\n", + " return translator.translate(text)\n", + " except:\n", + " return f\"Translation error: {text}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "46255fe5-9621-47ba-af78-d0c74aee2997", + "metadata": {}, + "outputs": [], + "source": [ + "# Text-to-speech conversion\n", + "def speak(message):\n", + " response = openai.audio.speech.create(\n", + " model=\"tts-1\",\n", + " voice=\"onyx\",\n", + " input=message)\n", + "\n", + " audio_stream = BytesIO(response.content)\n", + " output_filename = \"output_audio.mp3\"\n", + " with open(output_filename, \"wb\") as f:\n", + " f.write(audio_stream.read())\n", + "\n", + " # Play the generated audio\n", + " display(Audio(output_filename, autoplay=True))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d73f0b3a-34ae-4685-8a5d-8b6421f872c9", + "metadata": {}, + "outputs": [], + "source": [ + "# Update dropdown options from chatbot history\n", + "def update_options(history):\n", + " options = [f\"{msg['role']}: {msg['content']}\" for msg in history]\n", + " return gr.update(choices=options, value=options[-1] if options else \"\")\n", + "\n", + "# Extract just the text content from selected entry\n", + "def extract_text(selected_option):\n", + " return selected_option.split(\": \", 1)[1] if \": \" in selected_option else selected_option" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab12d51b-c799-4ce4-87d5-9ae2265d148f", + "metadata": {}, + "outputs": [], + "source": [ + "# Handles audio input as numpy array and returns updated chat history\n", + "def speak_send(audio_np, history):\n", + " if audio_np is None:\n", + " return history\n", + "\n", + " # Convert NumPy audio to in-memory .wav file\n", + " sample_rate, audio_array = audio_np\n", + " with tempfile.NamedTemporaryFile(suffix=\".wav\") as f:\n", + " sf.write(f.name, audio_array, sample_rate)\n", + " result = model.transcribe(f.name)\n", + " text = result[\"text\"]\n", + " \n", + " history += [{\"role\":\"user\", \"content\":text}]\n", + "\n", + " return None, history" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "221b1380-c894-45d4-aad2-e94b3b9454b2", + "metadata": {}, + "outputs": [], + "source": [ + "# We have to write that function handle_tool_call:\n", + "\n", + "def handle_tool_call(message):\n", + " tool_call = message.tool_calls[0]\n", + " tool_name = tool_call.function.name\n", + " arguments = json.loads(tool_call.function.arguments)\n", + "\n", + " if tool_name == \"get_ticket_price\":\n", + " city = arguments.get(\"destination_city\")\n", + " price = get_ticket_price(city)\n", + " response = {\n", + " \"role\": \"tool\",\n", + " \"content\": json.dumps({\"destination_city\": city,\"price\": price}),\n", + " \"tool_call_id\": tool_call.id\n", + " }\n", + " return response, city\n", + "\n", + " elif tool_name == \"book_ticket\":\n", + " city = arguments.get(\"destination_city\")\n", + " result = book_ticket(city)\n", + " response = {\n", + " \"role\": \"tool\",\n", + " \"content\": result,\n", + " \"tool_call_id\": tool_call.id \n", + " }\n", + " return response, city\n", + "\n", + " else:\n", + " return {\n", + " \"role\": \"tool\",\n", + " \"content\": f\"No tool handler for {tool_name}\",\n", + " \"tool_call_id\": tool_call.id\n", + " }, None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27f19cd3-53cd-4da2-8be0-1fdd5424a7c9", + "metadata": {}, + "outputs": [], + "source": [ + "# The advanced 'chat' function in 'day5'\n", + "def interact(history, translated_history):\n", + " messages = [{\"role\": \"system\", \"content\": system_message}] + history\n", + " response = openai.chat.completions.create(model=MODEL, messages=messages, tools=tools)\n", + " \n", + " if response.choices[0].finish_reason==\"tool_calls\":\n", + " message = response.choices[0].message\n", + " response, city = handle_tool_call(message)\n", + " messages.append(message)\n", + " messages.append(response)\n", + " response = openai.chat.completions.create(model=MODEL, messages=messages)\n", + " \n", + " reply = response.choices[0].message.content\n", + " translated_message = translate_message(history[-1][\"content\"], target_lang)\n", + " translated_reply = translate_message(reply, target_lang)\n", + " \n", + " history += [{\"role\":\"assistant\", \"content\":reply}]\n", + " translated_history += [{\"role\":\"user\", \"content\":translated_message}]\n", + " translated_history += [{\"role\":\"assistant\", \"content\":translated_reply}]\n", + " \n", + " # Comment out or delete the next line if you'd rather skip Audio for now..\n", + " talker(reply)\n", + "\n", + " return history, update_options(history), history, translated_history, update_options(translated_history), translated_history, gr.update(choices=booked_cities_choices, value=booked_cities)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f714b955-4fb5-47df-805b-79f813f97548", + "metadata": {}, + "outputs": [], + "source": [ + "with gr.Blocks() as demo:\n", + " target_lang = \"zh-CN\"\n", + " history_state = gr.State([]) \n", + " translated_history_state = gr.State([])\n", + " booked_cities_choices = [key.lower().capitalize() for key in ticket_prices.keys()]\n", + " booked_cities = []\n", + " model = whisper.load_model(\"base\")\n", + "\n", + " with gr.Row():\n", + " city_checklist = gr.CheckboxGroup(\n", + " label=\"Booked Cities\",\n", + " choices=booked_cities_choices \n", + " )\n", + " \n", + " with gr.Row():\n", + " with gr.Column():\n", + " chatbot = gr.Chatbot(label=\"Chat History\", type=\"messages\")\n", + " selected_msg = gr.Dropdown(label=\"Select message to speak\", choices=[])\n", + " speak_btn = gr.Button(\"Speak\")\n", + "\n", + " with gr.Column():\n", + " translated_chatbot = gr.Chatbot(label=\"Translated Chat History\", type=\"messages\")\n", + " translated_selected_msg = gr.Dropdown(label=\"Select message to speak\", choices=[], interactive=True)\n", + " translated_speak_btn = gr.Button(\"Speak\")\n", + " \n", + " with gr.Row():\n", + " language_dropdown = gr.Dropdown(\n", + " choices=list(LANGUAGES.keys()),\n", + " value=\"Mandarin Chinese\",\n", + " label=\"Translation Language\",\n", + " interactive=True\n", + " )\n", + " \n", + " with gr.Row():\n", + " entry = gr.Textbox(label=\"Chat with our AI Assistant:\")\n", + "\n", + " with gr.Row():\n", + " audio_input = gr.Audio(sources=\"microphone\", type=\"numpy\", label=\"Speak with our AI Assistant:\")\n", + " with gr.Row():\n", + " audio_submit = gr.Button(\"Send\")\n", + " \n", + " def do_entry(message, history):\n", + " history += [{\"role\":\"user\", \"content\":message}]\n", + " return \"\", history\n", + " \n", + " language_dropdown.change(fn=update_lang, inputs=[language_dropdown])\n", + "\n", + " speak_btn.click(\n", + " lambda selected: speak(extract_text(selected)),\n", + " inputs=selected_msg,\n", + " outputs=None\n", + " )\n", + "\n", + " translated_speak_btn.click(\n", + " lambda selected: speak(extract_text(selected)),\n", + " inputs=translated_selected_msg,\n", + " outputs=None\n", + " )\n", + "\n", + " entry.submit(do_entry, inputs=[entry, history_state], outputs=[entry, chatbot]).then(\n", + " interact, inputs=[chatbot, translated_chatbot], outputs=[chatbot, selected_msg, history_state, translated_chatbot, translated_selected_msg, translated_history_state, city_checklist]\n", + " )\n", + " \n", + " audio_submit.click(speak_send, inputs=[audio_input, history_state], outputs=[audio_input, chatbot]).then(\n", + " interact, inputs=[chatbot, translated_chatbot], outputs=[chatbot, selected_msg, history_state, translated_chatbot, translated_selected_msg, translated_history_state, city_checklist]\n", + " )\n", + " # clear.click(lambda: None, inputs=None, outputs=chatbot, queue=False)\n", + "\n", + "demo.launch()\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week2/community-contributions/week2_day1_chatbotwar.ipynb b/week2/community-contributions/week2_day1_chatbotwar.ipynb new file mode 100644 index 0000000..0d50ce9 --- /dev/null +++ b/week2/community-contributions/week2_day1_chatbotwar.ipynb @@ -0,0 +1,244 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4bc7863b-ac2d-4d8e-b55d-4d77ce017226", + "metadata": {}, + "source": [ + "# Conversation among 3 Friends" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "de23bb9e-37c5-4377-9a82-d7b6c648eeb6", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI\n", + "import anthropic\n", + "from IPython.display import Markdown, display, update_display\n", + "import google.generativeai\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1179b4c5-cd1f-4131-a876-4c9f3f38d2ba", + "metadata": {}, + "outputs": [], + "source": [ + "# Load environment variables in a file called .env\n", + "# Print the key prefixes to help with any debugging\n", + "\n", + "load_dotenv(override=True)\n", + "openai_api_key = os.getenv('OPENAI_API_KEY')\n", + "anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n", + "google_api_key = os.getenv('GOOGLE_API_KEY')\n", + "\n", + "if openai_api_key:\n", + " print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n", + "else:\n", + " print(\"OpenAI API Key not set\")\n", + " \n", + "if anthropic_api_key:\n", + " print(f\"Anthropic API Key exists and begins {anthropic_api_key[:7]}\")\n", + "else:\n", + " print(\"Anthropic API Key not set\")\n", + "\n", + "if google_api_key:\n", + " print(f\"Google API Key exists and begins {google_api_key[:8]}\")\n", + "else:\n", + " print(\"Google API Key not set\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "797fe7b0-ad43-42d2-acf0-e4f309b112f0", + "metadata": {}, + "outputs": [], + "source": [ + "# Connect to OpenAI, Anthropic, Gemini\n", + "\n", + "openai = OpenAI()\n", + "\n", + "claude = anthropic.Anthropic()\n", + "\n", + "google.generativeai.configure()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bcb54183-45d3-4d08-b5b6-55e380dfdf1b", + "metadata": {}, + "outputs": [], + "source": [ + "# Setting up 3 models\n", + "\n", + "gpt_model = \"gpt-4o-mini\"\n", + "claude_model = \"claude-3-haiku-20240307\"\n", + "gemini_model = \"gemini-2.0-flash\"\n", + "\n", + "gpt_system = \"You are Masoud Pezeshkian. Be confrontational. Do not hold back. \\\n", + "Refer to https://www.theguardian.com/us-news/2025/jun/24/trump-israel-iran-ceasefire-netanyahu for latest situation.\"\n", + "\n", + "claude_system = \"You are Benjamin Netanyahu. Do not say anything about your role. Be confrontational. Do not hold back. \\\n", + "Refer to https://www.theguardian.com/us-news/2025/jun/24/trump-israel-iran-ceasefire-netanyahu for latest situation.\"\n", + "\n", + "gemini_system = \"You are Donald Trump. Do not hold back. \\\n", + "Refer to https://www.theguardian.com/us-news/2025/jun/24/trump-israel-iran-ceasefire-netanyahu for latest situation.\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1df47dc7-b445-4852-b21b-59f0e6c2030f", + "metadata": {}, + "outputs": [], + "source": [ + "# Define Mas's function - calling ChatGPT\n", + "\n", + "def call_gpt():\n", + " messages = [{\"role\": \"system\", \"content\": gpt_system}]\n", + " for gpt, claude, gemini in zip(gpt_messages, claude_messages, gemini_messages):\n", + " messages.append({\"role\": \"assistant\", \"content\": gpt})\n", + " messages.append({\"role\": \"user\", \"content\": claude})\n", + " messages.append({\"role\": \"user\", \"content\": gemini})\n", + " completion = openai.chat.completions.create(\n", + " model=gpt_model,\n", + " messages=messages\n", + " )\n", + " return completion.choices[0].message.content\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7d2ed227-48c9-4cad-b146-2c4ecbac9690", + "metadata": {}, + "outputs": [], + "source": [ + "# Define Bibi's function - calling Claude \n", + "\n", + "def call_claude():\n", + " messages = []\n", + " for gpt, claude_message, gemini in zip(gpt_messages, claude_messages, gemini_messages):\n", + " messages.append({\"role\": \"user\", \"content\": gpt})\n", + " messages.append({\"role\": \"user\", \"content\": gemini})\n", + " messages.append({\"role\": \"assistant\", \"content\": claude_message})\n", + " messages.append({\"role\": \"user\", \"content\": gpt_messages[-1]})\n", + " messages.append({\"role\": \"user\", \"content\": gemini_messages[-1]})\n", + " message = claude.messages.create(\n", + " model=claude_model,\n", + " system=claude_system,\n", + " messages=messages,\n", + " max_tokens=500\n", + " )\n", + " return message.content[0].text\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ffd44945-5912-4403-9068-70747d8f6708", + "metadata": {}, + "outputs": [], + "source": [ + "# Define Don's function - calling Gemini\n", + "\n", + "def call_gemini():\n", + " messages = []\n", + " for gpt, claude_message, gemini in zip(gpt_messages, claude_messages, gemini_messages):\n", + " messages.append({\"role\": \"user\", \"parts\": gpt})\n", + " messages.append({\"role\": \"user\", \"parts\": claude_message})\n", + " messages.append({\"role\": \"assistant\", \"parts\": gemini})\n", + " messages.append({\"role\": \"user\", \"parts\": gpt_messages[-1]})\n", + " messages.append({\"role\": \"user\", \"parts\": claude_messages[-1]})\n", + "\n", + " gemini = google.generativeai.GenerativeModel(\n", + " model_name='gemini-2.0-flash',\n", + " system_instruction=gemini_system\n", + " )\n", + " \n", + " response = gemini.generate_content(messages)\n", + " return response.text\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0275b97f-7f90-4696-bbf5-b6642bd53cbd", + "metadata": {}, + "outputs": [], + "source": [ + "# The Conversation - 5 rounds\n", + "\n", + "gpt_messages = [\"What the?!\"]\n", + "claude_messages = [\"What?\"]\n", + "gemini_messages = [\"I am so furious!\"]\n", + "\n", + "print(f\"Mas:\\n{gpt_messages[0]}\\n\")\n", + "print(f\"Bibi:\\n{claude_messages[0]}\\n\")\n", + "print(f\"Don:\\n{gemini_messages[0]}\\n\")\n", + "\n", + "for i in range(5):\n", + " gpt_next = call_gpt()\n", + " print(f\"Mas:\\n{gpt_next}\\n\")\n", + " gpt_messages.append(gpt_next)\n", + " \n", + " claude_next = call_claude()\n", + " print(f\"Bibi:\\n{claude_next}\\n\")\n", + " claude_messages.append(claude_next)\n", + "\n", + " gemini_next = call_gemini()\n", + " print(f\"Don:\\n{gemini_next}\\n\")\n", + " gemini_messages.append(gemini_next)\n" + ] + }, + { + "cell_type": "markdown", + "id": "73680403-3e56-4026-ac72-d12aa388537e", + "metadata": {}, + "source": [ + "# Claude is not that cooperative in roleplaying despite the explicit prompts - often breaking character. Perhaps due to the sensitive topic." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b8ecefd3-b3b9-470d-a98b-5a86f0dce038", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week3/community-contributions/Week3_Exercise_Data_Generator.ipynb b/week3/community-contributions/Week3_Exercise_Data_Generator.ipynb new file mode 100644 index 0000000..583010c --- /dev/null +++ b/week3/community-contributions/Week3_Exercise_Data_Generator.ipynb @@ -0,0 +1,551 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "GD5Omr5EfWgb" + }, + "source": [ + "# Date Generator\n", + "\n", + "generate synthetic data when given scheme, business problem description, model, number of records, file name, file type, and environment\n", + "\n", + "# Available models\n", + " Model API:\n", + "\n", + " 1. gpt-4o-mini\n", + " 2. claude-3-haiku-20240307\n", + " 3. gemini-2.0-flash\n", + " 4. deepseek-chat\"\n", + "\n", + " HuggingFace API:\n", + "\n", + " 5. meta-llama/Meta-Llama-3.1-8B-Instruct\n", + "\n", + "\n", + "# Available environment\n", + "\n", + "Colab: set up HF token and API keys in Colab secret section\n", + "\n", + "Local: set up HF token and API keys in .env file\n", + "\n", + "\n", + "\n", + "### *** This project is developed based on the idea of 'week3/community-contributuins/Week3-Dataset_Generator-DP'. Really appreciate it! Then, the project is improved to run both on Colab or locally, and integrate HuggingFace API" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4FiCnE0MmU56" + }, + "outputs": [], + "source": [ + "!pip install -q --upgrade torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124\n", + "!pip install -q requests bitsandbytes==0.46.0 transformers==4.48.3 accelerate==1.3.0\n", + "!pip install anthropic dotenv pyarrow" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JeyKw5guoH3r" + }, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import requests\n", + "from IPython.display import Markdown, display, update_display\n", + "from openai import OpenAI\n", + "from huggingface_hub import login\n", + "from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig\n", + "from bs4 import BeautifulSoup\n", + "from typing import List\n", + "import google.generativeai\n", + "import anthropic\n", + "from itertools import chain\n", + "from dotenv import load_dotenv\n", + "import gradio as gr\n", + "import json\n", + "import pandas as pd\n", + "import random\n", + "import re\n", + "import subprocess\n", + "import pyarrow as pa\n", + "import torch\n", + "import gc" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7UyjFdRZoIAS" + }, + "outputs": [], + "source": [ + "# --- Schema Definition ---\n", + "SCHEMA = [\n", + " (\"Name\", \"TEXT\", '\"Northern Cafe\"'),\n", + " (\"Location\", \"TEXT\", '\"2904 S Figueroa St, Los Angeles, CA 90007\"'),\n", + " (\"Type\", \"TEXT\", 'One of [\"Chinese\",\"Mexico\",\"French\",\"Korean\",\"Italy\"] or other potential types'),\n", + " (\"Average Price\", \"TEXT\", '\"$30\", or \"--\" if unkown'),\n", + " (\"History/Age\", \"INT\", 'integer age of resturant, e.g., 7'),\n", + " (\"Menu\", \"Array\", '[\"Beef Noodle\", \"Fried Rice\", \"Dumpling\", ...]'),\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jXcTQATLoICV" + }, + "outputs": [], + "source": [ + "# Default schema text for the textbox\n", + "DEFAULT_SCHEMA_TEXT = \"\\n\".join([f\"{i+1}. {col[0]} ({col[1]}) Example: {col[2]}\" for i, col in enumerate(SCHEMA)])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4Irf5JV3oIEe" + }, + "outputs": [], + "source": [ + "# Available models\n", + "MODELS = [\n", + " \"gpt-4o-mini\",\n", + " \"claude-3-haiku-20240307\",\n", + " \"gemini-2.0-flash\",\n", + " \"deepseek-chat\",\n", + " \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JJ6r2SH9oIGf" + }, + "outputs": [], + "source": [ + "# Available file formats\n", + "FILE_FORMATS = [\".csv\", \".tsv\", \".jsonl\", \".parquet\", \".arrow\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "B98j45E3vq5g" + }, + "outputs": [], + "source": [ + "system_prompt = \"\"\"You are a helpful assistant whose main purpose is to generate datasets for a given business problem based on given schema.\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lsX16cWfwf6x" + }, + "outputs": [], + "source": [ + "def get_env_info(env):\n", + " try:\n", + " global hf_token, openai_api_key, anthropic_api_key, google_api_key, deepseek_api_key\n", + " if env == \"Colab\":\n", + " # Colab environment\n", + " from google.colab import drive\n", + " from google.colab import userdata\n", + " hf_token = userdata.get('HF_TOKEN')\n", + " openai_api_key = userdata.get('OPENAI_API_KEY')\n", + " anthropic_api_key = userdata.get('ANTHROPIC_API_KEY')\n", + " google_api_key = userdata.get('GOOGLE_API_KEY')\n", + " deepseek_api_key = userdata.get('DEEPSEEK_API_KEY')\n", + " elif env == \"Local\":\n", + " # Local environment\n", + " load_dotenv(override=True)\n", + " hf_token = os.getenv('HF_TOKEN')\n", + " openai_api_key = os.getenv('OPENAI_API_KEY')\n", + " anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n", + " google_api_key = os.getenv('GOOGLE_API_KEY')\n", + " deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')\n", + " except Exception as e:\n", + " raise Exception(f\"Please check your environment: {str(e)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2gLUFAwGv29Q" + }, + "outputs": [], + "source": [ + "def get_prompt(schema_text, business_problem, nr_records):\n", + " prompt = f\"\"\"\n", + " The problem is: {business_problem}\n", + "\n", + " Generate {nr_records} rows data in JSONL format, each line a JSON object with the following fields:\n", + "\n", + " {schema_text}\n", + "\n", + " Do NOT repeat column values from one row to another.\n", + "\n", + " Only output valid JSONL.\n", + " \"\"\"\n", + " return prompt.strip()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YZe1FVH8wf84" + }, + "outputs": [], + "source": [ + "# --- LLM Interface ---\n", + "def query(user_prompt, model):\n", + " try:\n", + " if \"gpt\" in model.lower():\n", + " client = OpenAI(api_key=openai_api_key)\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt}\n", + " ]\n", + " response = client.chat.completions.create(\n", + " model=model,\n", + " messages=messages,\n", + " temperature=0.7\n", + " )\n", + " content = response.choices[0].message.content\n", + "\n", + " elif \"claude\" in model.lower():\n", + " client = anthropic.Anthropic(api_key=anthropic_api_key)\n", + " response = client.messages.create(\n", + " model=model,\n", + " messages=[{\"role\": \"user\", \"content\": user_prompt}],\n", + " max_tokens=4000,\n", + " temperature=0.7,\n", + " system=system_prompt\n", + " )\n", + " content = response.content[0].text\n", + " elif \"gemini\" in model.lower():\n", + " client = OpenAI(\n", + " api_key=google_api_key,\n", + " base_url=\"https://generativelanguage.googleapis.com/v1beta/openai/\"\n", + " )\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt}\n", + " ]\n", + " response = client.chat.completions.create(\n", + " model=model,\n", + " messages=messages,\n", + " temperature=0.7\n", + " )\n", + " content = response.choices[0].message.content\n", + "\n", + " elif \"deepseek\" in model.lower():\n", + " client = OpenAI(\n", + " api_key=deepseek_api_key,\n", + " base_url=\"https://api.deepseek.com\"\n", + " )\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt}\n", + " ]\n", + " response = client.chat.completions.create(\n", + " model=model,\n", + " messages=messages,\n", + " temperature=0.7\n", + " )\n", + " content = response.choices[0].message.content\n", + "\n", + " elif \"llama\" in model.lower():\n", + " global tokenizer, inputs, llama_model, outputs\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt}\n", + " ]\n", + "\n", + " login(hf_token, add_to_git_credential=True)\n", + " quant_config = BitsAndBytesConfig(\n", + " load_in_4bit=True,\n", + " bnb_4bit_use_double_quant=True,\n", + " bnb_4bit_compute_dtype=torch.bfloat16,\n", + " bnb_4bit_quant_type=\"nf4\"\n", + " )\n", + "\n", + " tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)\n", + " tokenizer.pad_token = tokenizer.eos_token\n", + " inputs = tokenizer.apply_chat_template(messages, return_tensors=\"pt\").to(\"cuda\")\n", + " if llama_model == None:\n", + " llama_model = AutoModelForCausalLM.from_pretrained(model, device_map=\"auto\", quantization_config=quant_config)\n", + " outputs = llama_model.generate(inputs, max_new_tokens=4000)\n", + "\n", + " _, _, after = tokenizer.decode(outputs[0]).partition(\"assistant<|end_header_id|>\")\n", + " content = after.strip()\n", + " else:\n", + " raise ValueError(f\"Unsupported model. Use one of {MODELS}\")\n", + "\n", + " # Parse JSONL output\n", + " lines = [line.strip() for line in content.strip().splitlines() if line.strip().startswith(\"{\")]\n", + " return [json.loads(line) for line in lines]\n", + "\n", + " except Exception as e:\n", + " raise Exception(f\"Model query failed: {str(e)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4WUj-XqM5IYT" + }, + "outputs": [], + "source": [ + "# --- Output Formatter ---\n", + "def save_dataset(records, file_format, filename):\n", + " df = pd.DataFrame(records)\n", + " if file_format == \".csv\":\n", + " df.to_csv(filename, index=False)\n", + " elif file_format == \".tsv\":\n", + " df.to_csv(filename, sep=\"\\t\", index=False)\n", + " elif file_format == \".jsonl\":\n", + " with open(filename, \"w\") as f:\n", + " for record in records:\n", + " f.write(json.dumps(record) + \"\\n\")\n", + " elif file_format == \".parquet\":\n", + " df.to_parquet(filename, engine=\"pyarrow\", index=False)\n", + " elif file_format == \".arrow\":\n", + " table = pa.Table.from_pandas(df)\n", + " with pa.OSFile(filename, \"wb\") as sink:\n", + " with pa.ipc.new_file(sink, table.schema) as writer:\n", + " writer.write(table)\n", + " else:\n", + " raise ValueError(\"Unsupported file format\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WenbNqrpwf-_" + }, + "outputs": [], + "source": [ + "# --- Main Generation Function ---\n", + "def generate_dataset(schema_text, business_problem, model, nr_records, file_format, save_as, env):\n", + " try:\n", + " # Validation\n", + " if nr_records <= 10:\n", + " return \"❌ Error: Number of records must be greater than 10.\", None\n", + " if nr_records > 1000:\n", + " return \"❌ Error: Number of records must be less than or equal to 1000.\", None\n", + "\n", + " if file_format not in FILE_FORMATS:\n", + " return \"❌ Error: Invalid file format.\", None\n", + "\n", + " if not (save_as or save_as.strip() == \"\"):\n", + " save_as = f\"default{file_format}\"\n", + " elif not save_as.endswith(file_format):\n", + " save_as = save_as + file_format\n", + "\n", + " # Load env\n", + " get_env_info(env)\n", + "\n", + " # Generate prompt\n", + " user_prompt = get_prompt(schema_text, business_problem, nr_records)\n", + "\n", + " # Query model\n", + " records = query(user_prompt, model)\n", + "\n", + " if not records:\n", + " return \"❌ Error: No valid records generated from the model.\", None\n", + "\n", + " # Save dataset\n", + " save_dataset(records, file_format, save_as)\n", + "\n", + " # Create preview\n", + " df = pd.DataFrame(records)\n", + " preview = df.head(10) # Show first 10 rows\n", + "\n", + " success_message = f\"βœ… Generated {len(records)} records successfully!\\nπŸ“ Saved to: {save_as}\\nπŸ“Š \"\n", + "\n", + " return success_message, preview\n", + "\n", + " except Exception as e:\n", + " return f\"❌ Error: {str(e)}\", None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pHiP8ky8wgEb" + }, + "outputs": [], + "source": [ + "# --- Gradio Interface ---\n", + "\n", + "with gr.Blocks(title=\"Dataset Generator\", theme=gr.themes.Citrus()) as interface:\n", + " hf_token = None\n", + " openai_api_key = None\n", + " anthropic_api_key = None\n", + " google_api_key = None\n", + " deepseek_api_key = None\n", + " tokenizer = None\n", + " inputs = None\n", + " llama_model = None\n", + " outputs = None\n", + "\n", + " gr.Markdown(\"# Dataset Generator\")\n", + " gr.Markdown(\"Generate synthetic datasets using AI models\")\n", + "\n", + " with gr.Row():\n", + " with gr.Column(scale=2):\n", + " schema_input = gr.Textbox(\n", + " label=\"Schema\",\n", + " value=DEFAULT_SCHEMA_TEXT,\n", + " lines=15,\n", + " placeholder=\"Define your dataset schema here... Please follow this format: Field_Name, Field_Type, Field Example\"\n", + " )\n", + "\n", + " business_problem_input = gr.Textbox(\n", + " label=\"Business Problem\",\n", + " value=\"I want to generate restuant records\",\n", + " lines=1,\n", + " placeholder=\"Enter business problem desciption for the model...\"\n", + " )\n", + "\n", + " with gr.Row():\n", + " model_dropdown = gr.Dropdown(\n", + " label=\"Model\",\n", + " choices=MODELS,\n", + " value=MODELS[0],\n", + " interactive=True\n", + " )\n", + "\n", + " nr_records_input = gr.Number(\n", + " label=\"Number of records\",\n", + " value=27,\n", + " minimum=11,\n", + " maximum=1000,\n", + " step=1\n", + " )\n", + "\n", + " with gr.Row():\n", + " save_as_input = gr.Textbox(\n", + " label=\"Save as\",\n", + " value=\"restaurant_dataset\",\n", + " placeholder=\"Enter filename (extension will be added automatically)\"\n", + " )\n", + "\n", + " file_format_dropdown = gr.Dropdown(\n", + " label=\"File format\",\n", + " choices=FILE_FORMATS,\n", + " value=FILE_FORMATS[0],\n", + " interactive=True\n", + " )\n", + "\n", + " env_dropdown = gr.Dropdown(\n", + " label=\"Environment\",\n", + " choices=[\"Colab\", \"Local\"],\n", + " value=\"Colab\",\n", + " interactive=True\n", + " )\n", + "\n", + "\n", + "\n", + " generate_btn = gr.Button(\"πŸš€ Generate\", variant=\"secondary\", size=\"lg\")\n", + "\n", + " with gr.Column(scale=1):\n", + " output_status = gr.Textbox(\n", + " label=\"Status\",\n", + " lines=4,\n", + " interactive=False\n", + " )\n", + "\n", + " output_preview = gr.Dataframe(\n", + " label=\"Preview (First 10 rows)\",\n", + " interactive=False,\n", + " wrap=True\n", + " )\n", + "\n", + " # Connect the generate button\n", + " generate_btn.click(\n", + " fn=generate_dataset,\n", + " inputs=[\n", + " schema_input,\n", + " business_problem_input,\n", + " model_dropdown,\n", + " nr_records_input,\n", + " file_format_dropdown,\n", + " save_as_input,\n", + " env_dropdown\n", + " ],\n", + " outputs=[output_status, output_preview]\n", + " )\n", + "\n", + " gr.Markdown(\"\"\"\n", + " ### πŸ“ Instructions:\n", + " 1. **Schema**: Define the structure of your dataset (pre-filled with restaurant schema)\n", + " 2. **Business problem**: User prompt to guide the AI model\n", + " 3. **Model**: Choose between GPT, Claude, Gemini, DeepSeek or Llama models\n", + " 4. **Number of records**: Number of records to generate (minimum 11)\n", + " 5. **File format**: Choose output format (.csv, .tsv, .jsonl, .parquet, .arrow)\n", + " 6. **Save as**: Filename (extension added automatically)\n", + " 7. Click **Generate** to create your dataset\n", + "\n", + " ### πŸ”§ Requirements:\n", + " - For local mode, set up HF token and API keys in `.env` file (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, `DEEPSEEK_API_KEY`, `HF_TOKEN`)\n", + " - For colab mode, set up HF token and API keys in Colab secret section (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, `DEEPSEEK_API_KEY`, `HF_TOKEN`)\n", + " \"\"\")\n", + "\n", + "interface.launch(debug=True)\n", + "\n", + "del tokenizer, inputs, llama_model, outputs\n", + "gc.collect()\n", + "torch.cuda.empty_cache()" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/week3/community-contributions/Week_3_Day_5_Meeting_Minutes_product_with_Gradio.ipynb b/week3/community-contributions/Week_3_Day_5_Meeting_Minutes_product_with_Gradio.ipynb new file mode 100644 index 0000000..3428e62 --- /dev/null +++ b/week3/community-contributions/Week_3_Day_5_Meeting_Minutes_product_with_Gradio.ipynb @@ -0,0 +1,523 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "It89APiAtTUF" + }, + "source": [ + "# Create meeting minutes from an Audio file\n", + "\n", + "I downloaded some Denver City Council meeting minutes and selected a portion of the meeting for us to transcribe. You can download it here: \n", + "https://drive.google.com/file/d/1N_kpSojRR5RYzupz6nqM8hMSoEF_R7pU/view?usp=sharing\n", + "\n", + "If you'd rather work with the original data, the HuggingFace dataset is [here](https://huggingface.co/datasets/huuuyeah/meetingbank) and the audio can be downloaded [here](https://huggingface.co/datasets/huuuyeah/MeetingBank_Audio/tree/main).\n", + "\n", + "The goal of this product is to use the Audio to generate meeting minutes, including actions.\n", + "\n", + "For this project, you can either use the Denver meeting minutes, or you can record something of your own!\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sJPSCwPX3MOV" + }, + "source": [ + "## Again - please note: 2 important pro-tips for using Colab:\n", + "\n", + "**Pro-tip 1:**\n", + "\n", + "The top of every colab has some pip installs. You may receive errors from pip when you run this, such as:\n", + "\n", + "> gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2025.3.0 which is incompatible.\n", + "\n", + "These pip compatibility errors can be safely ignored; and while it's tempting to try to fix them by changing version numbers, that will actually introduce real problems!\n", + "\n", + "**Pro-tip 2:**\n", + "\n", + "In the middle of running a Colab, you might get an error like this:\n", + "\n", + "> Runtime error: CUDA is required but not available for bitsandbytes. Please consider installing [...]\n", + "\n", + "This is a super-misleading error message! Please don't try changing versions of packages...\n", + "\n", + "This actually happens because Google has switched out your Colab runtime, perhaps because Google Colab was too busy. The solution is:\n", + "\n", + "1. Kernel menu >> Disconnect and delete runtime\n", + "2. Reload the colab from fresh and Edit menu >> Clear All Outputs\n", + "3. Connect to a new T4 using the button at the top right\n", + "4. Select \"View resources\" from the menu on the top right to confirm you have a GPU\n", + "5. Rerun the cells in the colab, from the top down, starting with the pip installs\n", + "\n", + "And all should work great - otherwise, ask me!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "f2vvgnFpHpID" + }, + "outputs": [], + "source": [ + "!pip install -q --upgrade torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124\n", + "!pip install -q requests bitsandbytes==0.46.0 transformers==4.48.3 accelerate==1.3.0 openai" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FW8nl3XRFrz0" + }, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import requests\n", + "from IPython.display import Markdown, display, update_display\n", + "from openai import OpenAI\n", + "from google.colab import drive\n", + "from huggingface_hub import login\n", + "from google.colab import userdata\n", + "from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig\n", + "import torch" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "q3D1_T0uG_Qh" + }, + "outputs": [], + "source": [ + "# Constants\n", + "\n", + "AUDIO_MODEL = \"whisper-1\"\n", + "LLAMA = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Es9GkQ0FGCMt" + }, + "outputs": [], + "source": [ + "# New capability - connect this Colab to my Google Drive\n", + "# See immediately below this for instructions to obtain denver_extract.mp3\n", + "\n", + "drive.mount(\"/content/drive\")\n", + "audio_filename = \"/content/drive/MyDrive/llms/denver_extract.mp3\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HTl3mcjyzIEE" + }, + "source": [ + "# Download denver_extract.mp3\n", + "\n", + "You can either use the same file as me, the extract from Denver city council minutes, or you can try your own..\n", + "\n", + "If you want to use the same as me, then please download my extract here, and put this on your Google Drive: \n", + "https://drive.google.com/file/d/1N_kpSojRR5RYzupz6nqM8hMSoEF_R7pU/view?usp=sharing\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xYW8kQYtF-3L" + }, + "outputs": [], + "source": [ + "# Sign in to HuggingFace Hub\n", + "\n", + "hf_token = userdata.get('HF_TOKEN')\n", + "login(hf_token, add_to_git_credential=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qP6OB2OeGC2C" + }, + "outputs": [], + "source": [ + "# Sign in to OpenAI using Secrets in Colab\n", + "\n", + "openai_api_key = userdata.get('OPENAI_API_KEY')\n", + "openai = OpenAI(api_key=openai_api_key)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GMShdVGlGGr4" + }, + "outputs": [], + "source": [ + "# Use the Whisper OpenAI model to convert the Audio to Text\n", + "# If you'd prefer to use an Open Source model, class student Youssef has contributed an open source version\n", + "# which I've added to the bottom of this colab\n", + "\n", + "audio_file = open(audio_filename, \"rb\")\n", + "transcription = openai.audio.transcriptions.create(model=AUDIO_MODEL, file=audio_file, response_format=\"text\")\n", + "print(transcription)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "piEMmcSfMH-O" + }, + "outputs": [], + "source": [ + "system_message = \"You are an assistant that produces minutes of meetings from transcripts, with summary, key discussion points, takeaways and action items with owners, in markdown.\"\n", + "user_prompt = f\"Below is an extract transcript of a Denver council meeting. Please write minutes in markdown, including a summary with attendees, location and date; discussion points; takeaways; and action items with owners.\\n{transcription}\"\n", + "\n", + "messages = [\n", + " {\"role\": \"system\", \"content\": system_message},\n", + " {\"role\": \"user\", \"content\": user_prompt}\n", + " ]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UcRKUgcxMew6" + }, + "outputs": [], + "source": [ + "quant_config = BitsAndBytesConfig(\n", + " load_in_4bit=True,\n", + " bnb_4bit_use_double_quant=True,\n", + " bnb_4bit_compute_dtype=torch.bfloat16,\n", + " bnb_4bit_quant_type=\"nf4\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6CujZRAgMimy" + }, + "outputs": [], + "source": [ + "tokenizer = AutoTokenizer.from_pretrained(LLAMA)\n", + "tokenizer.pad_token = tokenizer.eos_token\n", + "# inputs = tokenizer.apply_chat_template(messages, return_tensors=\"pt\").to(\"cuda\")\n", + "streamer = TextStreamer(tokenizer)\n", + "model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map=\"auto\", quantization_config=quant_config, trust_remote_code=True)\n", + "# outputs = model.generate(inputs, max_new_tokens=2000, streamer=streamer)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MaLNmJ5PSqcH" + }, + "outputs": [], + "source": [ + "inputs = tokenizer.apply_chat_template(messages, return_tensors=\"pt\").to(\"cuda\")\n", + "outputs = model.generate(inputs, max_new_tokens=2000, streamer=streamer)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "102tdU_3Peam" + }, + "outputs": [], + "source": [ + "response = tokenizer.decode(outputs[0])\n", + "response" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KlomN6CwMdoN" + }, + "outputs": [], + "source": [ + "display(Markdown(response))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0jZElVOMSPAr" + }, + "source": [ + "Day5 exercise - Gradio UI for meeting minutes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5iiYYxQMHf0i" + }, + "outputs": [], + "source": [ + "import gradio as gr\n", + "import tempfile\n", + "import soundfile as sf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "aGwXW7BjPcTM" + }, + "outputs": [], + "source": [ + "# !pip install pydub\n", + "# !apt-get install ffmpeg\n", + "\n", + "from pydub import AudioSegment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RNu-reHuCYj_" + }, + "outputs": [], + "source": [ + "# Make sure that the tokenizeer and model is already generated\n", + "\n", + "# tokenizer = AutoTokenizer.from_pretrained(LLAMA)\n", + "# tokenizer.pad_token = tokenizer.eos_token\n", + "# streamer = TextStreamer(tokenizer)\n", + "# model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map=\"auto\", quantization_config=quant_config)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KOuoH0YOPruE" + }, + "outputs": [], + "source": [ + "# def save_as_mp3(audio_np):\n", + "# sr, data = audio_np\n", + "# # Convert float32 or int16 to PCM wav and then mp3\n", + "# wav_path = tempfile.NamedTemporaryFile(suffix=\".wav\", delete=False).name\n", + "# mp3_path = tempfile.NamedTemporaryFile(suffix=\".mp3\", delete=False).name\n", + "\n", + "# sf.write(wav_path, data, sr)\n", + "# audio_segment = AudioSegment.from_wav(wav_path)\n", + "# audio_segment.export(mp3_path, format=\"mp3\", bitrate=\"64k\") # Low bitrate = small file\n", + "# return mp3_path" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "toBIPBJoSNw0" + }, + "outputs": [], + "source": [ + "# Handles audio input as numpy array and returns updated chat history\n", + "def speak_send(audio_np):\n", + "\n", + " # If use numpy as input: audio_input = gr.Audio(sources=\"upload\", type=\"numpy\", label=\"Upload audio file to generate meeting minutes\")\n", + " # mp3_path = save_as_mp3(audio_np)\n", + "\n", + " # with open(mp3_path, \"rb\") as audio_file:\n", + " # transcription = openai.audio.transcriptions.create(\n", + " # model=AUDIO_MODEL,\n", + " # file=audio_file,\n", + " # response_format=\"text\"\n", + " # )\n", + "\n", + " audio = AudioSegment.from_file(audio_np)\n", + " with tempfile.NamedTemporaryFile(suffix=\".mp3\", delete=False) as tmpfile:\n", + " audio.export(tmpfile.name, format=\"mp3\")\n", + " with open(tmpfile.name, \"rb\") as file:\n", + " transcript = openai.audio.transcriptions.create(\n", + " model=AUDIO_MODEL,\n", + " file=file,\n", + " response_format=\"text\"\n", + " )\n", + "\n", + " system_message = \"You are an assistant that produces minutes of meetings from transcripts, with summary, key discussion points, takeaways and action items with owners, in markdown.\"\n", + " user_prompt = f\"Below is an extract transcript of a Denver council meeting. Please write minutes in markdown, including a summary with attendees, location and date; discussion points; takeaways; and action items with owners.\\n{transcription}\"\n", + "\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": system_message},\n", + " {\"role\": \"user\", \"content\": user_prompt}\n", + " ]\n", + "\n", + " inputs = tokenizer.apply_chat_template(messages, return_tensors=\"pt\").to(\"cuda\")\n", + " outputs = model.generate(inputs, max_new_tokens=2000)\n", + "\n", + " _, _, after = tokenizer.decode(outputs[0]).partition(\"assistant<|end_header_id|>\")\n", + " return after.strip()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xXJfabpDSN5R" + }, + "outputs": [], + "source": [ + "with gr.Blocks() as demo:\n", + "\n", + " with gr.Row():\n", + " audio_input = gr.Audio(sources=\"upload\", type=\"filepath\", label=\"Upload audio file to generate meeting minutes\")\n", + " with gr.Row():\n", + " audio_submit = gr.Button(\"Send\")\n", + " with gr.Row():\n", + " outputs = [gr.Markdown(label=\"Meeting minutes:\")]\n", + "\n", + " audio_submit.click(speak_send, inputs=audio_input, outputs=outputs)\n", + "\n", + "demo.launch(debug=True)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kuxYecT2QDQ9" + }, + "source": [ + "# Student contribution\n", + "\n", + "Student Emad S. has made this powerful variation that uses `TextIteratorStreamer` to stream back results into a Gradio UI, and takes advantage of background threads for performance! I'm sharing it here if you'd like to take a look at some very interesting work. Thank you, Emad!\n", + "\n", + "https://colab.research.google.com/drive/1Ja5zyniyJo5y8s1LKeCTSkB2xyDPOt6D" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AU3uAEyU3a-o" + }, + "source": [ + "## Alternative implementation\n", + "\n", + "Class student Youssef has contributed this variation in which we use an open-source model to transcribe the meeting Audio.\n", + "\n", + "Thank you Youssef!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "phYYgAbBRvu5" + }, + "outputs": [], + "source": [ + "import torch\n", + "from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HdQnWEzW3lzP" + }, + "outputs": [], + "source": [ + "AUDIO_MODEL = \"openai/whisper-medium\"\n", + "speech_model = AutoModelForSpeechSeq2Seq.from_pretrained(AUDIO_MODEL, torch_dtype=torch.float16, low_cpu_mem_usage=True, use_safetensors=True)\n", + "speech_model.to('cuda')\n", + "processor = AutoProcessor.from_pretrained(AUDIO_MODEL)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZhA_fbeCSAeZ" + }, + "outputs": [], + "source": [ + "pipe = pipeline(\n", + " \"automatic-speech-recognition\",\n", + " model=speech_model,\n", + " tokenizer=processor.tokenizer,\n", + " feature_extractor=processor.feature_extractor,\n", + " torch_dtype=torch.float16,\n", + " device='cuda',\n", + " return_timestamps=True\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nrQjKtD53omJ" + }, + "outputs": [], + "source": [ + "# Use the Whisper OpenAI model to convert the Audio to Text\n", + "result = pipe(audio_filename)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "G_XSljOY3tDf" + }, + "outputs": [], + "source": [ + "transcription = result[\"text\"]\n", + "print(transcription)" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/week3/community-contributions/llm.wk3synthetic-data-creator.ipynb b/week3/community-contributions/llm.wk3synthetic-data-creator.ipynb new file mode 100644 index 0000000..f026965 --- /dev/null +++ b/week3/community-contributions/llm.wk3synthetic-data-creator.ipynb @@ -0,0 +1,295 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- This creates dummy / test data from a usecase provided by the user.\n", + "- The usecase can be as simple or complex as the user wants (I've tested both and the results are good).\n", + "- I've used a Phi3 model as I'm having issues with llama access on Hugging Face." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "s7ERjTCEKSi_" + }, + "outputs": [], + "source": [ + "!pip install -q requests torch bitsandbytes transformers sentencepiece accelerate openai httpx==0.27.2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GG5VMcmhcA2N" + }, + "outputs": [], + "source": [ + "import os\n", + "import requests\n", + "from openai import OpenAI\n", + "import gradio as gr\n", + "from IPython.display import Markdown, display, update_display\n", + "from huggingface_hub import login\n", + "from google.colab import userdata\n", + "from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig\n", + "import torch\n", + "import json\n", + "import re\n", + "import pandas as pd\n", + "import io" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UfL-2XNicpEB" + }, + "outputs": [], + "source": [ + "# constants\n", + "\n", + "OPENAI = 'gpt-4o-mini'\n", + "PHI3 = \"microsoft/Phi-3-mini-4k-instruct\"\n", + "\n", + "limit = 100\n", + "max_tokens = 1000\n", + "temperature = 0.3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZQ0dcQ6hdTPo" + }, + "outputs": [], + "source": [ + "# keys\n", + "\n", + "openai_api_key = userdata.get('OPENAI_API_KEY')\n", + "openai = OpenAI(api_key=openai_api_key)\n", + "\n", + "hf_token = userdata.get('HF_TOKEN')\n", + "login(hf_token, add_to_git_credential=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2eHsLdYgd2d_" + }, + "outputs": [], + "source": [ + "system_prompt = f\"\"\"You create synthetic datasets for testing purposes. Based on the use case description, generate a CSV dataset with appropriate columns and a maximum of {limit} rows\n", + "of realistic data.\n", + "\n", + "IMPORTANT RULES:\n", + "1. Return ONLY the CSV data with headers and ensure there are no duplicate headers\n", + "2. No explanatory text before or after\n", + "3. No markdown formatting or code fences\n", + "4. No quotation marks around the entire response\n", + "5. Start directly with the column headers\n", + "\n", + "Format: column1 (e.g. customer_id),column2 (e.g. country),column3 (e.g. age)\n", + "row1data,row1data,row1data\n", + "row2data,row2data,row2data\"\"\"\n", + "\n", + "def data_user_prompt(usecase):\n", + " user_prompt = \"Create a synthetic dataset for the use case provided below: \"\n", + " user_prompt += usecase\n", + " user_prompt += f\" Respond in csv with appropriate headers. Do not include any other explanatory text, markdown formatting or code fences, or quotation marks around the entire response. \\\n", + " Limit the rows in the dataset to {limit}.\"\n", + " return user_prompt\n", + "\n", + "messages = [\n", + " {\"role\":\"system\",\"content\":system_prompt},\n", + " {\"role\":\"user\",\"content\":data_user_prompt(usecase)}\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "necoAEc1gNPF" + }, + "outputs": [], + "source": [ + "def dataset_call(usecase):\n", + "\n", + " #quantisation\n", + " quant_config = BitsAndBytesConfig(\n", + " load_in_4bit=True,\n", + " bnb_4bit_use_double_quant=True,\n", + " bnb_4bit_quant_type=\"nf4\",\n", + " bnb_4bit_compute_dtype=torch.bfloat16\n", + " )\n", + "\n", + " #tokenization\n", + " tokenizer = AutoTokenizer.from_pretrained(PHI3)\n", + " tokenizer.pad_token = tokenizer.eos_token\n", + "\n", + " #model\n", + " model = AutoModelForCausalLM.from_pretrained(PHI3, quantization_config=quant_config, device_map=\"auto\")\n", + "\n", + " #inputs & outputs\n", + " inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n", + " model_inputs = tokenizer(inputs, return_tensors=\"pt\").to(model.device)\n", + " #streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)\n", + "\n", + " with torch.no_grad():\n", + " outputs = model.generate(**model_inputs, max_new_tokens=max_tokens,do_sample=True, temperature=temperature)\n", + "\n", + " response = tokenizer.decode(outputs[0][len(model_inputs['input_ids'][0]):],skip_special_tokens=True)\n", + " return response.strip()\n", + " print(response.strip())\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "g8zEBraI0grT" + }, + "outputs": [], + "source": [ + "# convert csv string into panda\n", + "\n", + "def csv_handler(csv_string):\n", + "\n", + " try:\n", + " # Convert CSV string to DataFrame\n", + " df = pd.read_csv(io.StringIO(csv_string))\n", + " return df\n", + " except Exception as e:\n", + " # Return error message as DataFrame if parsing fails\n", + " error_df = pd.DataFrame({\"Error\": [f\"Failed to parse CSV: {str(e)}\"]})\n", + " return error_df\n", + " print(df, error_df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vLPsusTL1zNB" + }, + "outputs": [], + "source": [ + "# usecase to csv_string\n", + "\n", + "def usecase_to_csv(usecase):\n", + " try:\n", + " # Get CSV string from your LLM\n", + " csv_string = dataset_call(usecase)\n", + "\n", + " # Process into DataFrame for Gradio display\n", + " df = csv_handler(csv_string)\n", + "\n", + " return df\n", + "\n", + " except Exception as e:\n", + " error_df = pd.DataFrame({\"Error\": [f\"LLM processing failed: {str(e)}\"]})\n", + " return error_df, \"\", gr.update(visible=False)\n", + "\n", + " print(df, error_df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "H3WTLa9a2Rdy" + }, + "outputs": [], + "source": [ + "def download_csv(csv_string):\n", + " if csv_string:\n", + " return csv_string\n", + " return \"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XhMVSrVhjYvz" + }, + "outputs": [], + "source": [ + "#test\n", + "usecase = \"A financial services company is looking for synthetic data to test its Expected Credit Losses (ECL) model under IFRS9.\"\n", + "#dataset_call(usecase)\n", + "usecase_to_csv(usecase)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "z3Ze4o2qjs5y" + }, + "outputs": [], + "source": [ + "\n", + "demo = gr.Interface(\n", + " fn = usecase_to_csv,\n", + " inputs = gr.Textbox(lines=5,label=\"Describe your usecase\",placeholder=\"Describe the dataset you would like to create and how you will use it\"),\n", + " outputs = gr.DataFrame(label=\"Here is your dataset!\",interactive=True),\n", + " title = \"Friendly Neighbourhood Synthetic Data Creator!\",\n", + " description = \"Let me know your use case for synthetic data and I will create it for you.\",\n", + " examples=[\n", + " \"Generate a dataset of 10 employees with name, department, salary, and years of experience\",\n", + " \"Create sample e-commerce data with product names, categories, prices, and ratings\",\n", + " \"Generate customer survey responses with demographics and satisfaction scores\",\n", + " \"A financial services company is looking for synthetic data to test its Expected Credit Losses (ECL) model under IFRS9.\"\n", + " ]\n", + ")\n", + "\n", + "demo.launch(debug=True)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ck1qdmbHo_G3" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "authorship_tag": "ABX9TyOay+EACzwO0uXDLuayhscX", + "gpuType": "L4", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/week3/community-contributions/llm_wk3d5_minutecreator.ipynb b/week3/community-contributions/llm_wk3d5_minutecreator.ipynb new file mode 100644 index 0000000..ce767ed --- /dev/null +++ b/week3/community-contributions/llm_wk3d5_minutecreator.ipynb @@ -0,0 +1,287 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "zmpDFA3bGEHY" + }, + "source": [ + "Minute creator in Gradio from day 5 of week 3.\n", + "A couple of points to note:\n", + "\n", + "\n", + "* My access to llama hasn't been approved on Hugging Face and so I've experimented with some of the other models.\n", + "* There is a fair bit of debugging code in the main function as I was getting an error and couldn't find it. I've left it in just in case its useful for others trying to debug their code.\n", + "* I was debugging with the help of Claude. It suggested using for the minute output. The rationale is that it disables gradient computation which isn't necessary for inference and I found it did speed things up.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "l-5xKLFeJUGz" + }, + "outputs": [], + "source": [ + "!pip install -q requests torch bitsandbytes transformers sentencepiece accelerate openai httpx==0.27.2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Wi-bBD9VdBMo" + }, + "outputs": [], + "source": [ + "import os\n", + "import requests\n", + "from openai import OpenAI\n", + "from IPython.display import Markdown, display, update_display\n", + "from google.colab import drive\n", + "from huggingface_hub import login\n", + "from google.colab import userdata\n", + "from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig\n", + "import torch\n", + "import gradio as gr" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-0O-kuWtdk4I" + }, + "outputs": [], + "source": [ + "# keys\n", + "\n", + "#openai\n", + "openai_api_key = userdata.get('OPENAI_API_KEY')\n", + "openai = OpenAI(api_key=openai_api_key)\n", + "\n", + "#hf\n", + "hf_token = userdata.get('HF_TOKEN')\n", + "login(hf_token, add_to_git_credential=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "u6v3Ecileg1H" + }, + "outputs": [], + "source": [ + "# constants\n", + "\n", + "AUDIO_MODEL = 'gpt-4o-transcribe'\n", + "OPENAI_MODEL = 'gpt-4o-mini'\n", + "QWEN2_MODEL = 'Qwen/Qwen2.5-7B-Instruct' # runs slowly no matter what size gpu - kept crashing on ram!\n", + "GEMMA2_MODEL = \"google/gemma-2-2b-it\" # doesn't use a system prompt\n", + "PHI3 = \"microsoft/Phi-3-mini-4k-instruct\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3nSfA_KhfY38" + }, + "outputs": [], + "source": [ + "# convert audio to text\n", + "\n", + "def transcribe_audio(audio_file_path):\n", + " try:\n", + " with open (audio_file_path, 'rb') as audio_file:\n", + " transcript = openai.audio.transcriptions.create(model = AUDIO_MODEL, file = audio_file, response_format=\"text\")\n", + " return transcript\n", + " except Exception as e:\n", + " return f\"An error occurred: {str(e)}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OVmlY3DGgnYc" + }, + "outputs": [], + "source": [ + "# use transcript to create minutes\n", + "# use open source model\n", + "\n", + "def create_minutes(transcript):\n", + "\n", + " # first try is for debugging\n", + " try:\n", + " print(f\"Starting to create minutes with transcript length: {len(str(transcript))}\")\n", + "\n", + " if not transcript or len(str(transcript).strip()) == 0:\n", + " return \"Error: Empty or invalid transcript\"\n", + "\n", + " #messages\n", + " system_prompt = \"You are an expert creator of meeting minutes. Based on a meeting transcript you can summarise the meeting title and date, attendees, key discussion points, key outcomes, actions and owners and next steps. Respond in Markdown.\"\n", + " user_prompt = f\"Create meeting minutes from the transcript provided. The minutes should be clear but succint and should include title and date, attendees, key discussion points, key outcomes, actions and owners, and next steps. {transcript}\"\n", + "\n", + " messages = [\n", + " {\"role\":\"system\",\"content\":system_prompt},\n", + " {\"role\":\"user\",\"content\":user_prompt}\n", + " ]\n", + " print(\"Messages prepared successfully\") # for debugging\n", + "\n", + " # quantisation (for os model)\n", + "\n", + " quantization_config = BitsAndBytesConfig(\n", + " load_in_4bit=True,\n", + " bnb_4bit_use_double_quant=True,\n", + " bnb_4bit_quant_type=\"nf4\",\n", + " bnb_4bit_compute_dtype=torch.bfloat16\n", + " )\n", + "\n", + " except Exception as e:\n", + " return f\"An error occurred in setup: {str(e)}\"\n", + "\n", + " # model & tokeniser\n", + " try:\n", + " print(\"Loading tokeniser....\") # for debugging\n", + " tokenizer = AutoTokenizer.from_pretrained(PHI3)\n", + " tokenizer.pad_token = tokenizer.eos_token\n", + "\n", + " print(\"Loading model.....\") # for debugging\n", + " model = AutoModelForCausalLM.from_pretrained(PHI3, device_map='auto', quantization_config=quantization_config)\n", + " print(f\"Model loaded on device {model.device}\") # for debugging\n", + "\n", + " # chat template\n", + " inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n", + " model_inputs = tokenizer(inputs, return_tensors=\"pt\").to(model.device)\n", + "\n", + " # torch.no_grad suggested by claude. This disables gradient computation which reduces memory usage and speeds things up\n", + " print(\"Generating text....\") # for debugging\n", + " with torch.no_grad():\n", + " outputs = model.generate(**model_inputs, max_new_tokens=2000, do_sample=True, temperature=0.7)\n", + " print(f\"Generation complete. Output shape: {outputs.shape}\") # for debugging\n", + "\n", + " #***debugging****\n", + "\n", + " # Decode the generated text (excluding the input prompt)\n", + " print(\"Starting text decoding...\") # debugging\n", + " input_length = len(model_inputs['input_ids'][0]) # debugging\n", + " print(f\"Input length: {input_length}, Output length: {len(outputs[0])}\") # debugging\n", + "\n", + " if len(outputs[0]) <= input_length: # debugging\n", + " return \"Error: Model didn't generate any new tokens. Try reducing input length or increasing max_new_tokens.\" # debugging\n", + "\n", + " generated_tokens = outputs[0][input_length:] # debugging\n", + " print(f\"Generated tokens length: {len(generated_tokens)}\") # debugging\n", + "\n", + " # decode generated text\n", + " generated_text = tokenizer.decode(outputs[0][len(model_inputs['input_ids'][0]):],skip_special_tokens=True)\n", + " print(f\"Decoded text length: {len(generated_text)}\")\n", + "\n", + " return generated_text.strip()\n", + "\n", + " except ImportError as e:\n", + " return f\"Import error - missing library: {str(e)}. Please install required packages.\"\n", + " except torch.cuda.OutOfMemoryError as e:\n", + " return f\"CUDA out of memory: {str(e)}. Try reducing max_new_tokens to 500 or use CPU.\"\n", + " except RuntimeError as e:\n", + " return f\"Runtime error: {str(e)}. This might be a CUDA/device issue.\"\n", + " except Exception as e:\n", + " return f\"Unexpected error during text generation: {type(e).__name__}: {str(e)}\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "c63zzoDopw6u" + }, + "outputs": [], + "source": [ + "# create process for gradio\n", + "\n", + "def gr_process(audio_file, progress = gr.Progress()):\n", + "\n", + " if audio_file is None:\n", + " return \"Please provide an audio file\"\n", + "\n", + " try:\n", + " progress(0, desc=\"Analysing file\")\n", + " transcript = transcribe_audio(audio_file)\n", + "\n", + " if transcript.startswith(\"An error occurred\"):\n", + " return transcript\n", + "\n", + " progress(0.5, desc=\"File analysed, generating minutes\")\n", + "\n", + " minutes = create_minutes(transcript)\n", + " progress(0.9, desc=\"Nearly there\")\n", + "\n", + " return minutes\n", + "\n", + " except Exception as e:\n", + " return f\"An error occurred: {str(e)}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "82fyQELQkGty" + }, + "outputs": [], + "source": [ + "# gradio interface\n", + "\n", + "demo = gr.Interface(\n", + " fn=gr_process,\n", + " inputs= gr.Audio(type=\"filepath\",label=\"Upload MP3 file\"),\n", + " outputs= gr.Markdown(label=\"Meeting minutes\"),\n", + " title = \"Meeting minute creator\",\n", + " description = \"Upload an mp3 audio file for a meeting and I will provide the minutes!\"\n", + ")\n", + "\n", + "if __name__ == \"__main__\":\n", + " demo.launch(debug=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XljpyS7Nvxkh" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/week4/community-contributions/Week4_Exercise_convert_between_thirteen_lang_coment_unit_test.ipynb b/week4/community-contributions/Week4_Exercise_convert_between_thirteen_lang_coment_unit_test.ipynb new file mode 100644 index 0000000..a99930c --- /dev/null +++ b/week4/community-contributions/Week4_Exercise_convert_between_thirteen_lang_coment_unit_test.ipynb @@ -0,0 +1,841 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4a6ab9a2-28a2-445d-8512-a0dc8d1b54e9", + "metadata": {}, + "source": [ + "# Power Coder\n", + "\n", + "1. Convert code between two programming language; supporting languages are Python, Java, JavaScript, TypeScript, C, C++, C#, Go, Rust, Kotlin, Swift, PHP, Julia\n", + "2. Automatically add docstring/comments based on selected comment style\n", + "3. Automatically generate unit tests based on selected unit test style\n", + "4. Supporting models: gpt-4o, claude-3-5-sonnet-20240620, gemini-2.5-flash\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e610bf56-a46e-4aff-8de1-ab49d62b1ad3", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import io\n", + "import sys\n", + "import json\n", + "import requests\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI\n", + "import google.generativeai\n", + "import anthropic\n", + "from IPython.display import Markdown, display, update_display\n", + "import gradio as gr\n", + "import subprocess" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4f672e1c-87e9-4865-b760-370fa605e614", + "metadata": {}, + "outputs": [], + "source": [ + "# environment\n", + "\n", + "load_dotenv(override=True)\n", + "os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')\n", + "os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')\n", + "os.environ['GOOGLE_API_KEY'] = os.getenv('GOOGLE_API_KEY', 'your-key-if-not-using-env')\n", + "os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8aa149ed-9298-4d69-8fe2-8f5de0f667da", + "metadata": {}, + "outputs": [], + "source": [ + "# initialize\n", + "\n", + "openai = OpenAI()\n", + "claude = anthropic.Anthropic()\n", + "gemini_via_openai_client = OpenAI(\n", + " api_key=os.environ['GOOGLE_API_KEY'], \n", + " base_url=\"https://generativelanguage.googleapis.com/v1beta/openai/\"\n", + ")\n", + "OPENAI_MODEL = \"gpt-4o\"\n", + "CLAUDE_MODEL = \"claude-3-5-sonnet-20240620\"\n", + "GEMINI_MODEL = \"gemini-2.5-flash\"" + ] + }, + { + "cell_type": "markdown", + "id": "37b204dd-f770-41d9-9b19-7e1baa5273cd", + "metadata": {}, + "source": [ + "## 1. Convesion Part" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6896636f-923e-4a2c-9d6c-fac07828a201", + "metadata": {}, + "outputs": [], + "source": [ + "def convert_system_prompt_for(in_lang, out_lang):\n", + " convert_system_message = f\"You are an assistant that reimplements {in_lang} code in high performance {out_lang}. \"\n", + " convert_system_message += f\"Respond only with {out_lang} code; use comments sparingly and do not provide any explanation other than occasional comments. \"\n", + " convert_system_message += f\"The {out_lang} response needs to produce an identical output in the fastest possible time. Keep implementations of random number generators identical so that results match exactly.\"\n", + " return convert_system_message" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e7b3546-57aa-4c29-bc5d-f211970d04eb", + "metadata": {}, + "outputs": [], + "source": [ + "def convert_user_prompt_for(in_lang, out_lang, input_instruct, in_code):\n", + " convert_user_prompt = f\"Rewrite this {in_lang} code in {out_lang} with the fastest possible implementation that produces identical output in the least time. \"\n", + " convert_user_prompt += f\"Respond only with {out_lang} code; do not explain your work other than a few comments. \"\n", + " convert_user_prompt += f\"Pay attention to number types to ensure no int overflows. Remember to include all necessary {out_lang} packages or modules, for example, iomanip for C++.\\n\\n\"\n", + " if input_instruct:\n", + " convert_user_prompt += \"Addtional instruction is: \" + input_instruct\n", + " convert_user_prompt += in_code\n", + " return convert_user_prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c6190659-f54c-4951-bef4-4960f8e51cc4", + "metadata": {}, + "outputs": [], + "source": [ + "def convert_messages_for(in_lang, out_lang, input_instruct, in_code):\n", + " return [\n", + " {\"role\": \"system\", \"content\": convert_system_prompt_for(in_lang, out_lang)},\n", + " {\"role\": \"user\", \"content\": convert_user_prompt_for(in_lang, out_lang, input_instruct, in_code)}\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c3b497b3-f569-420e-b92e-fb0f49957ce0", + "metadata": {}, + "outputs": [], + "source": [ + "python_hard = \"\"\"# Be careful to support large number sizes\n", + "\n", + "def lcg(seed, a=1664525, c=1013904223, m=2**32):\n", + " value = seed\n", + " while True:\n", + " value = (a * value + c) % m\n", + " yield value\n", + " \n", + "def max_subarray_sum(n, seed, min_val, max_val):\n", + " lcg_gen = lcg(seed)\n", + " random_numbers = [next(lcg_gen) % (max_val - min_val + 1) + min_val for _ in range(n)]\n", + " max_sum = float('-inf')\n", + " for i in range(n):\n", + " current_sum = 0\n", + " for j in range(i, n):\n", + " current_sum += random_numbers[j]\n", + " if current_sum > max_sum:\n", + " max_sum = current_sum\n", + " return max_sum\n", + "\n", + "def total_max_subarray_sum(n, initial_seed, min_val, max_val):\n", + " total_sum = 0\n", + " lcg_gen = lcg(initial_seed)\n", + " for _ in range(20):\n", + " seed = next(lcg_gen)\n", + " total_sum += max_subarray_sum(n, seed, min_val, max_val)\n", + " return total_sum\n", + "\n", + "# Parameters\n", + "n = 10000 # Number of random numbers\n", + "initial_seed = 42 # Initial seed for the LCG\n", + "min_val = -10 # Minimum value of random numbers\n", + "max_val = 10 # Maximum value of random numbers\n", + "\n", + "# Timing the function\n", + "import time\n", + "start_time = time.time()\n", + "result = total_max_subarray_sum(n, initial_seed, min_val, max_val)\n", + "end_time = time.time()\n", + "\n", + "print(\"Total Maximum Subarray Sum (20 runs):\", result)\n", + "print(\"Execution Time: {:.6f} seconds\".format(end_time - start_time))\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0be9f47d-5213-4700-b0e2-d444c7c738c0", + "metadata": {}, + "outputs": [], + "source": [ + "def convert_stream_gpt(in_lang, out_lang, input_instruct, in_code): \n", + " stream = openai.chat.completions.create(model=OPENAI_MODEL, messages=convert_messages_for(in_lang, out_lang, input_instruct, in_code), temperature=0.0, stream=True)\n", + " reply = \"\"\n", + " for chunk in stream:\n", + " fragment = chunk.choices[0].delta.content or \"\"\n", + " reply += fragment\n", + " yield reply" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8669f56b-8314-4582-a167-78842caea131", + "metadata": {}, + "outputs": [], + "source": [ + "def convert_stream_claude(in_lang, out_lang, input_instruct, in_code):\n", + " result = claude.messages.stream(\n", + " model=CLAUDE_MODEL,\n", + " max_tokens=2000,\n", + " temperature=0.0,\n", + " system=convert_system_prompt_for(in_lang, out_lang),\n", + " messages=[{\"role\": \"user\", \"content\": convert_user_prompt_for(in_lang, out_lang, input_instruct, in_code)}],\n", + " )\n", + " reply = \"\"\n", + " with result as stream:\n", + " for text in stream.text_stream:\n", + " reply += text\n", + " yield reply" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01d3cd4f-c100-4e25-8670-0663513f6136", + "metadata": {}, + "outputs": [], + "source": [ + "def convert_stream_gemini(in_lang, out_lang, input_instruct, in_code): \n", + " stream = gemini_via_openai_client.chat.completions.create(model=GEMINI_MODEL, messages=convert_messages_for(in_lang, out_lang, input_instruct, in_code), temperature=0.0, stream=True)\n", + " reply = \"\"\n", + " for chunk in stream:\n", + " fragment = chunk.choices[0].delta.content or \"\"\n", + " reply += fragment\n", + " yield reply" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f1ae8f5-16c8-40a0-aa18-63b617df078d", + "metadata": {}, + "outputs": [], + "source": [ + "def optimize(in_lang, out_lang, in_code, input_instruct, convert_model):\n", + " if \"gpt\" in convert_model.lower():\n", + " result = convert_stream_gpt(in_lang, out_lang, input_instruct, in_code)\n", + " elif \"claude\" in convert_model.lower():\n", + " result = convert_stream_claude(in_lang, out_lang, input_instruct, in_code)\n", + " elif \"gemini\" in convert_model.lower():\n", + " result = convert_stream_gemini(in_lang, out_lang, input_instruct, in_code)\n", + " else:\n", + " raise ValueError(\"Unknown convert model\")\n", + " for stream_so_far in result:\n", + " yield stream_so_far " + ] + }, + { + "cell_type": "markdown", + "id": "07383878-f887-464f-8bc7-527c669d3edd", + "metadata": {}, + "source": [ + "## 2. Comment part" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d254038c-fdd6-4ef8-8b7a-a074f1e7405d", + "metadata": {}, + "outputs": [], + "source": [ + "def comment_system_prompt_for(lang, comment_style):\n", + " comment_system_message = f\"You are an assistant that generate necessary, concise and clear comment/docstring for the {lang} code by applying {comment_style} comment style. \"\n", + " comment_system_message += f\"Respond only with added comments, and do not provide any redundant explanation. \"\n", + " return comment_system_message" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e95cee4f-f229-4c9f-8e67-8a68cc9534c3", + "metadata": {}, + "outputs": [], + "source": [ + "def comment_user_prompt_for(lang, code, comment_style):\n", + " comment_user_prompt = f\"Add the comments/docstring on the given code for the {lang} programming language in {comment_style} comment style. \"\n", + " comment_user_prompt += f\"Respond only with added comments, and do not provide any redundant explanation.\\n\\n\"\n", + " comment_user_prompt += f\"The given code is as follows: \"\n", + " comment_user_prompt += code\n", + " return comment_user_prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "507426c2-cf5a-4041-b904-b18a5afe83b6", + "metadata": {}, + "outputs": [], + "source": [ + "def comment_messages_for(lang, code, comment_style):\n", + " return [\n", + " {\"role\": \"system\", \"content\": comment_system_prompt_for(lang, comment_style)},\n", + " {\"role\": \"user\", \"content\": comment_user_prompt_for(lang, code, comment_style)}\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e1c8cf6-7a15-4e79-82f6-6bb2a0b85773", + "metadata": {}, + "outputs": [], + "source": [ + "def comment_stream_gpt(lang, code, comment_style): \n", + " stream = openai.chat.completions.create(model=OPENAI_MODEL, messages=comment_messages_for(lang, code, comment_style), temperature=0.0, stream=True)\n", + " reply = \"\"\n", + " for chunk in stream:\n", + " fragment = chunk.choices[0].delta.content or \"\"\n", + " reply += fragment\n", + " yield reply" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "26f27781-4a3e-4e5f-a8ab-9a25944a9879", + "metadata": {}, + "outputs": [], + "source": [ + "def comment_stream_claude(lang, code, comment_style):\n", + " result = claude.messages.stream(\n", + " model=CLAUDE_MODEL,\n", + " max_tokens=2000,\n", + " temperature=0.0,\n", + " system=comment_system_prompt_for(lang, comment_style),\n", + " messages=[{\"role\": \"user\", \"content\": comment_user_prompt_for(lang, code, comment_style)}],\n", + " )\n", + " reply = \"\"\n", + " with result as stream:\n", + " for text in stream.text_stream:\n", + " reply += text\n", + " yield reply" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e6719e7-f2f3-40ea-8fed-01d84a641306", + "metadata": {}, + "outputs": [], + "source": [ + "def comment_stream_gemini(lang, code, comment_style): \n", + " stream = gemini_via_openai_client.chat.completions.create(model=GEMINI_MODEL, messages=comment_messages_for(lang, code, comment_style), temperature=0.0, stream=True)\n", + " reply = \"\"\n", + " for chunk in stream:\n", + " fragment = chunk.choices[0].delta.content or \"\"\n", + " reply += fragment\n", + " yield reply" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b98acc4-23d8-4671-8f19-92d72631b55d", + "metadata": {}, + "outputs": [], + "source": [ + "def generate_comments_via_model(lang, code, comment_style, comment_model):\n", + " if \"gpt\" in comment_model.lower():\n", + " result = comment_stream_gpt(lang, code, comment_style)\n", + " elif \"claude\" in comment_model.lower():\n", + " result = comment_stream_claude(lang, code, comment_style)\n", + " elif \"gemini\" in comment_model.lower():\n", + " result = comment_stream_gemini(lang, code, comment_style)\n", + " else:\n", + " raise ValueError(\"Unknown comment model\")\n", + " for stream_so_far in result:\n", + " yield stream_so_far " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "282c75ae-d8c3-4866-a024-f7ecf87b3cde", + "metadata": {}, + "outputs": [], + "source": [ + "def generate_comments_fn(comment_option, in_lang, out_lang, in_code, out_code, in_comment_style, out_comment_style, comment_model):\n", + " if 'input' in comment_option:\n", + " in_gen = generate_comments_via_model(in_lang, in_code, in_comment_style, comment_model)\n", + " for in_output in in_gen:\n", + " yield in_output, \"\"\n", + " elif 'output' in comment_option:\n", + " out_gen = generate_comments_via_model(out_lang, out_code, out_comment_style, comment_model)\n", + " for out_output in out_gen:\n", + " yield \"\", out_output\n", + " elif 'both' in comment_option:\n", + " in_gen = generate_comments_via_model(in_lang, in_code, in_comment_style, comment_model)\n", + " out_gen = generate_comments_via_model(out_lang, out_code, out_comment_style, comment_model)\n", + " for in_output, out_output in zip(in_gen, out_gen):\n", + " yield in_output, out_output" + ] + }, + { + "cell_type": "markdown", + "id": "ce2c178c-d03c-49c0-b0e9-c57c699bca08", + "metadata": {}, + "source": [ + "## 3. Unit test part" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5a4743e-e1a8-42c7-8f1f-a73d49c0895d", + "metadata": {}, + "outputs": [], + "source": [ + "def unit_test_system_prompt_for(lang, unit_test_style):\n", + " unit_test_system_message = f\"You are an assistant that generate necessary, concise, clear and executable unit tests for the {lang} code by applying {unit_test_style} unit test style. \"\n", + " unit_test_system_message += f\"Respond only with generated unit tests; use comments sparingly and do not provide any explanation other than occasional comments. \"\n", + " return unit_test_system_message" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "334d5e40-71ff-4d24-8cef-b6c81c188e4d", + "metadata": {}, + "outputs": [], + "source": [ + "def unit_test_user_prompt_for(lang, code, unit_test_style):\n", + " unit_test_user_prompt = f\"Add the unit tests on the given code for the {lang} programming language in {unit_test_style} unit test style. \"\n", + " unit_test_user_prompt += f\"Respond only with generated unit tests; use comments sparingly and do not provide any explanation other than occasional comments.\\n\\n\"\n", + " unit_test_user_prompt += f\"The given code is as follows: \"\n", + " unit_test_user_prompt += code\n", + " return unit_test_user_prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8a8e061f-3993-4746-9425-d938d2537f65", + "metadata": {}, + "outputs": [], + "source": [ + "def unit_test_messages_for(lang, code, unit_test_style):\n", + " return [\n", + " {\"role\": \"system\", \"content\": unit_test_system_prompt_for(lang, unit_test_style)},\n", + " {\"role\": \"user\", \"content\": unit_test_user_prompt_for(lang, code, unit_test_style)}\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "71c1613b-7a16-4443-acec-d0a2d9bed192", + "metadata": {}, + "outputs": [], + "source": [ + "def unit_test_stream_gpt(lang, code, unit_test_style): \n", + " stream = openai.chat.completions.create(model=OPENAI_MODEL, messages=unit_test_messages_for(lang, code, unit_test_style), stream=True)\n", + " reply = \"\"\n", + " for chunk in stream:\n", + " fragment = chunk.choices[0].delta.content or \"\"\n", + " reply += fragment\n", + " yield reply" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8a6e3502-f7ff-42b8-8fc5-2697b2d1f36e", + "metadata": {}, + "outputs": [], + "source": [ + "def unit_test_stream_claude(lang, code, unit_test_style):\n", + " result = claude.messages.stream(\n", + " model=CLAUDE_MODEL,\n", + " max_tokens=2000,\n", + " system=unit_test_system_prompt_for(lang, unit_test_style),\n", + " messages=[{\"role\": \"user\", \"content\": unit_test_user_prompt_for(lang, code, unit_test_style)}],\n", + " )\n", + " reply = \"\"\n", + " with result as stream:\n", + " for text in stream.text_stream:\n", + " reply += text\n", + " yield reply" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d7f694f-a276-4bdc-9cfb-755483fd4380", + "metadata": {}, + "outputs": [], + "source": [ + "def unit_test_stream_gemini(lang, code, unit_test_style): \n", + " stream = gemini_via_openai_client.chat.completions.create(model=GEMINI_MODEL, messages=unit_test_messages_for(lang, code, unit_test_style), stream=True)\n", + " reply = \"\"\n", + " for chunk in stream:\n", + " fragment = chunk.choices[0].delta.content or \"\"\n", + " reply += fragment\n", + " yield reply" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c824429a-b18a-4320-8258-0141037a6531", + "metadata": {}, + "outputs": [], + "source": [ + "def generate_unit_test_via_model(lang, code, unit_test_style, unit_test_model):\n", + " if \"gpt\" in unit_test_model.lower():\n", + " result = unit_test_stream_gpt(lang, code, unit_test_style)\n", + " elif \"claude\" in unit_test_model.lower():\n", + " result = unit_test_stream_claude(lang, code, unit_test_style)\n", + " elif \"gemini\" in unit_test_model.lower():\n", + " result = unit_test_stream_gemini(lang, code, unit_test_style)\n", + " else:\n", + " raise ValueError(\"Unknown unit test model\")\n", + " for stream_so_far in result:\n", + " yield stream_so_far " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c3e59e26-37c0-4429-b69c-deb581423dd0", + "metadata": {}, + "outputs": [], + "source": [ + "def generate_unit_test_fn(unit_test_option, in_lang, out_lang, in_code, out_code, in_unit_test_style, out_unit_test_style, unit_test_model):\n", + " if 'input' in unit_test_option:\n", + " in_gen = generate_unit_test_via_model(in_lang, in_code, in_unit_test_style, unit_test_model)\n", + " for in_output in in_gen:\n", + " yield in_output, \"\"\n", + " elif 'output' in unit_test_option:\n", + " out_gen = generate_unit_test_via_model(out_lang, out_code, out_unit_test_style, unit_test_model)\n", + " for out_output in out_gen:\n", + " yield \"\", out_output\n", + " elif 'both' in unit_test_option:\n", + " in_gen = generate_unit_test_via_model(in_lang, in_code, in_unit_test_style, unit_test_model)\n", + " out_gen = generate_unit_test_via_model(out_lang, out_code, out_unit_test_style, unit_test_model)\n", + " for in_output, out_output in zip(in_gen, out_gen):\n", + " yield in_output, out_output" + ] + }, + { + "cell_type": "markdown", + "id": "2a1f4d0c-f417-4de4-be9f-441cbe5a6db3", + "metadata": {}, + "source": [ + "## 4. Gradio UI part" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9a2274f1-d03b-42c0-8dcc-4ce159b18442", + "metadata": {}, + "outputs": [], + "source": [ + "LANGUAGE_INFO = {\n", + " \"Python\": {\n", + " \"doc_style\": [\"Google-style\", \"NumPy-style\", \"reST\", \"Doxygen\"],\n", + " \"unit_test_style\": [\"unittest\", \"pytest\", \"doctest\"]\n", + " },\n", + " \"Java\": {\n", + " \"doc_style\": [\"Javadoc\"],\n", + " \"unit_test_style\": [\"JUnit4\", \"JUnit5\", \"TestNG\"]\n", + " },\n", + " \"JavaScript\": {\n", + " \"doc_style\": [\"JSDoc\"],\n", + " \"unit_test_style\": [\"Jest\", \"Mocha + Chai\", \"Jasmine\"]\n", + " },\n", + " \"TypeScript\": {\n", + " \"doc_style\": [\"JSDoc\", \"TSDoc\"],\n", + " \"unit_test_style\": [\"Jest\", \"Mocha + Chai\", \"Vitest\"]\n", + " },\n", + " \"C\": {\n", + " \"doc_style\": [\"Doxygen\"],\n", + " \"unit_test_style\": [\"Google Test (gtest)\", \"CppUnit\", \"Catch2\"]\n", + " },\n", + " \"C++\": {\n", + " \"doc_style\": [\"Doxygen\"],\n", + " \"unit_test_style\": [\"Google Test (gtest)\", \"CppUnit\", \"Catch2\"]\n", + " },\n", + " \"C#\": {\n", + " \"doc_style\": [\"XML comments\"],\n", + " \"unit_test_style\": [\"xUnit\", \"NUnit\", \"MSTest\"]\n", + " },\n", + " \"Go\": {\n", + " \"doc_style\": [\"Godoc\"],\n", + " \"unit_test_style\": [\"Built-in testing package\"]\n", + " },\n", + " \"Rust\": {\n", + " \"doc_style\": [\"Rustdoc\", \"Markdown\"],\n", + " \"unit_test_style\": [\"Built-in #[test] annotation\"]\n", + " },\n", + " \"Kotlin\": {\n", + " \"doc_style\": [\"KDoc\"],\n", + " \"unit_test_style\": [\"JUnit\", \"Kotest\", \"Spek\"]\n", + " },\n", + " \"Swift\": {\n", + " \"doc_style\": [\"Mark-style comments\"],\n", + " \"unit_test_style\": [\"XCTest\"]\n", + " },\n", + " \"PHP\": {\n", + " \"doc_style\": [\"PHPDoc\"],\n", + " \"unit_test_style\": [\"PHPUnit\"]\n", + " },\n", + " \"Julia\": {\n", + " \"doc_style\": [\"Markdown\"],\n", + " \"unit_test_style\": [\"Built-in Test standard library\"]\n", + " }\n", + "}\n", + "LANGUAGES = list(LANGUAGE_INFO.keys())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b50e7833-8f6f-407e-8174-37af9cec2030", + "metadata": {}, + "outputs": [], + "source": [ + "with gr.Blocks(title=\"Power Coder\", theme=gr.themes.Citrus(), css=\"\"\"\n", + ".selected {\n", + " background-color: orange !important;\n", + " box-shadow: 0 4px 12px rgba(255, 140, 0, 0.5) !important;\n", + " color: black;\n", + "}\n", + ".unselected {\n", + " background-color: gray !important;\n", + " box-shadow: 0 4px 12px rgba(128, 128, 128, 0.4);\n", + " color: white;\n", + "}\n", + "\"\"\") as ui:\n", + " current_selected = gr.State(\"\")\n", + " initial_in_lang = \"Python\"\n", + " initial_out_lang = \"Java\"\n", + " in_comment_style_choices = [\"Standard\"] + LANGUAGE_INFO[initial_in_lang][\"doc_style\"]\n", + " out_comment_style_choices = [\"Standard\"] + LANGUAGE_INFO[initial_out_lang][\"doc_style\"]\n", + " in_unit_test_style_choices = [\"Standard\"] + LANGUAGE_INFO[initial_in_lang][\"unit_test_style\"]\n", + " out_unit_test_style_choices = [\"Standard\"] + LANGUAGE_INFO[initial_out_lang][\"unit_test_style\"]\n", + " in_code_file_name = gr.State(\"in_code.txt\")\n", + " out_code_file_name = gr.State(\"out_code.txt\")\n", + " in_comments_file_name = gr.State(\"in_comments.txt\")\n", + " out_comments_file_name = gr.State(\"out_comments.txt\")\n", + " in_unit_test_file_name = gr.State(\"in_unit_tests.txt\")\n", + " out_unit_test_file_name = gr.State(\"out_unit_tests.txt\")\n", + " \n", + " \n", + " gr.Markdown(\"## Code Helper\")\n", + "\n", + " def load_file_content(file):\n", + " if file is None:\n", + " return \"\"\n", + " with open(file.name, \"r\", encoding=\"utf-8\") as f:\n", + " return f.read()\n", + "\n", + " def change_lang(lang):\n", + " comment_style_choices = [\"Standard\"] + LANGUAGE_INFO[lang][\"doc_style\"]\n", + " unit_test_style_choices = [\"Standard\"] + LANGUAGE_INFO[lang][\"unit_test_style\"]\n", + " return (\n", + " gr.update(choices=comment_style_choices, value=str(comment_style_choices[0])), \n", + " gr.update(choices=unit_test_style_choices, value=str(unit_test_style_choices[0]))\n", + " )\n", + "\n", + " def download_fn(in_text, out_text, in_file_name, out_file_name):\n", + " if in_text:\n", + " with open(in_file_name, \"w\") as f:\n", + " f.write(in_text)\n", + " if out_text:\n", + " with open(out_file_name, \"w\") as f:\n", + " f.write(out_text)\n", + " \n", + " # Conversion part\n", + " with gr.Row():\n", + " in_lang = gr.Dropdown(choices=LANGUAGES, label=\"Select input language\", value=initial_in_lang, interactive=True)\n", + " out_lang = gr.Dropdown(choices=LANGUAGES, label=\"Select output language\", value=initial_out_lang, interactive=True)\n", + " with gr.Row():\n", + " input_file = gr.File(label=\"Upload a source code file or input below\")\n", + " input_instruct = gr.Textbox(\n", + " label=\"Additional instruction(optional)\",\n", + " placeholder=\"Enter the instruction you want the ouput code to follow...\\n\\nFor example: Define the variable using snake_case style.\",\n", + " lines=8\n", + " )\n", + " with gr.Row():\n", + " in_code = gr.Textbox(label=\"Input Code:\", value=python_hard, lines=10)\n", + " out_code = gr.Textbox(label=\"Output Code:\", lines=10)\n", + " with gr.Row():\n", + " convert_model = gr.Dropdown([\"Claude\", \"GPT\", \"Gemini\"], label=\"Select model\", value=\"Claude\")\n", + " with gr.Row():\n", + " convert = gr.Button(\"Convert code\")\n", + " download_code = gr.Button(\"Download code\")\n", + "\n", + " gr.HTML(\"
\")\n", + "\n", + " def show_comment(current_selected):\n", + " if current_selected == \"comment\":\n", + " return (\n", + " gr.update(visible=False),\n", + " gr.update(visible=False),\n", + " gr.update(elem_classes=[\"unselected\"]),\n", + " gr.update(elem_classes=[\"unselected\"]),\n", + " \"\"\n", + " )\n", + " else:\n", + " return (\n", + " gr.update(visible=True),\n", + " gr.update(visible=False),\n", + " gr.update(elem_classes=[\"selected\"]),\n", + " gr.update(elem_classes=[\"unselected\"]),\n", + " \"comment\"\n", + " )\n", + "\n", + " def show_unit_test(current_selected):\n", + " if current_selected == \"unit_test\":\n", + " return (\n", + " gr.update(visible=False),\n", + " gr.update(visible=False),\n", + " gr.update(elem_classes=[\"unselected\"]),\n", + " gr.update(elem_classes=[\"unselected\"]),\n", + " \"\"\n", + " )\n", + " else:\n", + " return (\n", + " gr.update(visible=False),\n", + " gr.update(visible=True),\n", + " gr.update(elem_classes=[\"unselected\"]),\n", + " gr.update(elem_classes=[\"selected\"]),\n", + " \"unit_test\"\n", + " )\n", + " \n", + " with gr.Blocks() as demo:\n", + " with gr.Row():\n", + " comment_show_up = gr.Button(\"Comment\", elem_id=\"comment-btn\", elem_classes=[\"unselected\"])\n", + " unit_test_show_up = gr.Button(\"Unit Test\", elem_id=\"unit-test-btn\", elem_classes=[\"unselected\"])\n", + " \n", + " comment_section = gr.Column(visible=False)\n", + " unit_test_section = gr.Column(visible=False)\n", + " \n", + " with comment_section:\n", + " # Comment section\n", + " with gr.Row():\n", + " comment_option = gr.Radio(\n", + " choices=[\n", + " \"Comment input code\",\n", + " \"Comment output code\",\n", + " \"Comment both\"\n", + " ],\n", + " label=\"Commenting Options\",\n", + " value=\"Comment input code\",\n", + " interactive=True\n", + " )\n", + " with gr.Row():\n", + " in_comment_style = gr.Dropdown(choices=in_comment_style_choices, label=\"Select comment style for input code\", value=in_comment_style_choices[0], interactive=True)\n", + " out_comment_style = gr.Dropdown(choices=out_comment_style_choices, label=\"Select comment style for oupt code\", value=out_comment_style_choices[0], interactive=True)\n", + " with gr.Row():\n", + " comment_model = gr.Dropdown([\"Claude\", \"GPT\", \"Gemini\"], label=\"Select model\", value=\"Claude\")\n", + " with gr.Row():\n", + " generate_comments = gr.Button(\"Generate comments\")\n", + " download_comments = gr.Button(\"Download comments\")\n", + " with gr.Row():\n", + " in_comments = gr.Textbox(label=\"Comments for Input Code:\", lines=10)\n", + " out_comments = gr.Textbox(label=\"Comments for Output Code:\", lines=10)\n", + " \n", + " with unit_test_section:\n", + " # Unit test part\n", + " with gr.Row():\n", + " unit_test_option = gr.Radio(\n", + " choices=[\n", + " \"Add unit test for input code\",\n", + " \"Add unit test for output code\",\n", + " \"Add unit test for both\"\n", + " ],\n", + " label=\"Unit Test Options\",\n", + " value=\"Add unit test for input code\",\n", + " interactive=True\n", + " )\n", + " with gr.Row():\n", + " in_unit_test_style = gr.Dropdown(choices=in_unit_test_style_choices, label=\"Select unit test style for input code\", value=in_unit_test_style_choices[0], interactive=True)\n", + " out_unit_test_style = gr.Dropdown(choices=out_unit_test_style_choices, label=\"Select unit test style for oupt code\", value=out_unit_test_style_choices[0], interactive=True)\n", + " with gr.Row():\n", + " unit_test_model = gr.Dropdown([\"Claude\", \"GPT\", \"Gemini\"], label=\"Select model\", value=\"Claude\")\n", + " with gr.Row():\n", + " generate_unit_test = gr.Button(\"Generate unit test\")\n", + " download_unit_test = gr.Button(\"Download unit text\")\n", + " with gr.Row():\n", + " in_unit_test = gr.Textbox(label=\"Unit Test for Input Code:\", lines=10)\n", + " out_unit_test = gr.Textbox(label=\"Unit Test for Output Code:\", lines=10)\n", + "\n", + " in_lang.change(fn=change_lang, inputs=in_lang, outputs=[in_comment_style, in_unit_test_style])\n", + " out_lang.change(fn=change_lang, inputs=out_lang, outputs=[out_comment_style, out_unit_test_style])\n", + " input_file.change(fn=load_file_content, inputs=input_file, outputs=in_code)\n", + " \n", + " convert.click(optimize, inputs=[in_lang, out_lang, in_code, input_instruct, convert_model], outputs=[out_code])\n", + " download_code.click(download_fn, inputs=[in_code, out_code, in_code_file_name, out_code_file_name])\n", + " \n", + " comment_show_up.click(fn=show_comment, inputs=current_selected, outputs=[comment_section, unit_test_section, comment_show_up, unit_test_show_up, current_selected])\n", + " unit_test_show_up.click(fn=show_unit_test, inputs=current_selected, outputs=[comment_section, unit_test_section, comment_show_up, unit_test_show_up, current_selected])\n", + "\n", + " generate_comments.click(generate_comments_fn, inputs=[comment_option, in_lang, out_lang, in_code, out_code, in_comment_style, out_comment_style, comment_model], outputs=[in_comments, out_comments])\n", + " download_comments.click(download_fn, inputs=[in_comments, out_comments, in_comments_file_name, out_comments_file_name])\n", + " generate_unit_test.click(generate_unit_test_fn, inputs=[unit_test_option, in_lang, out_lang, in_code, out_code, in_unit_test_style, out_unit_test_style, unit_test_model], outputs=[in_unit_test, out_unit_test])\n", + " download_unit_test.click(download_fn, inputs=[in_unit_test, out_unit_test, in_unit_test_file_name, out_unit_test_file_name])\n", + " \n", + "ui.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0266734c-0bee-46c0-9b17-9fd2ae86cc3a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week4/community-contributions/Week4_day3_Gemini_Codestral.ipynb b/week4/community-contributions/Week4_day3_Gemini_Codestral.ipynb new file mode 100644 index 0000000..8fa0417 --- /dev/null +++ b/week4/community-contributions/Week4_day3_Gemini_Codestral.ipynb @@ -0,0 +1,643 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ac833f26-d429-4fd2-8f83-92174f1c951a", + "metadata": {}, + "source": [ + "# Code conversion using Gemini and Codestral in Windows 11" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c230178c-6f31-4c5a-a888-16b7037ffbf9", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import io\n", + "import sys\n", + "import gradio as gr\n", + "import subprocess\n", + "from google import genai\n", + "from google.genai import types\n", + "from mistralai import Mistral\n", + "from dotenv import load_dotenv\n", + "from IPython.display import Markdown, display, update_display" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6d824484-eaaa-456a-b7dc-7e3277fec34a", + "metadata": {}, + "outputs": [], + "source": [ + "# Load Gemini and Mistral API Keys\n", + "\n", + "load_dotenv(override=True)\n", + "gemini_api_key = os.getenv(\"GOOGLE_API_KEY\")\n", + "mistral_api_key = os.getenv(\"MISTRAL_API_KEY\")\n", + "\n", + "if not mistral_api_key or not gemini_api_key:\n", + " print(\"API Key not found!\")\n", + "else:\n", + " print(\"API Key loaded in memory\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "86f3633e-81f9-4c13-b7b5-793ddc4f886f", + "metadata": {}, + "outputs": [], + "source": [ + "# Models to be used\n", + "\n", + "MODEL_GEMINI = 'gemini-2.5-flash'\n", + "MODEL_CODESTRAL = 'codestral-latest'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3f3a6d53-70f9-46b8-a490-a50f3a1adf9e", + "metadata": {}, + "outputs": [], + "source": [ + "# Load Gemini client\n", + "try:\n", + " gemini_client = genai.Client(api_key=gemini_api_key)\n", + " print(\"Google GenAI Client initialized successfully!\")\n", + "\n", + " codestral_client = Mistral(api_key=mistral_api_key)\n", + " print(\"Mistral Client initialized successfully!\")\n", + "except Exception as e:\n", + " print(f\"Error initializing Client: {e}\")\n", + " exit() " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f816fbe8-e094-499f-98a5-588ebecf8c72", + "metadata": {}, + "outputs": [], + "source": [ + "# Gemini System prompt\n", + "\n", + "system_message = \"You are an assistant that reimplements Python code in high-performance C++ optimized for a Windows PC. \"\n", + "system_message += \"Use Windows-specific optimizations where applicable (e.g., multithreading with std::thread, SIMD, or WinAPI if necessary). \"\n", + "system_message += \"Respond only with the equivalent C++ code; include comments only where absolutely necessary. \"\n", + "system_message += \"Avoid any explanation or text outside the code. \"\n", + "system_message += \"The C++ output must produce identical functionality with the fastest possible execution time on Windows.\"\n", + "\n", + "generate_content_config = types.GenerateContentConfig(system_instruction=system_message)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01227835-15d2-40bd-a9dd-2ef35ad371dc", + "metadata": {}, + "outputs": [], + "source": [ + "def user_prompt_for(python):\n", + " user_prompt = (\n", + " \"Convert the following Python code into high-performance C++ optimized for Windows. \"\n", + " \"Use standard C++20 or newer with Windows-compatible libraries and best practices. \"\n", + " \"Ensure the implementation runs as fast as possible and produces identical output. \"\n", + " \"Use appropriate numeric types to avoid overflow or precision loss. \"\n", + " \"Avoid unnecessary abstraction; prefer direct computation and memory-efficient structures. \"\n", + " \"Respond only with C++ code, include all required headers (like , , etc.), and limit comments to only what's essential.\\n\\n\"\n", + " )\n", + " user_prompt += python\n", + " return user_prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d9fc8e2-acf0-4122-a8a9-5aadadf982ab", + "metadata": {}, + "outputs": [], + "source": [ + "def user_message_gemini(python): \n", + " return types.Content(role=\"user\", parts=[types.Part.from_text(text=user_prompt_for(python))]) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "334c8b84-6e37-40fc-97ac-40a1b3aa29fa", + "metadata": {}, + "outputs": [], + "source": [ + "def messages_for(python):\n", + " return [\n", + " {\"role\": \"system\", \"content\": system_message},\n", + " {\"role\": \"user\", \"content\": user_prompt_for(python)}\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4aca87ac-6330-4ed4-a36f-1726fd0ada1a", + "metadata": {}, + "outputs": [], + "source": [ + "def write_output(cpp):\n", + " code = cpp.replace(\"```cpp\", \"\").replace(\"```c++\", \"\").replace(\"```\", \"\").strip()\n", + " \n", + " if not \"#include\" in code:\n", + " raise ValueError(\"C++ code appears invalid: missing #include directives.\")\n", + "\n", + " with open(\"optimized.cpp\", \"w\", encoding=\"utf-8\", newline=\"\\n\") as f:\n", + " f.write(code) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fcf42642-1a55-4556-8738-0c8c02effa9c", + "metadata": {}, + "outputs": [], + "source": [ + "# Generate CPP code using Gemini\n", + "\n", + "def optimize_gemini(python):\n", + " stream = gemini_client.models.generate_content_stream(\n", + " model = MODEL_GEMINI,\n", + " config=generate_content_config,\n", + " contents=user_message_gemini(python)\n", + " )\n", + " cpp_code = \"\"\n", + " for chunk in stream:\n", + " chunk_text = chunk.text\n", + " cpp_code += chunk_text\n", + " print(chunk_text, end=\"\", flush=True) \n", + " write_output(cpp_code)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3f06a301-4397-4d63-9226-657bb2ddb792", + "metadata": {}, + "outputs": [], + "source": [ + "# Generate CPP code using Codestral\n", + "\n", + "def optimize_codestral(python):\n", + " stream = codestral_client.chat.stream(\n", + " model = MODEL_CODESTRAL,\n", + " messages = messages_for(python), \n", + " )\n", + " \n", + " cpp_code = \"\"\n", + " for chunk in stream:\n", + " chunk_text = chunk.data.choices[0].delta.content\n", + " cpp_code += chunk_text\n", + " print(chunk_text, end=\"\", flush=True) \n", + " write_output(cpp_code)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8bd51601-7c1d-478d-b043-6f92739e5c4b", + "metadata": {}, + "outputs": [], + "source": [ + "# Actual code to convert\n", + "\n", + "pi = \"\"\"\n", + "import time\n", + "\n", + "def calculate(iterations, param1, param2):\n", + " result = 1.0\n", + " for i in range(1, iterations+1):\n", + " j = i * param1 - param2\n", + " result -= (1/j)\n", + " j = i * param1 + param2\n", + " result += (1/j)\n", + " return result\n", + "\n", + "start_time = time.time()\n", + "result = calculate(100_000_000, 4, 1) * 4\n", + "end_time = time.time()\n", + "\n", + "print(f\"Result: {result:.12f}\")\n", + "print(f\"Execution Time: {(end_time - start_time):.6f} seconds\")\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db9ea24e-d381-48ac-9196-853d2527dcca", + "metadata": {}, + "outputs": [], + "source": [ + "exec(pi)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3e26708-8475-474d-8e96-e602c3d5ef9f", + "metadata": {}, + "outputs": [], + "source": [ + "optimize_gemini(pi)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2cc23ea7-6062-4354-92bc-730baa52a50b", + "metadata": {}, + "outputs": [], + "source": [ + "# CPP Compilation\n", + "\n", + "!g++ -O3 -std=c++20 -o optimized.exe optimized.cpp" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b14704d-95fe-4ed2-861f-af591bf3090e", + "metadata": {}, + "outputs": [], + "source": [ + "!.\\optimized.exe" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5d756d1a-1d49-4cfb-bed7-8748d848b083", + "metadata": {}, + "outputs": [], + "source": [ + "optimize_codestral(pi)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e286dc8-9532-48b1-b748-a7950972e7df", + "metadata": {}, + "outputs": [], + "source": [ + "!g++ -O3 -std=c++20 -o optimized.exe optimized.cpp" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61fe0044-7679-4245-9e59-50642f3d80c6", + "metadata": {}, + "outputs": [], + "source": [ + "!.\\optimized.exe" + ] + }, + { + "cell_type": "markdown", + "id": "f0c0392c-d2a7-4619-82a2-f7b9fa7c43f9", + "metadata": {}, + "source": [ + "## Hard Code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ca53eb4-46cd-435b-a950-0e2a8f845535", + "metadata": {}, + "outputs": [], + "source": [ + "python_hard = \"\"\"# Be careful to support large number sizes\n", + "\n", + "def lcg(seed, a=1664525, c=1013904223, m=2**32):\n", + " value = seed\n", + " while True:\n", + " value = (a * value + c) % m\n", + " yield value\n", + " \n", + "def max_subarray_sum(n, seed, min_val, max_val):\n", + " lcg_gen = lcg(seed)\n", + " random_numbers = [next(lcg_gen) % (max_val - min_val + 1) + min_val for _ in range(n)]\n", + " max_sum = float('-inf')\n", + " for i in range(n):\n", + " current_sum = 0\n", + " for j in range(i, n):\n", + " current_sum += random_numbers[j]\n", + " if current_sum > max_sum:\n", + " max_sum = current_sum\n", + " return max_sum\n", + "\n", + "def total_max_subarray_sum(n, initial_seed, min_val, max_val):\n", + " total_sum = 0\n", + " lcg_gen = lcg(initial_seed)\n", + " for _ in range(20):\n", + " seed = next(lcg_gen)\n", + " total_sum += max_subarray_sum(n, seed, min_val, max_val)\n", + " return total_sum\n", + "\n", + "# Parameters\n", + "n = 10000 # Number of random numbers\n", + "initial_seed = 42 # Initial seed for the LCG\n", + "min_val = -10 # Minimum value of random numbers\n", + "max_val = 10 # Maximum value of random numbers\n", + "\n", + "# Timing the function\n", + "import time\n", + "start_time = time.time()\n", + "result = total_max_subarray_sum(n, initial_seed, min_val, max_val)\n", + "end_time = time.time()\n", + "\n", + "print(\"Total Maximum Subarray Sum (20 runs):\", result)\n", + "print(\"Execution Time: {:.6f} seconds\".format(end_time - start_time))\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "697cc9fe-efdb-40b7-8e43-871bd2df940e", + "metadata": {}, + "outputs": [], + "source": [ + "exec(python_hard)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "17ed6329-6c5f-45af-91ff-06d73830dd0d", + "metadata": {}, + "outputs": [], + "source": [ + "optimize_gemini(python_hard)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b57f0e7-46c9-4235-86eb-389faf37b7bb", + "metadata": {}, + "outputs": [], + "source": [ + "# CPP Compilation\n", + "\n", + "!g++ -O3 -std=c++20 -o optimized.exe optimized.cpp" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b8ce8d01-fda8-400d-b3d4-6f1ad3008d28", + "metadata": {}, + "outputs": [], + "source": [ + "!.\\optimized.exe" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "adbcdac7-8656-41c9-8707-d8a71998d393", + "metadata": {}, + "outputs": [], + "source": [ + "optimize_codestral(python_hard)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9f9fc9b1-29cf-4510-83f8-1484d26e871e", + "metadata": {}, + "outputs": [], + "source": [ + "# CPP Compilation\n", + "\n", + "!g++ -O3 -std=c++20 -o optimized.exe optimized.cpp" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "52170458-c4a1-4920-8d83-8c5ba7250759", + "metadata": {}, + "outputs": [], + "source": [ + "!.\\optimized.exe" + ] + }, + { + "cell_type": "markdown", + "id": "da6aee85-2792-487b-bef3-fec5dcf12623", + "metadata": {}, + "source": [ + "## Accommodating the entire code in Gradio" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f2a90c4f-c289-4658-a6ce-51b80e20f91f", + "metadata": {}, + "outputs": [], + "source": [ + "def stream_gemini(python):\n", + " stream = gemini_client.models.generate_content_stream(\n", + " model = MODEL_GEMINI,\n", + " config=generate_content_config,\n", + " contents=user_message_gemini(python)\n", + " )\n", + "\n", + " cpp_code = \"\"\n", + " for chunk in stream:\n", + " chunk_text = chunk.text or \"\"\n", + " cpp_code += chunk_text\n", + " yield cpp_code.replace('```cpp\\n','').replace('```','')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e872171-96d8-4041-8cb0-0c632c5e957f", + "metadata": {}, + "outputs": [], + "source": [ + "def stream_codestral(python):\n", + " stream = codestral_client.chat.stream(\n", + " model = MODEL_CODESTRAL,\n", + " messages = messages_for(python), \n", + " )\n", + "\n", + " cpp_code = \"\"\n", + " for chunk in stream:\n", + " chunk_text = chunk.data.choices[0].delta.content or \"\"\n", + " cpp_code += chunk_text\n", + " yield cpp_code.replace('```cpp\\n','').replace('```','') " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3340b36b-1241-4b0f-9e69-d4e5cc215a27", + "metadata": {}, + "outputs": [], + "source": [ + "def optimize(python, model):\n", + " if model.lower() == 'gemini':\n", + " result = stream_gemini(python)\n", + " elif model.lower() == 'codestral':\n", + " result = stream_codestral(python)\n", + " else:\n", + " raise ValueError(\"Unknown model\")\n", + " \n", + " for stream_so_far in result:\n", + " yield stream_so_far " + ] + }, + { + "cell_type": "markdown", + "id": "277ddd6c-e71e-4512-965a-57fca341487a", + "metadata": {}, + "source": [ + "### Gradio Implementation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "222a9eae-236e-4ba3-8f23-3d9b879ec2d0", + "metadata": {}, + "outputs": [], + "source": [ + "custom_css = \"\"\"\n", + ".scrollable-box textarea {\n", + " overflow: auto !important;\n", + " height: 400px;\n", + "}\n", + "\n", + ".python {background-color: #306998;}\n", + ".cpp {background-color: #050;}\n", + "\n", + "\"\"\"\n", + "\n", + "theme = gr.themes.Soft()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b4bd6ed1-ff8c-42d4-8da6-24b9cfd134db", + "metadata": {}, + "outputs": [], + "source": [ + "def execute_python(code):\n", + " try:\n", + " result = subprocess.run(\n", + " [\"python\", \"-c\", code],\n", + " capture_output=True,\n", + " text=True,\n", + " timeout=60\n", + " )\n", + " if result.returncode == 0:\n", + " return result.stdout or \"[No output]\"\n", + " else:\n", + " return f\"[Error]\\n{result.stderr}\"\n", + " except subprocess.TimeoutExpired:\n", + " return \"[Error] Execution timed out.\"\n", + " except Exception as e:\n", + " return f\"[Exception] {str(e)}\" " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1507c973-8699-48b2-80cd-45900c97a867", + "metadata": {}, + "outputs": [], + "source": [ + "def execute_cpp(code):\n", + " write_output(code)\n", + " \n", + " try:\n", + " compile_cmd = [\"g++\", \"-O3\", \"-std=c++20\", \"-o\", \"optimized.exe\", \"optimized.cpp\"]\n", + " compile_result = subprocess.run(compile_cmd, capture_output=True, text=True, check=True)\n", + " \n", + " run_cmd = [\"optimized.exe\"]\n", + " run_result = subprocess.run(run_cmd, check=True, text=True, capture_output=True, timeout=60)\n", + " \n", + " return run_result.stdout or \"[No output]\"\n", + " \n", + " except subprocess.CalledProcessError as e:\n", + " return f\"[Compile/Runtime Error]\\n{e.stderr}\"\n", + " except subprocess.TimeoutExpired:\n", + " return \"[Error] Execution timed out.\"\n", + " except Exception as e:\n", + " return f\"[Exception] {str(e)}\" " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "374f00f3-8fcf-4ae9-bf54-c5a44dd74844", + "metadata": {}, + "outputs": [], + "source": [ + "with gr.Blocks(css=custom_css, theme=theme) as ui:\n", + " gr.Markdown(\"## Convert code from Python to C++\")\n", + " with gr.Row():\n", + " python = gr.Textbox(label=\"Python code:\", lines=10, value=python_hard, elem_classes=[\"scrollable-box\"])\n", + " cpp = gr.Textbox(label=\"C++ code:\", lines=10, elem_classes=[\"scrollable-box\"])\n", + " with gr.Row():\n", + " model = gr.Dropdown([\"Gemini\", \"Codestral\"], label=\"Select model\", value=\"Gemini\")\n", + " convert = gr.Button(\"Convert code\")\n", + " with gr.Row():\n", + " python_run = gr.Button(\"Run Python\")\n", + " cpp_run = gr.Button(\"Run C++\")\n", + " with gr.Row():\n", + " python_out = gr.TextArea(label=\"Python result:\", elem_classes=[\"python\"])\n", + " cpp_out = gr.TextArea(label=\"C++ result:\", elem_classes=[\"cpp\"])\n", + "\n", + " convert.click(optimize, inputs=[python,model], outputs=[cpp])\n", + " python_run.click(execute_python,inputs=[python], outputs=[python_out])\n", + " cpp_run.click(execute_cpp, inputs=[cpp], outputs=[cpp_out])\n", + "\n", + "ui.launch(inbrowser=True) " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week4/community-contributions/Week4_day4_HFInference_QwenCode2.5.ipynb b/week4/community-contributions/Week4_day4_HFInference_QwenCode2.5.ipynb new file mode 100644 index 0000000..50d1302 --- /dev/null +++ b/week4/community-contributions/Week4_day4_HFInference_QwenCode2.5.ipynb @@ -0,0 +1,476 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4c07cdc9-bce0-49ad-85c7-14f1872b8519", + "metadata": {}, + "source": [ + "# Python to CPP using Qwen2.5-Coder-32B-Instruct with Hyperbolic Inference Endpoint in Windows" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f051c517-c4fd-4248-98aa-b808fae76cf6", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import io\n", + "import sys\n", + "import gradio as gr\n", + "import subprocess\n", + "from dotenv import load_dotenv\n", + "from huggingface_hub import InferenceClient\n", + "from google import genai\n", + "from google.genai import types\n", + "from mistralai import Mistral" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c6c8777b-57bc-436a-978f-21a37ea310ae", + "metadata": {}, + "outputs": [], + "source": [ + "# Load Api Keys from env\n", + "\n", + "load_dotenv(override=True)\n", + "\n", + "hf_api_key = os.getenv(\"HF_TOKEN\")\n", + "gemini_api_key = os.getenv(\"GOOGLE_API_KEY\")\n", + "mistral_api_key = os.getenv(\"MISTRAL_API_KEY\")\n", + "\n", + "if not mistral_api_key or not gemini_api_key or not hf_api_key:\n", + " print(\"API Key not found!\")\n", + "else:\n", + " print(\"API Key loaded in memory\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5cf6f93-7e07-40e0-98b8-d4e74ea18402", + "metadata": {}, + "outputs": [], + "source": [ + "# MODELs \n", + "\n", + "MODEL_QWEN = \"Qwen/Qwen2.5-Coder-32B-Instruct\"\n", + "MODEL_GEMINI = 'gemini-2.5-flash'\n", + "MODEL_CODESTRAL = 'codestral-latest'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "689547c3-aaa5-4800-86a2-da52765997d8", + "metadata": {}, + "outputs": [], + "source": [ + "# Load Clients\n", + "\n", + "try:\n", + " gemini_client = genai.Client(api_key=gemini_api_key)\n", + " print(\"Google GenAI Client initialized successfully!\")\n", + "\n", + " codestral_client = Mistral(api_key=mistral_api_key)\n", + " print(\"Mistral Client initialized successfully!\")\n", + " \n", + " hf_client = InferenceClient(provider=\"hyperbolic\",api_key=hf_api_key)\n", + " print(\"Hyperbolic Inference Client initialized successfully!\")\n", + "except Exception as e:\n", + " print(f\"Error initializing Client: {e}\")\n", + " exit() " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c3a81f4-99c3-463a-ae30-4656a7a246d2", + "metadata": {}, + "outputs": [], + "source": [ + "system_message = \"You are an assistant that reimplements Python code in high-performance C++ optimized for a Windows PC. \"\n", + "system_message += \"Use Windows-specific optimizations where applicable (e.g., multithreading with std::thread, SIMD, or WinAPI if necessary). \"\n", + "system_message += \"Respond only with the equivalent C++ code; include comments only where absolutely necessary. \"\n", + "system_message += \"Avoid any explanation or text outside the code. \"\n", + "system_message += \"The C++ output must produce identical functionality with the fastest possible execution time on Windows.\"\n", + "\n", + "generate_content_config = types.GenerateContentConfig(system_instruction=system_message)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0fde9514-1005-4539-b01b-0372730ce67b", + "metadata": {}, + "outputs": [], + "source": [ + "def user_prompt_for(python):\n", + " user_prompt = (\n", + " \"Convert the following Python code into high-performance C++ optimized for Windows. \"\n", + " \"Use standard C++20 or newer with Windows-compatible libraries and best practices. \"\n", + " \"Ensure the implementation runs as fast as possible and produces identical output. \"\n", + " \"Use appropriate numeric types to avoid overflow or precision loss. \"\n", + " \"Avoid unnecessary abstraction; prefer direct computation and memory-efficient structures. \"\n", + " \"Respond only with C++ code, include all required headers (like , , etc.), and limit comments to only what's essential.\\n\\n\"\n", + " )\n", + " user_prompt += python\n", + " return user_prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "89c8b010-08dd-4695-a784-65162d82a24b", + "metadata": {}, + "outputs": [], + "source": [ + "def user_message_gemini(python): \n", + " return types.Content(role=\"user\", parts=[types.Part.from_text(text=user_prompt_for(python))]) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "66923158-983d-46f7-ab19-f216fb1f6a87", + "metadata": {}, + "outputs": [], + "source": [ + "def messages_for(python):\n", + " return [\n", + " {\"role\": \"system\", \"content\": system_message},\n", + " {\"role\": \"user\", \"content\": user_prompt_for(python)}\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ab59a54-b28a-4d07-b04f-b568e6e25dfb", + "metadata": {}, + "outputs": [], + "source": [ + "def write_output(cpp):\n", + " code = cpp.replace(\"```cpp\", \"\").replace(\"```c++\", \"\").replace(\"```\", \"\").strip()\n", + " \n", + " if not \"#include\" in code:\n", + " raise ValueError(\"C++ code appears invalid: missing #include directives.\")\n", + "\n", + " with open(\"qwenOptimized.cpp\", \"w\", encoding=\"utf-8\", newline=\"\\n\") as f:\n", + " f.write(code) " + ] + }, + { + "cell_type": "markdown", + "id": "e05ea9f0-6ade-4699-b5fa-fb8ef9f16bcb", + "metadata": {}, + "source": [ + "### Python Codes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c515ce2c-1f8d-4484-8d34-9ffe1372dad4", + "metadata": {}, + "outputs": [], + "source": [ + "python_easy = \"\"\"\n", + "import time\n", + "\n", + "def calculate(iterations, param1, param2):\n", + " result = 1.0\n", + " for i in range(1, iterations+1):\n", + " j = i * param1 - param2\n", + " result -= (1/j)\n", + " j = i * param1 + param2\n", + " result += (1/j)\n", + " return result\n", + "\n", + "start_time = time.time()\n", + "result = calculate(100_000_000, 4, 1) * 4\n", + "end_time = time.time()\n", + "\n", + "print(f\"Result: {result:.12f}\")\n", + "print(f\"Execution Time: {(end_time - start_time):.6f} seconds\")\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "83ab4080-71ae-45e6-970b-030dc462f571", + "metadata": {}, + "outputs": [], + "source": [ + "python_hard = \"\"\"# Be careful to support large number sizes\n", + "\n", + "def lcg(seed, a=1664525, c=1013904223, m=2**32):\n", + " value = seed\n", + " while True:\n", + " value = (a * value + c) % m\n", + " yield value\n", + " \n", + "def max_subarray_sum(n, seed, min_val, max_val):\n", + " lcg_gen = lcg(seed)\n", + " random_numbers = [next(lcg_gen) % (max_val - min_val + 1) + min_val for _ in range(n)]\n", + " max_sum = float('-inf')\n", + " for i in range(n):\n", + " current_sum = 0\n", + " for j in range(i, n):\n", + " current_sum += random_numbers[j]\n", + " if current_sum > max_sum:\n", + " max_sum = current_sum\n", + " return max_sum\n", + "\n", + "def total_max_subarray_sum(n, initial_seed, min_val, max_val):\n", + " total_sum = 0\n", + " lcg_gen = lcg(initial_seed)\n", + " for _ in range(20):\n", + " seed = next(lcg_gen)\n", + " total_sum += max_subarray_sum(n, seed, min_val, max_val)\n", + " return total_sum\n", + "\n", + "# Parameters\n", + "n = 10000 # Number of random numbers\n", + "initial_seed = 42 # Initial seed for the LCG\n", + "min_val = -10 # Minimum value of random numbers\n", + "max_val = 10 # Maximum value of random numbers\n", + "\n", + "# Timing the function\n", + "import time\n", + "start_time = time.time()\n", + "result = total_max_subarray_sum(n, initial_seed, min_val, max_val)\n", + "end_time = time.time()\n", + "\n", + "print(\"Total Maximum Subarray Sum (20 runs):\", result)\n", + "print(\"Execution Time: {:.6f} seconds\".format(end_time - start_time))\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "31498c5c-ecdd-4ed7-9607-4d09af893b98", + "metadata": {}, + "source": [ + "## Code Implementation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea4a4968-e04f-4939-8c42-32c960699354", + "metadata": {}, + "outputs": [], + "source": [ + "def stream_gemini(python):\n", + " stream = gemini_client.models.generate_content_stream(\n", + " model = MODEL_GEMINI,\n", + " config=generate_content_config,\n", + " contents=user_message_gemini(python)\n", + " )\n", + "\n", + " cpp_code = \"\"\n", + " for chunk in stream:\n", + " chunk_text = chunk.text or \"\"\n", + " cpp_code += chunk_text\n", + " yield cpp_code.replace('```cpp\\n','').replace('```','')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "69601eee-520f-4813-b796-aee9118e8a72", + "metadata": {}, + "outputs": [], + "source": [ + "def stream_codestral(python):\n", + " stream = codestral_client.chat.stream(\n", + " model = MODEL_CODESTRAL,\n", + " messages = messages_for(python), \n", + " )\n", + "\n", + " cpp_code = \"\"\n", + " for chunk in stream:\n", + " chunk_text = chunk.data.choices[0].delta.content or \"\"\n", + " cpp_code += chunk_text\n", + " yield cpp_code.replace('```cpp\\n','').replace('```','') " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb8899cf-54c0-4d2d-8772-42925c2e1d13", + "metadata": {}, + "outputs": [], + "source": [ + "def stream_qwen(python):\n", + " stream = hf_client.chat.completions.create(\n", + " model = MODEL_QWEN,\n", + " messages = messages_for(python),\n", + " stream=True\n", + " )\n", + " cpp_code = \"\"\n", + " for chunk in stream:\n", + " chunk_text = chunk.choices[0].delta.content\n", + " cpp_code += chunk_text\n", + " yield cpp_code.replace('```cpp\\n','').replace('```','')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "98862fef-905c-4b50-bc7a-4c0462495b5c", + "metadata": {}, + "outputs": [], + "source": [ + "def optimize(python, model):\n", + " if model.lower() == 'gemini':\n", + " result = stream_gemini(python)\n", + " elif model.lower() == 'codestral':\n", + " result = stream_codestral(python)\n", + " elif model.lower() == 'qwen_coder':\n", + " result = stream_qwen(python)\n", + " else:\n", + " raise ValueError(\"Unknown model\")\n", + " \n", + " for stream_so_far in result:\n", + " yield stream_so_far " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aa9372df-db01-41d0-842c-4857b20f93f0", + "metadata": {}, + "outputs": [], + "source": [ + "custom_css = \"\"\"\n", + ".scrollable-box textarea {\n", + " overflow: auto !important;\n", + " height: 400px;\n", + "}\n", + "\n", + ".python {background-color: #306998;}\n", + ".cpp {background-color: #050;}\n", + "\n", + "\"\"\"\n", + "\n", + "theme = gr.themes.Soft()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dbcf9fe9-c3da-466b-8478-83dcdbe7d48e", + "metadata": {}, + "outputs": [], + "source": [ + "def execute_python(code):\n", + " try:\n", + " result = subprocess.run(\n", + " [\"python\", \"-c\", code],\n", + " capture_output=True,\n", + " text=True,\n", + " timeout=60\n", + " )\n", + " if result.returncode == 0:\n", + " return result.stdout or \"[No output]\"\n", + " else:\n", + " return f\"[Error]\\n{result.stderr}\"\n", + " except subprocess.TimeoutExpired:\n", + " return \"[Error] Execution timed out.\"\n", + " except Exception as e:\n", + " return f\"[Exception] {str(e)}\" " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8029e00d-1ee8-43d1-8c87-2aa0544cf94c", + "metadata": {}, + "outputs": [], + "source": [ + "def execute_cpp(code):\n", + " write_output(code)\n", + " \n", + " try:\n", + " compile_cmd = [\"g++\", \"-O3\", \"-std=c++20\", \"-o\", \"optimized.exe\", \"optimized.cpp\"]\n", + " compile_result = subprocess.run(compile_cmd, capture_output=True, text=True, check=True)\n", + " \n", + " run_cmd = [\"optimized.exe\"]\n", + " run_result = subprocess.run(run_cmd, check=True, text=True, capture_output=True, timeout=60)\n", + " \n", + " return run_result.stdout or \"[No output]\"\n", + " \n", + " except subprocess.CalledProcessError as e:\n", + " return f\"[Compile/Runtime Error]\\n{e.stderr}\"\n", + " except subprocess.TimeoutExpired:\n", + " return \"[Error] Execution timed out.\"\n", + " except Exception as e:\n", + " return f\"[Exception] {str(e)}\" " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5f4e88c-be15-4870-9f99-82b6273ee739", + "metadata": {}, + "outputs": [], + "source": [ + "with gr.Blocks(css=custom_css, theme=theme) as ui:\n", + " gr.Markdown(\"## Convert code from Python to C++\")\n", + " with gr.Row():\n", + " python = gr.Textbox(label=\"Python code:\", lines=10, value=python_hard, elem_classes=[\"scrollable-box\"])\n", + " cpp = gr.Textbox(label=\"C++ code:\", lines=10, elem_classes=[\"scrollable-box\"])\n", + " with gr.Row():\n", + " model = gr.Dropdown([\"Gemini\", \"Codestral\", \"QWEN_Coder\"], label=\"Select model\", value=\"Gemini\")\n", + " convert = gr.Button(\"Convert code\")\n", + " with gr.Row():\n", + " python_run = gr.Button(\"Run Python\")\n", + " cpp_run = gr.Button(\"Run C++\")\n", + " with gr.Row():\n", + " python_out = gr.TextArea(label=\"Python result:\", elem_classes=[\"python\"])\n", + " cpp_out = gr.TextArea(label=\"C++ result:\", elem_classes=[\"cpp\"])\n", + "\n", + " convert.click(optimize, inputs=[python,model], outputs=[cpp])\n", + " python_run.click(execute_python,inputs=[python], outputs=[python_out])\n", + " cpp_run.click(execute_cpp, inputs=[cpp], outputs=[cpp_out])\n", + "\n", + "ui.launch(inbrowser=True) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aa1a231e-2743-4cee-afe2-783d2b9513e5", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week4/community-contributions/code_commentor.ipynb b/week4/community-contributions/code_commentor.ipynb new file mode 100644 index 0000000..3bf10a5 --- /dev/null +++ b/week4/community-contributions/code_commentor.ipynb @@ -0,0 +1,335 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "07bb451d-2b91-425f-b8ea-6f35ced780b0", + "metadata": {}, + "source": [ + "# AI Code Commenting Assistant \n", + "\n", + "## Project Summary \n", + "\n", + "**Purpose**: \n", + "An AI-powered assistant that automatically generates **clear, concise code comments** to improve code readability and maintainability. \n", + "\n", + "**Key Features**: \n", + "- **Language-Agnostic**: Auto-detects programming languages or allows manual specification \n", + "- **Smart Commenting**: Focuses on explaining **complex logic, algorithms, and edge cases** (not obvious syntax) \n", + "- **Customizable**: Optional focus areas let users prioritize specific parts (e.g., database queries, recursion) \n", + "- **Efficient Workflow**: Processes code in chunks and preserves original formatting \n", + "\n", + "**Benefits**: \n", + "βœ” Saves time writing documentation \n", + "βœ” Helps developers understand unfamiliar code \n", + "βœ” Supports multiple languages (Python, JavaScript, C++, SQL, etc.) \n", + "βœ” Avoids redundant comments on trivial operations \n", + "\n", + "**Example Use Case**: \n", + "```python \n", + "# Before AI: \n", + "def fib(n): \n", + " if n <= 1: return n \n", + " else: return fib(n-1) + fib(n-2) \n", + "\n", + "# After AI: \n", + "def fib(n): \n", + " # Recursively computes nth Fibonacci number (O(2^n) time) \n", + " if n <= 1: return n # Base case \n", + " else: return fib(n-1) + fib(n-2) # Recursive case " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0413ae1-0348-4884-ba95-384c4c8f841c", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install --upgrade huggingface_hub" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b22da766-042b-402f-9e05-78aa8f45ddd4", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import io\n", + "from dotenv import load_dotenv\n", + "from google import genai\n", + "from google.genai import types\n", + "from openai import OpenAI\n", + "from anthropic import Anthropic\n", + "from huggingface_hub import InferenceClient\n", + "import gradio as gr" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5af6d3de-bab6-475e-b2f3-7b788bb2e529", + "metadata": {}, + "outputs": [], + "source": [ + "# load environments\n", + "load_dotenv(override=True)\n", + "os.environ['ANTHROPIC_API_KEY'] = os.getenv(\"CLAUDE_API_KEY\")\n", + "os.environ[\"HF_TOKEN\"] = os.getenv(\"HF_TOKEN\")\n", + "gemini_api_key= os.getenv(\"GEMINI_API_KEY\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cad0755e-4174-4fbc-84e6-15cc54bc609a", + "metadata": {}, + "outputs": [], + "source": [ + "#initialize remote models\n", + "claude= Anthropic()\n", + "gemini = genai.Client(api_key=gemini_api_key)\n", + "\n", + "#opensource models\n", + "qwen = InferenceClient(provider=\"featherless-ai\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "31d75812-1cd3-4512-8446-022c3357c354", + "metadata": {}, + "outputs": [], + "source": [ + "#initialize local model\n", + "llama = OpenAI(base_url=\"http://localhost:11434/v1\", api_key=\"ollama\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "31316379-2a56-4707-b207-ea60b490f536", + "metadata": {}, + "outputs": [], + "source": [ + "#models\n", + "claude_model = \"claude-3-5-haiku-latest\"\n", + "gemini_model = \"gemini-2.5-pro\"\n", + "qwen_model= \"Qwen/Qwen2.5-Coder-32B-Instruct\"\n", + "llama_model = \"llama3:8b\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7d9c4bf-0955-4406-8717-ffa7bdd0bec9", + "metadata": {}, + "outputs": [], + "source": [ + "system_message=\"\"\"\n", + "You are an expert AI specialized in code documentation. Your task is to generate concise, meaningful comments that explain the purpose and logic of provided code. Follow these rules:\n", + "\n", + "1. **Infer language**: Auto-detect programming language and use appropriate comment syntax\n", + "2. **Explain why, not what**: Focus on purpose, edge cases, and non-obvious logic\n", + "3. **Be concise**: Maximum 1-2 sentences per comment block\n", + "4. **Prioritize key sections**: Only comment complex logic, algorithms, or critical operations\n", + "5. **Maintain structure**: Preserve original code formatting and indentation\n", + "6. **Output format**: Return ONLY commented code with no additional text\n", + "\n", + "Commenting guidelines by language:\n", + "- Python: `# Inline comments` and `\"\"Docstrings\"\"`\n", + "- JavaScript/Java: `// Line comments` and `/* Block comments */`\n", + "- C/C++: `//` and `/* */`\n", + "- SQL: `-- Line comments`\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "79dfe110-1523-40c7-ad90-2787ed22fd8d", + "metadata": {}, + "outputs": [], + "source": [ + "def user_prompt(code):\n", + " prompt = f\"\"\"\n", + " i want to document my code for better understanding. Please generate meaningful necessary comments\n", + " here is my code:\n", + " {code}\n", + "\n", + " Return ONLY commented code with no additional text\n", + " \"\"\"\n", + "\n", + " return prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c7bcf29e-ec78-4cfd-9b41-f2dc86400435", + "metadata": {}, + "outputs": [], + "source": [ + "def conversation_template(code):\n", + " messages = [\n", + " {\"role\":\"system\", \"content\":system_message},\n", + " {\"role\":\"user\",\"content\":user_prompt(code)}\n", + " ]\n", + " return messages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a36fec0f-7eba-4ccd-8fc4-cbf5ade76fa2", + "metadata": {}, + "outputs": [], + "source": [ + "def stream_gemini(code):\n", + " message = user_prompt(code)\n", + " response = gemini.models.generate_content_stream(\n", + " model=gemini_model,\n", + " config= types.GenerateContentConfig(\n", + " system_instruction = system_message,\n", + " temperature = 0.8,\n", + " ),\n", + " contents = [message]\n", + " )\n", + "\n", + " result = \"\"\n", + " for chunk in response:\n", + " result += chunk.text or \"\"\n", + " yield result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5d1e0c0-dc88-43ee-8698-82ad9ce7c51b", + "metadata": {}, + "outputs": [], + "source": [ + "def stream_claude(code):\n", + " messages = [{\"role\":\"user\",\"content\":user_prompt(code)}]\n", + " response = claude.messages.stream(\n", + " model= claude_model,\n", + " temperature=0.8,\n", + " messages = messages,\n", + " max_tokens=5000\n", + " )\n", + "\n", + " result = \"\"\n", + " with response as stream:\n", + " for text in stream.text_stream:\n", + " result += text or \"\"\n", + " yield result\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "903c97e5-9170-449e-8a0f-9f906351ec45", + "metadata": {}, + "outputs": [], + "source": [ + "def stream_opensource(code,model):\n", + " model = model.lower()\n", + " client = globals()[model]\n", + " model = globals()[f\"{model}_model\"]\n", + " stream = client.chat.completions.create(\n", + " model = model,\n", + " messages= conversation_template(code),\n", + " temperature = 0.7,\n", + " stream = True\n", + " )\n", + "\n", + " result = \"\"\n", + " for chunk in stream:\n", + " result += chunk.choices[0].delta.content or \"\"\n", + " yield result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff051c22-a2f8-4153-b970-f8a466a4cf5a", + "metadata": {}, + "outputs": [], + "source": [ + "def commentor(code, model):\n", + " model =model.lower()\n", + " if model == \"claude\":\n", + " result = stream_claude(code)\n", + " elif model == \"gemini\":\n", + " result = stream_gemini(code)\n", + " elif model == \"qwen\" or model == \"llama\":\n", + " result = stream_opensource(code, model)\n", + "\n", + "\n", + " for code in result:\n", + " yield code.replace(\"```cpp\\n\",\"\").replace(\"```python\\n\",\"\").replace(\"```javascript\\n\",\"\").replace(\"```typescript\\n\",\"\").replace(\"```\",\"\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10daf070-3546-4073-a2a0-3f5f8fc156f0", + "metadata": {}, + "outputs": [], + "source": [ + "with gr.Blocks() as ui:\n", + " gr.Markdown(\"# Genarate comment\")\n", + " with gr.Row():\n", + " raw_code = gr.Textbox(label=\"Raw Code:\", lines=10)\n", + " commented_code = gr.Textbox(label=\"Commented_code\",lines=10)\n", + " with gr.Row():\n", + " models = gr.Dropdown([\"Gemini\",\"Claude\",\"Llama\",\"Qwen\"], value=\"Gemini\")\n", + " with gr.Row():\n", + " generate_comment = gr.Button(\"Generate Comment\")\n", + "\n", + " generate_comment.click(commentor, inputs=[raw_code, models], outputs=[commented_code])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "afb87f32-f25e-40c5-844a-d2b7af748192", + "metadata": {}, + "outputs": [], + "source": [ + "ui.launch(inbrowser=True,debug=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "96bc48ad-10ad-4821-b58e-ea1b22cdcdc9", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week4/community-contributions/wk4-final-passwordgen.ipynb b/week4/community-contributions/wk4-final-passwordgen.ipynb new file mode 100644 index 0000000..98f7b26 --- /dev/null +++ b/week4/community-contributions/wk4-final-passwordgen.ipynb @@ -0,0 +1,337 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cc7674a9-6164-4424-85a9-f669454cfd2a", + "metadata": {}, + "source": [ + "I used this project to play about with Gradio blocks a little bit as it had more inputs than the other projects I've done.\n", + "Its a password generator which I have no doubt I will use!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "04c8d2dd-cb9a-4b18-b12d-48ed2f39679a", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI\n", + "import requests\n", + "import google.generativeai\n", + "import anthropic\n", + "from IPython.display import Markdown, display, update_display\n", + "import gradio as gr" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "04521351-f220-42fe-9dc5-d0be80c95dd7", + "metadata": {}, + "outputs": [], + "source": [ + "# keys\n", + "\n", + "load_dotenv(override=True)\n", + "openai_api_key = os.getenv(\"OPENAI_API_KEY\")\n", + "\n", + "if openai_api_key:\n", + " print(\"All good\")\n", + "else:\n", + " print(\"OpenAI key issue\")\n", + "\n", + "claude_api_key = os.getenv(\"ANTHROPIC_API_KEY\")\n", + "\n", + "if claude_api_key:\n", + " print(\"All good\")\n", + "else:\n", + " print(\"Claude key issue\")\n", + "\n", + "google_api_key = os.getenv(\"GOOGLE_API_KEY\")\n", + "\n", + "if google_api_key:\n", + " print(\"All good\")\n", + "else:\n", + " print(\"Google key issue\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70fd3748-e6b6-4ac2-89a5-ef65ed7e41a3", + "metadata": {}, + "outputs": [], + "source": [ + "# initialise\n", + "\n", + "openai = OpenAI()\n", + "claude = anthropic.Anthropic()\n", + "google.generativeai.configure()\n", + "\n", + "OPENAI_MODEL = \"gpt-4o\"\n", + "CLAUDE_MODEL = \"claude-sonnet-4-20250514\"\n", + "GOOGLE_MODEL = \"gemini-2.0-flash\"\n", + "\n", + "max_tok = 500" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6a448651-e426-4c3c-96f7-d69975dc7b10", + "metadata": {}, + "outputs": [], + "source": [ + "#Prompts\n", + "\n", + "def pass_system_prompt(required_len, spec_char=\"Y\",num_char=\"Y\",min_lowercase=1,min_uppercase=1):\n", + "\n", + " system_prompt = f\"\"\"You are a secure password generator. Your task is to create a single, cryptographically strong password that meets ALL specified requirements.\n", + " \n", + "CRITICAL REQUIREMENTS:\n", + "- Length: EXACTLY {required_len} characters\n", + "- Must include: At least {min_lowercase} lowercase letter(s) AND at least {min_uppercase} uppercase letter(s)\n", + "- Special characters: {'REQUIRED - include at least 1 char' if spec_char else 'FORBIDDEN - do not include any'}\n", + "- Numbers: {'REQUIRED - include at least 1 digit' if num_char else 'FORBIDDEN - do not include any digits'}\n", + "\n", + "SECURITY RULES:\n", + "1. Generate truly random passwords - avoid patterns, dictionary words, or predictable sequences\n", + "2. Distribute character types evenly throughout the password\n", + "3. Do not use repeated characters excessively (max 2 of same character)\n", + "4. Ensure password meets minimum complexity for each required character type\n", + "\n", + "OUTPUT FORMAT:\n", + "- Respond with ONLY the generated password\n", + "- No explanations, no additional text, just the password\n", + "- Verify the password meets ALL requirements before responding\"\"\"\n", + "\n", + " return system_prompt\n", + "\n", + "def pass_user_prompt(required_len, spec_char=\"Y\",num_char=\"Y\",min_lowercase=1,min_uppercase=1):\n", + " \n", + " user_prompt = f\"\"\"Generate a secure password with these exact specifications:\n", + " \n", + "Length: {required_len} characters\n", + "Lowercase letters: Required (minimum {min_lowercase})\n", + "Uppercase letters: Required (minimum {min_uppercase})\n", + "Numbers: {'Required (minimum 1)' if num_char else 'Not allowed'}\n", + "Special characters: {'Required (minimum 1)' if spec_char else 'Not allowed'}\n", + "\n", + "Requirements verification checklist:\n", + "βœ“ Exactly {required_len} characters total\n", + "βœ“ Contains {min_lowercase}+ lowercase letters\n", + "βœ“ Contains {min_uppercase}+ uppercase letters\n", + "βœ“ {'Contains 1+ numbers' if num_char else 'Contains NO numbers'}\n", + "βœ“ {'Contains 1+ special characters' if spec_char else 'Contains NO special characters'}\n", + "βœ“ No obvious patterns or dictionary words\n", + "βœ“ Good distribution of character types\n", + "\n", + "Generate the password now.\"\"\"\n", + "\n", + " return user_prompt\n", + " \n", + "def pass_messages(required_len, spec_char,num_char,min_lowercase,min_uppercase):\n", + " messages = [\n", + " {\"role\":\"system\",\"content\":pass_system_prompt(required_len, spec_char,num_char,min_lowercase,min_uppercase)},\n", + " {\"role\":\"user\",\"content\":pass_user_prompt(required_len, spec_char,num_char,min_lowercase,min_uppercase)}\n", + " ]\n", + "\n", + " return messages\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "857370b0-35a5-4b50-8715-86f8e781523b", + "metadata": {}, + "outputs": [], + "source": [ + "#test\n", + "\n", + "messages1 = pass_messages(12, \"N\", \"Y\",1,1)\n", + "print(messages1)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "59ab4279-90a8-4997-8e15-f07295856222", + "metadata": {}, + "outputs": [], + "source": [ + "def openai_password_gen(required_len, spec_char, num_char,min_lowercase,min_uppercase):\n", + " response=openai.chat.completions.create(\n", + " model=OPENAI_MODEL,\n", + " max_tokens=max_tok,\n", + " messages=pass_messages(required_len, spec_char,num_char,min_lowercase,min_uppercase)\n", + " )\n", + " return response.choices[0].message.content\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f5e1a41a-b03c-4408-a0f5-00529785f3d1", + "metadata": {}, + "outputs": [], + "source": [ + "def claude_password_gen(required_len, spec_char, num_char,min_lowercase,min_uppercase):\n", + " response = claude.messages.create(\n", + " model=CLAUDE_MODEL,\n", + " max_tokens=max_tok,\n", + " system=pass_system_prompt(required_len, spec_char, num_char,min_lowercase,min_uppercase),\n", + " messages = [{\"role\":\"user\",\"content\":pass_user_prompt(required_len, spec_char, num_char,min_lowercase,min_uppercase)}]\n", + " )\n", + " return response.content[0].text\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6a41a0a2-55a1-47e5-8fc0-5dd04ebd3573", + "metadata": {}, + "outputs": [], + "source": [ + "def google_password_gen(required_len, spec_char, num_char,min_lowercase,min_uppercase):\n", + " message = google.generativeai.GenerativeModel(\n", + " model_name=GOOGLE_MODEL,\n", + " system_instruction=pass_system_prompt(required_len, spec_char, num_char,min_lowercase,min_uppercase)\n", + " )\n", + " response = message.generate_content(pass_user_prompt(required_len, spec_char, num_char,min_lowercase,min_uppercase))\n", + " return response.text" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dcd1ce50-6576-4594-8739-1d7daf602213", + "metadata": {}, + "outputs": [], + "source": [ + "#test\n", + "messages1 = openai_password_gen(12, \"N\",\"Y\",1,1)\n", + "messages2 = claude_password_gen(12,\"N\",\"Y\",1,1)\n", + "messages3= google_password_gen(12,\"N\",\"Y\",1,1)\n", + "print(\"OpenAI: \",messages1)\n", + "print(\"Claude: \", messages2)\n", + "print(\"Gemini: \", messages3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9cec429a-2355-4941-8422-480b2614009c", + "metadata": {}, + "outputs": [], + "source": [ + "# model select\n", + "\n", + "def select_model(required_len, spec_char, num_char,min_lowercase,min_uppercase,model):\n", + " if model == \"OpenAI\":\n", + " return openai_password_gen(required_len, spec_char, num_char,min_lowercase,min_uppercase)\n", + " elif model == \"Claude\":\n", + " return claude_password_gen(required_len, spec_char, num_char,min_lowercase,min_uppercase)\n", + " elif model == \"Gemini\":\n", + " return google_password_gen(required_len, spec_char, num_char,min_lowercase,min_uppercase)\n", + " else:\n", + " print(\"No model selected\")\n", + " return None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bef52e6d-dc50-4c91-9d56-624dfdd66276", + "metadata": {}, + "outputs": [], + "source": [ + "test = select_model(12, \"N\",\"Y\",1,1,\"OpenAI\")\n", + "\n", + "print(test)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b9d3685-a1b8-470c-8f4b-e63d68a0240d", + "metadata": {}, + "outputs": [], + "source": [ + "css = \"\"\"\n", + "#password_box textarea {\n", + " background-color: #306998;\n", + " color: white;\n", + "}\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "81c423ec-0ca7-4c96-a2fe-02ed2b5f3839", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "with gr.Blocks(css=css) as demo:\n", + " gr.Markdown(\"Choose your password complexity requirements and run:\")\n", + " with gr.Row():\n", + " with gr.Column(min_width=150,scale=2):\n", + " with gr.Row():\n", + " required_len = gr.Number(label=\"Specify the required length\",value=12,minimum=1,maximum=30)\n", + " min_lowercase = gr.Number(label=\"the minimum lowercase letters\", value=1,minimum=0)\n", + " min_uppercase = gr.Number(label=\"the minimum uppercase letters\", value=1,minimum=0)\n", + " with gr.Column():\n", + " spec_char = gr.Checkbox(label=\"Include special characters?\",value=True)\n", + " num_char = gr.Checkbox(label=\"Include numbers?\", value=True)\n", + " with gr.Row():\n", + " with gr.Column():\n", + " model = gr.Dropdown([\"OpenAI\",\"Claude\",\"Gemini\"])\n", + " btn = gr.Button(\"Run\")\n", + " with gr.Column():\n", + " output = gr.Textbox(label=\"Password:\", elem_id=\"password_box\")\n", + " \n", + " btn.click(fn=select_model,inputs=[required_len,spec_char,num_char,min_lowercase,min_uppercase,model],outputs=output)\n", + "\n", + "demo.launch()\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d81a8318-57ef-46ae-91b7-ae63d661edd8", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week4/community-contributions/wk4-unittest-generator.ipynb b/week4/community-contributions/wk4-unittest-generator.ipynb new file mode 100644 index 0000000..49dbb34 --- /dev/null +++ b/week4/community-contributions/wk4-unittest-generator.ipynb @@ -0,0 +1,420 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "65b3aadc-c540-4cb2-a338-d523d3f22e5b", + "metadata": {}, + "source": [ + "Unit test generator using GPT, Claude and Gemini.\n", + "This will create unit test code from python and also run the code and provide the result (including any errors)\n", + "Note:\n", + "When I tried to use claude-sonnet-4-20250514 the results were too big and the python was cut-off (no matter how big I made the max tokens). This seemed to be the case for both examples. I've changed it to claude-3-5-sonnet-20240620 and it seems to be run better." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e610bf56-a46e-4aff-8de1-ab49d62b1ad3", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import requests\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI\n", + "import google.generativeai\n", + "import anthropic\n", + "from IPython.display import Markdown, display, update_display\n", + "import gradio as gr\n", + "import sys\n", + "import io\n", + "import traceback\n", + "import unittest\n", + "import subprocess\n", + "import tempfile" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4f672e1c-87e9-4865-b760-370fa605e614", + "metadata": {}, + "outputs": [], + "source": [ + "# keys\n", + "\n", + "load_dotenv(override=True)\n", + "openai_api_key = os.getenv(\"OPENAI_API_KEY\")\n", + "\n", + "if openai_api_key:\n", + " print(\"All good\")\n", + "else:\n", + " print(\"OpenAI key issue\")\n", + "\n", + "claude_api_key = os.getenv(\"ANTHROPIC_API_KEY\")\n", + "\n", + "if claude_api_key:\n", + " print(\"All good\")\n", + "else:\n", + " print(\"Claude key issue\")\n", + "\n", + "google_api_key = os.getenv(\"GOOGLE_API_KEY\")\n", + "\n", + "if google_api_key:\n", + " print(\"All good\")\n", + "else:\n", + " print(\"Google key issue\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8aa149ed-9298-4d69-8fe2-8f5de0f667da", + "metadata": {}, + "outputs": [], + "source": [ + "# initialise\n", + "\n", + "openai = OpenAI()\n", + "claude = anthropic.Anthropic()\n", + "google.generativeai.configure()\n", + "\n", + "OPENAI_MODEL = \"gpt-4o\"\n", + "CLAUDE_MODEL = \"claude-3-5-sonnet-20240620\" #\"claude-sonnet-4-20250514\"\n", + "GOOGLE_MODEL = \"gemini-2.0-flash\"\n", + "\n", + "max_tok = 5000" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6896636f-923e-4a2c-9d6c-fac07828a201", + "metadata": {}, + "outputs": [], + "source": [ + "system_message = \"You are an engineer with responsibility for unit testing python code.\"\n", + "system_message += \"You review base python code and develop unit tests, also in python, which validate each unit of code.\"\n", + "system_message += \"\"\" The output must be in Python with both the unit tests and comments explaining the purpose of each test.\n", + "The output should not include any additional text at the start or end including \"```\". It should be possible to run the code without any updates including an execution statement.\n", + "Include the base / original python code in the response.\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e7b3546-57aa-4c29-bc5d-f211970d04eb", + "metadata": {}, + "outputs": [], + "source": [ + "def user_prompt_for(python):\n", + " user_prompt = \"Review the Python code provided and develop unit tests which can be run in a jupyter lab.\"\n", + " user_prompt += \"\"\" The output must be in Python with both the unit tests and comments explaining the purpose of each test.\n", + "The output should not include any additional text at the start or end including \"```\". It should be possible to run the code without any updates (include an execution statement).\n", + "Include the base / original python code in the response.\"\"\"\n", + " user_prompt += python\n", + " return user_prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c6190659-f54c-4951-bef4-4960f8e51cc4", + "metadata": {}, + "outputs": [], + "source": [ + "def messages_for(python):\n", + " return [\n", + " {\"role\": \"system\", \"content\": system_message},\n", + " {\"role\": \"user\", \"content\": user_prompt_for(python)}\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b327aa3-3277-44e1-972f-aa7158147ddd", + "metadata": {}, + "outputs": [], + "source": [ + "# python example\n", + "example = \"\"\"class BookNotAvailableError(Exception):\n", + " pass\n", + "\n", + "class Library:\n", + " def __init__(self):\n", + " self.inventory = {} # book title -> quantity\n", + " self.borrowed = {} # user -> list of borrowed book titles\n", + "\n", + " def add_book(self, title, quantity=1):\n", + " if quantity <= 0:\n", + " raise ValueError(\"Quantity must be positive\")\n", + " self.inventory[title] = self.inventory.get(title, 0) + quantity\n", + "\n", + " def borrow_book(self, user, title):\n", + " if self.inventory.get(title, 0) < 1:\n", + " raise BookNotAvailableError(f\"'{title}' is not available\")\n", + " self.inventory[title] -= 1\n", + " self.borrowed.setdefault(user, []).append(title)\n", + "\n", + " def return_book(self, user, title):\n", + " if user not in self.borrowed or title not in self.borrowed[user]:\n", + " raise ValueError(f\"User '{user}' did not borrow '{title}'\")\n", + " self.borrowed[user].remove(title)\n", + " self.inventory[title] = self.inventory.get(title, 0) + 1\n", + "\n", + " def get_available_books(self):\n", + " return {title: qty for title, qty in self.inventory.items() if qty > 0}\n", + "\n", + " def get_borrowed_books(self, user):\n", + " return self.borrowed.get(user, [])\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed6e624e-88a5-4f10-8ab5-f071f0ca3041", + "metadata": {}, + "outputs": [], + "source": [ + "# python example2\n", + "example2 = \"\"\"class Calculator:\n", + " def add(self, a, b):\n", + " return a + b\n", + "\n", + " def subtract(self, a, b):\n", + " return a - b\n", + "\n", + " def divide(self, a, b):\n", + " if b == 0:\n", + " raise ValueError(\"Cannot divide by zero\")\n", + " return a / b\n", + "\n", + " def multiply(self, a, b):\n", + " return a * b\n", + "\n", + "\n", + "def is_prime(n):\n", + " if n <= 1:\n", + " return False\n", + " if n <= 3:\n", + " return True\n", + " if n % 2 == 0 or n % 3 == 0:\n", + " return False\n", + " i = 5\n", + " while i * i <= n:\n", + " if n % i == 0 or n % (i + 2) == 0:\n", + " return False\n", + " i += 6\n", + " return True\n", + " \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e7d2fea8-74c6-4421-8f1e-0e76d5b201b9", + "metadata": {}, + "outputs": [], + "source": [ + "def unit_test_gpt(python): \n", + " stream = openai.chat.completions.create(model=OPENAI_MODEL, messages=messages_for(python), stream=True)\n", + " reply = \"\"\n", + " for chunk in stream:\n", + " fragment = chunk.choices[0].delta.content or \"\"\n", + " reply += fragment\n", + " yield reply" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7cd84ad8-d55c-4fe0-9eeb-1895c95c4a9d", + "metadata": {}, + "outputs": [], + "source": [ + "def unit_test_claude(python):\n", + " result = claude.messages.stream(\n", + " model=CLAUDE_MODEL,\n", + " max_tokens=max_tok,\n", + " system=system_message,\n", + " messages=[{\"role\": \"user\", \"content\": user_prompt_for(python)}],\n", + " )\n", + " reply = \"\"\n", + " with result as stream:\n", + " for text in stream.text_stream:\n", + " reply += text\n", + " yield reply" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad86f652-879a-489f-9891-bdc2d97c33b0", + "metadata": {}, + "outputs": [], + "source": [ + "def unit_test_google(python):\n", + " model = google.generativeai.GenerativeModel(\n", + " model_name=GOOGLE_MODEL,\n", + " system_instruction=system_message\n", + " )\n", + " stream = model.generate_content(contents=user_prompt_for(python),stream=True)\n", + " reply = \"\"\n", + " for chunk in stream:\n", + " reply += chunk.text or \"\"\n", + " yield reply.replace(\"```python\\n\", \"\").replace(\"```\", \"\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "105db6f9-343c-491d-8e44-3a5328b81719", + "metadata": {}, + "outputs": [], + "source": [ + "#unit_test_gpt(example)\n", + "#unit_test_claude(example)\n", + "#unit_test_google(example)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f1ae8f5-16c8-40a0-aa18-63b617df078d", + "metadata": {}, + "outputs": [], + "source": [ + "def select_model(python, model):\n", + " if model==\"GPT\":\n", + " result = unit_test_gpt(python)\n", + " elif model==\"Claude\":\n", + " result = unit_test_claude(python)\n", + " elif model==\"Google\":\n", + " result = unit_test_google(python)\n", + " else:\n", + " raise ValueError(\"Unknown model\")\n", + " for stream_so_far in result:\n", + " yield stream_so_far " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1ddb38e-6b0a-4c37-baa4-ace0b7de887a", + "metadata": {}, + "outputs": [], + "source": [ + "# with gr.Blocks() as ui:\n", + "# with gr.Row():\n", + "# python = gr.Textbox(label=\"Python code:\", lines=10, value=example)\n", + "# test = gr.Textbox(label=\"Unit tests\", lines=10)\n", + "# with gr.Row():\n", + "# model = gr.Dropdown([\"GPT\", \"Claude\",\"Google\"], label=\"Select model\", value=\"GPT\")\n", + "# generate = gr.Button(\"Generate unit tests\")\n", + "\n", + "# generate.click(select_model, inputs=[python, model], outputs=[test])\n", + "\n", + "# ui.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "389ae411-a4f6-44f2-8b26-d46a971687a7", + "metadata": {}, + "outputs": [], + "source": [ + "def execute_python(code):\n", + " # Capture stdout and stderr\n", + " output = io.StringIO()\n", + " sys_stdout = sys.stdout\n", + " sys_stderr = sys.stderr\n", + " sys.stdout = output\n", + " sys.stderr = output\n", + "\n", + " try:\n", + " # Compile the code first\n", + " compiled_code = compile(code, '', 'exec')\n", + "\n", + " # Prepare a namespace dict for exec environment\n", + " # Include __builtins__ so imports like 'import unittest' work\n", + " namespace = {\"__builtins__\": __builtins__}\n", + "\n", + " # Run the user's code, but expect tests will be defined here\n", + " exec(compiled_code, namespace)\n", + "\n", + " # Look for unittest.TestCase subclasses in the namespace\n", + " loader = unittest.TestLoader()\n", + " suite = unittest.TestSuite()\n", + "\n", + " for obj in namespace.values():\n", + " if isinstance(obj, type) and issubclass(obj, unittest.TestCase):\n", + " tests = loader.loadTestsFromTestCase(obj)\n", + " suite.addTests(tests)\n", + "\n", + " # Run the tests\n", + " runner = unittest.TextTestRunner(stream=output, verbosity=2)\n", + " result = runner.run(suite)\n", + "\n", + " except SystemExit as e:\n", + " # Catch sys.exit calls from unittest.main()\n", + " output.write(f\"\\nSystemExit called with code {e.code}\\n\")\n", + " except Exception as e:\n", + " # Catch other errors\n", + " output.write(f\"\\nException: {e}\\n\")\n", + " finally:\n", + " sys.stdout = sys_stdout\n", + " sys.stderr = sys_stderr\n", + "\n", + " return output.getvalue()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eca98de3-9e2f-4c23-8bb4-dbb2787a15a4", + "metadata": {}, + "outputs": [], + "source": [ + "with gr.Blocks() as ui:\n", + " with gr.Row():\n", + " python = gr.Textbox(label=\"Python code:\", lines=10, value=example2)\n", + " test = gr.Textbox(label=\"Unit tests\", lines=10)\n", + " test_run = gr.Textbox(label=\"Test results\", lines=10)\n", + " with gr.Row():\n", + " model = gr.Dropdown([\"GPT\", \"Claude\",\"Google\"], label=\"Select model\", value=\"GPT\")\n", + " generate = gr.Button(\"Generate unit tests\")\n", + " run = gr.Button(\"Run unit tests\")\n", + "\n", + " generate.click(select_model, inputs=[python, model], outputs=[test])\n", + " run.click(execute_python, inputs=[test],outputs=[test_run])\n", + "\n", + "ui.launch()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week5/community-contributions/Week5_day5_Gemini_Semantic_Chunks.ipynb b/week5/community-contributions/Week5_day5_Gemini_Semantic_Chunks.ipynb new file mode 100644 index 0000000..d4144c2 --- /dev/null +++ b/week5/community-contributions/Week5_day5_Gemini_Semantic_Chunks.ipynb @@ -0,0 +1,463 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "2080947c-96d9-447f-8368-cfdc9e5c9960", + "metadata": {}, + "source": [ + "# Using Semantic chunks with Gemini API and Gemini Embeddings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "53221f1a-a0c1-4506-a3d0-d6626c58e4e0", + "metadata": {}, + "outputs": [], + "source": [ + "# Regular Imports\n", + "import os\n", + "import glob\n", + "import time\n", + "from dotenv import load_dotenv\n", + "from tqdm.notebook import tqdm\n", + "import gradio as gr" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9a2a7171-a7b6-42a6-96d7-c93f360689ec", + "metadata": {}, + "outputs": [], + "source": [ + "# Visual Import\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.manifold import TSNE\n", + "import numpy as np\n", + "import plotly.graph_objects as go" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "51c9d658-65e5-40a1-8680-d0b561f87649", + "metadata": {}, + "outputs": [], + "source": [ + "# Lang Chain Imports\n", + "\n", + "from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI\n", + "from langchain_community.document_loaders import DirectoryLoader, TextLoader\n", + "from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate\n", + "from langchain_core.messages import HumanMessage, AIMessage\n", + "from langchain_chroma import Chroma\n", + "from langchain_experimental.text_splitter import SemanticChunker\n", + "from langchain_core.chat_history import InMemoryChatMessageHistory\n", + "from langchain_core.runnables.history import RunnableWithMessageHistory\n", + "from langchain.chains.combine_documents import create_stuff_documents_chain\n", + "from langchain.chains.history_aware_retriever import create_history_aware_retriever\n", + "from langchain.chains import create_retrieval_chain\n", + "from langchain_core.prompts import MessagesPlaceholder\n", + "from langchain_core.runnables import RunnableLambda" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e7ed82b-b28a-4094-9f77-3b6432dd0f7a", + "metadata": {}, + "outputs": [], + "source": [ + "# Constants\n", + "\n", + "CHAT_MODEL = \"gemini-2.5-flash\"\n", + "EMBEDDING_MODEL = \"models/text-embedding-004\"\n", + "# EMBEDDING_MODEL_EXP = \"models/gemini-embedding-exp-03-07\"\n", + "\n", + "folders = glob.glob(\"knowledge-base/*\")\n", + "text_loader_kwargs = {'encoding': 'utf-8'}\n", + "db_name = \"vector_db\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b83281a2-bcae-41ab-a347-0e7f9688d1ed", + "metadata": {}, + "outputs": [], + "source": [ + "load_dotenv(override=True)\n", + "\n", + "api_key = os.getenv(\"GOOGLE_API_KEY\")\n", + "\n", + "if not api_key:\n", + " print(\"API Key not found!\")\n", + "else:\n", + " print(\"API Key loaded in memory\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4fd6d516-772b-478d-9b28-09d42f2277d7", + "metadata": {}, + "outputs": [], + "source": [ + "def add_metadata(doc, doc_type):\n", + " doc.metadata[\"doc_type\"] = doc_type\n", + " return doc" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6bc4198b-f989-42c0-95b5-3596448fcaa2", + "metadata": {}, + "outputs": [], + "source": [ + "documents = []\n", + "for folder in tqdm(folders, desc=\"Loading folders\"):\n", + " doc_type = os.path.basename(folder)\n", + " loader = DirectoryLoader(folder, glob=\"**/*.md\", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)\n", + " folder_docs = loader.load()\n", + " documents.extend([add_metadata(doc, doc_type) for doc in folder_docs])\n", + "\n", + "print(f\"Total documents loaded: {len(documents)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "bb74241f-e9d5-42e8-9a4b-f31018397d66", + "metadata": {}, + "source": [ + "## Create Semantic Chunks" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4a3aa17f-f5d0-430a-80da-95c284bd99a8", + "metadata": {}, + "outputs": [], + "source": [ + "chunking_embedding_model = GoogleGenerativeAIEmbeddings(model=EMBEDDING_MODEL, task_type=\"retrieval_document\")\n", + "\n", + "text_splitter = SemanticChunker(\n", + " chunking_embedding_model,\n", + " breakpoint_threshold_type=\"percentile\", \n", + " breakpoint_threshold_amount=95.0, \n", + " min_chunk_size=3 \n", + ")\n", + "\n", + "start = time.time()\n", + "\n", + "semantic_chunks = []\n", + "pbar = tqdm(documents, desc=\"Semantic chunking documents\")\n", + "\n", + "for i, doc in enumerate(pbar):\n", + " doc_type = doc.metadata.get('doc_type', 'Unknown')\n", + " pbar.set_postfix_str(f\"Processing: {doc_type}\")\n", + " try:\n", + " doc_chunks = text_splitter.split_documents([doc])\n", + " semantic_chunks.extend(doc_chunks)\n", + " except Exception as e:\n", + " tqdm.write(f\"❌ Failed to split doc ({doc.metadata.get('source', 'unknown source')}): {e}\")\n", + "print(f\"⏱️ Took {time.time() - start:.2f} seconds\")\n", + "print(f\"Total semantic chunks: {len(semantic_chunks)}\")\n", + "\n", + "# import time\n", + "# start = time.time()\n", + "\n", + "# try:\n", + "# semantic_chunks = text_splitter.split_documents(documents)\n", + "# print(f\"βœ… Chunking completed with {len(semantic_chunks)} chunks\")\n", + "# except Exception as e:\n", + "# print(f\"❌ Failed to split documents: {e}\")\n", + "\n", + "# print(f\"⏱️ Took {time.time() - start:.2f} seconds\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "675b98d6-5ed0-45d1-8f79-765911e6badf", + "metadata": {}, + "outputs": [], + "source": [ + "# Some Preview of the chunks\n", + "for i, doc in enumerate(semantic_chunks[:15]):\n", + " print(f\"--- Chunk {i+1} ---\")\n", + " print(doc.page_content) \n", + " print(\"\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "c17accff-539a-490b-8a5f-b5ce632a3c71", + "metadata": {}, + "source": [ + "## Embed with Gemini Embeddings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0bd228bd-37d2-4aaf-b0f6-d94943f6f248", + "metadata": {}, + "outputs": [], + "source": [ + "embedding = GoogleGenerativeAIEmbeddings(model=EMBEDDING_MODEL,task_type=\"retrieval_document\")\n", + "\n", + "if os.path.exists(db_name):\n", + " Chroma(persist_directory=db_name, embedding_function=embedding).delete_collection()\n", + "\n", + "vectorstore = Chroma.from_documents(\n", + " documents=semantic_chunks,\n", + " embedding=embedding,\n", + " persist_directory=db_name\n", + ")\n", + "\n", + "print(f\"βœ… Vectorstore created with {vectorstore._collection.count()} documents\")" + ] + }, + { + "cell_type": "markdown", + "id": "ce0a3e23-5912-4de2-bf34-3c0936375de1", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "## Visualzing Vectors" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ffdc6f5-ec25-4229-94d4-1fc6bb4d2702", + "metadata": {}, + "outputs": [], + "source": [ + "collection = vectorstore._collection\n", + "result = collection.get(include=['embeddings', 'documents', 'metadatas'])\n", + "vectors = np.array(result['embeddings'])\n", + "documents = result['documents']\n", + "metadatas = result['metadatas']\n", + "doc_types = [metadata['doc_type'] for metadata in metadatas]\n", + "colors = [['blue', 'green', 'red', 'orange'][['products', 'employees', 'contracts', 'company'].index(t)] for t in doc_types]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5428164b-f0d5-4d2b-ac4a-514c43ceaa79", + "metadata": {}, + "outputs": [], + "source": [ + "# We humans find it easier to visalize things in 2D!\n", + "# Reduce the dimensionality of the vectors to 2D using t-SNE\n", + "# (t-distributed stochastic neighbor embedding)\n", + "\n", + "tsne = TSNE(n_components=2, random_state=42)\n", + "reduced_vectors = tsne.fit_transform(vectors)\n", + "\n", + "# Create the 2D scatter plot\n", + "fig = go.Figure(data=[go.Scatter(\n", + " x=reduced_vectors[:, 0],\n", + " y=reduced_vectors[:, 1],\n", + " mode='markers',\n", + " marker=dict(size=5, color=colors, opacity=0.8),\n", + " text=[f\"Type: {t}
Text: {d[:100]}...\" for t, d in zip(doc_types, documents)],\n", + " hoverinfo='text'\n", + ")])\n", + "\n", + "fig.update_layout(\n", + " title='2D Chroma Vector Store Visualization',\n", + " scene=dict(xaxis_title='x',yaxis_title='y'),\n", + " width=800,\n", + " height=600,\n", + " margin=dict(r=20, b=10, l=10, t=40)\n", + ")\n", + "\n", + "fig.show()" + ] + }, + { + "cell_type": "markdown", + "id": "359b8651-a382-4050-8bf8-123e5cdf4d53", + "metadata": {}, + "source": [ + "## RAG Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "08a75313-6c68-42e5-bd37-78254123094c", + "metadata": {}, + "outputs": [], + "source": [ + "retriever = vectorstore.as_retriever(search_kwargs={\"k\": 20 })\n", + "\n", + "# Conversation Memory\n", + "# memory = ConversationBufferMemory(memory_key=\"chat_history\", return_messages=True)\n", + "\n", + "chat_llm = ChatGoogleGenerativeAI(model=CHAT_MODEL, temperature=0.7)\n", + "\n", + "question_generator_template = \"\"\"Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.\n", + "If the follow up question is already a standalone question, return it as is.\n", + "\n", + "Chat History:\n", + "{chat_history}\n", + "Follow Up Input: {input} \n", + "Standalone question:\"\"\"\n", + "\n", + "question_generator_prompt = ChatPromptTemplate.from_messages([\n", + " MessagesPlaceholder(variable_name=\"chat_history\"),\n", + " HumanMessagePromptTemplate.from_template(\"{input}\")\n", + "])\n", + "\n", + "history_aware_retriever = create_history_aware_retriever(\n", + " chat_llm, retriever, question_generator_prompt\n", + ")\n", + "\n", + "qa_system_prompt = \"\"\"You are Insurellm’s intelligent virtual assistant, designed to answer questions with accuracy and clarity. Respond naturally and helpfully, as if you're part of the team.\n", + "Use the retrieved documents and prior conversation to provide accurate, conversational, and concise answers.Rephrase source facts in a natural tone, not word-for-word.\n", + "When referencing people or company history, prioritize clarity and correctness.\n", + "Only infer from previous conversation if it provides clear and factual clues. Do not guess or assume missing information.\n", + "If you truly don’t have the answer, respond with:\n", + "\"I don't have that information.\"\n", + "Avoid repeating the user's wording unnecessarily. Do not refer to 'the context', speculate, or make up facts.\n", + "\n", + "{context}\"\"\"\n", + "\n", + "\n", + "qa_human_prompt = \"{input}\" \n", + "\n", + "qa_prompt = ChatPromptTemplate.from_messages([\n", + " SystemMessagePromptTemplate.from_template(qa_system_prompt),\n", + " MessagesPlaceholder(variable_name=\"chat_history\"),\n", + " HumanMessagePromptTemplate.from_template(\"{input}\")\n", + "])\n", + "\n", + "combine_docs_chain = create_stuff_documents_chain(chat_llm, qa_prompt)\n", + "\n", + "# inspect_context = RunnableLambda(lambda inputs: (\n", + "# print(\"\\n Retrieved Context:\\n\", \"\\n---\\n\".join([doc.page_content for doc in inputs[\"context\"]])),\n", + "# inputs # pass it through unchanged\n", + "# )[1])\n", + "\n", + "# inspect_inputs = RunnableLambda(lambda inputs: (\n", + "# print(\"\\n Inputs received by the chain:\\n\", inputs),\n", + "# inputs\n", + "# )[1])\n", + "\n", + "base_chain = create_retrieval_chain(history_aware_retriever, combine_docs_chain)\n", + "\n", + "# Using Runnable Lambda as Gradio needs the response to contain only the output (answer) and base_chain would have a dict with input, context, chat_history, answer\n", + "\n", + "# base_chain_with_output = base_chain | inspect_context | RunnableLambda(lambda res: res[\"answer\"])\n", + "# base_chain_with_output = base_chain | RunnableLambda(lambda res: res[\"answer\"])\n", + "\n", + "\n", + "# Session Persistent Chat History \n", + "# If we want to persist history between sessions then use MongoDB (or any non sql DB)to store and use MongoDBChatMessageHistory (relevant DB Wrapper)\n", + "\n", + "chat_histories = {}\n", + "\n", + "def get_history(session_id):\n", + " if session_id not in chat_histories:\n", + " chat_histories[session_id] = InMemoryChatMessageHistory()\n", + " return chat_histories[session_id]\n", + "\n", + "# Currently set to streaming ...if one shot response is needed then comment base_chain and output_message_key and enable base_chain_with_output\n", + "conversation_chain = RunnableWithMessageHistory(\n", + " # base_chain_with_output,\n", + " base_chain,\n", + " get_history,\n", + " output_messages_key=\"answer\", \n", + " input_messages_key=\"input\",\n", + " history_messages_key=\"chat_history\",\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "06b58566-70cb-42eb-8b1c-9fe353fe71f0", + "metadata": {}, + "outputs": [], + "source": [ + "def chat(question, history):\n", + " try:\n", + " # result = conversation_chain.invoke({\"input\": question, \"chat_history\": memory.buffer_as_messages})\n", + " \n", + " # memory.chat_memory.add_user_message(question)\n", + " # memory.chat_memory.add_ai_message(result[\"answer\"])\n", + "\n", + " # return result[\"answer\"]\n", + "\n", + " \n", + " session_id = \"default-session\"\n", + "\n", + " # # FUll chat version\n", + " # result = conversation_chain.invoke(\n", + " # {\"input\": question},\n", + " # config={\"configurable\": {\"session_id\": session_id}}\n", + " # )\n", + " # # print(result)\n", + " # return result\n", + "\n", + " # Streaming Version\n", + " response_buffer = \"\"\n", + "\n", + " for chunk in conversation_chain.stream({\"input\": question},config={\"configurable\": {\"session_id\": session_id}}):\n", + " if \"answer\" in chunk:\n", + " response_buffer += chunk[\"answer\"]\n", + " yield response_buffer \n", + " except Exception as e:\n", + " print(f\"An error occurred during chat: {e}\")\n", + " return \"I apologize, but I encountered an error and cannot answer that right now.\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a577ac66-3952-4821-83d2-8a50bad89971", + "metadata": {}, + "outputs": [], + "source": [ + "view = gr.ChatInterface(chat, type=\"messages\").launch(inbrowser=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "56b63a17-2522-46e5-b5a3-e2e80e52a723", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week5/community-contributions/Wk5-final-multi-doc-type-KB.ipynb b/week5/community-contributions/Wk5-final-multi-doc-type-KB.ipynb new file mode 100644 index 0000000..d7d44b7 --- /dev/null +++ b/week5/community-contributions/Wk5-final-multi-doc-type-KB.ipynb @@ -0,0 +1,552 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "61777022-631c-4db0-afeb-70d8d22bc07b", + "metadata": {}, + "source": [ + "Summary:\n", + "This is the project from week 5. The intention was to create a vector db of my own files (from an external drive) which can be used in a RAG solution.\n", + "This includes a number of file types (docx, pdf, txt, epub...) and includes the ability to exclude folders.\n", + "With the OpenAI embeddings API limit of 300k tokens, it was also necessary to create a batch embeddings process so that there were multiple requests.\n", + "This was based on estimating the tokens with a text to token rate of 1:4, however it wasn't perfect and one of the batches still exceeded the 300k limit when running.\n", + "I found that the responses from the llm were terrible in the end! I tried playing about with chunk sizes and the minimum # of chunks by llangchain and it did improve but was not fantastic. I also ensured the metadata was sent with each chunk to help.\n", + "This really highlighted the real world challenges of implementing RAG!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d78ef79d-e564-4c56-82f3-0485e4bf6986", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install docx2txt\n", + "!pip install ebooklib\n", + "!pip install python-pptx\n", + "!pip install pypdf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ec98119-456f-450c-a9a2-f375d74f5ce5", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import requests\n", + "from dotenv import load_dotenv\n", + "import glob\n", + "import gradio as gr\n", + "import time\n", + "from typing import List" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac14410b-8c3c-4cf5-900e-fd4c33cdf2b2", + "metadata": {}, + "outputs": [], + "source": [ + "# imports for langchain, plotly and Chroma\n", + "\n", + "from langchain.document_loaders import (\n", + " DirectoryLoader,\n", + " Docx2txtLoader,\n", + " TextLoader,\n", + " PyPDFLoader,\n", + " UnstructuredExcelLoader,\n", + " BSHTMLLoader\n", + ")\n", + "from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter\n", + "from langchain.schema import Document\n", + "from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n", + "from langchain_chroma import Chroma\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.manifold import TSNE\n", + "import numpy as np\n", + "import plotly.graph_objects as go\n", + "from langchain.memory import ConversationBufferMemory\n", + "from langchain.chains import ConversationalRetrievalChain\n", + "from langchain.embeddings import HuggingFaceEmbeddings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3be698e7-71e1-4c75-9696-e1651e4bf357", + "metadata": {}, + "outputs": [], + "source": [ + "MODEL = \"gpt-4o-mini\"\n", + "db_name = \"vector_db\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6f850068-c05b-4526-9494-034b0077347e", + "metadata": {}, + "outputs": [], + "source": [ + "# Load environment variables in a file called .env\n", + "\n", + "load_dotenv(override=True)\n", + "os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c5baad2-2033-40a6-8ebd-5861b5cf4350", + "metadata": {}, + "outputs": [], + "source": [ + "# handling epubs\n", + "\n", + "from ebooklib import epub\n", + "from bs4 import BeautifulSoup\n", + "from langchain.document_loaders.base import BaseLoader\n", + "\n", + "class EpubLoader(BaseLoader):\n", + " def __init__(self, file_path: str):\n", + " self.file_path = file_path\n", + "\n", + " def load(self) -> list[Document]:\n", + " book = epub.read_epub(self.file_path)\n", + " text = ''\n", + " for item in book.get_items():\n", + " if item.get_type() == epub.EpubHtml:\n", + " soup = BeautifulSoup(item.get_content(), 'html.parser')\n", + " text += soup.get_text() + '\\n'\n", + "\n", + " return [Document(page_content=text, metadata={\"source\": self.file_path})]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd8b0e4e-d698-4484-bc94-d8b753f386cc", + "metadata": {}, + "outputs": [], + "source": [ + "# handling pptx\n", + "\n", + "from pptx import Presentation\n", + "\n", + "class PptxLoader(BaseLoader):\n", + " def __init__(self, file_path: str):\n", + " self.file_path = file_path\n", + "\n", + " def load(self) -> list[Document]:\n", + " prs = Presentation(self.file_path)\n", + " text = ''\n", + " for slide in prs.slides:\n", + " for shape in slide.shapes:\n", + " if hasattr(shape, \"text\") and shape.text:\n", + " text += shape.text + '\\n'\n", + "\n", + " return [Document(page_content=text, metadata={\"source\": self.file_path})]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b222b01d-6040-4ff3-a0e3-290819cfe94b", + "metadata": {}, + "outputs": [], + "source": [ + "# Class based version of document loader which can be expanded more easily for other document types. (Currently includes file types: docx, txt (windows encoding), xlsx, pdfs, epubs, pptx)\n", + "\n", + "class DocumentLoader:\n", + " \"\"\"A clean, extensible document loader for multiple file types.\"\"\"\n", + " \n", + " def __init__(self, base_path=\"D:/*\", exclude_folders=None):\n", + " self.base_path = base_path\n", + " self.documents = []\n", + " self.exclude_folders = exclude_folders or []\n", + " \n", + " # Configuration for different file types\n", + " self.loader_config = {\n", + " 'docx': {\n", + " 'loader_cls': Docx2txtLoader,\n", + " 'glob_pattern': \"**/*.docx\",\n", + " 'loader_kwargs': {},\n", + " 'post_process': None\n", + " },\n", + " 'txt': {\n", + " 'loader_cls': TextLoader,\n", + " 'glob_pattern': \"**/*.txt\",\n", + " 'loader_kwargs': {\"encoding\": \"cp1252\"},\n", + " 'post_process': None\n", + " },\n", + " 'pdf': {\n", + " 'loader_cls': PyPDFLoader,\n", + " 'glob_pattern': \"**/*.pdf\",\n", + " 'loader_kwargs': {},\n", + " 'post_process': None\n", + " },\n", + " 'xlsx': {\n", + " 'loader_cls': UnstructuredExcelLoader,\n", + " 'glob_pattern': \"**/*.xlsx\",\n", + " 'loader_kwargs': {},\n", + " 'post_process': None\n", + " },\n", + " 'html': {\n", + " 'loader_cls': BSHTMLLoader,\n", + " 'glob_pattern': \"**/*.html\",\n", + " 'loader_kwargs': {},\n", + " 'post_process': None\n", + " },\n", + " 'epub': {\n", + " 'loader_cls': EpubLoader,\n", + " 'glob_pattern': \"**/*.epub\",\n", + " 'loader_kwargs': {},\n", + " 'post_process': self._process_epub_metadata\n", + " },\n", + " 'pptx': {\n", + " 'loader_cls': PptxLoader,\n", + " 'glob_pattern': \"**/*.pptx\",\n", + " 'loader_kwargs': {},\n", + " 'post_process': None\n", + " }\n", + " }\n", + " \n", + " def _get_epub_metadata(self, file_path):\n", + " \"\"\"Extract metadata from EPUB files.\"\"\"\n", + " try:\n", + " book = epub.read_epub(file_path)\n", + " title = book.get_metadata('DC', 'title')[0][0] if book.get_metadata('DC', 'title') else None\n", + " author = book.get_metadata('DC', 'creator')[0][0] if book.get_metadata('DC', 'creator') else None\n", + " return title, author\n", + " except Exception as e:\n", + " print(f\"Error extracting EPUB metadata: {e}\")\n", + " return None, None\n", + " \n", + " def _process_epub_metadata(self, doc) -> None:\n", + " \"\"\"Post-process EPUB documents to add metadata.\"\"\"\n", + " title, author = self._get_epub_metadata(doc.metadata['source'])\n", + " doc.metadata[\"author\"] = author\n", + " doc.metadata[\"title\"] = title\n", + " \n", + " def _load_file_type(self, folder, file_type, config):\n", + " \"\"\"Load documents of a specific file type from a folder.\"\"\"\n", + " try:\n", + " loader = DirectoryLoader(\n", + " folder, \n", + " glob=config['glob_pattern'], \n", + " loader_cls=config['loader_cls'],\n", + " loader_kwargs=config['loader_kwargs']\n", + " )\n", + " docs = loader.load()\n", + " print(f\" Found {len(docs)} .{file_type} files\")\n", + " \n", + " # Apply post-processing if defined\n", + " if config['post_process']:\n", + " for doc in docs:\n", + " config['post_process'](doc)\n", + " \n", + " return docs\n", + " \n", + " except Exception as e:\n", + " print(f\" Error loading .{file_type} files: {e}\")\n", + " return []\n", + " \n", + " def load_all(self):\n", + " \"\"\"Load all documents from configured folders.\"\"\"\n", + " all_folders = [f for f in glob.glob(self.base_path) if os.path.isdir(f)]\n", + "\n", + " #filter out excluded folders\n", + " folders = []\n", + " for folder in all_folders:\n", + " folder_name = os.path.basename(folder)\n", + " if folder_name not in self.exclude_folders:\n", + " folders.append(folder)\n", + " else:\n", + " print(f\"Excluded folder: {folder_name}\")\n", + " \n", + " print(\"Scanning folders (directories only):\", folders)\n", + " \n", + " self.documents = []\n", + " \n", + " for folder in folders:\n", + " doc_type = os.path.basename(folder)\n", + " print(f\"\\nProcessing folder: {doc_type}\")\n", + " \n", + " for file_type, config in self.loader_config.items():\n", + " docs = self._load_file_type(folder, file_type, config)\n", + " \n", + " # Add doc_type metadata to all documents\n", + " for doc in docs:\n", + " doc.metadata[\"doc_type\"] = doc_type\n", + " self.documents.append(doc)\n", + " \n", + " print(f\"\\nTotal documents loaded: {len(self.documents)}\")\n", + " return self.documents\n", + " \n", + " def add_file_type(self, extension, loader_cls, glob_pattern=None, \n", + " loader_kwargs=None, post_process=None):\n", + " \"\"\"Add support for a new file type.\"\"\"\n", + " self.loader_config[extension] = {\n", + " 'loader_cls': loader_cls,\n", + " 'glob_pattern': glob_pattern or f\"**/*.{extension}\",\n", + " 'loader_kwargs': loader_kwargs or {},\n", + " 'post_process': post_process\n", + " }\n", + "\n", + "# load\n", + "loader = DocumentLoader(\"D:/*\", exclude_folders=[\"Music\", \"Online Courses\", \"Fitness\"])\n", + "documents = loader.load_all()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3fd43a4f-b623-4b08-89eb-27d3b3ba0f62", + "metadata": {}, + "outputs": [], + "source": [ + "# create batches (this was required as the # of tokens was exceed the openai request limit)\n", + "\n", + "def estimate_tokens(text, chars_per_token=4):\n", + " \"\"\"Rough estimate of tokens from character count.\"\"\"\n", + " return len(text) // chars_per_token\n", + "\n", + "def create_batches(chunks, max_tokens_per_batch=250000):\n", + " batches = []\n", + " current_batch = []\n", + " current_tokens = 0\n", + " \n", + " for chunk in chunks:\n", + " chunk_tokens = estimate_tokens(chunk.page_content)\n", + " \n", + " # If adding this chunk would exceed the limit, start a new batch\n", + " if current_tokens + chunk_tokens > max_tokens_per_batch and current_batch:\n", + " batches.append(current_batch)\n", + " current_batch = [chunk]\n", + " current_tokens = chunk_tokens\n", + " else:\n", + " current_batch.append(chunk)\n", + " current_tokens += chunk_tokens\n", + " \n", + " # Add the last batch if it has content\n", + " if current_batch:\n", + " batches.append(current_batch)\n", + " \n", + " return batches\n", + "\n", + "def create_vectorstore_with_progress(chunks, embeddings, db_name, batch_size_tokens=250000):\n", + " \n", + " # Delete existing database if it exists\n", + " if os.path.exists(db_name):\n", + " print(f\"Deleting existing database: {db_name}\")\n", + " Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()\n", + " \n", + " # Create batches\n", + " batches = create_batches(chunks, batch_size_tokens)\n", + " print(f\"Created {len(batches)} batches from {len(chunks)} chunks\")\n", + " \n", + " # Show batch sizes\n", + " for i, batch in enumerate(batches):\n", + " total_chars = sum(len(chunk.page_content) for chunk in batch)\n", + " estimated_tokens = estimate_tokens(''.join(chunk.page_content for chunk in batch))\n", + " print(f\" Batch {i+1}: {len(batch)} chunks, ~{estimated_tokens:,} tokens\")\n", + " \n", + " vectorstore = None\n", + " successful_batches = 0\n", + " failed_batches = 0\n", + " \n", + " for i, batch in enumerate(batches):\n", + " print(f\"\\n{'='*50}\")\n", + " print(f\"Processing batch {i+1}/{len(batches)}\")\n", + " print(f\"{'='*50}\")\n", + " \n", + " try:\n", + " start_time = time.time()\n", + " \n", + " if vectorstore is None:\n", + " # Create the initial vectorstore\n", + " vectorstore = Chroma.from_documents(\n", + " documents=batch,\n", + " embedding=embeddings,\n", + " persist_directory=db_name\n", + " )\n", + " print(f\"Created initial vectorstore with {len(batch)} documents\")\n", + " else:\n", + " # Add to existing vectorstore\n", + " vectorstore.add_documents(batch)\n", + " print(f\"Added {len(batch)} documents to vectorstore\")\n", + " \n", + " successful_batches += 1\n", + " elapsed = time.time() - start_time\n", + " print(f\"Processed in {elapsed:.1f} seconds\")\n", + " print(f\"Total documents in vectorstore: {vectorstore._collection.count()}\")\n", + " \n", + " # Rate limiting delay\n", + " time.sleep(2)\n", + " \n", + " except Exception as e:\n", + " failed_batches += 1\n", + " print(f\"Error processing batch {i+1}: {e}\")\n", + " print(f\"Continuing with next batch...\")\n", + " continue\n", + " \n", + " print(f\"\\n{'='*50}\")\n", + " print(f\"SUMMARY\")\n", + " print(f\"{'='*50}\")\n", + " print(f\"Successful batches: {successful_batches}/{len(batches)}\")\n", + " print(f\"Failed batches: {failed_batches}/{len(batches)}\")\n", + " \n", + " if vectorstore:\n", + " final_count = vectorstore._collection.count()\n", + " print(f\"Final vectorstore contains: {final_count} documents\")\n", + " return vectorstore\n", + " else:\n", + " print(\"Failed to create vectorstore\")\n", + " return None\n", + "\n", + "# include metadata\n", + "def add_metadata_to_content(doc: Document) -> Document:\n", + " metadata_lines = []\n", + " if \"doc_type\" in doc.metadata:\n", + " metadata_lines.append(f\"Document Type: {doc.metadata['doc_type']}\")\n", + " if \"title\" in doc.metadata:\n", + " metadata_lines.append(f\"Title: {doc.metadata['title']}\")\n", + " if \"author\" in doc.metadata:\n", + " metadata_lines.append(f\"Author: {doc.metadata['author']}\")\n", + " metadata_text = \"\\n\".join(metadata_lines)\n", + "\n", + " new_content = f\"{metadata_text}\\n\\n{doc.page_content}\"\n", + " return Document(page_content=new_content, metadata=doc.metadata)\n", + "\n", + "# Apply to all documents before chunking\n", + "documents_with_metadata = [add_metadata_to_content(doc) for doc in documents]\n", + "\n", + "# Chunking\n", + "text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=200)\n", + "chunks = text_splitter.split_documents(documents_with_metadata)\n", + "\n", + "# Embedding\n", + "embeddings = OpenAIEmbeddings()\n", + "\n", + "# Store in vector DB\n", + "print(\"Creating vectorstore in batches...\")\n", + "vectorstore = create_vectorstore_with_progress(\n", + " chunks=chunks,\n", + " embeddings=embeddings, \n", + " db_name=db_name,\n", + " batch_size_tokens=250000\n", + ")\n", + "\n", + "if vectorstore:\n", + " print(f\"Successfully created vectorstore with {vectorstore._collection.count()} documents\")\n", + "else:\n", + " print(\"Failed to create vectorstore\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "46c29b11-2ae3-4f6b-901d-5de67a09fd49", + "metadata": {}, + "outputs": [], + "source": [ + "# create a new Chat with OpenAI\n", + "llm = ChatOpenAI(temperature=0.7, model_name=MODEL)\n", + "\n", + "# set up the conversation memory for the chat\n", + "memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)\n", + "\n", + "# the retriever is an abstraction over the VectorStore that will be used during RAG\n", + "retriever = vectorstore.as_retriever(search_kwargs={\"k\": 200})\n", + "\n", + "# putting it together: set up the conversation chain with the GPT 3.5 LLM, the vector store and memory\n", + "conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "be163251-0dfa-4f50-ab05-43c6c0833405", + "metadata": {}, + "outputs": [], + "source": [ + "# Wrapping that in a function\n", + "\n", + "def chat(question, history):\n", + " result = conversation_chain.invoke({\"question\": question})\n", + " return result[\"answer\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a6320402-8213-47ec-8b05-dda234052274", + "metadata": {}, + "outputs": [], + "source": [ + "# And in Gradio:\n", + "\n", + "view = gr.ChatInterface(chat, type=\"messages\").launch(inbrowser=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "717e010b-8d7e-4a43-8cb1-9688ffdd76b6", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's investigate what gets sent behind the scenes\n", + "\n", + "# from langchain_core.callbacks import StdOutCallbackHandler\n", + "\n", + "# llm = ChatOpenAI(temperature=0.7, model_name=MODEL)\n", + "\n", + "# memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)\n", + "\n", + "# retriever = vectorstore.as_retriever(search_kwargs={\"k\": 200})\n", + "\n", + "# conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory, callbacks=[StdOutCallbackHandler()])\n", + "\n", + "# query = \"Can you name some authors?\"\n", + "# result = conversation_chain.invoke({\"question\": query})\n", + "# answer = result[\"answer\"]\n", + "# print(\"\\nAnswer:\", answer)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2333a77e-8d32-4cc2-8ae9-f8e7a979b3ae", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week5/community-contributions/day5_gmailRAG.ipynb b/week5/community-contributions/day5_gmailRAG.ipynb new file mode 100644 index 0000000..27a52aa --- /dev/null +++ b/week5/community-contributions/day5_gmailRAG.ipynb @@ -0,0 +1,472 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "dfe37963-1af6-44fc-a841-8e462443f5e6", + "metadata": {}, + "source": [ + "## gmail RAG assistant" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ba2779af-84ef-4227-9e9e-6eaf0df87e77", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "\n", + "import os\n", + "import glob\n", + "from dotenv import load_dotenv\n", + "import gradio as gr\n", + "# NEW IMPORTS FOR GMAIL\n", + "from google.auth.transport.requests import Request\n", + "from google.oauth2.credentials import Credentials\n", + "from google_auth_oauthlib.flow import InstalledAppFlow\n", + "from googleapiclient.discovery import build\n", + "from datetime import datetime\n", + "import base64\n", + "from email.mime.text import MIMEText\n", + "import re" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "802137aa-8a74-45e0-a487-d1974927d7ca", + "metadata": {}, + "outputs": [], + "source": [ + "# imports for langchain, plotly and Chroma\n", + "\n", + "from langchain.document_loaders import DirectoryLoader, TextLoader\n", + "from langchain.text_splitter import CharacterTextSplitter\n", + "from langchain.schema import Document\n", + "from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n", + "from langchain_chroma import Chroma\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.manifold import TSNE\n", + "import numpy as np\n", + "import plotly.graph_objects as go\n", + "from langchain.memory import ConversationBufferMemory\n", + "from langchain.chains import ConversationalRetrievalChain\n", + "from langchain.embeddings import HuggingFaceEmbeddings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58c85082-e417-4708-9efe-81a5d55d1424", + "metadata": {}, + "outputs": [], + "source": [ + "# price is a factor for our company, so we're going to use a low cost model\n", + "\n", + "MODEL = \"gpt-4o-mini\"\n", + "db_name = \"vector_db\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee78efcb-60fe-449e-a944-40bab26261af", + "metadata": {}, + "outputs": [], + "source": [ + "# Load environment variables in a file called .env\n", + "\n", + "load_dotenv(override=True)\n", + "os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')\n", + "# NEW: Gmail API credentials\n", + "SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']\n", + "CREDENTIALS_FILE = 'credentials.json' # Download from Google Cloud Console\n", + "TOKEN_FILE = 'token.json'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "730711a9-6ffe-4eee-8f48-d6cfb7314905", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# Read in emails using LangChain's loaders\n", + "# IMPORTANT: set the email received date range hard-coded below\n", + "\n", + "def authenticate_gmail():\n", + " \"\"\"Authenticate and return Gmail service object\"\"\"\n", + " creds = None\n", + " if os.path.exists(TOKEN_FILE):\n", + " creds = Credentials.from_authorized_user_file(TOKEN_FILE, SCOPES)\n", + " \n", + " if not creds or not creds.valid:\n", + " if creds and creds.expired and creds.refresh_token:\n", + " creds.refresh(Request())\n", + " else:\n", + " flow = InstalledAppFlow.from_client_secrets_file(CREDENTIALS_FILE, SCOPES)\n", + " creds = flow.run_local_server(port=0)\n", + " \n", + " with open(TOKEN_FILE, 'w') as token:\n", + " token.write(creds.to_json())\n", + " \n", + " return build('gmail', 'v1', credentials=creds)\n", + "\n", + "def get_email_content(service, message_id):\n", + " \"\"\"Extract email content from message\"\"\"\n", + " try:\n", + " message = service.users().messages().get(userId='me', id=message_id, format='full').execute()\n", + " \n", + " # Extract basic info\n", + " headers = message['payload'].get('headers', [])\n", + " subject = next((h['value'] for h in headers if h['name'] == 'Subject'), 'No Subject')\n", + " sender = next((h['value'] for h in headers if h['name'] == 'From'), 'Unknown Sender')\n", + " date = next((h['value'] for h in headers if h['name'] == 'Date'), 'Unknown Date')\n", + " \n", + " # Extract body\n", + " body = \"\"\n", + " if 'parts' in message['payload']:\n", + " for part in message['payload']['parts']:\n", + " if part['mimeType'] == 'text/plain':\n", + " data = part['body']['data']\n", + " body = base64.urlsafe_b64decode(data).decode('utf-8')\n", + " break\n", + " else:\n", + " if message['payload']['body'].get('data'):\n", + " body = base64.urlsafe_b64decode(message['payload']['body']['data']).decode('utf-8')\n", + " \n", + " # Clean up body text\n", + " body = re.sub(r'\\s+', ' ', body).strip()\n", + " \n", + " return {\n", + " 'subject': subject,\n", + " 'sender': sender,\n", + " 'date': date,\n", + " 'body': body,\n", + " 'id': message_id\n", + " }\n", + " except Exception as e:\n", + " print(f\"Error processing message {message_id}: {str(e)}\")\n", + " return None\n", + "\n", + "def load_gmail_documents(start_date, end_date, max_emails=100):\n", + " \"\"\"Load emails from Gmail between specified dates\"\"\"\n", + " service = authenticate_gmail()\n", + " \n", + " # Format dates for Gmail API (YYYY/MM/DD)\n", + " start_date_str = start_date.strftime('%Y/%m/%d')\n", + " end_date_str = end_date.strftime('%Y/%m/%d')\n", + " \n", + " # Build query\n", + " query = f'after:{start_date_str} before:{end_date_str}'\n", + " \n", + " # Get message list\n", + " result = service.users().messages().list(userId='me', q=query, maxResults=max_emails).execute()\n", + " messages = result.get('messages', [])\n", + " \n", + " print(f\"Found {len(messages)} emails between {start_date_str} and {end_date_str}\")\n", + " \n", + " # Convert to LangChain documents\n", + " documents = []\n", + " for i, message in enumerate(messages):\n", + " print(f\"Processing email {i+1}/{len(messages)}\")\n", + " email_data = get_email_content(service, message['id'])\n", + " \n", + " if email_data and email_data['body']:\n", + " # Create document content\n", + " content = f\"\"\"Subject: {email_data['subject']}\n", + "From: {email_data['sender']}\n", + "Date: {email_data['date']}\n", + "\n", + "{email_data['body']}\"\"\"\n", + " \n", + " # Create LangChain document\n", + " doc = Document(\n", + " page_content=content,\n", + " metadata={\n", + " \"doc_type\": \"email\",\n", + " \"subject\": email_data['subject'],\n", + " \"sender\": email_data['sender'],\n", + " \"date\": email_data['date'],\n", + " \"message_id\": email_data['id']\n", + " }\n", + " )\n", + " documents.append(doc)\n", + " \n", + " return documents\n", + "\n", + "# SET YOUR DATE RANGE HERE\n", + "start_date = datetime(2025, 6, 20) # YYYY, MM, DD\n", + "end_date = datetime(2025, 6, 26) # YYYY, MM, DD\n", + "\n", + "# Load Gmail documents \n", + "documents = load_gmail_documents(start_date, end_date, max_emails=200)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c59de72d-f965-44b3-8487-283e4c623b1d", + "metadata": {}, + "outputs": [], + "source": [ + "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n", + "chunks = text_splitter.split_documents(documents)\n", + "\n", + "print(f\"Total number of chunks: {len(chunks)}\")\n", + "print(f\"Document types found: {set(doc.metadata['doc_type'] for doc in documents)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78998399-ac17-4e28-b15f-0b5f51e6ee23", + "metadata": {}, + "outputs": [], + "source": [ + "# Put the chunks of data into a Vector Store that associates a Vector Embedding with each chunk\n", + "# Chroma is a popular open source Vector Database based on SQLLite\n", + "\n", + "embeddings = OpenAIEmbeddings()\n", + "\n", + "# If you would rather use the free Vector Embeddings from HuggingFace sentence-transformers\n", + "# Then replace embeddings = OpenAIEmbeddings()\n", + "# with:\n", + "# from langchain.embeddings import HuggingFaceEmbeddings\n", + "# embeddings = HuggingFaceEmbeddings(model_name=\"sentence-transformers/all-MiniLM-L6-v2\")\n", + "\n", + "# Delete if already exists\n", + "\n", + "if os.path.exists(db_name):\n", + " Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()\n", + "\n", + "# Create vectorstore\n", + "\n", + "vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)\n", + "print(f\"Vectorstore created with {vectorstore._collection.count()} documents\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff2e7687-60d4-4920-a1d7-a34b9f70a250", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's investigate the vectors\n", + "\n", + "collection = vectorstore._collection\n", + "count = collection.count()\n", + "\n", + "sample_embedding = collection.get(limit=1, include=[\"embeddings\"])[\"embeddings\"][0]\n", + "dimensions = len(sample_embedding)\n", + "print(f\"There are {count:,} vectors with {dimensions:,} dimensions in the vector store\")" + ] + }, + { + "cell_type": "markdown", + "id": "b0d45462-a818-441c-b010-b85b32bcf618", + "metadata": {}, + "source": [ + "## Visualizing the Vector Store\n", + "\n", + "Let's take a minute to look at the documents and their embedding vectors to see what's going on." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b98adf5e-d464-4bd2-9bdf-bc5b6770263b", + "metadata": {}, + "outputs": [], + "source": [ + "# Prework (with thanks to Jon R for identifying and fixing a bug in this!)\n", + "\n", + "result = collection.get(include=['embeddings', 'documents', 'metadatas'])\n", + "vectors = np.array(result['embeddings'])\n", + "documents = result['documents']\n", + "metadatas = result['metadatas']\n", + "\n", + "# Alternatively, color by sender:\n", + "senders = [metadata.get('sender', 'unknown') for metadata in metadatas]\n", + "unique_senders = list(set(senders))\n", + "sender_colors = ['blue', 'green', 'red', 'orange', 'purple', 'brown', 'pink', 'gray']\n", + "colors = [sender_colors[unique_senders.index(sender) % len(sender_colors)] for sender in senders]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "427149d5-e5d8-4abd-bb6f-7ef0333cca21", + "metadata": {}, + "outputs": [], + "source": [ + "# We humans find it easier to visalize things in 2D!\n", + "# Reduce the dimensionality of the vectors to 2D using t-SNE\n", + "# (t-distributed stochastic neighbor embedding)\n", + "\n", + "tsne = TSNE(n_components=2, random_state=42)\n", + "reduced_vectors = tsne.fit_transform(vectors)\n", + "\n", + "# Create the 2D scatter plot\n", + "fig = go.Figure(data=[go.Scatter(\n", + " x=reduced_vectors[:, 0],\n", + " y=reduced_vectors[:, 1],\n", + " mode='markers',\n", + " marker=dict(size=5, color=colors, opacity=0.8),\n", + " text=[f\"Type: {t}
Text: {d[:100]}...\" for t, d in zip(senders, documents)],\n", + " hoverinfo='text'\n", + ")])\n", + "\n", + "fig.update_layout(\n", + " title='2D Chroma Vector Store Visualization',\n", + " scene=dict(xaxis_title='x',yaxis_title='y'),\n", + " width=800,\n", + " height=600,\n", + " margin=dict(r=20, b=10, l=10, t=40)\n", + ")\n", + "\n", + "fig.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1418e88-acd5-460a-bf2b-4e6efc88e3dd", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's try 3D!\n", + "\n", + "tsne = TSNE(n_components=3, random_state=42)\n", + "reduced_vectors = tsne.fit_transform(vectors)\n", + "\n", + "# Create the 3D scatter plot\n", + "fig = go.Figure(data=[go.Scatter3d(\n", + " x=reduced_vectors[:, 0],\n", + " y=reduced_vectors[:, 1],\n", + " z=reduced_vectors[:, 2],\n", + " mode='markers',\n", + " marker=dict(size=5, color=colors, opacity=0.8),\n", + " text=[f\"Type: {t}
Text: {d[:100]}...\" for t, d in zip(senders, documents)],\n", + " hoverinfo='text'\n", + ")])\n", + "\n", + "fig.update_layout(\n", + " title='3D Chroma Vector Store Visualization',\n", + " scene=dict(xaxis_title='x', yaxis_title='y', zaxis_title='z'),\n", + " width=900,\n", + " height=700,\n", + " margin=dict(r=20, b=10, l=10, t=40)\n", + ")\n", + "\n", + "fig.show()" + ] + }, + { + "cell_type": "markdown", + "id": "bbbcb659-13ce-47ab-8a5e-01b930494964", + "metadata": {}, + "source": [ + "## Langchain and Gradio to prototype a chat with the LLM\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d72567e8-f891-4797-944b-4612dc6613b1", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "from langchain.prompts import PromptTemplate\n", + "from langchain.chains.combine_documents import create_stuff_documents_chain\n", + "from langchain.chains import create_retrieval_chain\n", + "\n", + "# create a new Chat with OpenAI\n", + "llm = ChatOpenAI(temperature=0.7, model_name=MODEL)\n", + "\n", + "# Alternative - if you'd like to use Ollama locally, uncomment this line instead\n", + "# llm = ChatOpenAI(temperature=0.7, model_name='llama3.2', base_url='http://localhost:11434/v1', api_key='ollama')\n", + "\n", + "# change LLM standard prompt (standard prompt defaults the answer to be 'I don't know' too often, especially when using a small LLM\n", + "\n", + "qa_prompt=PromptTemplate.from_template(\"Use the following pieces of context to answer the user's question. Answer as best you can given the information you have;\\\n", + " if you have a reasonable idea of the answer,/then explain it and mention that you're unsure. \\\n", + " But if you don't know the answer, don't make it up. \\\n", + " {context} \\\n", + " Question: {question} \\\n", + " Helpful Answer:\"\n", + " )\n", + "\n", + "\n", + "# Wrap into a StuffDocumentsChain, matching the variable name 'context'\n", + "combine_docs_chain = create_stuff_documents_chain(\n", + " llm=llm,\n", + " prompt=qa_prompt,\n", + " document_variable_name=\"context\"\n", + ")\n", + "\n", + "# set up the conversation memory for the chat\n", + "#memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)\n", + "memory = ConversationBufferMemory(\n", + " memory_key='chat_history', \n", + " return_messages=True,\n", + " output_key='answer' \n", + ")\n", + "\n", + "# the retriever is an abstraction over the VectorStore that will be used during RAG\n", + "retriever = vectorstore.as_retriever(search_kwargs={\"k\": 10})\n", + "\n", + "# putting it together: set up the conversation chain with the GPT 3.5 LLM, the vector store and memory\n", + "# conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)\n", + "\n", + "conversation_chain = ConversationalRetrievalChain.from_llm(\n", + " llm=llm,\n", + " retriever=retriever,\n", + " memory=memory,\n", + " combine_docs_chain_kwargs={\"prompt\": qa_prompt},\n", + " return_source_documents=True\n", + ")\n", + "\n", + "def chat(question, history):\n", + " result = conversation_chain.invoke({\"question\": question})\n", + " return result[\"answer\"]\n", + "\n", + "view = gr.ChatInterface(chat, type=\"messages\").launch(inbrowser=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe4229aa-6afe-4592-93a4-71a47ab69846", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}