sach91 bootcamp week8 exercise

2025-10-30 15:42:04 +05:30
parent 3fa7a3dad5
commit ef48ed539d
20 changed files with 3124 additions and 0 deletions
--- a/community-contributions/sach91-bootcamp/week8/README.md
+++ b/community-contributions/sach91-bootcamp/week8/README.md
@@ -0,0 +1,259 @@
 # 🧠 KnowledgeHub - Personal Knowledge Management & Research Assistant
 An elegant, fully local AI-powered knowledge management system that helps you organize, search, and understand your documents using state-of-the-art LLM technology.
 ## ✨ Features
 ### 🎯 Core Capabilities
 - **📤 Document Ingestion**: Upload PDF, DOCX, TXT, MD, and HTML files
 - **❓ Intelligent Q&A**: Ask questions and get answers from your documents using RAG
 - **📝 Smart Summarization**: Generate concise summaries with key points
 - **🔗 Connection Discovery**: Find relationships between documents
 - **💾 Multi-format Export**: Export as Markdown, HTML, or plain text
 - **📊 Statistics Dashboard**: Track your knowledge base growth
 ### 🔒 Privacy-First
 - **100% Local Processing**: All data stays on your machine
 - **No Cloud Dependencies**: Uses Ollama for local LLM inference
 - **Open Source**: Full transparency and control
 ### ⚡ Technology Stack
 - **LLM**: Ollama with Llama 3.2 (3B) or Llama 3.1 (8B)
 - **Embeddings**: sentence-transformers (all-MiniLM-L6-v2)
 - **Vector Database**: ChromaDB
 - **UI**: Gradio
 - **Document Processing**: pypdf, python-docx, beautifulsoup4
 ## 🚀 Quick Start
 ### Prerequisites
 1. **Python 3.8+** installed
 2. **Ollama** installed and running
 #### Installing Ollama
 **macOS/Linux:**
 ```bash
 curl -fsSL https://ollama.com/install.sh | sh
 ```
 **Windows:**
 Download from [ollama.com/download](https://ollama.com/download)
 ### Installation
 1. **Clone or download this repository**
 2. **Install Python dependencies:**
 ```bash
 pip install -r requirements.txt
 ```
 3. **Pull Llama model using Ollama:**
 ```bash
 # For faster inference (recommended for most users)
 ollama pull llama3.2
 # OR for better quality (requires more RAM)
 ollama pull llama3.1
 ```
 4. **Start Ollama server** (if not already running):
 ```bash
 ollama serve
 ```
 5. **Launch KnowledgeHub:**
 ```bash
 python app.py
 ```
 The application will open in your browser at `http://127.0.0.1:7860`
 ## 📖 Usage Guide
 ### 1. Upload Documents
 - Go to the "Upload Documents" tab
 - Select a file (PDF, DOCX, TXT, MD, or HTML)
 - Click "Upload & Process"
 - The document will be chunked and stored in your local vector database
 ### 2. Ask Questions
 - Go to the "Ask Questions" tab
 - Type your question in natural language
 - Adjust the number of sources to retrieve (default: 5)
 - Click "Ask" to get an AI-generated answer with sources
 ### 3. Summarize Documents
 - Go to the "Summarize" tab
 - Select a document from the dropdown
 - Click "Generate Summary"
 - Get a concise summary with key points
 ### 4. Find Connections
 - Go to the "Find Connections" tab
 - Select a document to analyze
 - Adjust how many related documents to find
 - See documents that are semantically similar
 ### 5. Export Knowledge
 - Go to the "Export" tab
 - Choose your format (Markdown, HTML, or Text)
 - Click "Export" to download your knowledge base
 ### 6. View Statistics
 - Go to the "Statistics" tab
 - See overview of your knowledge base
 - Track total documents, chunks, and characters
 ## 🏗️ Architecture
 ```
 KnowledgeHub/
 ├── agents/              # Specialized AI agents
 │   ├── base_agent.py           # Base class for all agents
 │   ├── ingestion_agent.py      # Document processing
 │   ├── question_agent.py       # RAG-based Q&A
 │   ├── summary_agent.py        # Summarization
 │   ├── connection_agent.py     # Finding relationships
 │   └── export_agent.py         # Exporting data
 ├── models/              # Data models
 │   ├── document.py             # Document structures
 │   └── knowledge_graph.py      # Graph structures
 ├── utils/               # Utilities
 │   ├── ollama_client.py        # Ollama API wrapper
 │   ├── embeddings.py           # Embedding generation
 │   └── document_parser.py      # File parsing
 ├── vectorstore/         # ChromaDB storage (auto-created)
 ├── temp_uploads/        # Temporary file storage (auto-created)
 ├── app.py              # Main Gradio application
 └── requirements.txt    # Python dependencies
 ```
 ## 🎯 Multi-Agent Framework
 KnowledgeHub uses a sophisticated multi-agent architecture:
 1. **Ingestion Agent**: Parses documents, creates chunks, generates embeddings
 2. **Question Agent**: Retrieves relevant context and answers questions
 3. **Summary Agent**: Creates concise summaries and extracts key points
 4. **Connection Agent**: Finds semantic relationships between documents
 5. **Export Agent**: Formats and exports knowledge in multiple formats
 Each agent is independent, reusable, and focused on a specific task, following best practices in agentic AI development.
 ## ⚙️ Configuration
 ### Changing Models
 Edit `app.py` to use different models:
 ```python
 # For Llama 3.1 8B (better quality, more RAM)
 self.llm_client = OllamaClient(model="llama3.1")
 # For Llama 3.2 3B (faster, less RAM)
 self.llm_client = OllamaClient(model="llama3.2")
 ```
 ### Adjusting Chunk Size
 Edit `agents/ingestion_agent.py`:
 ```python
 self.parser = DocumentParser(
    chunk_size=1000,      # Characters per chunk
    chunk_overlap=200     # Overlap between chunks
 )
 ```
 ### Changing Embedding Model
 Edit `app.py`:
 ```python
 self.embedding_model = EmbeddingModel(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
 )
 ```
 ## 🔧 Troubleshooting
 ### "Cannot connect to Ollama"
 - Ensure Ollama is installed: `ollama --version`
 - Start the Ollama service: `ollama serve`
 - Verify the model is pulled: `ollama list`
 ### "Module not found" errors
 - Ensure all dependencies are installed: `pip install -r requirements.txt`
 - Try upgrading pip: `pip install --upgrade pip`
 ### "Out of memory" errors
 - Use Llama 3.2 (3B) instead of Llama 3.1 (8B)
 - Reduce chunk_size in document parser
 - Process fewer documents at once
 ### Slow response times
 - Ensure you're using a CUDA-enabled GPU (if available)
 - Reduce the number of retrieved chunks (top_k parameter)
 - Use a smaller model (llama3.2)
 ## 🎓 Learning Resources
 This project demonstrates key concepts in LLM engineering:
 - **RAG (Retrieval Augmented Generation)**: Combining retrieval with generation
 - **Vector Databases**: Using ChromaDB for semantic search
 - **Multi-Agent Systems**: Specialized agents working together
 - **Embeddings**: Semantic representation of text
 - **Local LLM Deployment**: Using Ollama for privacy-focused AI
 ## 📊 Performance
 **Hardware Requirements:**
 - Minimum: 8GB RAM, CPU
 - Recommended: 16GB RAM, GPU (NVIDIA with CUDA)
 - Optimal: 32GB RAM, GPU (RTX 3060 or better)
 **Processing Speed** (Llama 3.2 on M1 Mac):
 - Document ingestion: ~2-5 seconds per page
 - Question answering: ~5-15 seconds
 - Summarization: ~10-20 seconds
 ## 🤝 Contributing
 This is a learning project showcasing LLM engineering principles. Feel free to:
 - Experiment with different models
 - Add new agents for specialized tasks
 - Improve the UI
 - Optimize performance
 ## 📄 License
 This project is open source and available for educational purposes.
 ## 🙏 Acknowledgments
 Built with:
 - [Ollama](https://ollama.com/) - Local LLM runtime
 - [Gradio](https://gradio.app/) - UI framework
 - [ChromaDB](https://www.trychroma.com/) - Vector database
 - [Sentence Transformers](https://www.sbert.net/) - Embeddings
 - [Llama](https://ai.meta.com/llama/) - Meta's open source LLMs
 ## 🎯 Next Steps
 Potential enhancements:
 1. Add support for images and diagrams
 2. Implement multi-document chat history
 3. Build a visual knowledge graph
 4. Add collaborative features
 5. Create mobile app interface
 6. Implement advanced filters and search
 7. Add citation tracking
 8. Create automated study guides
 ---
 **Made with ❤️ for the LLM Engineering Community**
--- a/community-contributions/sach91-bootcamp/week8/agents/init.py
+++ b/community-contributions/sach91-bootcamp/week8/agents/init.py
@@ -0,0 +1,18 @@
 """
 KnowledgeHub Agents
 """
 from .base_agent import BaseAgent
 from .ingestion_agent import IngestionAgent
 from .question_agent import QuestionAgent
 from .summary_agent import SummaryAgent
 from .connection_agent import ConnectionAgent
 from .export_agent import ExportAgent
 __all__ = [
    'BaseAgent',
    'IngestionAgent',
    'QuestionAgent',
    'SummaryAgent',
    'ConnectionAgent',
    'ExportAgent'
 ]
--- a/community-contributions/sach91-bootcamp/week8/agents/base_agent.py
+++ b/community-contributions/sach91-bootcamp/week8/agents/base_agent.py
@@ -0,0 +1,91 @@
 """
 Base Agent class - Foundation for all specialized agents
 """
 from abc import ABC, abstractmethod
 import logging
 from typing import Optional, Dict, Any
 from utils.ollama_client import OllamaClient
 logger = logging.getLogger(__name__)
 class BaseAgent(ABC):
    """Abstract base class for all agents"""
    def __init__(self, name: str, llm_client: Optional[OllamaClient] = None, 
                 model: str = "llama3.2"):
        """
        Initialize base agent
        Args:
            name: Agent name for logging
            llm_client: Shared Ollama client (creates new one if None)
            model: Ollama model to use
        """
        self.name = name
        self.model = model
        # Use shared client or create new one
        if llm_client is None:
            self.llm = OllamaClient(model=model)
            logger.info(f"{self.name} initialized with new LLM client (model: {model})")
        else:
            self.llm = llm_client
            logger.info(f"{self.name} initialized with shared LLM client (model: {model})")
    def generate(self, prompt: str, system: Optional[str] = None, 
                 temperature: float = 0.7, max_tokens: int = 2048) -> str:
        """
        Generate text using the LLM
        Args:
            prompt: User prompt
            system: System message (optional)
            temperature: Sampling temperature
            max_tokens: Maximum tokens to generate
        Returns:
            Generated text
        """
        logger.info(f"{self.name} generating response")
        response = self.llm.generate(
            prompt=prompt,
            system=system,
            temperature=temperature,
            max_tokens=max_tokens
        )
        logger.debug(f"{self.name} generated {len(response)} characters")
        return response
    def chat(self, messages: list, temperature: float = 0.7, 
             max_tokens: int = 2048) -> str:
        """
        Chat completion with message history
        Args:
            messages: List of message dicts with 'role' and 'content'
            temperature: Sampling temperature
            max_tokens: Maximum tokens to generate
        Returns:
            Generated text
        """
        logger.info(f"{self.name} processing chat with {len(messages)} messages")
        response = self.llm.chat(
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens
        )
        logger.debug(f"{self.name} generated {len(response)} characters")
        return response
    @abstractmethod
    def process(self, *args, **kwargs) -> Any:
        """
        Main processing method - must be implemented by subclasses
        Each agent implements its specialized logic here
        """
        pass
    def __str__(self):
        return f"{self.name} (model: {self.model})"
--- a/community-contributions/sach91-bootcamp/week8/agents/connection_agent.py
+++ b/community-contributions/sach91-bootcamp/week8/agents/connection_agent.py
@@ -0,0 +1,289 @@
 """
 Connection Agent - Finds relationships and connections between documents
 """
 import logging
 from typing import List, Dict, Tuple
 from agents.base_agent import BaseAgent
 from models.knowledge_graph import KnowledgeNode, KnowledgeEdge, KnowledgeGraph
 from utils.embeddings import EmbeddingModel
 import chromadb
 import numpy as np
 logger = logging.getLogger(__name__)
 class ConnectionAgent(BaseAgent):
    """Agent that discovers connections between documents and concepts"""
    def __init__(self, collection: chromadb.Collection,
                 embedding_model: EmbeddingModel,
                 llm_client=None, model: str = "llama3.2"):
        """
        Initialize connection agent
        Args:
            collection: ChromaDB collection with documents
            embedding_model: Model for computing similarities
            llm_client: Optional shared LLM client
            model: Ollama model name
        """
        super().__init__(name="ConnectionAgent", llm_client=llm_client, model=model)
        self.collection = collection
        self.embedding_model = embedding_model
        logger.info(f"{self.name} initialized")
    def process(self, document_id: str = None, query: str = None, 
                top_k: int = 5) -> Dict:
        """
        Find documents related to a document or query
        Args:
            document_id: ID of reference document
            query: Search query (used if document_id not provided)
            top_k: Number of related documents to find
        Returns:
            Dictionary with related documents and connections
        """
        if document_id:
            logger.info(f"{self.name} finding connections for document: {document_id}")
            return self._find_related_to_document(document_id, top_k)
        elif query:
            logger.info(f"{self.name} finding connections for query: {query[:100]}")
            return self._find_related_to_query(query, top_k)
        else:
            return {'related': [], 'error': 'No document_id or query provided'}
    def _find_related_to_document(self, document_id: str, top_k: int) -> Dict:
        """Find documents related to a specific document"""
        try:
            # Get chunks from the document
            results = self.collection.get(
                where={"document_id": document_id},
                include=['embeddings', 'documents', 'metadatas']
            )
            if not results['ids']:
                return {'related': [], 'error': 'Document not found'}
            # Use the first chunk's embedding as representative
            query_embedding = results['embeddings'][0]
            document_name = results['metadatas'][0].get('filename', 'Unknown')
            # Search for similar chunks from OTHER documents
            search_results = self.collection.query(
                query_embeddings=[query_embedding],
                n_results=top_k * 3,  # Get more to filter out same document
                include=['documents', 'metadatas', 'distances']
            )
            # Filter out chunks from the same document
            related = []
            seen_docs = set([document_id])
            if search_results['ids']:
                for i in range(len(search_results['ids'][0])):
                    related_doc_id = search_results['metadatas'][0][i].get('document_id')
                    if related_doc_id not in seen_docs:
                        seen_docs.add(related_doc_id)
                        similarity = 1.0 - search_results['distances'][0][i]
                        related.append({
                            'document_id': related_doc_id,
                            'document_name': search_results['metadatas'][0][i].get('filename', 'Unknown'),
                            'similarity': float(similarity),
                            'preview': search_results['documents'][0][i][:150] + "..."
                        })
                        if len(related) >= top_k:
                            break
            return {
                'source_document': document_name,
                'source_id': document_id,
                'related': related,
                'num_related': len(related)
            }
        except Exception as e:
            logger.error(f"Error finding related documents: {e}")
            return {'related': [], 'error': str(e)}
    def _find_related_to_query(self, query: str, top_k: int) -> Dict:
        """Find documents related to a query"""
        try:
            # Generate query embedding
            query_embedding = self.embedding_model.embed_query(query)
            # Search
            results = self.collection.query(
                query_embeddings=[query_embedding],
                n_results=top_k * 2,  # Get more to deduplicate by document
                include=['documents', 'metadatas', 'distances']
            )
            # Deduplicate by document
            related = []
            seen_docs = set()
            if results['ids']:
                for i in range(len(results['ids'][0])):
                    doc_id = results['metadatas'][0][i].get('document_id')
                    if doc_id not in seen_docs:
                        seen_docs.add(doc_id)
                        similarity = 1.0 - results['distances'][0][i]
                        related.append({
                            'document_id': doc_id,
                            'document_name': results['metadatas'][0][i].get('filename', 'Unknown'),
                            'similarity': float(similarity),
                            'preview': results['documents'][0][i][:150] + "..."
                        })
                        if len(related) >= top_k:
                            break
            return {
                'query': query,
                'related': related,
                'num_related': len(related)
            }
        except Exception as e:
            logger.error(f"Error finding related documents: {e}")
            return {'related': [], 'error': str(e)}
    def build_knowledge_graph(self, similarity_threshold: float = 0.7) -> KnowledgeGraph:
        """
        Build a knowledge graph showing document relationships
        Args:
            similarity_threshold: Minimum similarity to create an edge
        Returns:
            KnowledgeGraph object
        """
        logger.info(f"{self.name} building knowledge graph")
        graph = KnowledgeGraph()
        try:
            # Get all documents
            all_results = self.collection.get(
                include=['embeddings', 'metadatas']
            )
            if not all_results['ids']:
                return graph
            # Group by document
            documents = {}
            for i, metadata in enumerate(all_results['metadatas']):
                doc_id = metadata.get('document_id')
                if doc_id not in documents:
                    documents[doc_id] = {
                        'name': metadata.get('filename', 'Unknown'),
                        'embedding': all_results['embeddings'][i]
                    }
            # Create nodes
            for doc_id, doc_data in documents.items():
                node = KnowledgeNode(
                    id=doc_id,
                    name=doc_data['name'],
                    node_type='document',
                    description=f"Document: {doc_data['name']}"
                )
                graph.add_node(node)
            # Create edges based on similarity
            doc_ids = list(documents.keys())
            for i, doc_id1 in enumerate(doc_ids):
                emb1 = np.array(documents[doc_id1]['embedding'])
                for doc_id2 in doc_ids[i+1:]:
                    emb2 = np.array(documents[doc_id2]['embedding'])
                    # Calculate similarity
                    similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
                    if similarity >= similarity_threshold:
                        edge = KnowledgeEdge(
                            source_id=doc_id1,
                            target_id=doc_id2,
                            relationship='similar_to',
                            weight=float(similarity)
                        )
                        graph.add_edge(edge)
            logger.info(f"{self.name} built graph with {len(graph.nodes)} nodes and {len(graph.edges)} edges")
            return graph
        except Exception as e:
            logger.error(f"Error building knowledge graph: {e}")
            return graph
    def explain_connection(self, doc_id1: str, doc_id2: str) -> str:
        """
        Use LLM to explain why two documents are related
        Args:
            doc_id1: First document ID
            doc_id2: Second document ID
        Returns:
            Explanation text
        """
        try:
            # Get sample chunks from each document
            results1 = self.collection.get(
                where={"document_id": doc_id1},
                limit=2,
                include=['documents', 'metadatas']
            )
            results2 = self.collection.get(
                where={"document_id": doc_id2},
                limit=2,
                include=['documents', 'metadatas']
            )
            if not results1['ids'] or not results2['ids']:
                return "Could not retrieve documents"
            doc1_name = results1['metadatas'][0].get('filename', 'Document 1')
            doc2_name = results2['metadatas'][0].get('filename', 'Document 2')
            doc1_text = " ".join(results1['documents'][:2])[:1000]
            doc2_text = " ".join(results2['documents'][:2])[:1000]
            system_prompt = """You analyze documents and explain their relationships.
 Provide a brief, clear explanation of how two documents are related."""
            user_prompt = f"""Analyze these two documents and explain how they are related:
 Document 1 ({doc1_name}):
 {doc1_text}
 Document 2 ({doc2_name}):
 {doc2_text}
 How are these documents related? Provide a concise explanation:"""
            explanation = self.generate(
                prompt=user_prompt,
                system=system_prompt,
                temperature=0.3,
                max_tokens=256
            )
            return explanation
        except Exception as e:
            logger.error(f"Error explaining connection: {e}")
            return f"Error: {str(e)}"
--- a/community-contributions/sach91-bootcamp/week8/agents/export_agent.py
+++ b/community-contributions/sach91-bootcamp/week8/agents/export_agent.py
@@ -0,0 +1,233 @@
 """
 Export Agent - Generates formatted reports and exports
 """
 import logging
 from typing import List, Dict
 from datetime import datetime
 from agents.base_agent import BaseAgent
 from models.document import Summary
 logger = logging.getLogger(__name__)
 class ExportAgent(BaseAgent):
    """Agent that exports summaries and reports in various formats"""
    def __init__(self, llm_client=None, model: str = "llama3.2"):
        """
        Initialize export agent
        Args:
            llm_client: Optional shared LLM client
            model: Ollama model name
        """
        super().__init__(name="ExportAgent", llm_client=llm_client, model=model)
        logger.info(f"{self.name} initialized")
    def process(self, content: Dict, format: str = "markdown") -> str:
        """
        Export content in specified format
        Args:
            content: Content dictionary to export
            format: Export format ('markdown', 'text', 'html')
        Returns:
            Formatted content string
        """
        logger.info(f"{self.name} exporting as {format}")
        if format == "markdown":
            return self._export_markdown(content)
        elif format == "text":
            return self._export_text(content)
        elif format == "html":
            return self._export_html(content)
        else:
            return str(content)
    def _export_markdown(self, content: Dict) -> str:
        """Export as Markdown"""
        md = []
        md.append(f"# Knowledge Report")
        md.append(f"\n*Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}*\n")
        if 'title' in content:
            md.append(f"## {content['title']}\n")
        if 'summary' in content:
            md.append(f"### Summary\n")
            md.append(f"{content['summary']}\n")
        if 'key_points' in content and content['key_points']:
            md.append(f"### Key Points\n")
            for point in content['key_points']:
                md.append(f"- {point}")
            md.append("")
        if 'sections' in content:
            for section in content['sections']:
                md.append(f"### {section['title']}\n")
                md.append(f"{section['content']}\n")
        if 'sources' in content and content['sources']:
            md.append(f"### Sources\n")
            for i, source in enumerate(content['sources'], 1):
                md.append(f"{i}. {source}")
            md.append("")
        return "\n".join(md)
    def _export_text(self, content: Dict) -> str:
        """Export as plain text"""
        lines = []
        lines.append("=" * 60)
        lines.append("KNOWLEDGE REPORT")
        lines.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
        lines.append("=" * 60)
        lines.append("")
        if 'title' in content:
            lines.append(content['title'])
            lines.append("-" * len(content['title']))
            lines.append("")
        if 'summary' in content:
            lines.append("SUMMARY:")
            lines.append(content['summary'])
            lines.append("")
        if 'key_points' in content and content['key_points']:
            lines.append("KEY POINTS:")
            for i, point in enumerate(content['key_points'], 1):
                lines.append(f"  {i}. {point}")
            lines.append("")
        if 'sections' in content:
            for section in content['sections']:
                lines.append(section['title'].upper())
                lines.append("-" * 40)
                lines.append(section['content'])
                lines.append("")
        if 'sources' in content and content['sources']:
            lines.append("SOURCES:")
            for i, source in enumerate(content['sources'], 1):
                lines.append(f"  {i}. {source}")
        lines.append("")
        lines.append("=" * 60)
        return "\n".join(lines)
    def _export_html(self, content: Dict) -> str:
        """Export as HTML"""
        html = []
        html.append("<!DOCTYPE html>")
        html.append("<html>")
        html.append("<head>")
        html.append("  <meta charset='utf-8'>")
        html.append("  <title>Knowledge Report</title>")
        html.append("  <style>")
        html.append("    body { font-family: Arial, sans-serif; max-width: 800px; margin: 40px auto; padding: 20px; }")
        html.append("    h1 { color: #333; border-bottom: 3px solid #007bff; padding-bottom: 10px; }")
        html.append("    h2 { color: #555; margin-top: 30px; }")
        html.append("    .meta { color: #888; font-style: italic; }")
        html.append("    .key-points { background: #f8f9fa; padding: 15px; border-left: 4px solid #007bff; }")
        html.append("    .source { color: #666; font-size: 0.9em; }")
        html.append("  </style>")
        html.append("</head>")
        html.append("<body>")
        html.append("  <h1>Knowledge Report</h1>")
        html.append(f"  <p class='meta'>Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}</p>")
        if 'title' in content:
            html.append(f"  <h2>{content['title']}</h2>")
        if 'summary' in content:
            html.append(f"  <h3>Summary</h3>")
            html.append(f"  <p>{content['summary']}</p>")
        if 'key_points' in content and content['key_points']:
            html.append("  <h3>Key Points</h3>")
            html.append("  <div class='key-points'>")
            html.append("    <ul>")
            for point in content['key_points']:
                html.append(f"      <li>{point}</li>")
            html.append("    </ul>")
            html.append("  </div>")
        if 'sections' in content:
            for section in content['sections']:
                html.append(f"  <h3>{section['title']}</h3>")
                html.append(f"  <p>{section['content']}</p>")
        if 'sources' in content and content['sources']:
            html.append("  <h3>Sources</h3>")
            html.append("  <ol class='source'>")
            for source in content['sources']:
                html.append(f"    <li>{source}</li>")
            html.append("  </ol>")
        html.append("</body>")
        html.append("</html>")
        return "\n".join(html)
    def create_study_guide(self, summaries: List[Summary]) -> str:
        """
        Create a study guide from multiple summaries
        Args:
            summaries: List of Summary objects
        Returns:
            Formatted study guide
        """
        logger.info(f"{self.name} creating study guide from {len(summaries)} summaries")
        # Compile all content
        all_summaries = "\n\n".join([
            f"{s.document_name}:\n{s.summary_text}" 
            for s in summaries
        ])
        all_key_points = []
        for s in summaries:
            all_key_points.extend(s.key_points)
        # Use LLM to create cohesive study guide
        system_prompt = """You create excellent study guides that synthesize information from multiple sources.
 Create a well-organized study guide with clear sections, key concepts, and important points."""
        user_prompt = f"""Create a comprehensive study guide based on these document summaries:
 {all_summaries}
 Create a well-structured study guide with:
 1. An overview
 2. Key concepts
 3. Important details
 4. Study tips
 Study Guide:"""
        study_guide = self.generate(
            prompt=user_prompt,
            system=system_prompt,
            temperature=0.5,
            max_tokens=2048
        )
        # Format as markdown
        content = {
            'title': 'Study Guide',
            'sections': [
                {'title': 'Overview', 'content': study_guide},
                {'title': 'Key Points from All Documents', 'content': '\n'.join([f"• {p}" for p in all_key_points[:15]])}
            ],
            'sources': [s.document_name for s in summaries]
        }
        return self._export_markdown(content)
--- a/community-contributions/sach91-bootcamp/week8/agents/ingestion_agent.py
+++ b/community-contributions/sach91-bootcamp/week8/agents/ingestion_agent.py
@@ -0,0 +1,157 @@
 """
 Ingestion Agent - Processes and stores documents in the vector database
 """
 import logging
 from typing import Dict, List
 import uuid
 from datetime import datetime
 from agents.base_agent import BaseAgent
 from models.document import Document, DocumentChunk
 from utils.document_parser import DocumentParser
 from utils.embeddings import EmbeddingModel
 import chromadb
 logger = logging.getLogger(__name__)
 class IngestionAgent(BaseAgent):
    """Agent responsible for ingesting and storing documents"""
    def __init__(self, collection: chromadb.Collection, 
                 embedding_model: EmbeddingModel,
                 llm_client=None, model: str = "llama3.2"):
        """
        Initialize ingestion agent
        Args:
            collection: ChromaDB collection for storage
            embedding_model: Model for generating embeddings
            llm_client: Optional shared LLM client
            model: Ollama model name
        """
        super().__init__(name="IngestionAgent", llm_client=llm_client, model=model)
        self.collection = collection
        self.embedding_model = embedding_model
        self.parser = DocumentParser(chunk_size=1000, chunk_overlap=200)
        logger.info(f"{self.name} ready with ChromaDB collection")
    def process(self, file_path: str) -> Document:
        """
        Process and ingest a document
        Args:
            file_path: Path to the document file
        Returns:
            Document object with metadata
        """
        logger.info(f"{self.name} processing: {file_path}")
        # Parse the document
        parsed = self.parser.parse_file(file_path)
        # Generate document ID
        doc_id = str(uuid.uuid4())
        # Create document chunks
        chunks = []
        chunk_texts = []
        chunk_ids = []
        chunk_metadatas = []
        for i, chunk_text in enumerate(parsed['chunks']):
            chunk_id = f"{doc_id}_chunk_{i}"
            chunk = DocumentChunk(
                id=chunk_id,
                document_id=doc_id,
                content=chunk_text,
                chunk_index=i,
                metadata={
                    'filename': parsed['filename'],
                    'extension': parsed['extension'],
                    'total_chunks': len(parsed['chunks'])
                }
            )
            chunks.append(chunk)
            chunk_texts.append(chunk_text)
            chunk_ids.append(chunk_id)
            chunk_metadatas.append({
                'document_id': doc_id,
                'filename': parsed['filename'],
                'chunk_index': i,
                'extension': parsed['extension']
            })
        # Generate embeddings
        logger.info(f"{self.name} generating embeddings for {len(chunks)} chunks")
        embeddings = self.embedding_model.embed_documents(chunk_texts)
        # Store in ChromaDB
        logger.info(f"{self.name} storing in ChromaDB")
        self.collection.add(
            ids=chunk_ids,
            documents=chunk_texts,
            embeddings=embeddings,
            metadatas=chunk_metadatas
        )
        # Create document object
        document = Document(
            id=doc_id,
            filename=parsed['filename'],
            filepath=parsed['filepath'],
            content=parsed['text'],
            chunks=chunks,
            metadata={
                'extension': parsed['extension'],
                'num_chunks': len(chunks),
                'total_chars': parsed['total_chars']
            },
            created_at=datetime.now()
        )
        logger.info(f"{self.name} successfully ingested: {document}")
        return document
    def get_statistics(self) -> Dict:
        """Get statistics about stored documents"""
        try:
            count = self.collection.count()
            return {
                'total_chunks': count,
                'collection_name': self.collection.name
            }
        except Exception as e:
            logger.error(f"Error getting statistics: {e}")
            return {'total_chunks': 0, 'error': str(e)}
    def delete_document(self, document_id: str) -> bool:
        """
        Delete all chunks of a document
        Args:
            document_id: ID of document to delete
        Returns:
            True if successful
        """
        try:
            # Get all chunk IDs for this document
            results = self.collection.get(
                where={"document_id": document_id}
            )
            if results['ids']:
                self.collection.delete(ids=results['ids'])
                logger.info(f"{self.name} deleted document {document_id}")
                return True
            return False
        except Exception as e:
            logger.error(f"Error deleting document: {e}")
            return False
--- a/community-contributions/sach91-bootcamp/week8/agents/question_agent.py
+++ b/community-contributions/sach91-bootcamp/week8/agents/question_agent.py
@@ -0,0 +1,156 @@
 """
 Question Agent - Answers questions using RAG (Retrieval Augmented Generation)
 """
 import logging
 from typing import List, Dict
 from agents.base_agent import BaseAgent
 from models.document import SearchResult, DocumentChunk
 from utils.embeddings import EmbeddingModel
 import chromadb
 logger = logging.getLogger(__name__)
 class QuestionAgent(BaseAgent):
    """Agent that answers questions using retrieved context"""
    def __init__(self, collection: chromadb.Collection,
                 embedding_model: EmbeddingModel,
                 llm_client=None, model: str = "llama3.2"):
        """
        Initialize question agent
        Args:
            collection: ChromaDB collection with documents
            embedding_model: Model for query embeddings
            llm_client: Optional shared LLM client
            model: Ollama model name
        """
        super().__init__(name="QuestionAgent", llm_client=llm_client, model=model)
        self.collection = collection
        self.embedding_model = embedding_model
        self.top_k = 5  # Number of chunks to retrieve
        logger.info(f"{self.name} initialized")
    def retrieve(self, query: str, top_k: int = None) -> List[SearchResult]:
        """
        Retrieve relevant document chunks for a query
        Args:
            query: Search query
            top_k: Number of results to return (uses self.top_k if None)
        Returns:
            List of SearchResult objects
        """
        if top_k is None:
            top_k = self.top_k
        logger.info(f"{self.name} retrieving top {top_k} chunks for query")
        # Generate query embedding
        query_embedding = self.embedding_model.embed_query(query)
        # Search ChromaDB
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )
        # Convert to SearchResult objects
        search_results = []
        if results['ids'] and len(results['ids']) > 0:
            for i in range(len(results['ids'][0])):
                chunk = DocumentChunk(
                    id=results['ids'][0][i],
                    document_id=results['metadatas'][0][i].get('document_id', ''),
                    content=results['documents'][0][i],
                    chunk_index=results['metadatas'][0][i].get('chunk_index', 0),
                    metadata=results['metadatas'][0][i]
                )
                result = SearchResult(
                    chunk=chunk,
                    score=1.0 - results['distances'][0][i],  # Convert distance to similarity
                    document_id=results['metadatas'][0][i].get('document_id', ''),
                    document_name=results['metadatas'][0][i].get('filename', 'Unknown')
                )
                search_results.append(result)
        logger.info(f"{self.name} retrieved {len(search_results)} results")
        return search_results
    def process(self, question: str, top_k: int = None) -> Dict[str, any]:
        """
        Answer a question using RAG
        Args:
            question: User's question
            top_k: Number of chunks to retrieve
        Returns:
            Dictionary with answer and sources
        """
        logger.info(f"{self.name} processing question: {question[:100]}...")
        # Retrieve relevant chunks
        search_results = self.retrieve(question, top_k)
        if not search_results:
            return {
                'answer': "I don't have any relevant information in my knowledge base to answer this question.",
                'sources': [],
                'context_used': ""
            }
        # Build context from retrieved chunks
        context_parts = []
        sources = []
        for i, result in enumerate(search_results, 1):
            context_parts.append(f"[Source {i}] {result.chunk.content}")
            sources.append({
                'document': result.document_name,
                'score': result.score,
                'preview': result.chunk.content[:150] + "..."
            })
        context = "\n\n".join(context_parts)
        # Create prompt for LLM
        system_prompt = """You are a helpful research assistant. Answer questions based on the provided context.
 Be accurate and cite sources when possible. If the context doesn't contain enough information to answer fully, say so.
 Keep your answer concise and relevant."""
        user_prompt = f"""Context from my knowledge base:
 {context}
 Question: {question}
 Answer based on the context above. If you reference specific information, mention which source(s) you're using."""
        # Generate answer
        answer = self.generate(
            prompt=user_prompt,
            system=system_prompt,
            temperature=0.3,  # Lower temperature for more factual responses
            max_tokens=1024
        )
        logger.info(f"{self.name} generated answer ({len(answer)} chars)")
        return {
            'answer': answer,
            'sources': sources,
            'context_used': context,
            'num_sources': len(sources)
        }
    def set_top_k(self, k: int):
        """Set the number of chunks to retrieve"""
        self.top_k = k
        logger.info(f"{self.name} top_k set to {k}")
--- a/community-contributions/sach91-bootcamp/week8/agents/summary_agent.py
+++ b/community-contributions/sach91-bootcamp/week8/agents/summary_agent.py
@@ -0,0 +1,181 @@
 """
 Summary Agent - Creates summaries and extracts key points from documents
 """
 import logging
 from typing import Dict, List
 from agents.base_agent import BaseAgent
 from models.document import Summary
 import chromadb
 logger = logging.getLogger(__name__)
 class SummaryAgent(BaseAgent):
    """Agent that creates summaries of documents"""
    def __init__(self, collection: chromadb.Collection,
                 llm_client=None, model: str = "llama3.2"):
        """
        Initialize summary agent
        Args:
            collection: ChromaDB collection with documents
            llm_client: Optional shared LLM client
            model: Ollama model name
        """
        super().__init__(name="SummaryAgent", llm_client=llm_client, model=model)
        self.collection = collection
        logger.info(f"{self.name} initialized")
    def process(self, document_id: str = None, document_text: str = None, 
                document_name: str = "Unknown") -> Summary:
        """
        Create a summary of a document
        Args:
            document_id: ID of document in ChromaDB (retrieves chunks if provided)
            document_text: Full document text (used if document_id not provided)
            document_name: Name of the document
        Returns:
            Summary object
        """
        logger.info(f"{self.name} creating summary for: {document_name}")
        # Get document text
        if document_id:
            text = self._get_document_text(document_id)
            if not text:
                return Summary(
                    document_id=document_id,
                    document_name=document_name,
                    summary_text="Error: Could not retrieve document",
                    key_points=[]
                )
        elif document_text:
            text = document_text
        else:
            return Summary(
                document_id="",
                document_name=document_name,
                summary_text="Error: No document provided",
                key_points=[]
            )
        # Truncate if too long (to fit in context)
        max_chars = 8000
        if len(text) > max_chars:
            logger.warning(f"{self.name} truncating document from {len(text)} to {max_chars} chars")
            text = text[:max_chars] + "\n\n[Document truncated...]"
        # Generate summary
        summary_text = self._generate_summary(text)
        # Extract key points
        key_points = self._extract_key_points(text)
        summary = Summary(
            document_id=document_id or "",
            document_name=document_name,
            summary_text=summary_text,
            key_points=key_points
        )
        logger.info(f"{self.name} completed summary with {len(key_points)} key points")
        return summary
    def _get_document_text(self, document_id: str) -> str:
        """Retrieve and reconstruct document text from chunks"""
        try:
            results = self.collection.get(
                where={"document_id": document_id}
            )
            if not results['ids']:
                return ""
            # Sort by chunk index
            chunks_data = list(zip(
                results['documents'],
                results['metadatas']
            ))
            chunks_data.sort(key=lambda x: x[1].get('chunk_index', 0))
            # Combine chunks
            text = "\n\n".join([chunk[0] for chunk in chunks_data])
            return text
        except Exception as e:
            logger.error(f"Error retrieving document: {e}")
            return ""
    def _generate_summary(self, text: str) -> str:
        """Generate a concise summary of the text"""
        system_prompt = """You are an expert at creating concise, informative summaries.
 Your summaries capture the main ideas and key information in clear, accessible language.
 Keep summaries to 3-5 sentences unless the document is very long."""
        user_prompt = f"""Please create a concise summary of the following document:
 {text}
 Summary:"""
        summary = self.generate(
            prompt=user_prompt,
            system=system_prompt,
            temperature=0.3,
            max_tokens=512
        )
        return summary.strip()
    def _extract_key_points(self, text: str) -> List[str]:
        """Extract key points from the text"""
        system_prompt = """You extract the most important key points from documents.
 List 3-7 key points as concise bullet points. Each point should be a complete, standalone statement."""
        user_prompt = f"""Please extract the key points from the following document:
 {text}
 List the key points (one per line, without bullets or numbers):"""
        response = self.generate(
            prompt=user_prompt,
            system=system_prompt,
            temperature=0.3,
            max_tokens=512
        )
        # Parse the response into a list
        key_points = []
        for line in response.split('\n'):
            line = line.strip()
            # Remove common list markers
            line = line.lstrip('•-*0123456789.)')
            line = line.strip()
            if line and len(line) > 10:  # Filter out very short lines
                key_points.append(line)
        return key_points[:7]  # Limit to 7 points
    def summarize_multiple(self, document_ids: List[str]) -> List[Summary]:
        """
        Create summaries for multiple documents
        Args:
            document_ids: List of document IDs
        Returns:
            List of Summary objects
        """
        summaries = []
        for doc_id in document_ids:
            summary = self.process(document_id=doc_id)
            summaries.append(summary)
        return summaries
--- a/community-contributions/sach91-bootcamp/week8/app.py
+++ b/community-contributions/sach91-bootcamp/week8/app.py
@@ -0,0 +1,846 @@
 """
 KnowledgeHub - Personal Knowledge Management & Research Assistant
 Main Gradio Application
 """
 import os
 import logging
 import json
 import gradio as gr
 from pathlib import Path
 import chromadb
 from datetime import datetime
 # Setup logging
 logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
 )
 logger = logging.getLogger(__name__)
 # Import utilities and agents
 from utils import OllamaClient, EmbeddingModel, DocumentParser
 from agents import (
    IngestionAgent, QuestionAgent, SummaryAgent,
    ConnectionAgent, ExportAgent
 )
 from models import Document
 # Constants
 VECTORSTORE_PATH = "./vectorstore"
 TEMP_UPLOAD_PATH = "./temp_uploads"
 DOCUMENTS_METADATA_PATH = "./vectorstore/documents_metadata.json"
 # Ensure directories exist
 os.makedirs(VECTORSTORE_PATH, exist_ok=True)
 os.makedirs(TEMP_UPLOAD_PATH, exist_ok=True)
 class KnowledgeHub:
    """Main application class managing all agents"""
    def __init__(self):
        logger.info("Initializing KnowledgeHub...")
        # Initialize ChromaDB
        self.client = chromadb.PersistentClient(path=VECTORSTORE_PATH)
        self.collection = self.client.get_or_create_collection(
            name="knowledge_base",
            metadata={"description": "Personal knowledge management collection"}
        )
        # Initialize embedding model
        self.embedding_model = EmbeddingModel()
        # Initialize shared LLM client
        self.llm_client = OllamaClient(model="llama3.2")
        # Check Ollama connection
        if not self.llm_client.check_connection():
            logger.warning("⚠️ Cannot connect to Ollama. Please ensure Ollama is running.")
            logger.warning("Start Ollama with: ollama serve")
        else:
            logger.info("✓ Connected to Ollama")
        # Initialize agents
        self.ingestion_agent = IngestionAgent(
            collection=self.collection,
            embedding_model=self.embedding_model,
            llm_client=self.llm_client
        )
        self.question_agent = QuestionAgent(
            collection=self.collection,
            embedding_model=self.embedding_model,
            llm_client=self.llm_client
        )
        self.summary_agent = SummaryAgent(
            collection=self.collection,
            llm_client=self.llm_client
        )
        self.connection_agent = ConnectionAgent(
            collection=self.collection,
            embedding_model=self.embedding_model,
            llm_client=self.llm_client
        )
        self.export_agent = ExportAgent(
            llm_client=self.llm_client
        )
        # Track uploaded documents
        self.documents = {}
        # Load existing documents from metadata file
        self._load_documents_metadata()
        logger.info("✓ KnowledgeHub initialized successfully")
    def _save_documents_metadata(self):
        """Save document metadata to JSON file"""
        try:
            metadata = {
                doc_id: doc.to_dict()
                for doc_id, doc in self.documents.items()
            }
            with open(DOCUMENTS_METADATA_PATH, 'w') as f:
                json.dump(metadata, f, indent=2)
            logger.debug(f"Saved metadata for {len(metadata)} documents")
        except Exception as e:
            logger.error(f"Error saving document metadata: {e}")
    def _load_documents_metadata(self):
        """Load document metadata from JSON file"""
        try:
            if os.path.exists(DOCUMENTS_METADATA_PATH):
                with open(DOCUMENTS_METADATA_PATH, 'r') as f:
                    metadata = json.load(f)
                # Reconstruct Document objects (simplified - without chunks)
                for doc_id, doc_data in metadata.items():
                    # Create a minimal Document object for UI purposes
                    # Full chunks are still in ChromaDB
                    doc = Document(
                        id=doc_id,
                        filename=doc_data['filename'],
                        filepath=doc_data.get('filepath', ''),
                        content=doc_data.get('content', ''),
                        chunks=[],  # Chunks are in ChromaDB
                        metadata=doc_data.get('metadata', {}),
                        created_at=datetime.fromisoformat(doc_data['created_at'])
                    )
                    self.documents[doc_id] = doc
                logger.info(f"✓ Loaded {len(self.documents)} existing documents from storage")
            else:
                logger.info("No existing documents found (starting fresh)")
        except Exception as e:
            logger.error(f"Error loading document metadata: {e}")
            logger.info("Starting with empty document list")
    def upload_document(self, files, progress=gr.Progress()):
        """Handle document upload - supports single or multiple files with progress tracking"""
        if files is None or len(files) == 0:
            return "⚠️ Please select file(s) to upload", "", []
        # Convert single file to list for consistent handling
        if not isinstance(files, list):
            files = [files]
        results = []
        successful = 0
        failed = 0
        total_chunks = 0
        # Initialize progress tracking
        progress(0, desc="Starting upload...")
        for file_idx, file in enumerate(files, 1):
            # Update progress
            progress_pct = (file_idx - 1) / len(files)
            progress(progress_pct, desc=f"Processing {file_idx}/{len(files)}: {Path(file.name).name}")
            try:
                logger.info(f"Processing file {file_idx}/{len(files)}: {file.name}")
                # Save uploaded file temporarily
                temp_path = os.path.join(TEMP_UPLOAD_PATH, Path(file.name).name)
                # Copy file content
                with open(temp_path, 'wb') as f:
                    f.write(file.read() if hasattr(file, 'read') else open(file.name, 'rb').read())
                # Process document
                document = self.ingestion_agent.process(temp_path)
                # Store document reference
                self.documents[document.id] = document
                # Track stats
                successful += 1
                total_chunks += document.num_chunks
                # Add to results
                results.append({
                    'status': '✅',
                    'filename': document.filename,
                    'chunks': document.num_chunks,
                    'size': f"{document.total_chars:,} chars"
                })
                # Clean up temp file
                os.remove(temp_path)
            except Exception as e:
                logger.error(f"Error processing {file.name}: {e}")
                failed += 1
                results.append({
                    'status': '❌',
                    'filename': Path(file.name).name,
                    'chunks': 0,
                    'size': f"Error: {str(e)[:50]}"
                })
        # Final progress update
        progress(1.0, desc="Upload complete!")
        # Save metadata once after all uploads
        if successful > 0:
            self._save_documents_metadata()
        # Create summary
        summary = f"""## Upload Complete! 🎉
 **Total Files:** {len(files)}
 **✅ Successful:** {successful}
 **❌ Failed:** {failed}
 **Total Chunks Created:** {total_chunks:,}
 {f"⚠️ **{failed} file(s) failed** - Check results table below for details" if failed > 0 else "All files processed successfully!"}
 """
        # Create detailed results table
        results_table = [[r['status'], r['filename'], r['chunks'], r['size']] for r in results]
        # Create preview of first successful document
        preview = ""
        for doc in self.documents.values():
            if doc.filename in [r['filename'] for r in results if r['status'] == '✅']:
                preview = doc.content[:500] + "..." if len(doc.content) > 500 else doc.content
                break
        return summary, preview, results_table
    def ask_question(self, question, top_k, progress=gr.Progress()):
        """Handle question answering with progress tracking"""
        if not question.strip():
            return "⚠️ Please enter a question", [], ""
        try:
            # Initial status
            progress(0, desc="Processing your question...")
            status = "🔄 **Searching knowledge base...**\n\nRetrieving relevant documents..."
            logger.info(f"Answering question: {question[:100]}")
            # Update progress
            progress(0.3, desc="Finding relevant documents...")
            result = self.question_agent.process(question, top_k=top_k)
            # Update progress
            progress(0.7, desc="Generating answer with LLM...")
            # Format answer
            answer = f"""### Answer\n\n{result['answer']}\n\n"""
            if result['sources']:
                answer += f"**Sources:** {result['num_sources']} documents referenced\n\n"
            # Format sources for display
            sources_data = []
            for i, source in enumerate(result['sources'], 1):
                sources_data.append([
                    i,
                    source['document'],
                    f"{source['score']:.2%}",
                    source['preview']
                ])
            progress(1.0, desc="Answer ready!")
            return answer, sources_data, "✅ Answer generated successfully!"
        except Exception as e:
            logger.error(f"Error answering question: {e}")
            return f"❌ Error: {str(e)}", [], f"❌ Error: {str(e)}"
    def create_summary(self, doc_selector, progress=gr.Progress()):
        """Create document summary with progress tracking"""
        if not doc_selector:
            return "⚠️ Please select a document to summarize", ""
        try:
            # Initial status
            progress(0, desc="Preparing to summarize...")
            logger.info(f'doc_selector : {doc_selector}')
            doc_id = doc_selector.split(" -|- ")[1]
            document = self.documents.get(doc_id)
            if not document:
                return "", "❌ Document not found"
            # Update status
            status_msg = f"🔄 **Generating summary for:** {document.filename}\n\nPlease wait, this may take 10-20 seconds..."
            progress(0.3, desc=f"Analyzing {document.filename}...")
            logger.info(f"Creating summary for: {document.filename}")
            # Generate summary
            summary = self.summary_agent.process(
                document_id=doc_id,
                document_name=document.filename
            )
            progress(1.0, desc="Summary complete!")
            # Format result
            result = f"""## Summary of {summary.document_name}\n\n{summary.summary_text}\n\n"""
            if summary.key_points:
                result += "### Key Points\n\n"
                for point in summary.key_points:
                    result += f"- {point}\n"
            return result, "✅ Summary generated successfully!"
        except Exception as e:
            logger.error(f"Error creating summary: {e}")
            return "", f"❌ Error: {str(e)}"
    def find_connections(self, doc_selector, top_k, progress=gr.Progress()):
        """Find related documents with progress tracking"""
        if not doc_selector:
            return "⚠️ Please select a document", [], ""
        try:
            progress(0, desc="Preparing to find connections...")
            doc_id = doc_selector.split(" -|- ")[1]
            document = self.documents.get(doc_id)
            if not document:
                return "❌ Document not found", [], "❌ Document not found"
            status = f"🔄 **Finding documents related to:** {document.filename}\n\nSearching knowledge base..."
            progress(0.3, desc=f"Analyzing {document.filename}...")
            logger.info(f"Finding connections for: {document.filename}")
            result = self.connection_agent.process(document_id=doc_id, top_k=top_k)
            progress(0.8, desc="Calculating similarity scores...")
            if 'error' in result:
                return f"❌ Error: {result['error']}", [], f"❌ Error: {result['error']}"
            message = f"""## Related Documents\n\n**Source:** {result['source_document']}\n\n"""
            message += f"**Found {result['num_related']} related documents:**\n\n"""
            # Format for table
            table_data = []
            for i, rel in enumerate(result['related'], 1):
                table_data.append([
                    i,
                    rel['document_name'],
                    f"{rel['similarity']:.2%}",
                    rel['preview']
                ])
            progress(1.0, desc="Connections found!")
            return message, table_data, "✅ Related documents found!"
        except Exception as e:
            logger.error(f"Error finding connections: {e}")
            return f"❌ Error: {str(e)}", [], f"❌ Error: {str(e)}"
    def export_knowledge(self, format_choice):
        """Export knowledge base"""
        try:
            logger.info(f"Exporting as {format_choice}")
            # Get statistics
            stats = self.ingestion_agent.get_statistics()
            # Create export content
            content = {
                'title': 'Knowledge Base Export',
                'summary': f"Total documents in knowledge base: {len(self.documents)}",
                'sections': [
                    {
                        'title': 'Documents',
                        'content': '\n'.join([f"- {doc.filename}" for doc in self.documents.values()])
                    },
                    {
                        'title': 'Statistics',
                        'content': f"Total chunks stored: {stats['total_chunks']}"
                    }
                ]
            }
            # Export
            if format_choice == "Markdown":
                output = self.export_agent.process(content, format="markdown")
                filename = f"knowledge_export_{datetime.now().strftime('%Y%m%d_%H%M%S')}.md"
            elif format_choice == "HTML":
                output = self.export_agent.process(content, format="html")
                filename = f"knowledge_export_{datetime.now().strftime('%Y%m%d_%H%M%S')}.html"
            else:  # Text
                output = self.export_agent.process(content, format="text")
                filename = f"knowledge_export_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"
            # Save file
            export_path = os.path.join(TEMP_UPLOAD_PATH, filename)
            with open(export_path, 'w', encoding='utf-8') as f:
                f.write(output)
            return f"✅ Exported as {format_choice}", export_path
        except Exception as e:
            logger.error(f"Error exporting: {e}")
            return f"❌ Error: {str(e)}", None
    def get_statistics(self):
        """Get knowledge base statistics"""
        try:
            stats = self.ingestion_agent.get_statistics()
            total_docs = len(self.documents)
            total_chunks = stats.get('total_chunks', 0)
            total_chars = sum(doc.total_chars for doc in self.documents.values())
            # Check if data is persisted
            persistence_status = "✅ Enabled" if os.path.exists(DOCUMENTS_METADATA_PATH) else "⚠️ Not configured"
            vectorstore_size = self._get_directory_size(VECTORSTORE_PATH)
            stats_text = f"""## Knowledge Base Statistics
 **Persistence Status:** {persistence_status}
 **Total Documents:** {total_docs}
 **Total Chunks:** {total_chunks:,}
 **Total Characters:** {total_chars:,}
 **Vector Store Size:** {vectorstore_size}
 ### Storage Locations
 - **Vector DB:** `{VECTORSTORE_PATH}/`
 - **Metadata:** `{DOCUMENTS_METADATA_PATH}`
 **📝 Note:** Your data persists across app restarts!
 **Recent Documents:**
 {chr(10).join([f"- {doc.filename} ({doc.num_chunks} chunks)" for doc in list(self.documents.values())[-5:]])}
 """
            if self.documents:
                stats_text += "\n".join([f"- {doc.filename} ({doc.num_chunks} chunks, added {doc.created_at.strftime('%Y-%m-%d')})"
                                        for doc in list(self.documents.values())[-10:]])
            else:
                stats_text += "\n*No documents yet. Upload some to get started!*"
            return stats_text
        except Exception as e:
            return f"❌ Error: {str(e)}"
    def _get_directory_size(self, path):
        """Calculate directory size"""
        try:
            total_size = 0
            for dirpath, dirnames, filenames in os.walk(path):
                for filename in filenames:
                    filepath = os.path.join(dirpath, filename)
                    if os.path.exists(filepath):
                        total_size += os.path.getsize(filepath)
            # Convert to human readable
            for unit in ['B', 'KB', 'MB', 'GB']:
                if total_size < 1024.0:
                    return f"{total_size:.1f} {unit}"
                total_size /= 1024.0
            return f"{total_size:.1f} TB"
        except:
            return "Unknown"
    def get_document_list(self):
        """Get list of documents for dropdown"""
        new_choices = [f"{doc.filename} -|- {doc.id}" for doc in self.documents.values()]
        return gr.update(choices=new_choices, value=None)
    def delete_document(self, doc_selector):
        """Delete a document from the knowledge base"""
        if not doc_selector:
            return "⚠️ Please select a document to delete", self.get_document_list()
        try:
            doc_id = doc_selector.split(" - ")[0]
            document = self.documents.get(doc_id)
            if not document:
                return "❌ Document not found", self.get_document_list()
            # Delete from ChromaDB
            success = self.ingestion_agent.delete_document(doc_id)
            if success:
                # Remove from documents dict
                filename = document.filename
                del self.documents[doc_id]
                # Save updated metadata
                self._save_documents_metadata()
                return f"✅ Deleted: {filename}", self.get_document_list()
            else:
                return f"❌ Error deleting document", self.get_document_list()
        except Exception as e:
            logger.error(f"Error deleting document: {e}")
            return f"❌ Error: {str(e)}", self.get_document_list()
    def clear_all_documents(self):
        """Clear entire knowledge base"""
        try:
            # Delete collection
            self.client.delete_collection("knowledge_base")
            # Recreate empty collection
            self.collection = self.client.create_collection(
                name="knowledge_base",
                metadata={"description": "Personal knowledge management collection"}
            )
            # Update agents with new collection
            self.ingestion_agent.collection = self.collection
            self.question_agent.collection = self.collection
            self.summary_agent.collection = self.collection
            self.connection_agent.collection = self.collection
            # Clear documents
            self.documents = {}
            self._save_documents_metadata()
            return "✅ All documents cleared from knowledge base"
        except Exception as e:
            logger.error(f"Error clearing database: {e}")
            return f"❌ Error: {str(e)}"
 def create_ui():
    """Create Gradio interface"""
    # Initialize app
    app = KnowledgeHub()
    # Custom CSS
    custom_css = """
    .main-header {
        text-align: center;
        background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
        color: white;
        padding: 30px;
        border-radius: 10px;
        margin-bottom: 20px;
    }
    .stat-box {
        background: #f8f9fa;
        padding: 15px;
        border-radius: 8px;
        border-left: 4px solid #667eea;
    }
    """
    with gr.Blocks(title="KnowledgeHub", css=custom_css, theme=gr.themes.Soft()) as interface:
        # Header
        gr.HTML("""
        <div class="main-header">
            <h1>🧠 KnowledgeHub</h1>
            <p>Personal Knowledge Management & Research Assistant</p>
            <p style="font-size: 14px; opacity: 0.9;">
                Powered by Ollama (Llama 3.2) • Fully Local & Private
            </p>
        </div>
        """)
        # Main tabs
        with gr.Tabs():
            # Tab 1: Upload Documents
            with gr.Tab("📤 Upload Documents"):
                gr.Markdown("### Upload your documents to build your knowledge base")
                gr.Markdown("*Supported formats: PDF, DOCX, TXT, MD, HTML, PY*")
                gr.Markdown("*💡 Tip: You can select multiple files at once!*")
                with gr.Row():
                    with gr.Column():
                        file_input = gr.File(
                            label="Select Document(s)",
                            file_types=[".pdf", ".docx", ".txt", ".md", ".html", ".py"],
                            file_count="multiple"  # Enable multiple file selection
                        )
                        upload_btn = gr.Button("📤 Upload & Process", variant="primary")
                    with gr.Column():
                        upload_status = gr.Markdown("Ready to upload documents")
                # Results table for batch uploads
                with gr.Row():
                    upload_results = gr.Dataframe(
                        headers=["Status", "Filename", "Chunks", "Size"],
                        label="Upload Results",
                        wrap=True,
                        visible=True
                    )
                with gr.Row():
                    document_preview = gr.Textbox(
                        label="Document Preview (First Uploaded)",
                        lines=10,
                        max_lines=15
                    )
                upload_btn.click(
                    fn=app.upload_document,
                    inputs=[file_input],
                    outputs=[upload_status, document_preview, upload_results]
                )
            # Tab 2: Ask Questions
            with gr.Tab("❓ Ask Questions"):
                gr.Markdown("### Ask questions about your documents")
                gr.Markdown("*Uses RAG (Retrieval Augmented Generation) to answer based on your knowledge base*")
                with gr.Row():
                    with gr.Column(scale=3):
                        question_input = gr.Textbox(
                            label="Your Question",
                            placeholder="What would you like to know?",
                            lines=3
                        )
                    with gr.Column(scale=1):
                        top_k_slider = gr.Slider(
                            minimum=1,
                            maximum=10,
                            value=5,
                            step=1,
                            label="Number of sources"
                        )
                        ask_btn = gr.Button("🔍 Ask", variant="primary")
                qa_status = gr.Markdown("Ready to answer questions")
                answer_output = gr.Markdown(label="Answer")
                sources_table = gr.Dataframe(
                    headers=["#", "Document", "Relevance", "Preview"],
                    label="Sources",
                    wrap=True
                )
                ask_btn.click(
                    fn=app.ask_question,
                    inputs=[question_input, top_k_slider],
                    outputs=[answer_output, sources_table, qa_status]
                )
            # Tab 3: Summarize
            with gr.Tab("📝 Summarize"):
                gr.Markdown("### Generate summaries and extract key points")
                with gr.Row():
                    with gr.Column():
                        doc_selector = gr.Dropdown(
                            choices=[],
                            label="Select Document",
                            info="Choose a document to summarize",
                            allow_custom_value=True
                        )
                        refresh_btn = gr.Button("🔄 Refresh List")
                        summarize_btn = gr.Button("📝 Generate Summary", variant="primary")
                        summary_status = gr.Markdown("Ready to generate summaries")
                    with gr.Column(scale=2):
                        summary_output = gr.Markdown(label="Summary")
                summarize_btn.click(
                    fn=app.create_summary,
                    inputs=[doc_selector],
                    outputs=[summary_output, summary_status]
                )
                refresh_btn.click(
                    fn=app.get_document_list,
                    outputs=[doc_selector]
                )
            # Tab 4: Find Connections
            with gr.Tab("🔗 Find Connections"):
                gr.Markdown("### Discover relationships between documents")
                with gr.Row():
                    with gr.Column():
                        conn_doc_selector = gr.Dropdown(
                            choices=[],
                            label="Select Document",
                            info="Find documents related to this one",
                            allow_custom_value=True
                        )
                        conn_top_k = gr.Slider(
                            minimum=1,
                            maximum=10,
                            value=5,
                            step=1,
                            label="Number of related documents"
                        )
                        refresh_conn_btn = gr.Button("🔄 Refresh List")
                        find_btn = gr.Button("🔗 Find Connections", variant="primary")
                        connection_status = gr.Markdown("Ready to find connections")
                connection_output = gr.Markdown(label="Connections")
                connections_table = gr.Dataframe(
                    headers=["#", "Document", "Similarity", "Preview"],
                    label="Related Documents",
                    wrap=True
                )
                find_btn.click(
                    fn=app.find_connections,
                    inputs=[conn_doc_selector, conn_top_k],
                    outputs=[connection_output, connections_table, connection_status]
                )
                refresh_conn_btn.click(
                    fn=app.get_document_list,
                    outputs=[conn_doc_selector]
                )
            # Tab 5: Export
            with gr.Tab("💾 Export"):
                gr.Markdown("### Export your knowledge base")
                with gr.Row():
                    with gr.Column():
                        format_choice = gr.Radio(
                            choices=["Markdown", "HTML", "Text"],
                            value="Markdown",
                            label="Export Format"
                        )
                        export_btn = gr.Button("💾 Export", variant="primary")
                    with gr.Column():
                        export_status = gr.Markdown("Ready to export")
                        export_file = gr.File(label="Download Export")
                export_btn.click(
                    fn=app.export_knowledge,
                    inputs=[format_choice],
                    outputs=[export_status, export_file]
                )
            # Tab 6: Manage Documents
            with gr.Tab("🗂️ Manage Documents"):
                gr.Markdown("### Manage your document library")
                with gr.Row():
                    with gr.Column():
                        gr.Markdown("#### Delete Document")
                        delete_doc_selector = gr.Dropdown(
                            choices=[],
                            label="Select Document to Delete",
                            info="Choose a document to remove from knowledge base"
                        )
                        with gr.Row():
                            refresh_delete_btn = gr.Button("🔄 Refresh List")
                            delete_btn = gr.Button("🗑️ Delete Document", variant="stop")
                        delete_status = gr.Markdown("")
                    with gr.Column():
                        gr.Markdown("#### Clear All Documents")
                        gr.Markdown("⚠️ **Warning:** This will delete your entire knowledge base!")
                        clear_confirm = gr.Textbox(
                            label="Type 'DELETE ALL' to confirm",
                            placeholder="DELETE ALL"
                        )
                        clear_all_btn = gr.Button("🗑️ Clear All Documents", variant="stop")
                        clear_status = gr.Markdown("")
                def confirm_and_clear(confirm_text):
                    if confirm_text.strip() == "DELETE ALL":
                        return app.clear_all_documents()
                    else:
                        return "⚠️ Please type 'DELETE ALL' to confirm"
                delete_btn.click(
                    fn=app.delete_document,
                    inputs=[delete_doc_selector],
                    outputs=[delete_status, delete_doc_selector]
                )
                refresh_delete_btn.click(
                    fn=app.get_document_list,
                    outputs=[delete_doc_selector]
                )
                clear_all_btn.click(
                    fn=confirm_and_clear,
                    inputs=[clear_confirm],
                    outputs=[clear_status]
                )
            # Tab 7: Statistics
            with gr.Tab("📊 Statistics"):
                gr.Markdown("### Knowledge Base Overview")
                stats_output = gr.Markdown()
                stats_btn = gr.Button("🔄 Refresh Statistics", variant="primary")
                stats_btn.click(
                    fn=app.get_statistics,
                    outputs=[stats_output]
                )
                # Auto-load stats on tab open
                interface.load(
                    fn=app.get_statistics,
                    outputs=[stats_output]
                )
        # Footer
        gr.HTML("""
        <div style="text-align: center; margin-top: 30px; padding: 20px; color: #666;">
            <p>🔒 All processing happens locally on your machine • Your data never leaves your computer</p>
            <p style="font-size: 12px;">Powered by Ollama, ChromaDB, and Sentence Transformers</p>
        </div>
        """)
    return interface
 if __name__ == "__main__":
    logger.info("Starting KnowledgeHub...")
    # Create and launch interface
    interface = create_ui()
    interface.launch(
        server_name="127.0.0.1",
        server_port=7860,
        share=False,
        inbrowser=True
    )
--- a/community-contributions/sach91-bootcamp/week8/models/init.py
+++ b/community-contributions/sach91-bootcamp/week8/models/init.py
@@ -0,0 +1,13 @@
 """
 models
 """
 from .knowledge_graph import KnowledgeGraph
 from .document import Document, DocumentChunk, SearchResult, Summary
 __all__ = [
    'KnowledgeGraph',
    'Document',
    'DocumentChunk',
    'SearchResult',
    'Summary'
 ]
--- a/community-contributions/sach91-bootcamp/week8/models/document.py
+++ b/community-contributions/sach91-bootcamp/week8/models/document.py
@@ -0,0 +1,82 @@
 """
 Document data models
 """
 from dataclasses import dataclass, field
 from typing import List, Dict, Optional
 from datetime import datetime
@dataclass
 class DocumentChunk:
    """Represents a chunk of a document"""
    id: str
    document_id: str
    content: str
    chunk_index: int
    metadata: Dict = field(default_factory=dict)
    def __str__(self):
        preview = self.content[:100] + "..." if len(self.content) > 100 else self.content
        return f"Chunk {self.chunk_index}: {preview}"
@dataclass
 class Document:
    """Represents a complete document"""
    id: str
    filename: str
    filepath: str
    content: str
    chunks: List[DocumentChunk]
    metadata: Dict = field(default_factory=dict)
    created_at: datetime = field(default_factory=datetime.now)
    @property
    def num_chunks(self) -> int:
        return len(self.chunks)
    @property
    def total_chars(self) -> int:
        return len(self.content)
    @property
    def extension(self) -> str:
        return self.metadata.get('extension', '')
    def __str__(self):
        return f"Document: {self.filename} ({self.num_chunks} chunks, {self.total_chars} chars)"
    def to_dict(self) -> Dict:
        """Convert to dictionary for storage"""
        return {
            'id': self.id,
            'filename': self.filename,
            'filepath': self.filepath,
            'content': self.content[:500] + '...' if len(self.content) > 500 else self.content,
            'num_chunks': self.num_chunks,
            'total_chars': self.total_chars,
            'extension': self.extension,
            'created_at': self.created_at.isoformat(),
            'metadata': self.metadata
        }
@dataclass
 class SearchResult:
    """Represents a search result from the vector database"""
    chunk: DocumentChunk
    score: float
    document_id: str
    document_name: str
    def __str__(self):
        return f"{self.document_name} (score: {self.score:.2f})"
@dataclass
 class Summary:
    """Represents a document summary"""
    document_id: str
    document_name: str
    summary_text: str
    key_points: List[str] = field(default_factory=list)
    created_at: datetime = field(default_factory=datetime.now)
    def __str__(self):
        return f"Summary of {self.document_name}: {self.summary_text[:100]}..."
--- a/community-contributions/sach91-bootcamp/week8/models/knowledge_graph.py
+++ b/community-contributions/sach91-bootcamp/week8/models/knowledge_graph.py
@@ -0,0 +1,110 @@
 """
 Knowledge Graph data models
 """
 from dataclasses import dataclass, field
 from typing import List, Dict, Set
 from datetime import datetime
@dataclass
 class KnowledgeNode:
    """Represents a concept or entity in the knowledge graph"""
    id: str
    name: str
    node_type: str  # 'document', 'concept', 'entity', 'topic'
    description: str = ""
    metadata: Dict = field(default_factory=dict)
    created_at: datetime = field(default_factory=datetime.now)
    def __str__(self):
        return f"{self.node_type.capitalize()}: {self.name}"
@dataclass
 class KnowledgeEdge:
    """Represents a relationship between nodes"""
    source_id: str
    target_id: str
    relationship: str  # 'related_to', 'cites', 'contains', 'similar_to'
    weight: float = 1.0
    metadata: Dict = field(default_factory=dict)
    def __str__(self):
        return f"{self.source_id} --[{self.relationship}]--> {self.target_id}"
@dataclass
 class KnowledgeGraph:
    """Represents the complete knowledge graph"""
    nodes: Dict[str, KnowledgeNode] = field(default_factory=dict)
    edges: List[KnowledgeEdge] = field(default_factory=list)
    def add_node(self, node: KnowledgeNode):
        """Add a node to the graph"""
        self.nodes[node.id] = node
    def add_edge(self, edge: KnowledgeEdge):
        """Add an edge to the graph"""
        if edge.source_id in self.nodes and edge.target_id in self.nodes:
            self.edges.append(edge)
    def get_neighbors(self, node_id: str) -> List[str]:
        """Get all nodes connected to a given node"""
        neighbors = set()
        for edge in self.edges:
            if edge.source_id == node_id:
                neighbors.add(edge.target_id)
            elif edge.target_id == node_id:
                neighbors.add(edge.source_id)
        return list(neighbors)
    def get_related_documents(self, node_id: str, max_depth: int = 2) -> Set[str]:
        """Get all documents related to a node within max_depth hops"""
        related = set()
        visited = set()
        queue = [(node_id, 0)]
        while queue:
            current_id, depth = queue.pop(0)
            if current_id in visited or depth > max_depth:
                continue
            visited.add(current_id)
            # If this is a document node, add it
            if current_id in self.nodes and self.nodes[current_id].node_type == 'document':
                related.add(current_id)
            # Add neighbors to queue
            if depth < max_depth:
                for neighbor_id in self.get_neighbors(current_id):
                    if neighbor_id not in visited:
                        queue.append((neighbor_id, depth + 1))
        return related
    def to_networkx(self):
        """Convert to NetworkX graph for visualization"""
        try:
            import networkx as nx
            G = nx.Graph()
            # Add nodes
            for node_id, node in self.nodes.items():
                G.add_node(node_id, 
                          name=node.name, 
                          type=node.node_type,
                          description=node.description)
            # Add edges
            for edge in self.edges:
                G.add_edge(edge.source_id, edge.target_id, 
                          relationship=edge.relationship,
                          weight=edge.weight)
            return G
        except ImportError:
            return None
    def __str__(self):
        return f"KnowledgeGraph: {len(self.nodes)} nodes, {len(self.edges)} edges"
--- a/community-contributions/sach91-bootcamp/week8/requirements.txt
+++ b/community-contributions/sach91-bootcamp/week8/requirements.txt
@@ -0,0 +1,26 @@
 # Core Dependencies
 gradio>=4.0.0
 chromadb>=0.4.0
 sentence-transformers>=2.2.0
 python-dotenv>=1.0.0
 # Document Processing
 pypdf>=3.0.0
 python-docx>=1.0.0
 markdown>=3.4.0
 beautifulsoup4>=4.12.0
 # Data Processing
 numpy>=1.24.0
 pandas>=2.0.0
 tqdm>=4.65.0
 # Visualization
 plotly>=5.14.0
 networkx>=3.0
 # Ollama Client
 requests>=2.31.0
 # Optional but useful
 scikit-learn>=1.3.0
--- a/community-contributions/sach91-bootcamp/week8/start.bat
+++ b/community-contributions/sach91-bootcamp/week8/start.bat
@@ -0,0 +1,71 @@
@echo off
 REM KnowledgeHub Startup Script for Windows
 echo 🧠 Starting KnowledgeHub...
 echo.
 REM Check if Ollama is installed
 where ollama >nul 2>nul
 if %errorlevel% neq 0 (
    echo ❌ Ollama is not installed or not in PATH
    echo Please install Ollama from https://ollama.com/download
    pause
    exit /b 1
 )
 REM Check Python
 where python >nul 2>nul
 if %errorlevel% neq 0 (
    echo ❌ Python is not installed or not in PATH
    echo Please install Python 3.8+ from https://www.python.org/downloads/
    pause
    exit /b 1
 )
 echo ✅ Prerequisites found
 echo.
 REM Check if Ollama service is running
 tasklist /FI "IMAGENAME eq ollama.exe" 2>NUL | find /I /N "ollama.exe">NUL
 if "%ERRORLEVEL%"=="1" (
    echo ⚠️  Ollama is not running. Please start Ollama first.
    echo You can start it from the Start menu or by running: ollama serve
    pause
    exit /b 1
 )
 echo ✅ Ollama is running
 echo.
 REM Check if model exists
 ollama list | find "llama3.2" >nul
 if %errorlevel% neq 0 (
    echo 📥 Llama 3.2 model not found. Pulling model...
    echo This may take a few minutes on first run...
    ollama pull llama3.2
 )
 echo ✅ Model ready
 echo.
 REM Install dependencies
 echo 🔍 Checking dependencies...
 python -c "import gradio" 2>nul
 if %errorlevel% neq 0 (
    echo 📦 Installing dependencies...
    pip install -r requirements.txt
 )
 echo ✅ Dependencies ready
 echo.
 REM Launch application
 echo 🚀 Launching KnowledgeHub...
 echo The application will open in your browser at http://127.0.0.1:7860
 echo.
 echo Press Ctrl+C to stop the application
 echo.
 python app.py
 pause
--- a/community-contributions/sach91-bootcamp/week8/start.sh
+++ b/community-contributions/sach91-bootcamp/week8/start.sh
@@ -0,0 +1,42 @@
 #!/bin/bash
 # KnowledgeHub Startup Script
 echo "🧠 Starting KnowledgeHub..."
 echo ""
 # Check if Ollama is running
 if ! pgrep -x "ollama" > /dev/null; then
    echo "⚠️  Ollama is not running. Starting Ollama..."
    ollama serve &
    sleep 3
 fi
 # Check if llama3.2 model exists
 if ! ollama list | grep -q "llama3.2"; then
    echo "📥 Llama 3.2 model not found. Pulling model..."
    echo "This may take a few minutes on first run..."
    ollama pull llama3.2
 fi
 echo "✅ Ollama is ready"
 echo ""
 # Check Python dependencies
 echo "🔍 Checking dependencies..."
 if ! python -c "import gradio" 2>/dev/null; then
    echo "📦 Installing dependencies..."
    pip install -r requirements.txt
 fi
 echo "✅ Dependencies ready"
 echo ""
 # Launch the application
 echo "🚀 Launching KnowledgeHub..."
 echo "The application will open in your browser at http://127.0.0.1:7860"
 echo ""
 echo "Press Ctrl+C to stop the application"
 echo ""
 python app.py
--- a/community-contributions/sach91-bootcamp/week8/utils/init.py
+++ b/community-contributions/sach91-bootcamp/week8/utils/init.py
@@ -0,0 +1,12 @@
 """
 models
 """
 from .document_parser import DocumentParser
 from .embeddings import EmbeddingModel
 from .ollama_client import OllamaClient
 __all__ = [
    'DocumentParser',
    'EmbeddingModel',
    'OllamaClient'
 ]
--- a/community-contributions/sach91-bootcamp/week8/utils/document_parser.py
+++ b/community-contributions/sach91-bootcamp/week8/utils/document_parser.py
@@ -0,0 +1,218 @@
 """
 Document Parser - Extract text from various document formats
 """
 import os
 from typing import List, Dict, Optional
 import logging
 from pathlib import Path
 logger = logging.getLogger(__name__)
 class DocumentParser:
    """Parse various document formats into text chunks"""
    SUPPORTED_FORMATS = ['.pdf', '.docx', '.txt', '.md', '.html', '.py']
    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
        """
        Initialize document parser
        Args:
            chunk_size: Maximum characters per chunk
            chunk_overlap: Overlap between chunks for context preservation
        """
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
    def parse_file(self, file_path: str) -> Dict:
        """
        Parse a file and return structured document data
        Args:
            file_path: Path to the file
        Returns:
            Dictionary with document metadata and chunks
        """
        path = Path(file_path)
        if not path.exists():
            raise FileNotFoundError(f"File not found: {file_path}")
        extension = path.suffix.lower()
        if extension not in self.SUPPORTED_FORMATS:
            raise ValueError(f"Unsupported format: {extension}")
        # Extract text based on file type
        if extension == '.pdf':
            text = self._parse_pdf(file_path)
        elif extension == '.docx':
            text = self._parse_docx(file_path)
        elif extension == '.txt' or extension == '.py':
            text = self._parse_txt(file_path)
        elif extension == '.md':
            text = self._parse_markdown(file_path)
        elif extension == '.html':
            text = self._parse_html(file_path)
        else:
            text = ""
        # Create chunks
        chunks = self._create_chunks(text)
        return {
            'filename': path.name,
            'filepath': str(path.absolute()),
            'extension': extension,
            'text': text,
            'chunks': chunks,
            'num_chunks': len(chunks),
            'total_chars': len(text)
        }
    def _parse_pdf(self, file_path: str) -> str:
        """Extract text from PDF"""
        try:
            from pypdf import PdfReader
            reader = PdfReader(file_path)
            text = ""
            for page in reader.pages:
                text += page.extract_text() + "\n\n"
            return text.strip()
        except ImportError:
            logger.error("pypdf not installed. Install with: pip install pypdf")
            return ""
        except Exception as e:
            logger.error(f"Error parsing PDF: {e}")
            return ""
    def _parse_docx(self, file_path: str) -> str:
        """Extract text from DOCX"""
        try:
            from docx import Document
            doc = Document(file_path)
            text = "\n\n".join([para.text for para in doc.paragraphs if para.text.strip()])
            return text.strip()
        except ImportError:
            logger.error("python-docx not installed. Install with: pip install python-docx")
            return ""
        except Exception as e:
            logger.error(f"Error parsing DOCX: {e}")
            return ""
    def _parse_txt(self, file_path: str) -> str:
        """Extract text from TXT"""
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                return f.read().strip()
        except Exception as e:
            logger.error(f"Error parsing TXT: {e}")
            return ""
    def _parse_markdown(self, file_path: str) -> str:
        """Extract text from Markdown"""
        try:
            import markdown
            from bs4 import BeautifulSoup
            with open(file_path, 'r', encoding='utf-8') as f:
                md_text = f.read()
            # Convert markdown to HTML then extract text
            html = markdown.markdown(md_text)
            soup = BeautifulSoup(html, 'html.parser')
            text = soup.get_text()
            return text.strip()
        except ImportError:
            # Fallback: just read as plain text
            return self._parse_txt(file_path)
        except Exception as e:
            logger.error(f"Error parsing Markdown: {e}")
            return ""
    def _parse_html(self, file_path: str) -> str:
        """Extract text from HTML"""
        try:
            from bs4 import BeautifulSoup
            with open(file_path, 'r', encoding='utf-8') as f:
                html = f.read()
            soup = BeautifulSoup(html, 'html.parser')
            # Remove script and style elements
            for script in soup(["script", "style"]):
                script.decompose()
            text = soup.get_text()
            # Clean up whitespace
            lines = (line.strip() for line in text.splitlines())
            chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
            text = '\n'.join(chunk for chunk in chunks if chunk)
            return text.strip()
        except ImportError:
            logger.error("beautifulsoup4 not installed. Install with: pip install beautifulsoup4")
            return ""
        except Exception as e:
            logger.error(f"Error parsing HTML: {e}")
            return ""
    def _create_chunks(self, text: str) -> List[str]:
        """
        Split text into overlapping chunks
        Args:
            text: Full text to chunk
        Returns:
            List of text chunks
        """
        if not text:
            return []
        chunks = []
        start = 0
        text_length = len(text)
        while start < text_length:
            logger.info(f'Processing chunk at {start}, for len {text_length}.')
            end = start + self.chunk_size
            # If this isn't the last chunk, try to break at a sentence or paragraph
            if end < text_length:
                # Look for paragraph break first
                break_pos = text.rfind('\n\n', start, end)
                if break_pos == -1:
                    # Look for sentence break
                    break_pos = text.rfind('. ', start, end)
                if break_pos == -1:
                    # Look for any space
                    break_pos = text.rfind(' ', start, end)
                if break_pos != -1 and break_pos > start and break_pos > end - self.chunk_overlap:
                    end = break_pos + 1
            chunk = text[start:end].strip()
            if chunk:
                chunks.append(chunk)
            # Move start position with overlap
            start = end - self.chunk_overlap
            if start < 0:
                start = 0
        return chunks
--- a/community-contributions/sach91-bootcamp/week8/utils/embeddings.py
+++ b/community-contributions/sach91-bootcamp/week8/utils/embeddings.py
@@ -0,0 +1,84 @@
 """
 Embeddings utility using sentence-transformers
 """
 from sentence_transformers import SentenceTransformer
 import numpy as np
 from typing import List, Union
 import logging
 logger = logging.getLogger(__name__)
 class EmbeddingModel:
    """Wrapper for sentence transformer embeddings"""
    def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
        """
        Initialize embedding model
        Args:
            model_name: HuggingFace model name for embeddings
        """
        self.model_name = model_name
        logger.info(f"Loading embedding model: {model_name}")
        self.model = SentenceTransformer(model_name)
        self.dimension = self.model.get_sentence_embedding_dimension()
        logger.info(f"Embedding dimension: {self.dimension}")
    def embed(self, texts: Union[str, List[str]]) -> np.ndarray:
        """
        Generate embeddings for text(s)
        Args:
            texts: Single text or list of texts
        Returns:
            Numpy array of embeddings
        """
        if isinstance(texts, str):
            texts = [texts]
        embeddings = self.model.encode(texts, show_progress_bar=False)
        return embeddings
    def embed_query(self, query: str) -> List[float]:
        """
        Embed a single query - returns as list for ChromaDB compatibility
        Args:
            query: Query text
        Returns:
            List of floats representing the embedding
        """
        embedding = self.model.encode([query], show_progress_bar=False)[0]
        return embedding.tolist()
    def embed_documents(self, documents: List[str]) -> List[List[float]]:
        """
        Embed multiple documents - returns as list of lists for ChromaDB
        Args:
            documents: List of document texts
        Returns:
            List of embeddings (each as list of floats)
        """
        embeddings = self.model.encode(documents, show_progress_bar=False)
        return embeddings.tolist()
    def similarity(self, text1: str, text2: str) -> float:
        """
        Calculate cosine similarity between two texts
        Args:
            text1: First text
            text2: Second text
        Returns:
            Similarity score between 0 and 1
        """
        emb1, emb2 = self.model.encode([text1, text2])
        # Cosine similarity
        similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
        return float(similarity)
--- a/community-contributions/sach91-bootcamp/week8/utils/ollama_client.py
+++ b/community-contributions/sach91-bootcamp/week8/utils/ollama_client.py
@@ -0,0 +1,107 @@
 """
 Ollama Client - Wrapper for local Ollama API
 """
 import requests
 import json
 from typing import List, Dict, Optional
 import logging
 logger = logging.getLogger(__name__)
 class OllamaClient:
    """Client for interacting with local Ollama models"""
    def __init__(self, base_url: str = "http://localhost:11434", model: str = "llama3.2"):
        self.base_url = base_url
        self.model = model
        self.api_url = f"{base_url}/api"
    def generate(self, prompt: str, system: Optional[str] = None, 
                 temperature: float = 0.7, max_tokens: int = 2048) -> str:
        """Generate text from a prompt"""
        try:
            payload = {
                "model": self.model,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": temperature,
                    "num_predict": max_tokens
                }
            }
            if system:
                payload["system"] = system
            response = requests.post(
                f"{self.api_url}/generate",
                json=payload,
                timeout=1200
            )
            response.raise_for_status()
            result = response.json()
            return result.get("response", "").strip()
        except requests.exceptions.RequestException as e:
            logger.error(f"Ollama API error: {e}")
            return f"Error: Unable to connect to Ollama. Is it running? ({str(e)})"
    def chat(self, messages: List[Dict[str, str]], 
             temperature: float = 0.7, max_tokens: int = 2048) -> str:
        """Chat completion with message history"""
        try:
            payload = {
                "model": self.model,
                "messages": messages,
                "stream": False,
                "options": {
                    "temperature": temperature,
                    "num_predict": max_tokens
                }
            }
            response = requests.post(
                f"{self.api_url}/chat",
                json=payload,
                timeout=1200
            )
            response.raise_for_status()
            result = response.json()
            return result.get("message", {}).get("content", "").strip()
        except requests.exceptions.RequestException as e:
            logger.error(f"Ollama API error: {e}")
            return f"Error: Unable to connect to Ollama. Is it running? ({str(e)})"
    def check_connection(self) -> bool:
        """Check if Ollama is running and model is available"""
        try:
            response = requests.get(f"{self.base_url}/api/tags", timeout=5)
            response.raise_for_status()
            models = response.json().get("models", [])
            model_names = [m["name"] for m in models]
            if self.model not in model_names:
                logger.warning(f"Model {self.model} not found. Available: {model_names}")
                return False
            return True
        except requests.exceptions.RequestException as e:
            logger.error(f"Cannot connect to Ollama: {e}")
            return False
    def list_models(self) -> List[str]:
        """List available Ollama models"""
        try:
            response = requests.get(f"{self.base_url}/api/tags", timeout=5)
            response.raise_for_status()
            models = response.json().get("models", [])
            return [m["name"] for m in models]
        except requests.exceptions.RequestException:
            return []
--- a/community-contributions/sach91-bootcamp/week8/verify_setup.py
+++ b/community-contributions/sach91-bootcamp/week8/verify_setup.py
@@ -0,0 +1,129 @@
 """
 Setup Verification Script for KnowledgeHub
 Run this to check if everything is configured correctly
 """
 import sys
 import os
 print("🔍 KnowledgeHub Setup Verification\n")
 print("=" * 60)
 # Check Python version
 print(f"✓ Python version: {sys.version}")
 print(f"✓ Python executable: {sys.executable}")
 print(f"✓ Current directory: {os.getcwd()}")
 print()
 # Check directory structure
 print("📁 Checking directory structure...")
 required_dirs = ['agents', 'models', 'utils']
 for dir_name in required_dirs:
    if os.path.isdir(dir_name):
        init_file = os.path.join(dir_name, '__init__.py')
        if os.path.exists(init_file):
            print(f"  ✓ {dir_name}/ exists with __init__.py")
        else:
            print(f"  ⚠️  {dir_name}/ exists but missing __init__.py")
    else:
        print(f"  ❌ {dir_name}/ directory not found")
 print()
 # Check required files
 print("📄 Checking required files...")
 required_files = ['app.py', 'requirements.txt']
 for file_name in required_files:
    if os.path.exists(file_name):
        print(f"  ✓ {file_name} exists")
    else:
        print(f"  ❌ {file_name} not found")
 print()
 # Try importing modules
 print("📦 Testing imports...")
 errors = []
 try:
    from utils import OllamaClient, EmbeddingModel, DocumentParser
    print("  ✓ utils module imported successfully")
 except ImportError as e:
    print(f"  ❌ Cannot import utils: {e}")
    errors.append(str(e))
 try:
    from models import Document, DocumentChunk, SearchResult, Summary
    print("  ✓ models module imported successfully")
 except ImportError as e:
    print(f"  ❌ Cannot import models: {e}")
    errors.append(str(e))
 try:
    from agents import (
        IngestionAgent, QuestionAgent, SummaryAgent,
        ConnectionAgent, ExportAgent
    )
    print("  ✓ agents module imported successfully")
 except ImportError as e:
    print(f"  ❌ Cannot import agents: {e}")
    errors.append(str(e))
 print()
 # Check dependencies
 print("📚 Checking Python dependencies...")
 required_packages = [
    'gradio', 'chromadb', 'sentence_transformers', 
    'requests', 'numpy', 'tqdm'
 ]
 missing_packages = []
 for package in required_packages:
    try:
        __import__(package.replace('-', '_'))
        print(f"  ✓ {package} installed")
    except ImportError:
        print(f"  ❌ {package} not installed")
        missing_packages.append(package)
 print()
 # Check Ollama
 print("🤖 Checking Ollama...")
 try:
    import requests
    response = requests.get('http://localhost:11434/api/tags', timeout=2)
    if response.status_code == 200:
        print("  ✓ Ollama is running")
        models = response.json().get('models', [])
        if models:
            print(f"  ✓ Available models: {[m['name'] for m in models]}")
            if any('llama3.2' in m['name'] for m in models):
                print("  ✓ llama3.2 model found")
            else:
                print("  ⚠️  llama3.2 model not found. Run: ollama pull llama3.2")
        else:
            print("  ⚠️  No models found. Run: ollama pull llama3.2")
    else:
        print("  ⚠️  Ollama responded but with error")
 except Exception as e:
    print(f"  ❌ Cannot connect to Ollama: {e}")
    print("     Start Ollama with: ollama serve")
 print()
 print("=" * 60)
 # Final summary
 if errors or missing_packages:
    print("\n⚠️  ISSUES FOUND:\n")
    if errors:
        print("Import Errors:")
        for error in errors:
            print(f"  - {error}")
    if missing_packages:
        print("\nMissing Packages:")
        print(f"  Run: pip install {' '.join(missing_packages)}")
    print("\n💡 Fix these issues before running app.py")
 else:
    print("\n✅ All checks passed! You're ready to run:")
    print("   python app.py")
 print()