sach91 bootcamp week8 exercise
This commit is contained in:
259
community-contributions/sach91-bootcamp/week8/README.md
Normal file
259
community-contributions/sach91-bootcamp/week8/README.md
Normal file
@@ -0,0 +1,259 @@
|
|||||||
|
# 🧠 KnowledgeHub - Personal Knowledge Management & Research Assistant
|
||||||
|
|
||||||
|
An elegant, fully local AI-powered knowledge management system that helps you organize, search, and understand your documents using state-of-the-art LLM technology.
|
||||||
|
|
||||||
|
## ✨ Features
|
||||||
|
|
||||||
|
### 🎯 Core Capabilities
|
||||||
|
- **📤 Document Ingestion**: Upload PDF, DOCX, TXT, MD, and HTML files
|
||||||
|
- **❓ Intelligent Q&A**: Ask questions and get answers from your documents using RAG
|
||||||
|
- **📝 Smart Summarization**: Generate concise summaries with key points
|
||||||
|
- **🔗 Connection Discovery**: Find relationships between documents
|
||||||
|
- **💾 Multi-format Export**: Export as Markdown, HTML, or plain text
|
||||||
|
- **📊 Statistics Dashboard**: Track your knowledge base growth
|
||||||
|
|
||||||
|
### 🔒 Privacy-First
|
||||||
|
- **100% Local Processing**: All data stays on your machine
|
||||||
|
- **No Cloud Dependencies**: Uses Ollama for local LLM inference
|
||||||
|
- **Open Source**: Full transparency and control
|
||||||
|
|
||||||
|
### ⚡ Technology Stack
|
||||||
|
- **LLM**: Ollama with Llama 3.2 (3B) or Llama 3.1 (8B)
|
||||||
|
- **Embeddings**: sentence-transformers (all-MiniLM-L6-v2)
|
||||||
|
- **Vector Database**: ChromaDB
|
||||||
|
- **UI**: Gradio
|
||||||
|
- **Document Processing**: pypdf, python-docx, beautifulsoup4
|
||||||
|
|
||||||
|
## 🚀 Quick Start
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
1. **Python 3.8+** installed
|
||||||
|
2. **Ollama** installed and running
|
||||||
|
|
||||||
|
#### Installing Ollama
|
||||||
|
|
||||||
|
**macOS/Linux:**
|
||||||
|
```bash
|
||||||
|
curl -fsSL https://ollama.com/install.sh | sh
|
||||||
|
```
|
||||||
|
|
||||||
|
**Windows:**
|
||||||
|
Download from [ollama.com/download](https://ollama.com/download)
|
||||||
|
|
||||||
|
### Installation
|
||||||
|
|
||||||
|
1. **Clone or download this repository**
|
||||||
|
|
||||||
|
2. **Install Python dependencies:**
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Pull Llama model using Ollama:**
|
||||||
|
```bash
|
||||||
|
# For faster inference (recommended for most users)
|
||||||
|
ollama pull llama3.2
|
||||||
|
|
||||||
|
# OR for better quality (requires more RAM)
|
||||||
|
ollama pull llama3.1
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Start Ollama server** (if not already running):
|
||||||
|
```bash
|
||||||
|
ollama serve
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Launch KnowledgeHub:**
|
||||||
|
```bash
|
||||||
|
python app.py
|
||||||
|
```
|
||||||
|
|
||||||
|
The application will open in your browser at `http://127.0.0.1:7860`
|
||||||
|
|
||||||
|
## 📖 Usage Guide
|
||||||
|
|
||||||
|
### 1. Upload Documents
|
||||||
|
- Go to the "Upload Documents" tab
|
||||||
|
- Select a file (PDF, DOCX, TXT, MD, or HTML)
|
||||||
|
- Click "Upload & Process"
|
||||||
|
- The document will be chunked and stored in your local vector database
|
||||||
|
|
||||||
|
### 2. Ask Questions
|
||||||
|
- Go to the "Ask Questions" tab
|
||||||
|
- Type your question in natural language
|
||||||
|
- Adjust the number of sources to retrieve (default: 5)
|
||||||
|
- Click "Ask" to get an AI-generated answer with sources
|
||||||
|
|
||||||
|
### 3. Summarize Documents
|
||||||
|
- Go to the "Summarize" tab
|
||||||
|
- Select a document from the dropdown
|
||||||
|
- Click "Generate Summary"
|
||||||
|
- Get a concise summary with key points
|
||||||
|
|
||||||
|
### 4. Find Connections
|
||||||
|
- Go to the "Find Connections" tab
|
||||||
|
- Select a document to analyze
|
||||||
|
- Adjust how many related documents to find
|
||||||
|
- See documents that are semantically similar
|
||||||
|
|
||||||
|
### 5. Export Knowledge
|
||||||
|
- Go to the "Export" tab
|
||||||
|
- Choose your format (Markdown, HTML, or Text)
|
||||||
|
- Click "Export" to download your knowledge base
|
||||||
|
|
||||||
|
### 6. View Statistics
|
||||||
|
- Go to the "Statistics" tab
|
||||||
|
- See overview of your knowledge base
|
||||||
|
- Track total documents, chunks, and characters
|
||||||
|
|
||||||
|
## 🏗️ Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
KnowledgeHub/
|
||||||
|
├── agents/ # Specialized AI agents
|
||||||
|
│ ├── base_agent.py # Base class for all agents
|
||||||
|
│ ├── ingestion_agent.py # Document processing
|
||||||
|
│ ├── question_agent.py # RAG-based Q&A
|
||||||
|
│ ├── summary_agent.py # Summarization
|
||||||
|
│ ├── connection_agent.py # Finding relationships
|
||||||
|
│ └── export_agent.py # Exporting data
|
||||||
|
├── models/ # Data models
|
||||||
|
│ ├── document.py # Document structures
|
||||||
|
│ └── knowledge_graph.py # Graph structures
|
||||||
|
├── utils/ # Utilities
|
||||||
|
│ ├── ollama_client.py # Ollama API wrapper
|
||||||
|
│ ├── embeddings.py # Embedding generation
|
||||||
|
│ └── document_parser.py # File parsing
|
||||||
|
├── vectorstore/ # ChromaDB storage (auto-created)
|
||||||
|
├── temp_uploads/ # Temporary file storage (auto-created)
|
||||||
|
├── app.py # Main Gradio application
|
||||||
|
└── requirements.txt # Python dependencies
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🎯 Multi-Agent Framework
|
||||||
|
|
||||||
|
KnowledgeHub uses a sophisticated multi-agent architecture:
|
||||||
|
|
||||||
|
1. **Ingestion Agent**: Parses documents, creates chunks, generates embeddings
|
||||||
|
2. **Question Agent**: Retrieves relevant context and answers questions
|
||||||
|
3. **Summary Agent**: Creates concise summaries and extracts key points
|
||||||
|
4. **Connection Agent**: Finds semantic relationships between documents
|
||||||
|
5. **Export Agent**: Formats and exports knowledge in multiple formats
|
||||||
|
|
||||||
|
Each agent is independent, reusable, and focused on a specific task, following best practices in agentic AI development.
|
||||||
|
|
||||||
|
## ⚙️ Configuration
|
||||||
|
|
||||||
|
### Changing Models
|
||||||
|
|
||||||
|
Edit `app.py` to use different models:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# For Llama 3.1 8B (better quality, more RAM)
|
||||||
|
self.llm_client = OllamaClient(model="llama3.1")
|
||||||
|
|
||||||
|
# For Llama 3.2 3B (faster, less RAM)
|
||||||
|
self.llm_client = OllamaClient(model="llama3.2")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Adjusting Chunk Size
|
||||||
|
|
||||||
|
Edit `agents/ingestion_agent.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
self.parser = DocumentParser(
|
||||||
|
chunk_size=1000, # Characters per chunk
|
||||||
|
chunk_overlap=200 # Overlap between chunks
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Changing Embedding Model
|
||||||
|
|
||||||
|
Edit `app.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
self.embedding_model = EmbeddingModel(
|
||||||
|
model_name="sentence-transformers/all-MiniLM-L6-v2"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🔧 Troubleshooting
|
||||||
|
|
||||||
|
### "Cannot connect to Ollama"
|
||||||
|
- Ensure Ollama is installed: `ollama --version`
|
||||||
|
- Start the Ollama service: `ollama serve`
|
||||||
|
- Verify the model is pulled: `ollama list`
|
||||||
|
|
||||||
|
### "Module not found" errors
|
||||||
|
- Ensure all dependencies are installed: `pip install -r requirements.txt`
|
||||||
|
- Try upgrading pip: `pip install --upgrade pip`
|
||||||
|
|
||||||
|
### "Out of memory" errors
|
||||||
|
- Use Llama 3.2 (3B) instead of Llama 3.1 (8B)
|
||||||
|
- Reduce chunk_size in document parser
|
||||||
|
- Process fewer documents at once
|
||||||
|
|
||||||
|
### Slow response times
|
||||||
|
- Ensure you're using a CUDA-enabled GPU (if available)
|
||||||
|
- Reduce the number of retrieved chunks (top_k parameter)
|
||||||
|
- Use a smaller model (llama3.2)
|
||||||
|
|
||||||
|
## 🎓 Learning Resources
|
||||||
|
|
||||||
|
This project demonstrates key concepts in LLM engineering:
|
||||||
|
|
||||||
|
- **RAG (Retrieval Augmented Generation)**: Combining retrieval with generation
|
||||||
|
- **Vector Databases**: Using ChromaDB for semantic search
|
||||||
|
- **Multi-Agent Systems**: Specialized agents working together
|
||||||
|
- **Embeddings**: Semantic representation of text
|
||||||
|
- **Local LLM Deployment**: Using Ollama for privacy-focused AI
|
||||||
|
|
||||||
|
## 📊 Performance
|
||||||
|
|
||||||
|
**Hardware Requirements:**
|
||||||
|
- Minimum: 8GB RAM, CPU
|
||||||
|
- Recommended: 16GB RAM, GPU (NVIDIA with CUDA)
|
||||||
|
- Optimal: 32GB RAM, GPU (RTX 3060 or better)
|
||||||
|
|
||||||
|
**Processing Speed** (Llama 3.2 on M1 Mac):
|
||||||
|
- Document ingestion: ~2-5 seconds per page
|
||||||
|
- Question answering: ~5-15 seconds
|
||||||
|
- Summarization: ~10-20 seconds
|
||||||
|
|
||||||
|
## 🤝 Contributing
|
||||||
|
|
||||||
|
This is a learning project showcasing LLM engineering principles. Feel free to:
|
||||||
|
- Experiment with different models
|
||||||
|
- Add new agents for specialized tasks
|
||||||
|
- Improve the UI
|
||||||
|
- Optimize performance
|
||||||
|
|
||||||
|
## 📄 License
|
||||||
|
|
||||||
|
This project is open source and available for educational purposes.
|
||||||
|
|
||||||
|
## 🙏 Acknowledgments
|
||||||
|
|
||||||
|
Built with:
|
||||||
|
- [Ollama](https://ollama.com/) - Local LLM runtime
|
||||||
|
- [Gradio](https://gradio.app/) - UI framework
|
||||||
|
- [ChromaDB](https://www.trychroma.com/) - Vector database
|
||||||
|
- [Sentence Transformers](https://www.sbert.net/) - Embeddings
|
||||||
|
- [Llama](https://ai.meta.com/llama/) - Meta's open source LLMs
|
||||||
|
|
||||||
|
## 🎯 Next Steps
|
||||||
|
|
||||||
|
Potential enhancements:
|
||||||
|
1. Add support for images and diagrams
|
||||||
|
2. Implement multi-document chat history
|
||||||
|
3. Build a visual knowledge graph
|
||||||
|
4. Add collaborative features
|
||||||
|
5. Create mobile app interface
|
||||||
|
6. Implement advanced filters and search
|
||||||
|
7. Add citation tracking
|
||||||
|
8. Create automated study guides
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Made with ❤️ for the LLM Engineering Community**
|
||||||
@@ -0,0 +1,18 @@
|
|||||||
|
"""
|
||||||
|
KnowledgeHub Agents
|
||||||
|
"""
|
||||||
|
from .base_agent import BaseAgent
|
||||||
|
from .ingestion_agent import IngestionAgent
|
||||||
|
from .question_agent import QuestionAgent
|
||||||
|
from .summary_agent import SummaryAgent
|
||||||
|
from .connection_agent import ConnectionAgent
|
||||||
|
from .export_agent import ExportAgent
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
'BaseAgent',
|
||||||
|
'IngestionAgent',
|
||||||
|
'QuestionAgent',
|
||||||
|
'SummaryAgent',
|
||||||
|
'ConnectionAgent',
|
||||||
|
'ExportAgent'
|
||||||
|
]
|
||||||
@@ -0,0 +1,91 @@
|
|||||||
|
"""
|
||||||
|
Base Agent class - Foundation for all specialized agents
|
||||||
|
"""
|
||||||
|
from abc import ABC, abstractmethod
|
||||||
|
import logging
|
||||||
|
from typing import Optional, Dict, Any
|
||||||
|
from utils.ollama_client import OllamaClient
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class BaseAgent(ABC):
|
||||||
|
"""Abstract base class for all agents"""
|
||||||
|
|
||||||
|
def __init__(self, name: str, llm_client: Optional[OllamaClient] = None,
|
||||||
|
model: str = "llama3.2"):
|
||||||
|
"""
|
||||||
|
Initialize base agent
|
||||||
|
|
||||||
|
Args:
|
||||||
|
name: Agent name for logging
|
||||||
|
llm_client: Shared Ollama client (creates new one if None)
|
||||||
|
model: Ollama model to use
|
||||||
|
"""
|
||||||
|
self.name = name
|
||||||
|
self.model = model
|
||||||
|
|
||||||
|
# Use shared client or create new one
|
||||||
|
if llm_client is None:
|
||||||
|
self.llm = OllamaClient(model=model)
|
||||||
|
logger.info(f"{self.name} initialized with new LLM client (model: {model})")
|
||||||
|
else:
|
||||||
|
self.llm = llm_client
|
||||||
|
logger.info(f"{self.name} initialized with shared LLM client (model: {model})")
|
||||||
|
|
||||||
|
def generate(self, prompt: str, system: Optional[str] = None,
|
||||||
|
temperature: float = 0.7, max_tokens: int = 2048) -> str:
|
||||||
|
"""
|
||||||
|
Generate text using the LLM
|
||||||
|
|
||||||
|
Args:
|
||||||
|
prompt: User prompt
|
||||||
|
system: System message (optional)
|
||||||
|
temperature: Sampling temperature
|
||||||
|
max_tokens: Maximum tokens to generate
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Generated text
|
||||||
|
"""
|
||||||
|
logger.info(f"{self.name} generating response")
|
||||||
|
response = self.llm.generate(
|
||||||
|
prompt=prompt,
|
||||||
|
system=system,
|
||||||
|
temperature=temperature,
|
||||||
|
max_tokens=max_tokens
|
||||||
|
)
|
||||||
|
logger.debug(f"{self.name} generated {len(response)} characters")
|
||||||
|
return response
|
||||||
|
|
||||||
|
def chat(self, messages: list, temperature: float = 0.7,
|
||||||
|
max_tokens: int = 2048) -> str:
|
||||||
|
"""
|
||||||
|
Chat completion with message history
|
||||||
|
|
||||||
|
Args:
|
||||||
|
messages: List of message dicts with 'role' and 'content'
|
||||||
|
temperature: Sampling temperature
|
||||||
|
max_tokens: Maximum tokens to generate
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Generated text
|
||||||
|
"""
|
||||||
|
logger.info(f"{self.name} processing chat with {len(messages)} messages")
|
||||||
|
response = self.llm.chat(
|
||||||
|
messages=messages,
|
||||||
|
temperature=temperature,
|
||||||
|
max_tokens=max_tokens
|
||||||
|
)
|
||||||
|
logger.debug(f"{self.name} generated {len(response)} characters")
|
||||||
|
return response
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def process(self, *args, **kwargs) -> Any:
|
||||||
|
"""
|
||||||
|
Main processing method - must be implemented by subclasses
|
||||||
|
|
||||||
|
Each agent implements its specialized logic here
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return f"{self.name} (model: {self.model})"
|
||||||
@@ -0,0 +1,289 @@
|
|||||||
|
"""
|
||||||
|
Connection Agent - Finds relationships and connections between documents
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
from typing import List, Dict, Tuple
|
||||||
|
from agents.base_agent import BaseAgent
|
||||||
|
from models.knowledge_graph import KnowledgeNode, KnowledgeEdge, KnowledgeGraph
|
||||||
|
from utils.embeddings import EmbeddingModel
|
||||||
|
import chromadb
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class ConnectionAgent(BaseAgent):
|
||||||
|
"""Agent that discovers connections between documents and concepts"""
|
||||||
|
|
||||||
|
def __init__(self, collection: chromadb.Collection,
|
||||||
|
embedding_model: EmbeddingModel,
|
||||||
|
llm_client=None, model: str = "llama3.2"):
|
||||||
|
"""
|
||||||
|
Initialize connection agent
|
||||||
|
|
||||||
|
Args:
|
||||||
|
collection: ChromaDB collection with documents
|
||||||
|
embedding_model: Model for computing similarities
|
||||||
|
llm_client: Optional shared LLM client
|
||||||
|
model: Ollama model name
|
||||||
|
"""
|
||||||
|
super().__init__(name="ConnectionAgent", llm_client=llm_client, model=model)
|
||||||
|
|
||||||
|
self.collection = collection
|
||||||
|
self.embedding_model = embedding_model
|
||||||
|
|
||||||
|
logger.info(f"{self.name} initialized")
|
||||||
|
|
||||||
|
def process(self, document_id: str = None, query: str = None,
|
||||||
|
top_k: int = 5) -> Dict:
|
||||||
|
"""
|
||||||
|
Find documents related to a document or query
|
||||||
|
|
||||||
|
Args:
|
||||||
|
document_id: ID of reference document
|
||||||
|
query: Search query (used if document_id not provided)
|
||||||
|
top_k: Number of related documents to find
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary with related documents and connections
|
||||||
|
"""
|
||||||
|
if document_id:
|
||||||
|
logger.info(f"{self.name} finding connections for document: {document_id}")
|
||||||
|
return self._find_related_to_document(document_id, top_k)
|
||||||
|
elif query:
|
||||||
|
logger.info(f"{self.name} finding connections for query: {query[:100]}")
|
||||||
|
return self._find_related_to_query(query, top_k)
|
||||||
|
else:
|
||||||
|
return {'related': [], 'error': 'No document_id or query provided'}
|
||||||
|
|
||||||
|
def _find_related_to_document(self, document_id: str, top_k: int) -> Dict:
|
||||||
|
"""Find documents related to a specific document"""
|
||||||
|
try:
|
||||||
|
# Get chunks from the document
|
||||||
|
results = self.collection.get(
|
||||||
|
where={"document_id": document_id},
|
||||||
|
include=['embeddings', 'documents', 'metadatas']
|
||||||
|
)
|
||||||
|
|
||||||
|
if not results['ids']:
|
||||||
|
return {'related': [], 'error': 'Document not found'}
|
||||||
|
|
||||||
|
# Use the first chunk's embedding as representative
|
||||||
|
query_embedding = results['embeddings'][0]
|
||||||
|
document_name = results['metadatas'][0].get('filename', 'Unknown')
|
||||||
|
|
||||||
|
# Search for similar chunks from OTHER documents
|
||||||
|
search_results = self.collection.query(
|
||||||
|
query_embeddings=[query_embedding],
|
||||||
|
n_results=top_k * 3, # Get more to filter out same document
|
||||||
|
include=['documents', 'metadatas', 'distances']
|
||||||
|
)
|
||||||
|
|
||||||
|
# Filter out chunks from the same document
|
||||||
|
related = []
|
||||||
|
seen_docs = set([document_id])
|
||||||
|
|
||||||
|
if search_results['ids']:
|
||||||
|
for i in range(len(search_results['ids'][0])):
|
||||||
|
related_doc_id = search_results['metadatas'][0][i].get('document_id')
|
||||||
|
|
||||||
|
if related_doc_id not in seen_docs:
|
||||||
|
seen_docs.add(related_doc_id)
|
||||||
|
|
||||||
|
similarity = 1.0 - search_results['distances'][0][i]
|
||||||
|
|
||||||
|
related.append({
|
||||||
|
'document_id': related_doc_id,
|
||||||
|
'document_name': search_results['metadatas'][0][i].get('filename', 'Unknown'),
|
||||||
|
'similarity': float(similarity),
|
||||||
|
'preview': search_results['documents'][0][i][:150] + "..."
|
||||||
|
})
|
||||||
|
|
||||||
|
if len(related) >= top_k:
|
||||||
|
break
|
||||||
|
|
||||||
|
return {
|
||||||
|
'source_document': document_name,
|
||||||
|
'source_id': document_id,
|
||||||
|
'related': related,
|
||||||
|
'num_related': len(related)
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error finding related documents: {e}")
|
||||||
|
return {'related': [], 'error': str(e)}
|
||||||
|
|
||||||
|
def _find_related_to_query(self, query: str, top_k: int) -> Dict:
|
||||||
|
"""Find documents related to a query"""
|
||||||
|
try:
|
||||||
|
# Generate query embedding
|
||||||
|
query_embedding = self.embedding_model.embed_query(query)
|
||||||
|
|
||||||
|
# Search
|
||||||
|
results = self.collection.query(
|
||||||
|
query_embeddings=[query_embedding],
|
||||||
|
n_results=top_k * 2, # Get more to deduplicate by document
|
||||||
|
include=['documents', 'metadatas', 'distances']
|
||||||
|
)
|
||||||
|
|
||||||
|
# Deduplicate by document
|
||||||
|
related = []
|
||||||
|
seen_docs = set()
|
||||||
|
|
||||||
|
if results['ids']:
|
||||||
|
for i in range(len(results['ids'][0])):
|
||||||
|
doc_id = results['metadatas'][0][i].get('document_id')
|
||||||
|
|
||||||
|
if doc_id not in seen_docs:
|
||||||
|
seen_docs.add(doc_id)
|
||||||
|
|
||||||
|
similarity = 1.0 - results['distances'][0][i]
|
||||||
|
|
||||||
|
related.append({
|
||||||
|
'document_id': doc_id,
|
||||||
|
'document_name': results['metadatas'][0][i].get('filename', 'Unknown'),
|
||||||
|
'similarity': float(similarity),
|
||||||
|
'preview': results['documents'][0][i][:150] + "..."
|
||||||
|
})
|
||||||
|
|
||||||
|
if len(related) >= top_k:
|
||||||
|
break
|
||||||
|
|
||||||
|
return {
|
||||||
|
'query': query,
|
||||||
|
'related': related,
|
||||||
|
'num_related': len(related)
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error finding related documents: {e}")
|
||||||
|
return {'related': [], 'error': str(e)}
|
||||||
|
|
||||||
|
def build_knowledge_graph(self, similarity_threshold: float = 0.7) -> KnowledgeGraph:
|
||||||
|
"""
|
||||||
|
Build a knowledge graph showing document relationships
|
||||||
|
|
||||||
|
Args:
|
||||||
|
similarity_threshold: Minimum similarity to create an edge
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
KnowledgeGraph object
|
||||||
|
"""
|
||||||
|
logger.info(f"{self.name} building knowledge graph")
|
||||||
|
|
||||||
|
graph = KnowledgeGraph()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Get all documents
|
||||||
|
all_results = self.collection.get(
|
||||||
|
include=['embeddings', 'metadatas']
|
||||||
|
)
|
||||||
|
|
||||||
|
if not all_results['ids']:
|
||||||
|
return graph
|
||||||
|
|
||||||
|
# Group by document
|
||||||
|
documents = {}
|
||||||
|
for i, metadata in enumerate(all_results['metadatas']):
|
||||||
|
doc_id = metadata.get('document_id')
|
||||||
|
if doc_id not in documents:
|
||||||
|
documents[doc_id] = {
|
||||||
|
'name': metadata.get('filename', 'Unknown'),
|
||||||
|
'embedding': all_results['embeddings'][i]
|
||||||
|
}
|
||||||
|
|
||||||
|
# Create nodes
|
||||||
|
for doc_id, doc_data in documents.items():
|
||||||
|
node = KnowledgeNode(
|
||||||
|
id=doc_id,
|
||||||
|
name=doc_data['name'],
|
||||||
|
node_type='document',
|
||||||
|
description=f"Document: {doc_data['name']}"
|
||||||
|
)
|
||||||
|
graph.add_node(node)
|
||||||
|
|
||||||
|
# Create edges based on similarity
|
||||||
|
doc_ids = list(documents.keys())
|
||||||
|
for i, doc_id1 in enumerate(doc_ids):
|
||||||
|
emb1 = np.array(documents[doc_id1]['embedding'])
|
||||||
|
|
||||||
|
for doc_id2 in doc_ids[i+1:]:
|
||||||
|
emb2 = np.array(documents[doc_id2]['embedding'])
|
||||||
|
|
||||||
|
# Calculate similarity
|
||||||
|
similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
|
||||||
|
|
||||||
|
if similarity >= similarity_threshold:
|
||||||
|
edge = KnowledgeEdge(
|
||||||
|
source_id=doc_id1,
|
||||||
|
target_id=doc_id2,
|
||||||
|
relationship='similar_to',
|
||||||
|
weight=float(similarity)
|
||||||
|
)
|
||||||
|
graph.add_edge(edge)
|
||||||
|
|
||||||
|
logger.info(f"{self.name} built graph with {len(graph.nodes)} nodes and {len(graph.edges)} edges")
|
||||||
|
return graph
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error building knowledge graph: {e}")
|
||||||
|
return graph
|
||||||
|
|
||||||
|
def explain_connection(self, doc_id1: str, doc_id2: str) -> str:
|
||||||
|
"""
|
||||||
|
Use LLM to explain why two documents are related
|
||||||
|
|
||||||
|
Args:
|
||||||
|
doc_id1: First document ID
|
||||||
|
doc_id2: Second document ID
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Explanation text
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Get sample chunks from each document
|
||||||
|
results1 = self.collection.get(
|
||||||
|
where={"document_id": doc_id1},
|
||||||
|
limit=2,
|
||||||
|
include=['documents', 'metadatas']
|
||||||
|
)
|
||||||
|
|
||||||
|
results2 = self.collection.get(
|
||||||
|
where={"document_id": doc_id2},
|
||||||
|
limit=2,
|
||||||
|
include=['documents', 'metadatas']
|
||||||
|
)
|
||||||
|
|
||||||
|
if not results1['ids'] or not results2['ids']:
|
||||||
|
return "Could not retrieve documents"
|
||||||
|
|
||||||
|
doc1_name = results1['metadatas'][0].get('filename', 'Document 1')
|
||||||
|
doc2_name = results2['metadatas'][0].get('filename', 'Document 2')
|
||||||
|
|
||||||
|
doc1_text = " ".join(results1['documents'][:2])[:1000]
|
||||||
|
doc2_text = " ".join(results2['documents'][:2])[:1000]
|
||||||
|
|
||||||
|
system_prompt = """You analyze documents and explain their relationships.
|
||||||
|
Provide a brief, clear explanation of how two documents are related."""
|
||||||
|
|
||||||
|
user_prompt = f"""Analyze these two documents and explain how they are related:
|
||||||
|
|
||||||
|
Document 1 ({doc1_name}):
|
||||||
|
{doc1_text}
|
||||||
|
|
||||||
|
Document 2 ({doc2_name}):
|
||||||
|
{doc2_text}
|
||||||
|
|
||||||
|
How are these documents related? Provide a concise explanation:"""
|
||||||
|
|
||||||
|
explanation = self.generate(
|
||||||
|
prompt=user_prompt,
|
||||||
|
system=system_prompt,
|
||||||
|
temperature=0.3,
|
||||||
|
max_tokens=256
|
||||||
|
)
|
||||||
|
|
||||||
|
return explanation
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error explaining connection: {e}")
|
||||||
|
return f"Error: {str(e)}"
|
||||||
@@ -0,0 +1,233 @@
|
|||||||
|
"""
|
||||||
|
Export Agent - Generates formatted reports and exports
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
from typing import List, Dict
|
||||||
|
from datetime import datetime
|
||||||
|
from agents.base_agent import BaseAgent
|
||||||
|
from models.document import Summary
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class ExportAgent(BaseAgent):
|
||||||
|
"""Agent that exports summaries and reports in various formats"""
|
||||||
|
|
||||||
|
def __init__(self, llm_client=None, model: str = "llama3.2"):
|
||||||
|
"""
|
||||||
|
Initialize export agent
|
||||||
|
|
||||||
|
Args:
|
||||||
|
llm_client: Optional shared LLM client
|
||||||
|
model: Ollama model name
|
||||||
|
"""
|
||||||
|
super().__init__(name="ExportAgent", llm_client=llm_client, model=model)
|
||||||
|
|
||||||
|
logger.info(f"{self.name} initialized")
|
||||||
|
|
||||||
|
def process(self, content: Dict, format: str = "markdown") -> str:
|
||||||
|
"""
|
||||||
|
Export content in specified format
|
||||||
|
|
||||||
|
Args:
|
||||||
|
content: Content dictionary to export
|
||||||
|
format: Export format ('markdown', 'text', 'html')
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Formatted content string
|
||||||
|
"""
|
||||||
|
logger.info(f"{self.name} exporting as {format}")
|
||||||
|
|
||||||
|
if format == "markdown":
|
||||||
|
return self._export_markdown(content)
|
||||||
|
elif format == "text":
|
||||||
|
return self._export_text(content)
|
||||||
|
elif format == "html":
|
||||||
|
return self._export_html(content)
|
||||||
|
else:
|
||||||
|
return str(content)
|
||||||
|
|
||||||
|
def _export_markdown(self, content: Dict) -> str:
|
||||||
|
"""Export as Markdown"""
|
||||||
|
md = []
|
||||||
|
md.append(f"# Knowledge Report")
|
||||||
|
md.append(f"\n*Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}*\n")
|
||||||
|
|
||||||
|
if 'title' in content:
|
||||||
|
md.append(f"## {content['title']}\n")
|
||||||
|
|
||||||
|
if 'summary' in content:
|
||||||
|
md.append(f"### Summary\n")
|
||||||
|
md.append(f"{content['summary']}\n")
|
||||||
|
|
||||||
|
if 'key_points' in content and content['key_points']:
|
||||||
|
md.append(f"### Key Points\n")
|
||||||
|
for point in content['key_points']:
|
||||||
|
md.append(f"- {point}")
|
||||||
|
md.append("")
|
||||||
|
|
||||||
|
if 'sections' in content:
|
||||||
|
for section in content['sections']:
|
||||||
|
md.append(f"### {section['title']}\n")
|
||||||
|
md.append(f"{section['content']}\n")
|
||||||
|
|
||||||
|
if 'sources' in content and content['sources']:
|
||||||
|
md.append(f"### Sources\n")
|
||||||
|
for i, source in enumerate(content['sources'], 1):
|
||||||
|
md.append(f"{i}. {source}")
|
||||||
|
md.append("")
|
||||||
|
|
||||||
|
return "\n".join(md)
|
||||||
|
|
||||||
|
def _export_text(self, content: Dict) -> str:
|
||||||
|
"""Export as plain text"""
|
||||||
|
lines = []
|
||||||
|
lines.append("=" * 60)
|
||||||
|
lines.append("KNOWLEDGE REPORT")
|
||||||
|
lines.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
|
||||||
|
lines.append("=" * 60)
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
if 'title' in content:
|
||||||
|
lines.append(content['title'])
|
||||||
|
lines.append("-" * len(content['title']))
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
if 'summary' in content:
|
||||||
|
lines.append("SUMMARY:")
|
||||||
|
lines.append(content['summary'])
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
if 'key_points' in content and content['key_points']:
|
||||||
|
lines.append("KEY POINTS:")
|
||||||
|
for i, point in enumerate(content['key_points'], 1):
|
||||||
|
lines.append(f" {i}. {point}")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
if 'sections' in content:
|
||||||
|
for section in content['sections']:
|
||||||
|
lines.append(section['title'].upper())
|
||||||
|
lines.append("-" * 40)
|
||||||
|
lines.append(section['content'])
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
if 'sources' in content and content['sources']:
|
||||||
|
lines.append("SOURCES:")
|
||||||
|
for i, source in enumerate(content['sources'], 1):
|
||||||
|
lines.append(f" {i}. {source}")
|
||||||
|
|
||||||
|
lines.append("")
|
||||||
|
lines.append("=" * 60)
|
||||||
|
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
def _export_html(self, content: Dict) -> str:
|
||||||
|
"""Export as HTML"""
|
||||||
|
html = []
|
||||||
|
html.append("<!DOCTYPE html>")
|
||||||
|
html.append("<html>")
|
||||||
|
html.append("<head>")
|
||||||
|
html.append(" <meta charset='utf-8'>")
|
||||||
|
html.append(" <title>Knowledge Report</title>")
|
||||||
|
html.append(" <style>")
|
||||||
|
html.append(" body { font-family: Arial, sans-serif; max-width: 800px; margin: 40px auto; padding: 20px; }")
|
||||||
|
html.append(" h1 { color: #333; border-bottom: 3px solid #007bff; padding-bottom: 10px; }")
|
||||||
|
html.append(" h2 { color: #555; margin-top: 30px; }")
|
||||||
|
html.append(" .meta { color: #888; font-style: italic; }")
|
||||||
|
html.append(" .key-points { background: #f8f9fa; padding: 15px; border-left: 4px solid #007bff; }")
|
||||||
|
html.append(" .source { color: #666; font-size: 0.9em; }")
|
||||||
|
html.append(" </style>")
|
||||||
|
html.append("</head>")
|
||||||
|
html.append("<body>")
|
||||||
|
|
||||||
|
html.append(" <h1>Knowledge Report</h1>")
|
||||||
|
html.append(f" <p class='meta'>Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}</p>")
|
||||||
|
|
||||||
|
if 'title' in content:
|
||||||
|
html.append(f" <h2>{content['title']}</h2>")
|
||||||
|
|
||||||
|
if 'summary' in content:
|
||||||
|
html.append(f" <h3>Summary</h3>")
|
||||||
|
html.append(f" <p>{content['summary']}</p>")
|
||||||
|
|
||||||
|
if 'key_points' in content and content['key_points']:
|
||||||
|
html.append(" <h3>Key Points</h3>")
|
||||||
|
html.append(" <div class='key-points'>")
|
||||||
|
html.append(" <ul>")
|
||||||
|
for point in content['key_points']:
|
||||||
|
html.append(f" <li>{point}</li>")
|
||||||
|
html.append(" </ul>")
|
||||||
|
html.append(" </div>")
|
||||||
|
|
||||||
|
if 'sections' in content:
|
||||||
|
for section in content['sections']:
|
||||||
|
html.append(f" <h3>{section['title']}</h3>")
|
||||||
|
html.append(f" <p>{section['content']}</p>")
|
||||||
|
|
||||||
|
if 'sources' in content and content['sources']:
|
||||||
|
html.append(" <h3>Sources</h3>")
|
||||||
|
html.append(" <ol class='source'>")
|
||||||
|
for source in content['sources']:
|
||||||
|
html.append(f" <li>{source}</li>")
|
||||||
|
html.append(" </ol>")
|
||||||
|
|
||||||
|
html.append("</body>")
|
||||||
|
html.append("</html>")
|
||||||
|
|
||||||
|
return "\n".join(html)
|
||||||
|
|
||||||
|
def create_study_guide(self, summaries: List[Summary]) -> str:
|
||||||
|
"""
|
||||||
|
Create a study guide from multiple summaries
|
||||||
|
|
||||||
|
Args:
|
||||||
|
summaries: List of Summary objects
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Formatted study guide
|
||||||
|
"""
|
||||||
|
logger.info(f"{self.name} creating study guide from {len(summaries)} summaries")
|
||||||
|
|
||||||
|
# Compile all content
|
||||||
|
all_summaries = "\n\n".join([
|
||||||
|
f"{s.document_name}:\n{s.summary_text}"
|
||||||
|
for s in summaries
|
||||||
|
])
|
||||||
|
|
||||||
|
all_key_points = []
|
||||||
|
for s in summaries:
|
||||||
|
all_key_points.extend(s.key_points)
|
||||||
|
|
||||||
|
# Use LLM to create cohesive study guide
|
||||||
|
system_prompt = """You create excellent study guides that synthesize information from multiple sources.
|
||||||
|
Create a well-organized study guide with clear sections, key concepts, and important points."""
|
||||||
|
|
||||||
|
user_prompt = f"""Create a comprehensive study guide based on these document summaries:
|
||||||
|
|
||||||
|
{all_summaries}
|
||||||
|
|
||||||
|
Create a well-structured study guide with:
|
||||||
|
1. An overview
|
||||||
|
2. Key concepts
|
||||||
|
3. Important details
|
||||||
|
4. Study tips
|
||||||
|
|
||||||
|
Study Guide:"""
|
||||||
|
|
||||||
|
study_guide = self.generate(
|
||||||
|
prompt=user_prompt,
|
||||||
|
system=system_prompt,
|
||||||
|
temperature=0.5,
|
||||||
|
max_tokens=2048
|
||||||
|
)
|
||||||
|
|
||||||
|
# Format as markdown
|
||||||
|
content = {
|
||||||
|
'title': 'Study Guide',
|
||||||
|
'sections': [
|
||||||
|
{'title': 'Overview', 'content': study_guide},
|
||||||
|
{'title': 'Key Points from All Documents', 'content': '\n'.join([f"• {p}" for p in all_key_points[:15]])}
|
||||||
|
],
|
||||||
|
'sources': [s.document_name for s in summaries]
|
||||||
|
}
|
||||||
|
|
||||||
|
return self._export_markdown(content)
|
||||||
@@ -0,0 +1,157 @@
|
|||||||
|
"""
|
||||||
|
Ingestion Agent - Processes and stores documents in the vector database
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
from typing import Dict, List
|
||||||
|
import uuid
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
from agents.base_agent import BaseAgent
|
||||||
|
from models.document import Document, DocumentChunk
|
||||||
|
from utils.document_parser import DocumentParser
|
||||||
|
from utils.embeddings import EmbeddingModel
|
||||||
|
import chromadb
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class IngestionAgent(BaseAgent):
|
||||||
|
"""Agent responsible for ingesting and storing documents"""
|
||||||
|
|
||||||
|
def __init__(self, collection: chromadb.Collection,
|
||||||
|
embedding_model: EmbeddingModel,
|
||||||
|
llm_client=None, model: str = "llama3.2"):
|
||||||
|
"""
|
||||||
|
Initialize ingestion agent
|
||||||
|
|
||||||
|
Args:
|
||||||
|
collection: ChromaDB collection for storage
|
||||||
|
embedding_model: Model for generating embeddings
|
||||||
|
llm_client: Optional shared LLM client
|
||||||
|
model: Ollama model name
|
||||||
|
"""
|
||||||
|
super().__init__(name="IngestionAgent", llm_client=llm_client, model=model)
|
||||||
|
|
||||||
|
self.collection = collection
|
||||||
|
self.embedding_model = embedding_model
|
||||||
|
self.parser = DocumentParser(chunk_size=1000, chunk_overlap=200)
|
||||||
|
|
||||||
|
logger.info(f"{self.name} ready with ChromaDB collection")
|
||||||
|
|
||||||
|
def process(self, file_path: str) -> Document:
|
||||||
|
"""
|
||||||
|
Process and ingest a document
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file_path: Path to the document file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Document object with metadata
|
||||||
|
"""
|
||||||
|
logger.info(f"{self.name} processing: {file_path}")
|
||||||
|
|
||||||
|
# Parse the document
|
||||||
|
parsed = self.parser.parse_file(file_path)
|
||||||
|
|
||||||
|
# Generate document ID
|
||||||
|
doc_id = str(uuid.uuid4())
|
||||||
|
|
||||||
|
# Create document chunks
|
||||||
|
chunks = []
|
||||||
|
chunk_texts = []
|
||||||
|
chunk_ids = []
|
||||||
|
chunk_metadatas = []
|
||||||
|
|
||||||
|
for i, chunk_text in enumerate(parsed['chunks']):
|
||||||
|
chunk_id = f"{doc_id}_chunk_{i}"
|
||||||
|
|
||||||
|
chunk = DocumentChunk(
|
||||||
|
id=chunk_id,
|
||||||
|
document_id=doc_id,
|
||||||
|
content=chunk_text,
|
||||||
|
chunk_index=i,
|
||||||
|
metadata={
|
||||||
|
'filename': parsed['filename'],
|
||||||
|
'extension': parsed['extension'],
|
||||||
|
'total_chunks': len(parsed['chunks'])
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
chunks.append(chunk)
|
||||||
|
chunk_texts.append(chunk_text)
|
||||||
|
chunk_ids.append(chunk_id)
|
||||||
|
chunk_metadatas.append({
|
||||||
|
'document_id': doc_id,
|
||||||
|
'filename': parsed['filename'],
|
||||||
|
'chunk_index': i,
|
||||||
|
'extension': parsed['extension']
|
||||||
|
})
|
||||||
|
|
||||||
|
# Generate embeddings
|
||||||
|
logger.info(f"{self.name} generating embeddings for {len(chunks)} chunks")
|
||||||
|
embeddings = self.embedding_model.embed_documents(chunk_texts)
|
||||||
|
|
||||||
|
# Store in ChromaDB
|
||||||
|
logger.info(f"{self.name} storing in ChromaDB")
|
||||||
|
self.collection.add(
|
||||||
|
ids=chunk_ids,
|
||||||
|
documents=chunk_texts,
|
||||||
|
embeddings=embeddings,
|
||||||
|
metadatas=chunk_metadatas
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create document object
|
||||||
|
document = Document(
|
||||||
|
id=doc_id,
|
||||||
|
filename=parsed['filename'],
|
||||||
|
filepath=parsed['filepath'],
|
||||||
|
content=parsed['text'],
|
||||||
|
chunks=chunks,
|
||||||
|
metadata={
|
||||||
|
'extension': parsed['extension'],
|
||||||
|
'num_chunks': len(chunks),
|
||||||
|
'total_chars': parsed['total_chars']
|
||||||
|
},
|
||||||
|
created_at=datetime.now()
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info(f"{self.name} successfully ingested: {document}")
|
||||||
|
return document
|
||||||
|
|
||||||
|
def get_statistics(self) -> Dict:
|
||||||
|
"""Get statistics about stored documents"""
|
||||||
|
try:
|
||||||
|
count = self.collection.count()
|
||||||
|
return {
|
||||||
|
'total_chunks': count,
|
||||||
|
'collection_name': self.collection.name
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error getting statistics: {e}")
|
||||||
|
return {'total_chunks': 0, 'error': str(e)}
|
||||||
|
|
||||||
|
def delete_document(self, document_id: str) -> bool:
|
||||||
|
"""
|
||||||
|
Delete all chunks of a document
|
||||||
|
|
||||||
|
Args:
|
||||||
|
document_id: ID of document to delete
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if successful
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Get all chunk IDs for this document
|
||||||
|
results = self.collection.get(
|
||||||
|
where={"document_id": document_id}
|
||||||
|
)
|
||||||
|
|
||||||
|
if results['ids']:
|
||||||
|
self.collection.delete(ids=results['ids'])
|
||||||
|
logger.info(f"{self.name} deleted document {document_id}")
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error deleting document: {e}")
|
||||||
|
return False
|
||||||
@@ -0,0 +1,156 @@
|
|||||||
|
"""
|
||||||
|
Question Agent - Answers questions using RAG (Retrieval Augmented Generation)
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
from typing import List, Dict
|
||||||
|
from agents.base_agent import BaseAgent
|
||||||
|
from models.document import SearchResult, DocumentChunk
|
||||||
|
from utils.embeddings import EmbeddingModel
|
||||||
|
import chromadb
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class QuestionAgent(BaseAgent):
|
||||||
|
"""Agent that answers questions using retrieved context"""
|
||||||
|
|
||||||
|
def __init__(self, collection: chromadb.Collection,
|
||||||
|
embedding_model: EmbeddingModel,
|
||||||
|
llm_client=None, model: str = "llama3.2"):
|
||||||
|
"""
|
||||||
|
Initialize question agent
|
||||||
|
|
||||||
|
Args:
|
||||||
|
collection: ChromaDB collection with documents
|
||||||
|
embedding_model: Model for query embeddings
|
||||||
|
llm_client: Optional shared LLM client
|
||||||
|
model: Ollama model name
|
||||||
|
"""
|
||||||
|
super().__init__(name="QuestionAgent", llm_client=llm_client, model=model)
|
||||||
|
|
||||||
|
self.collection = collection
|
||||||
|
self.embedding_model = embedding_model
|
||||||
|
self.top_k = 5 # Number of chunks to retrieve
|
||||||
|
|
||||||
|
logger.info(f"{self.name} initialized")
|
||||||
|
|
||||||
|
def retrieve(self, query: str, top_k: int = None) -> List[SearchResult]:
|
||||||
|
"""
|
||||||
|
Retrieve relevant document chunks for a query
|
||||||
|
|
||||||
|
Args:
|
||||||
|
query: Search query
|
||||||
|
top_k: Number of results to return (uses self.top_k if None)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of SearchResult objects
|
||||||
|
"""
|
||||||
|
if top_k is None:
|
||||||
|
top_k = self.top_k
|
||||||
|
|
||||||
|
logger.info(f"{self.name} retrieving top {top_k} chunks for query")
|
||||||
|
|
||||||
|
# Generate query embedding
|
||||||
|
query_embedding = self.embedding_model.embed_query(query)
|
||||||
|
|
||||||
|
# Search ChromaDB
|
||||||
|
results = self.collection.query(
|
||||||
|
query_embeddings=[query_embedding],
|
||||||
|
n_results=top_k
|
||||||
|
)
|
||||||
|
|
||||||
|
# Convert to SearchResult objects
|
||||||
|
search_results = []
|
||||||
|
|
||||||
|
if results['ids'] and len(results['ids']) > 0:
|
||||||
|
for i in range(len(results['ids'][0])):
|
||||||
|
chunk = DocumentChunk(
|
||||||
|
id=results['ids'][0][i],
|
||||||
|
document_id=results['metadatas'][0][i].get('document_id', ''),
|
||||||
|
content=results['documents'][0][i],
|
||||||
|
chunk_index=results['metadatas'][0][i].get('chunk_index', 0),
|
||||||
|
metadata=results['metadatas'][0][i]
|
||||||
|
)
|
||||||
|
|
||||||
|
result = SearchResult(
|
||||||
|
chunk=chunk,
|
||||||
|
score=1.0 - results['distances'][0][i], # Convert distance to similarity
|
||||||
|
document_id=results['metadatas'][0][i].get('document_id', ''),
|
||||||
|
document_name=results['metadatas'][0][i].get('filename', 'Unknown')
|
||||||
|
)
|
||||||
|
|
||||||
|
search_results.append(result)
|
||||||
|
|
||||||
|
logger.info(f"{self.name} retrieved {len(search_results)} results")
|
||||||
|
return search_results
|
||||||
|
|
||||||
|
def process(self, question: str, top_k: int = None) -> Dict[str, any]:
|
||||||
|
"""
|
||||||
|
Answer a question using RAG
|
||||||
|
|
||||||
|
Args:
|
||||||
|
question: User's question
|
||||||
|
top_k: Number of chunks to retrieve
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary with answer and sources
|
||||||
|
"""
|
||||||
|
logger.info(f"{self.name} processing question: {question[:100]}...")
|
||||||
|
|
||||||
|
# Retrieve relevant chunks
|
||||||
|
search_results = self.retrieve(question, top_k)
|
||||||
|
|
||||||
|
if not search_results:
|
||||||
|
return {
|
||||||
|
'answer': "I don't have any relevant information in my knowledge base to answer this question.",
|
||||||
|
'sources': [],
|
||||||
|
'context_used': ""
|
||||||
|
}
|
||||||
|
|
||||||
|
# Build context from retrieved chunks
|
||||||
|
context_parts = []
|
||||||
|
sources = []
|
||||||
|
|
||||||
|
for i, result in enumerate(search_results, 1):
|
||||||
|
context_parts.append(f"[Source {i}] {result.chunk.content}")
|
||||||
|
sources.append({
|
||||||
|
'document': result.document_name,
|
||||||
|
'score': result.score,
|
||||||
|
'preview': result.chunk.content[:150] + "..."
|
||||||
|
})
|
||||||
|
|
||||||
|
context = "\n\n".join(context_parts)
|
||||||
|
|
||||||
|
# Create prompt for LLM
|
||||||
|
system_prompt = """You are a helpful research assistant. Answer questions based on the provided context.
|
||||||
|
Be accurate and cite sources when possible. If the context doesn't contain enough information to answer fully, say so.
|
||||||
|
Keep your answer concise and relevant."""
|
||||||
|
|
||||||
|
user_prompt = f"""Context from my knowledge base:
|
||||||
|
|
||||||
|
{context}
|
||||||
|
|
||||||
|
Question: {question}
|
||||||
|
|
||||||
|
Answer based on the context above. If you reference specific information, mention which source(s) you're using."""
|
||||||
|
|
||||||
|
# Generate answer
|
||||||
|
answer = self.generate(
|
||||||
|
prompt=user_prompt,
|
||||||
|
system=system_prompt,
|
||||||
|
temperature=0.3, # Lower temperature for more factual responses
|
||||||
|
max_tokens=1024
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info(f"{self.name} generated answer ({len(answer)} chars)")
|
||||||
|
|
||||||
|
return {
|
||||||
|
'answer': answer,
|
||||||
|
'sources': sources,
|
||||||
|
'context_used': context,
|
||||||
|
'num_sources': len(sources)
|
||||||
|
}
|
||||||
|
|
||||||
|
def set_top_k(self, k: int):
|
||||||
|
"""Set the number of chunks to retrieve"""
|
||||||
|
self.top_k = k
|
||||||
|
logger.info(f"{self.name} top_k set to {k}")
|
||||||
@@ -0,0 +1,181 @@
|
|||||||
|
"""
|
||||||
|
Summary Agent - Creates summaries and extracts key points from documents
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
from typing import Dict, List
|
||||||
|
from agents.base_agent import BaseAgent
|
||||||
|
from models.document import Summary
|
||||||
|
import chromadb
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class SummaryAgent(BaseAgent):
|
||||||
|
"""Agent that creates summaries of documents"""
|
||||||
|
|
||||||
|
def __init__(self, collection: chromadb.Collection,
|
||||||
|
llm_client=None, model: str = "llama3.2"):
|
||||||
|
"""
|
||||||
|
Initialize summary agent
|
||||||
|
|
||||||
|
Args:
|
||||||
|
collection: ChromaDB collection with documents
|
||||||
|
llm_client: Optional shared LLM client
|
||||||
|
model: Ollama model name
|
||||||
|
"""
|
||||||
|
super().__init__(name="SummaryAgent", llm_client=llm_client, model=model)
|
||||||
|
self.collection = collection
|
||||||
|
|
||||||
|
logger.info(f"{self.name} initialized")
|
||||||
|
|
||||||
|
def process(self, document_id: str = None, document_text: str = None,
|
||||||
|
document_name: str = "Unknown") -> Summary:
|
||||||
|
"""
|
||||||
|
Create a summary of a document
|
||||||
|
|
||||||
|
Args:
|
||||||
|
document_id: ID of document in ChromaDB (retrieves chunks if provided)
|
||||||
|
document_text: Full document text (used if document_id not provided)
|
||||||
|
document_name: Name of the document
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Summary object
|
||||||
|
"""
|
||||||
|
logger.info(f"{self.name} creating summary for: {document_name}")
|
||||||
|
|
||||||
|
# Get document text
|
||||||
|
if document_id:
|
||||||
|
text = self._get_document_text(document_id)
|
||||||
|
if not text:
|
||||||
|
return Summary(
|
||||||
|
document_id=document_id,
|
||||||
|
document_name=document_name,
|
||||||
|
summary_text="Error: Could not retrieve document",
|
||||||
|
key_points=[]
|
||||||
|
)
|
||||||
|
elif document_text:
|
||||||
|
text = document_text
|
||||||
|
else:
|
||||||
|
return Summary(
|
||||||
|
document_id="",
|
||||||
|
document_name=document_name,
|
||||||
|
summary_text="Error: No document provided",
|
||||||
|
key_points=[]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Truncate if too long (to fit in context)
|
||||||
|
max_chars = 8000
|
||||||
|
if len(text) > max_chars:
|
||||||
|
logger.warning(f"{self.name} truncating document from {len(text)} to {max_chars} chars")
|
||||||
|
text = text[:max_chars] + "\n\n[Document truncated...]"
|
||||||
|
|
||||||
|
# Generate summary
|
||||||
|
summary_text = self._generate_summary(text)
|
||||||
|
|
||||||
|
# Extract key points
|
||||||
|
key_points = self._extract_key_points(text)
|
||||||
|
|
||||||
|
summary = Summary(
|
||||||
|
document_id=document_id or "",
|
||||||
|
document_name=document_name,
|
||||||
|
summary_text=summary_text,
|
||||||
|
key_points=key_points
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info(f"{self.name} completed summary with {len(key_points)} key points")
|
||||||
|
return summary
|
||||||
|
|
||||||
|
def _get_document_text(self, document_id: str) -> str:
|
||||||
|
"""Retrieve and reconstruct document text from chunks"""
|
||||||
|
try:
|
||||||
|
results = self.collection.get(
|
||||||
|
where={"document_id": document_id}
|
||||||
|
)
|
||||||
|
|
||||||
|
if not results['ids']:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
# Sort by chunk index
|
||||||
|
chunks_data = list(zip(
|
||||||
|
results['documents'],
|
||||||
|
results['metadatas']
|
||||||
|
))
|
||||||
|
|
||||||
|
chunks_data.sort(key=lambda x: x[1].get('chunk_index', 0))
|
||||||
|
|
||||||
|
# Combine chunks
|
||||||
|
text = "\n\n".join([chunk[0] for chunk in chunks_data])
|
||||||
|
return text
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error retrieving document: {e}")
|
||||||
|
return ""
|
||||||
|
|
||||||
|
def _generate_summary(self, text: str) -> str:
|
||||||
|
"""Generate a concise summary of the text"""
|
||||||
|
system_prompt = """You are an expert at creating concise, informative summaries.
|
||||||
|
Your summaries capture the main ideas and key information in clear, accessible language.
|
||||||
|
Keep summaries to 3-5 sentences unless the document is very long."""
|
||||||
|
|
||||||
|
user_prompt = f"""Please create a concise summary of the following document:
|
||||||
|
|
||||||
|
{text}
|
||||||
|
|
||||||
|
Summary:"""
|
||||||
|
|
||||||
|
summary = self.generate(
|
||||||
|
prompt=user_prompt,
|
||||||
|
system=system_prompt,
|
||||||
|
temperature=0.3,
|
||||||
|
max_tokens=512
|
||||||
|
)
|
||||||
|
|
||||||
|
return summary.strip()
|
||||||
|
|
||||||
|
def _extract_key_points(self, text: str) -> List[str]:
|
||||||
|
"""Extract key points from the text"""
|
||||||
|
system_prompt = """You extract the most important key points from documents.
|
||||||
|
List 3-7 key points as concise bullet points. Each point should be a complete, standalone statement."""
|
||||||
|
|
||||||
|
user_prompt = f"""Please extract the key points from the following document:
|
||||||
|
|
||||||
|
{text}
|
||||||
|
|
||||||
|
List the key points (one per line, without bullets or numbers):"""
|
||||||
|
|
||||||
|
response = self.generate(
|
||||||
|
prompt=user_prompt,
|
||||||
|
system=system_prompt,
|
||||||
|
temperature=0.3,
|
||||||
|
max_tokens=512
|
||||||
|
)
|
||||||
|
|
||||||
|
# Parse the response into a list
|
||||||
|
key_points = []
|
||||||
|
for line in response.split('\n'):
|
||||||
|
line = line.strip()
|
||||||
|
# Remove common list markers
|
||||||
|
line = line.lstrip('•-*0123456789.)')
|
||||||
|
line = line.strip()
|
||||||
|
|
||||||
|
if line and len(line) > 10: # Filter out very short lines
|
||||||
|
key_points.append(line)
|
||||||
|
|
||||||
|
return key_points[:7] # Limit to 7 points
|
||||||
|
|
||||||
|
def summarize_multiple(self, document_ids: List[str]) -> List[Summary]:
|
||||||
|
"""
|
||||||
|
Create summaries for multiple documents
|
||||||
|
|
||||||
|
Args:
|
||||||
|
document_ids: List of document IDs
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of Summary objects
|
||||||
|
"""
|
||||||
|
summaries = []
|
||||||
|
|
||||||
|
for doc_id in document_ids:
|
||||||
|
summary = self.process(document_id=doc_id)
|
||||||
|
summaries.append(summary)
|
||||||
|
|
||||||
|
return summaries
|
||||||
846
community-contributions/sach91-bootcamp/week8/app.py
Normal file
846
community-contributions/sach91-bootcamp/week8/app.py
Normal file
@@ -0,0 +1,846 @@
|
|||||||
|
"""
|
||||||
|
KnowledgeHub - Personal Knowledge Management & Research Assistant
|
||||||
|
Main Gradio Application
|
||||||
|
"""
|
||||||
|
import os
|
||||||
|
import logging
|
||||||
|
import json
|
||||||
|
import gradio as gr
|
||||||
|
from pathlib import Path
|
||||||
|
import chromadb
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
# Setup logging
|
||||||
|
logging.basicConfig(
|
||||||
|
level=logging.INFO,
|
||||||
|
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
|
||||||
|
)
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
# Import utilities and agents
|
||||||
|
from utils import OllamaClient, EmbeddingModel, DocumentParser
|
||||||
|
from agents import (
|
||||||
|
IngestionAgent, QuestionAgent, SummaryAgent,
|
||||||
|
ConnectionAgent, ExportAgent
|
||||||
|
)
|
||||||
|
from models import Document
|
||||||
|
|
||||||
|
# Constants
|
||||||
|
VECTORSTORE_PATH = "./vectorstore"
|
||||||
|
TEMP_UPLOAD_PATH = "./temp_uploads"
|
||||||
|
DOCUMENTS_METADATA_PATH = "./vectorstore/documents_metadata.json"
|
||||||
|
|
||||||
|
# Ensure directories exist
|
||||||
|
os.makedirs(VECTORSTORE_PATH, exist_ok=True)
|
||||||
|
os.makedirs(TEMP_UPLOAD_PATH, exist_ok=True)
|
||||||
|
|
||||||
|
class KnowledgeHub:
|
||||||
|
"""Main application class managing all agents"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
logger.info("Initializing KnowledgeHub...")
|
||||||
|
|
||||||
|
# Initialize ChromaDB
|
||||||
|
self.client = chromadb.PersistentClient(path=VECTORSTORE_PATH)
|
||||||
|
self.collection = self.client.get_or_create_collection(
|
||||||
|
name="knowledge_base",
|
||||||
|
metadata={"description": "Personal knowledge management collection"}
|
||||||
|
)
|
||||||
|
|
||||||
|
# Initialize embedding model
|
||||||
|
self.embedding_model = EmbeddingModel()
|
||||||
|
|
||||||
|
# Initialize shared LLM client
|
||||||
|
self.llm_client = OllamaClient(model="llama3.2")
|
||||||
|
|
||||||
|
# Check Ollama connection
|
||||||
|
if not self.llm_client.check_connection():
|
||||||
|
logger.warning("⚠️ Cannot connect to Ollama. Please ensure Ollama is running.")
|
||||||
|
logger.warning("Start Ollama with: ollama serve")
|
||||||
|
else:
|
||||||
|
logger.info("✓ Connected to Ollama")
|
||||||
|
|
||||||
|
# Initialize agents
|
||||||
|
self.ingestion_agent = IngestionAgent(
|
||||||
|
collection=self.collection,
|
||||||
|
embedding_model=self.embedding_model,
|
||||||
|
llm_client=self.llm_client
|
||||||
|
)
|
||||||
|
|
||||||
|
self.question_agent = QuestionAgent(
|
||||||
|
collection=self.collection,
|
||||||
|
embedding_model=self.embedding_model,
|
||||||
|
llm_client=self.llm_client
|
||||||
|
)
|
||||||
|
|
||||||
|
self.summary_agent = SummaryAgent(
|
||||||
|
collection=self.collection,
|
||||||
|
llm_client=self.llm_client
|
||||||
|
)
|
||||||
|
|
||||||
|
self.connection_agent = ConnectionAgent(
|
||||||
|
collection=self.collection,
|
||||||
|
embedding_model=self.embedding_model,
|
||||||
|
llm_client=self.llm_client
|
||||||
|
)
|
||||||
|
|
||||||
|
self.export_agent = ExportAgent(
|
||||||
|
llm_client=self.llm_client
|
||||||
|
)
|
||||||
|
|
||||||
|
# Track uploaded documents
|
||||||
|
self.documents = {}
|
||||||
|
|
||||||
|
# Load existing documents from metadata file
|
||||||
|
self._load_documents_metadata()
|
||||||
|
|
||||||
|
logger.info("✓ KnowledgeHub initialized successfully")
|
||||||
|
|
||||||
|
def _save_documents_metadata(self):
|
||||||
|
"""Save document metadata to JSON file"""
|
||||||
|
try:
|
||||||
|
metadata = {
|
||||||
|
doc_id: doc.to_dict()
|
||||||
|
for doc_id, doc in self.documents.items()
|
||||||
|
}
|
||||||
|
|
||||||
|
with open(DOCUMENTS_METADATA_PATH, 'w') as f:
|
||||||
|
json.dump(metadata, f, indent=2)
|
||||||
|
|
||||||
|
logger.debug(f"Saved metadata for {len(metadata)} documents")
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error saving document metadata: {e}")
|
||||||
|
|
||||||
|
def _load_documents_metadata(self):
|
||||||
|
"""Load document metadata from JSON file"""
|
||||||
|
try:
|
||||||
|
if os.path.exists(DOCUMENTS_METADATA_PATH):
|
||||||
|
with open(DOCUMENTS_METADATA_PATH, 'r') as f:
|
||||||
|
metadata = json.load(f)
|
||||||
|
|
||||||
|
# Reconstruct Document objects (simplified - without chunks)
|
||||||
|
for doc_id, doc_data in metadata.items():
|
||||||
|
# Create a minimal Document object for UI purposes
|
||||||
|
# Full chunks are still in ChromaDB
|
||||||
|
doc = Document(
|
||||||
|
id=doc_id,
|
||||||
|
filename=doc_data['filename'],
|
||||||
|
filepath=doc_data.get('filepath', ''),
|
||||||
|
content=doc_data.get('content', ''),
|
||||||
|
chunks=[], # Chunks are in ChromaDB
|
||||||
|
metadata=doc_data.get('metadata', {}),
|
||||||
|
created_at=datetime.fromisoformat(doc_data['created_at'])
|
||||||
|
)
|
||||||
|
self.documents[doc_id] = doc
|
||||||
|
|
||||||
|
logger.info(f"✓ Loaded {len(self.documents)} existing documents from storage")
|
||||||
|
else:
|
||||||
|
logger.info("No existing documents found (starting fresh)")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error loading document metadata: {e}")
|
||||||
|
logger.info("Starting with empty document list")
|
||||||
|
|
||||||
|
def upload_document(self, files, progress=gr.Progress()):
|
||||||
|
"""Handle document upload - supports single or multiple files with progress tracking"""
|
||||||
|
if files is None or len(files) == 0:
|
||||||
|
return "⚠️ Please select file(s) to upload", "", []
|
||||||
|
|
||||||
|
# Convert single file to list for consistent handling
|
||||||
|
if not isinstance(files, list):
|
||||||
|
files = [files]
|
||||||
|
|
||||||
|
results = []
|
||||||
|
successful = 0
|
||||||
|
failed = 0
|
||||||
|
total_chunks = 0
|
||||||
|
|
||||||
|
# Initialize progress tracking
|
||||||
|
progress(0, desc="Starting upload...")
|
||||||
|
|
||||||
|
for file_idx, file in enumerate(files, 1):
|
||||||
|
# Update progress
|
||||||
|
progress_pct = (file_idx - 1) / len(files)
|
||||||
|
progress(progress_pct, desc=f"Processing {file_idx}/{len(files)}: {Path(file.name).name}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
logger.info(f"Processing file {file_idx}/{len(files)}: {file.name}")
|
||||||
|
|
||||||
|
# Save uploaded file temporarily
|
||||||
|
temp_path = os.path.join(TEMP_UPLOAD_PATH, Path(file.name).name)
|
||||||
|
|
||||||
|
# Copy file content
|
||||||
|
with open(temp_path, 'wb') as f:
|
||||||
|
f.write(file.read() if hasattr(file, 'read') else open(file.name, 'rb').read())
|
||||||
|
|
||||||
|
# Process document
|
||||||
|
document = self.ingestion_agent.process(temp_path)
|
||||||
|
|
||||||
|
# Store document reference
|
||||||
|
self.documents[document.id] = document
|
||||||
|
|
||||||
|
# Track stats
|
||||||
|
successful += 1
|
||||||
|
total_chunks += document.num_chunks
|
||||||
|
|
||||||
|
# Add to results
|
||||||
|
results.append({
|
||||||
|
'status': '✅',
|
||||||
|
'filename': document.filename,
|
||||||
|
'chunks': document.num_chunks,
|
||||||
|
'size': f"{document.total_chars:,} chars"
|
||||||
|
})
|
||||||
|
|
||||||
|
# Clean up temp file
|
||||||
|
os.remove(temp_path)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error processing {file.name}: {e}")
|
||||||
|
failed += 1
|
||||||
|
results.append({
|
||||||
|
'status': '❌',
|
||||||
|
'filename': Path(file.name).name,
|
||||||
|
'chunks': 0,
|
||||||
|
'size': f"Error: {str(e)[:50]}"
|
||||||
|
})
|
||||||
|
|
||||||
|
# Final progress update
|
||||||
|
progress(1.0, desc="Upload complete!")
|
||||||
|
|
||||||
|
# Save metadata once after all uploads
|
||||||
|
if successful > 0:
|
||||||
|
self._save_documents_metadata()
|
||||||
|
|
||||||
|
# Create summary
|
||||||
|
summary = f"""## Upload Complete! 🎉
|
||||||
|
|
||||||
|
**Total Files:** {len(files)}
|
||||||
|
**✅ Successful:** {successful}
|
||||||
|
**❌ Failed:** {failed}
|
||||||
|
**Total Chunks Created:** {total_chunks:,}
|
||||||
|
|
||||||
|
{f"⚠️ **{failed} file(s) failed** - Check results table below for details" if failed > 0 else "All files processed successfully!"}
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Create detailed results table
|
||||||
|
results_table = [[r['status'], r['filename'], r['chunks'], r['size']] for r in results]
|
||||||
|
|
||||||
|
# Create preview of first successful document
|
||||||
|
preview = ""
|
||||||
|
for doc in self.documents.values():
|
||||||
|
if doc.filename in [r['filename'] for r in results if r['status'] == '✅']:
|
||||||
|
preview = doc.content[:500] + "..." if len(doc.content) > 500 else doc.content
|
||||||
|
break
|
||||||
|
|
||||||
|
return summary, preview, results_table
|
||||||
|
|
||||||
|
def ask_question(self, question, top_k, progress=gr.Progress()):
|
||||||
|
"""Handle question answering with progress tracking"""
|
||||||
|
if not question.strip():
|
||||||
|
return "⚠️ Please enter a question", [], ""
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Initial status
|
||||||
|
progress(0, desc="Processing your question...")
|
||||||
|
status = "🔄 **Searching knowledge base...**\n\nRetrieving relevant documents..."
|
||||||
|
|
||||||
|
logger.info(f"Answering question: {question[:100]}")
|
||||||
|
|
||||||
|
# Update progress
|
||||||
|
progress(0.3, desc="Finding relevant documents...")
|
||||||
|
|
||||||
|
result = self.question_agent.process(question, top_k=top_k)
|
||||||
|
|
||||||
|
# Update progress
|
||||||
|
progress(0.7, desc="Generating answer with LLM...")
|
||||||
|
|
||||||
|
# Format answer
|
||||||
|
answer = f"""### Answer\n\n{result['answer']}\n\n"""
|
||||||
|
|
||||||
|
if result['sources']:
|
||||||
|
answer += f"**Sources:** {result['num_sources']} documents referenced\n\n"
|
||||||
|
|
||||||
|
# Format sources for display
|
||||||
|
sources_data = []
|
||||||
|
for i, source in enumerate(result['sources'], 1):
|
||||||
|
sources_data.append([
|
||||||
|
i,
|
||||||
|
source['document'],
|
||||||
|
f"{source['score']:.2%}",
|
||||||
|
source['preview']
|
||||||
|
])
|
||||||
|
|
||||||
|
progress(1.0, desc="Answer ready!")
|
||||||
|
|
||||||
|
return answer, sources_data, "✅ Answer generated successfully!"
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error answering question: {e}")
|
||||||
|
return f"❌ Error: {str(e)}", [], f"❌ Error: {str(e)}"
|
||||||
|
|
||||||
|
def create_summary(self, doc_selector, progress=gr.Progress()):
|
||||||
|
"""Create document summary with progress tracking"""
|
||||||
|
if not doc_selector:
|
||||||
|
return "⚠️ Please select a document to summarize", ""
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Initial status
|
||||||
|
progress(0, desc="Preparing to summarize...")
|
||||||
|
|
||||||
|
logger.info(f'doc_selector : {doc_selector}')
|
||||||
|
doc_id = doc_selector.split(" -|- ")[1]
|
||||||
|
document = self.documents.get(doc_id)
|
||||||
|
|
||||||
|
if not document:
|
||||||
|
return "", "❌ Document not found"
|
||||||
|
|
||||||
|
# Update status
|
||||||
|
status_msg = f"🔄 **Generating summary for:** {document.filename}\n\nPlease wait, this may take 10-20 seconds..."
|
||||||
|
progress(0.3, desc=f"Analyzing {document.filename}...")
|
||||||
|
|
||||||
|
logger.info(f"Creating summary for: {document.filename}")
|
||||||
|
|
||||||
|
# Generate summary
|
||||||
|
summary = self.summary_agent.process(
|
||||||
|
document_id=doc_id,
|
||||||
|
document_name=document.filename
|
||||||
|
)
|
||||||
|
|
||||||
|
progress(1.0, desc="Summary complete!")
|
||||||
|
|
||||||
|
# Format result
|
||||||
|
result = f"""## Summary of {summary.document_name}\n\n{summary.summary_text}\n\n"""
|
||||||
|
|
||||||
|
if summary.key_points:
|
||||||
|
result += "### Key Points\n\n"
|
||||||
|
for point in summary.key_points:
|
||||||
|
result += f"- {point}\n"
|
||||||
|
|
||||||
|
return result, "✅ Summary generated successfully!"
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error creating summary: {e}")
|
||||||
|
return "", f"❌ Error: {str(e)}"
|
||||||
|
|
||||||
|
def find_connections(self, doc_selector, top_k, progress=gr.Progress()):
|
||||||
|
"""Find related documents with progress tracking"""
|
||||||
|
if not doc_selector:
|
||||||
|
return "⚠️ Please select a document", [], ""
|
||||||
|
|
||||||
|
try:
|
||||||
|
progress(0, desc="Preparing to find connections...")
|
||||||
|
|
||||||
|
doc_id = doc_selector.split(" -|- ")[1]
|
||||||
|
document = self.documents.get(doc_id)
|
||||||
|
|
||||||
|
if not document:
|
||||||
|
return "❌ Document not found", [], "❌ Document not found"
|
||||||
|
|
||||||
|
status = f"🔄 **Finding documents related to:** {document.filename}\n\nSearching knowledge base..."
|
||||||
|
progress(0.3, desc=f"Analyzing {document.filename}...")
|
||||||
|
|
||||||
|
logger.info(f"Finding connections for: {document.filename}")
|
||||||
|
|
||||||
|
result = self.connection_agent.process(document_id=doc_id, top_k=top_k)
|
||||||
|
|
||||||
|
progress(0.8, desc="Calculating similarity scores...")
|
||||||
|
|
||||||
|
if 'error' in result:
|
||||||
|
return f"❌ Error: {result['error']}", [], f"❌ Error: {result['error']}"
|
||||||
|
|
||||||
|
message = f"""## Related Documents\n\n**Source:** {result['source_document']}\n\n"""
|
||||||
|
message += f"**Found {result['num_related']} related documents:**\n\n"""
|
||||||
|
|
||||||
|
# Format for table
|
||||||
|
table_data = []
|
||||||
|
for i, rel in enumerate(result['related'], 1):
|
||||||
|
table_data.append([
|
||||||
|
i,
|
||||||
|
rel['document_name'],
|
||||||
|
f"{rel['similarity']:.2%}",
|
||||||
|
rel['preview']
|
||||||
|
])
|
||||||
|
|
||||||
|
progress(1.0, desc="Connections found!")
|
||||||
|
|
||||||
|
return message, table_data, "✅ Related documents found!"
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error finding connections: {e}")
|
||||||
|
return f"❌ Error: {str(e)}", [], f"❌ Error: {str(e)}"
|
||||||
|
|
||||||
|
def export_knowledge(self, format_choice):
|
||||||
|
"""Export knowledge base"""
|
||||||
|
try:
|
||||||
|
logger.info(f"Exporting as {format_choice}")
|
||||||
|
|
||||||
|
# Get statistics
|
||||||
|
stats = self.ingestion_agent.get_statistics()
|
||||||
|
|
||||||
|
# Create export content
|
||||||
|
content = {
|
||||||
|
'title': 'Knowledge Base Export',
|
||||||
|
'summary': f"Total documents in knowledge base: {len(self.documents)}",
|
||||||
|
'sections': [
|
||||||
|
{
|
||||||
|
'title': 'Documents',
|
||||||
|
'content': '\n'.join([f"- {doc.filename}" for doc in self.documents.values()])
|
||||||
|
},
|
||||||
|
{
|
||||||
|
'title': 'Statistics',
|
||||||
|
'content': f"Total chunks stored: {stats['total_chunks']}"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
# Export
|
||||||
|
if format_choice == "Markdown":
|
||||||
|
output = self.export_agent.process(content, format="markdown")
|
||||||
|
filename = f"knowledge_export_{datetime.now().strftime('%Y%m%d_%H%M%S')}.md"
|
||||||
|
elif format_choice == "HTML":
|
||||||
|
output = self.export_agent.process(content, format="html")
|
||||||
|
filename = f"knowledge_export_{datetime.now().strftime('%Y%m%d_%H%M%S')}.html"
|
||||||
|
else: # Text
|
||||||
|
output = self.export_agent.process(content, format="text")
|
||||||
|
filename = f"knowledge_export_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"
|
||||||
|
|
||||||
|
# Save file
|
||||||
|
export_path = os.path.join(TEMP_UPLOAD_PATH, filename)
|
||||||
|
with open(export_path, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(output)
|
||||||
|
|
||||||
|
return f"✅ Exported as {format_choice}", export_path
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error exporting: {e}")
|
||||||
|
return f"❌ Error: {str(e)}", None
|
||||||
|
|
||||||
|
def get_statistics(self):
|
||||||
|
"""Get knowledge base statistics"""
|
||||||
|
try:
|
||||||
|
stats = self.ingestion_agent.get_statistics()
|
||||||
|
|
||||||
|
total_docs = len(self.documents)
|
||||||
|
total_chunks = stats.get('total_chunks', 0)
|
||||||
|
total_chars = sum(doc.total_chars for doc in self.documents.values())
|
||||||
|
|
||||||
|
# Check if data is persisted
|
||||||
|
persistence_status = "✅ Enabled" if os.path.exists(DOCUMENTS_METADATA_PATH) else "⚠️ Not configured"
|
||||||
|
vectorstore_size = self._get_directory_size(VECTORSTORE_PATH)
|
||||||
|
|
||||||
|
stats_text = f"""## Knowledge Base Statistics
|
||||||
|
|
||||||
|
**Persistence Status:** {persistence_status}
|
||||||
|
**Total Documents:** {total_docs}
|
||||||
|
**Total Chunks:** {total_chunks:,}
|
||||||
|
**Total Characters:** {total_chars:,}
|
||||||
|
**Vector Store Size:** {vectorstore_size}
|
||||||
|
|
||||||
|
### Storage Locations
|
||||||
|
- **Vector DB:** `{VECTORSTORE_PATH}/`
|
||||||
|
- **Metadata:** `{DOCUMENTS_METADATA_PATH}`
|
||||||
|
|
||||||
|
**📝 Note:** Your data persists across app restarts!
|
||||||
|
|
||||||
|
**Recent Documents:**
|
||||||
|
{chr(10).join([f"- {doc.filename} ({doc.num_chunks} chunks)" for doc in list(self.documents.values())[-5:]])}
|
||||||
|
"""
|
||||||
|
if self.documents:
|
||||||
|
stats_text += "\n".join([f"- {doc.filename} ({doc.num_chunks} chunks, added {doc.created_at.strftime('%Y-%m-%d')})"
|
||||||
|
for doc in list(self.documents.values())[-10:]])
|
||||||
|
else:
|
||||||
|
stats_text += "\n*No documents yet. Upload some to get started!*"
|
||||||
|
|
||||||
|
return stats_text
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return f"❌ Error: {str(e)}"
|
||||||
|
|
||||||
|
def _get_directory_size(self, path):
|
||||||
|
"""Calculate directory size"""
|
||||||
|
try:
|
||||||
|
total_size = 0
|
||||||
|
for dirpath, dirnames, filenames in os.walk(path):
|
||||||
|
for filename in filenames:
|
||||||
|
filepath = os.path.join(dirpath, filename)
|
||||||
|
if os.path.exists(filepath):
|
||||||
|
total_size += os.path.getsize(filepath)
|
||||||
|
|
||||||
|
# Convert to human readable
|
||||||
|
for unit in ['B', 'KB', 'MB', 'GB']:
|
||||||
|
if total_size < 1024.0:
|
||||||
|
return f"{total_size:.1f} {unit}"
|
||||||
|
total_size /= 1024.0
|
||||||
|
return f"{total_size:.1f} TB"
|
||||||
|
except:
|
||||||
|
return "Unknown"
|
||||||
|
|
||||||
|
def get_document_list(self):
|
||||||
|
"""Get list of documents for dropdown"""
|
||||||
|
new_choices = [f"{doc.filename} -|- {doc.id}" for doc in self.documents.values()]
|
||||||
|
return gr.update(choices=new_choices, value=None)
|
||||||
|
|
||||||
|
|
||||||
|
def delete_document(self, doc_selector):
|
||||||
|
"""Delete a document from the knowledge base"""
|
||||||
|
if not doc_selector:
|
||||||
|
return "⚠️ Please select a document to delete", self.get_document_list()
|
||||||
|
|
||||||
|
try:
|
||||||
|
doc_id = doc_selector.split(" - ")[0]
|
||||||
|
document = self.documents.get(doc_id)
|
||||||
|
|
||||||
|
if not document:
|
||||||
|
return "❌ Document not found", self.get_document_list()
|
||||||
|
|
||||||
|
# Delete from ChromaDB
|
||||||
|
success = self.ingestion_agent.delete_document(doc_id)
|
||||||
|
|
||||||
|
if success:
|
||||||
|
# Remove from documents dict
|
||||||
|
filename = document.filename
|
||||||
|
del self.documents[doc_id]
|
||||||
|
|
||||||
|
# Save updated metadata
|
||||||
|
self._save_documents_metadata()
|
||||||
|
|
||||||
|
return f"✅ Deleted: {filename}", self.get_document_list()
|
||||||
|
else:
|
||||||
|
return f"❌ Error deleting document", self.get_document_list()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error deleting document: {e}")
|
||||||
|
return f"❌ Error: {str(e)}", self.get_document_list()
|
||||||
|
|
||||||
|
def clear_all_documents(self):
|
||||||
|
"""Clear entire knowledge base"""
|
||||||
|
try:
|
||||||
|
# Delete collection
|
||||||
|
self.client.delete_collection("knowledge_base")
|
||||||
|
|
||||||
|
# Recreate empty collection
|
||||||
|
self.collection = self.client.create_collection(
|
||||||
|
name="knowledge_base",
|
||||||
|
metadata={"description": "Personal knowledge management collection"}
|
||||||
|
)
|
||||||
|
|
||||||
|
# Update agents with new collection
|
||||||
|
self.ingestion_agent.collection = self.collection
|
||||||
|
self.question_agent.collection = self.collection
|
||||||
|
self.summary_agent.collection = self.collection
|
||||||
|
self.connection_agent.collection = self.collection
|
||||||
|
|
||||||
|
# Clear documents
|
||||||
|
self.documents = {}
|
||||||
|
self._save_documents_metadata()
|
||||||
|
|
||||||
|
return "✅ All documents cleared from knowledge base"
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error clearing database: {e}")
|
||||||
|
return f"❌ Error: {str(e)}"
|
||||||
|
|
||||||
|
|
||||||
|
def create_ui():
|
||||||
|
"""Create Gradio interface"""
|
||||||
|
|
||||||
|
# Initialize app
|
||||||
|
app = KnowledgeHub()
|
||||||
|
|
||||||
|
# Custom CSS
|
||||||
|
custom_css = """
|
||||||
|
.main-header {
|
||||||
|
text-align: center;
|
||||||
|
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||||
|
color: white;
|
||||||
|
padding: 30px;
|
||||||
|
border-radius: 10px;
|
||||||
|
margin-bottom: 20px;
|
||||||
|
}
|
||||||
|
.stat-box {
|
||||||
|
background: #f8f9fa;
|
||||||
|
padding: 15px;
|
||||||
|
border-radius: 8px;
|
||||||
|
border-left: 4px solid #667eea;
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
|
||||||
|
with gr.Blocks(title="KnowledgeHub", css=custom_css, theme=gr.themes.Soft()) as interface:
|
||||||
|
|
||||||
|
# Header
|
||||||
|
gr.HTML("""
|
||||||
|
<div class="main-header">
|
||||||
|
<h1>🧠 KnowledgeHub</h1>
|
||||||
|
<p>Personal Knowledge Management & Research Assistant</p>
|
||||||
|
<p style="font-size: 14px; opacity: 0.9;">
|
||||||
|
Powered by Ollama (Llama 3.2) • Fully Local & Private
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
""")
|
||||||
|
|
||||||
|
# Main tabs
|
||||||
|
with gr.Tabs():
|
||||||
|
|
||||||
|
# Tab 1: Upload Documents
|
||||||
|
with gr.Tab("📤 Upload Documents"):
|
||||||
|
gr.Markdown("### Upload your documents to build your knowledge base")
|
||||||
|
gr.Markdown("*Supported formats: PDF, DOCX, TXT, MD, HTML, PY*")
|
||||||
|
gr.Markdown("*💡 Tip: You can select multiple files at once!*")
|
||||||
|
|
||||||
|
with gr.Row():
|
||||||
|
with gr.Column():
|
||||||
|
file_input = gr.File(
|
||||||
|
label="Select Document(s)",
|
||||||
|
file_types=[".pdf", ".docx", ".txt", ".md", ".html", ".py"],
|
||||||
|
file_count="multiple" # Enable multiple file selection
|
||||||
|
)
|
||||||
|
upload_btn = gr.Button("📤 Upload & Process", variant="primary")
|
||||||
|
|
||||||
|
with gr.Column():
|
||||||
|
upload_status = gr.Markdown("Ready to upload documents")
|
||||||
|
|
||||||
|
# Results table for batch uploads
|
||||||
|
with gr.Row():
|
||||||
|
upload_results = gr.Dataframe(
|
||||||
|
headers=["Status", "Filename", "Chunks", "Size"],
|
||||||
|
label="Upload Results",
|
||||||
|
wrap=True,
|
||||||
|
visible=True
|
||||||
|
)
|
||||||
|
|
||||||
|
with gr.Row():
|
||||||
|
document_preview = gr.Textbox(
|
||||||
|
label="Document Preview (First Uploaded)",
|
||||||
|
lines=10,
|
||||||
|
max_lines=15
|
||||||
|
)
|
||||||
|
|
||||||
|
upload_btn.click(
|
||||||
|
fn=app.upload_document,
|
||||||
|
inputs=[file_input],
|
||||||
|
outputs=[upload_status, document_preview, upload_results]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Tab 2: Ask Questions
|
||||||
|
with gr.Tab("❓ Ask Questions"):
|
||||||
|
gr.Markdown("### Ask questions about your documents")
|
||||||
|
gr.Markdown("*Uses RAG (Retrieval Augmented Generation) to answer based on your knowledge base*")
|
||||||
|
|
||||||
|
with gr.Row():
|
||||||
|
with gr.Column(scale=3):
|
||||||
|
question_input = gr.Textbox(
|
||||||
|
label="Your Question",
|
||||||
|
placeholder="What would you like to know?",
|
||||||
|
lines=3
|
||||||
|
)
|
||||||
|
|
||||||
|
with gr.Column(scale=1):
|
||||||
|
top_k_slider = gr.Slider(
|
||||||
|
minimum=1,
|
||||||
|
maximum=10,
|
||||||
|
value=5,
|
||||||
|
step=1,
|
||||||
|
label="Number of sources"
|
||||||
|
)
|
||||||
|
ask_btn = gr.Button("🔍 Ask", variant="primary")
|
||||||
|
|
||||||
|
qa_status = gr.Markdown("Ready to answer questions")
|
||||||
|
answer_output = gr.Markdown(label="Answer")
|
||||||
|
|
||||||
|
sources_table = gr.Dataframe(
|
||||||
|
headers=["#", "Document", "Relevance", "Preview"],
|
||||||
|
label="Sources",
|
||||||
|
wrap=True
|
||||||
|
)
|
||||||
|
|
||||||
|
ask_btn.click(
|
||||||
|
fn=app.ask_question,
|
||||||
|
inputs=[question_input, top_k_slider],
|
||||||
|
outputs=[answer_output, sources_table, qa_status]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Tab 3: Summarize
|
||||||
|
with gr.Tab("📝 Summarize"):
|
||||||
|
gr.Markdown("### Generate summaries and extract key points")
|
||||||
|
|
||||||
|
with gr.Row():
|
||||||
|
with gr.Column():
|
||||||
|
doc_selector = gr.Dropdown(
|
||||||
|
choices=[],
|
||||||
|
label="Select Document",
|
||||||
|
info="Choose a document to summarize",
|
||||||
|
allow_custom_value=True
|
||||||
|
)
|
||||||
|
refresh_btn = gr.Button("🔄 Refresh List")
|
||||||
|
summarize_btn = gr.Button("📝 Generate Summary", variant="primary")
|
||||||
|
summary_status = gr.Markdown("Ready to generate summaries")
|
||||||
|
|
||||||
|
with gr.Column(scale=2):
|
||||||
|
summary_output = gr.Markdown(label="Summary")
|
||||||
|
|
||||||
|
summarize_btn.click(
|
||||||
|
fn=app.create_summary,
|
||||||
|
inputs=[doc_selector],
|
||||||
|
outputs=[summary_output, summary_status]
|
||||||
|
)
|
||||||
|
|
||||||
|
refresh_btn.click(
|
||||||
|
fn=app.get_document_list,
|
||||||
|
outputs=[doc_selector]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Tab 4: Find Connections
|
||||||
|
with gr.Tab("🔗 Find Connections"):
|
||||||
|
gr.Markdown("### Discover relationships between documents")
|
||||||
|
|
||||||
|
with gr.Row():
|
||||||
|
with gr.Column():
|
||||||
|
conn_doc_selector = gr.Dropdown(
|
||||||
|
choices=[],
|
||||||
|
label="Select Document",
|
||||||
|
info="Find documents related to this one",
|
||||||
|
allow_custom_value=True
|
||||||
|
)
|
||||||
|
conn_top_k = gr.Slider(
|
||||||
|
minimum=1,
|
||||||
|
maximum=10,
|
||||||
|
value=5,
|
||||||
|
step=1,
|
||||||
|
label="Number of related documents"
|
||||||
|
)
|
||||||
|
refresh_conn_btn = gr.Button("🔄 Refresh List")
|
||||||
|
find_btn = gr.Button("🔗 Find Connections", variant="primary")
|
||||||
|
connection_status = gr.Markdown("Ready to find connections")
|
||||||
|
|
||||||
|
connection_output = gr.Markdown(label="Connections")
|
||||||
|
|
||||||
|
connections_table = gr.Dataframe(
|
||||||
|
headers=["#", "Document", "Similarity", "Preview"],
|
||||||
|
label="Related Documents",
|
||||||
|
wrap=True
|
||||||
|
)
|
||||||
|
|
||||||
|
find_btn.click(
|
||||||
|
fn=app.find_connections,
|
||||||
|
inputs=[conn_doc_selector, conn_top_k],
|
||||||
|
outputs=[connection_output, connections_table, connection_status]
|
||||||
|
)
|
||||||
|
|
||||||
|
refresh_conn_btn.click(
|
||||||
|
fn=app.get_document_list,
|
||||||
|
outputs=[conn_doc_selector]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Tab 5: Export
|
||||||
|
with gr.Tab("💾 Export"):
|
||||||
|
gr.Markdown("### Export your knowledge base")
|
||||||
|
|
||||||
|
with gr.Row():
|
||||||
|
with gr.Column():
|
||||||
|
format_choice = gr.Radio(
|
||||||
|
choices=["Markdown", "HTML", "Text"],
|
||||||
|
value="Markdown",
|
||||||
|
label="Export Format"
|
||||||
|
)
|
||||||
|
export_btn = gr.Button("💾 Export", variant="primary")
|
||||||
|
|
||||||
|
with gr.Column():
|
||||||
|
export_status = gr.Markdown("Ready to export")
|
||||||
|
export_file = gr.File(label="Download Export")
|
||||||
|
|
||||||
|
export_btn.click(
|
||||||
|
fn=app.export_knowledge,
|
||||||
|
inputs=[format_choice],
|
||||||
|
outputs=[export_status, export_file]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Tab 6: Manage Documents
|
||||||
|
with gr.Tab("🗂️ Manage Documents"):
|
||||||
|
gr.Markdown("### Manage your document library")
|
||||||
|
|
||||||
|
with gr.Row():
|
||||||
|
with gr.Column():
|
||||||
|
gr.Markdown("#### Delete Document")
|
||||||
|
delete_doc_selector = gr.Dropdown(
|
||||||
|
choices=[],
|
||||||
|
label="Select Document to Delete",
|
||||||
|
info="Choose a document to remove from knowledge base"
|
||||||
|
)
|
||||||
|
with gr.Row():
|
||||||
|
refresh_delete_btn = gr.Button("🔄 Refresh List")
|
||||||
|
delete_btn = gr.Button("🗑️ Delete Document", variant="stop")
|
||||||
|
delete_status = gr.Markdown("")
|
||||||
|
|
||||||
|
with gr.Column():
|
||||||
|
gr.Markdown("#### Clear All Documents")
|
||||||
|
gr.Markdown("⚠️ **Warning:** This will delete your entire knowledge base!")
|
||||||
|
clear_confirm = gr.Textbox(
|
||||||
|
label="Type 'DELETE ALL' to confirm",
|
||||||
|
placeholder="DELETE ALL"
|
||||||
|
)
|
||||||
|
clear_all_btn = gr.Button("🗑️ Clear All Documents", variant="stop")
|
||||||
|
clear_status = gr.Markdown("")
|
||||||
|
|
||||||
|
def confirm_and_clear(confirm_text):
|
||||||
|
if confirm_text.strip() == "DELETE ALL":
|
||||||
|
return app.clear_all_documents()
|
||||||
|
else:
|
||||||
|
return "⚠️ Please type 'DELETE ALL' to confirm"
|
||||||
|
|
||||||
|
delete_btn.click(
|
||||||
|
fn=app.delete_document,
|
||||||
|
inputs=[delete_doc_selector],
|
||||||
|
outputs=[delete_status, delete_doc_selector]
|
||||||
|
)
|
||||||
|
|
||||||
|
refresh_delete_btn.click(
|
||||||
|
fn=app.get_document_list,
|
||||||
|
outputs=[delete_doc_selector]
|
||||||
|
)
|
||||||
|
|
||||||
|
clear_all_btn.click(
|
||||||
|
fn=confirm_and_clear,
|
||||||
|
inputs=[clear_confirm],
|
||||||
|
outputs=[clear_status]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Tab 7: Statistics
|
||||||
|
with gr.Tab("📊 Statistics"):
|
||||||
|
gr.Markdown("### Knowledge Base Overview")
|
||||||
|
|
||||||
|
stats_output = gr.Markdown()
|
||||||
|
stats_btn = gr.Button("🔄 Refresh Statistics", variant="primary")
|
||||||
|
|
||||||
|
stats_btn.click(
|
||||||
|
fn=app.get_statistics,
|
||||||
|
outputs=[stats_output]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Auto-load stats on tab open
|
||||||
|
interface.load(
|
||||||
|
fn=app.get_statistics,
|
||||||
|
outputs=[stats_output]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Footer
|
||||||
|
gr.HTML("""
|
||||||
|
<div style="text-align: center; margin-top: 30px; padding: 20px; color: #666;">
|
||||||
|
<p>🔒 All processing happens locally on your machine • Your data never leaves your computer</p>
|
||||||
|
<p style="font-size: 12px;">Powered by Ollama, ChromaDB, and Sentence Transformers</p>
|
||||||
|
</div>
|
||||||
|
""")
|
||||||
|
|
||||||
|
return interface
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
logger.info("Starting KnowledgeHub...")
|
||||||
|
|
||||||
|
# Create and launch interface
|
||||||
|
interface = create_ui()
|
||||||
|
interface.launch(
|
||||||
|
server_name="127.0.0.1",
|
||||||
|
server_port=7860,
|
||||||
|
share=False,
|
||||||
|
inbrowser=True
|
||||||
|
)
|
||||||
@@ -0,0 +1,13 @@
|
|||||||
|
"""
|
||||||
|
models
|
||||||
|
"""
|
||||||
|
from .knowledge_graph import KnowledgeGraph
|
||||||
|
from .document import Document, DocumentChunk, SearchResult, Summary
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
'KnowledgeGraph',
|
||||||
|
'Document',
|
||||||
|
'DocumentChunk',
|
||||||
|
'SearchResult',
|
||||||
|
'Summary'
|
||||||
|
]
|
||||||
@@ -0,0 +1,82 @@
|
|||||||
|
"""
|
||||||
|
Document data models
|
||||||
|
"""
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from typing import List, Dict, Optional
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class DocumentChunk:
|
||||||
|
"""Represents a chunk of a document"""
|
||||||
|
id: str
|
||||||
|
document_id: str
|
||||||
|
content: str
|
||||||
|
chunk_index: int
|
||||||
|
metadata: Dict = field(default_factory=dict)
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
preview = self.content[:100] + "..." if len(self.content) > 100 else self.content
|
||||||
|
return f"Chunk {self.chunk_index}: {preview}"
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Document:
|
||||||
|
"""Represents a complete document"""
|
||||||
|
id: str
|
||||||
|
filename: str
|
||||||
|
filepath: str
|
||||||
|
content: str
|
||||||
|
chunks: List[DocumentChunk]
|
||||||
|
metadata: Dict = field(default_factory=dict)
|
||||||
|
created_at: datetime = field(default_factory=datetime.now)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def num_chunks(self) -> int:
|
||||||
|
return len(self.chunks)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def total_chars(self) -> int:
|
||||||
|
return len(self.content)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def extension(self) -> str:
|
||||||
|
return self.metadata.get('extension', '')
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return f"Document: {self.filename} ({self.num_chunks} chunks, {self.total_chars} chars)"
|
||||||
|
|
||||||
|
def to_dict(self) -> Dict:
|
||||||
|
"""Convert to dictionary for storage"""
|
||||||
|
return {
|
||||||
|
'id': self.id,
|
||||||
|
'filename': self.filename,
|
||||||
|
'filepath': self.filepath,
|
||||||
|
'content': self.content[:500] + '...' if len(self.content) > 500 else self.content,
|
||||||
|
'num_chunks': self.num_chunks,
|
||||||
|
'total_chars': self.total_chars,
|
||||||
|
'extension': self.extension,
|
||||||
|
'created_at': self.created_at.isoformat(),
|
||||||
|
'metadata': self.metadata
|
||||||
|
}
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class SearchResult:
|
||||||
|
"""Represents a search result from the vector database"""
|
||||||
|
chunk: DocumentChunk
|
||||||
|
score: float
|
||||||
|
document_id: str
|
||||||
|
document_name: str
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return f"{self.document_name} (score: {self.score:.2f})"
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Summary:
|
||||||
|
"""Represents a document summary"""
|
||||||
|
document_id: str
|
||||||
|
document_name: str
|
||||||
|
summary_text: str
|
||||||
|
key_points: List[str] = field(default_factory=list)
|
||||||
|
created_at: datetime = field(default_factory=datetime.now)
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return f"Summary of {self.document_name}: {self.summary_text[:100]}..."
|
||||||
@@ -0,0 +1,110 @@
|
|||||||
|
"""
|
||||||
|
Knowledge Graph data models
|
||||||
|
"""
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from typing import List, Dict, Set
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class KnowledgeNode:
|
||||||
|
"""Represents a concept or entity in the knowledge graph"""
|
||||||
|
id: str
|
||||||
|
name: str
|
||||||
|
node_type: str # 'document', 'concept', 'entity', 'topic'
|
||||||
|
description: str = ""
|
||||||
|
metadata: Dict = field(default_factory=dict)
|
||||||
|
created_at: datetime = field(default_factory=datetime.now)
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return f"{self.node_type.capitalize()}: {self.name}"
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class KnowledgeEdge:
|
||||||
|
"""Represents a relationship between nodes"""
|
||||||
|
source_id: str
|
||||||
|
target_id: str
|
||||||
|
relationship: str # 'related_to', 'cites', 'contains', 'similar_to'
|
||||||
|
weight: float = 1.0
|
||||||
|
metadata: Dict = field(default_factory=dict)
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return f"{self.source_id} --[{self.relationship}]--> {self.target_id}"
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class KnowledgeGraph:
|
||||||
|
"""Represents the complete knowledge graph"""
|
||||||
|
nodes: Dict[str, KnowledgeNode] = field(default_factory=dict)
|
||||||
|
edges: List[KnowledgeEdge] = field(default_factory=list)
|
||||||
|
|
||||||
|
def add_node(self, node: KnowledgeNode):
|
||||||
|
"""Add a node to the graph"""
|
||||||
|
self.nodes[node.id] = node
|
||||||
|
|
||||||
|
def add_edge(self, edge: KnowledgeEdge):
|
||||||
|
"""Add an edge to the graph"""
|
||||||
|
if edge.source_id in self.nodes and edge.target_id in self.nodes:
|
||||||
|
self.edges.append(edge)
|
||||||
|
|
||||||
|
def get_neighbors(self, node_id: str) -> List[str]:
|
||||||
|
"""Get all nodes connected to a given node"""
|
||||||
|
neighbors = set()
|
||||||
|
for edge in self.edges:
|
||||||
|
if edge.source_id == node_id:
|
||||||
|
neighbors.add(edge.target_id)
|
||||||
|
elif edge.target_id == node_id:
|
||||||
|
neighbors.add(edge.source_id)
|
||||||
|
return list(neighbors)
|
||||||
|
|
||||||
|
def get_related_documents(self, node_id: str, max_depth: int = 2) -> Set[str]:
|
||||||
|
"""Get all documents related to a node within max_depth hops"""
|
||||||
|
related = set()
|
||||||
|
visited = set()
|
||||||
|
queue = [(node_id, 0)]
|
||||||
|
|
||||||
|
while queue:
|
||||||
|
current_id, depth = queue.pop(0)
|
||||||
|
|
||||||
|
if current_id in visited or depth > max_depth:
|
||||||
|
continue
|
||||||
|
|
||||||
|
visited.add(current_id)
|
||||||
|
|
||||||
|
# If this is a document node, add it
|
||||||
|
if current_id in self.nodes and self.nodes[current_id].node_type == 'document':
|
||||||
|
related.add(current_id)
|
||||||
|
|
||||||
|
# Add neighbors to queue
|
||||||
|
if depth < max_depth:
|
||||||
|
for neighbor_id in self.get_neighbors(current_id):
|
||||||
|
if neighbor_id not in visited:
|
||||||
|
queue.append((neighbor_id, depth + 1))
|
||||||
|
|
||||||
|
return related
|
||||||
|
|
||||||
|
def to_networkx(self):
|
||||||
|
"""Convert to NetworkX graph for visualization"""
|
||||||
|
try:
|
||||||
|
import networkx as nx
|
||||||
|
|
||||||
|
G = nx.Graph()
|
||||||
|
|
||||||
|
# Add nodes
|
||||||
|
for node_id, node in self.nodes.items():
|
||||||
|
G.add_node(node_id,
|
||||||
|
name=node.name,
|
||||||
|
type=node.node_type,
|
||||||
|
description=node.description)
|
||||||
|
|
||||||
|
# Add edges
|
||||||
|
for edge in self.edges:
|
||||||
|
G.add_edge(edge.source_id, edge.target_id,
|
||||||
|
relationship=edge.relationship,
|
||||||
|
weight=edge.weight)
|
||||||
|
|
||||||
|
return G
|
||||||
|
|
||||||
|
except ImportError:
|
||||||
|
return None
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return f"KnowledgeGraph: {len(self.nodes)} nodes, {len(self.edges)} edges"
|
||||||
@@ -0,0 +1,26 @@
|
|||||||
|
# Core Dependencies
|
||||||
|
gradio>=4.0.0
|
||||||
|
chromadb>=0.4.0
|
||||||
|
sentence-transformers>=2.2.0
|
||||||
|
python-dotenv>=1.0.0
|
||||||
|
|
||||||
|
# Document Processing
|
||||||
|
pypdf>=3.0.0
|
||||||
|
python-docx>=1.0.0
|
||||||
|
markdown>=3.4.0
|
||||||
|
beautifulsoup4>=4.12.0
|
||||||
|
|
||||||
|
# Data Processing
|
||||||
|
numpy>=1.24.0
|
||||||
|
pandas>=2.0.0
|
||||||
|
tqdm>=4.65.0
|
||||||
|
|
||||||
|
# Visualization
|
||||||
|
plotly>=5.14.0
|
||||||
|
networkx>=3.0
|
||||||
|
|
||||||
|
# Ollama Client
|
||||||
|
requests>=2.31.0
|
||||||
|
|
||||||
|
# Optional but useful
|
||||||
|
scikit-learn>=1.3.0
|
||||||
71
community-contributions/sach91-bootcamp/week8/start.bat
Normal file
71
community-contributions/sach91-bootcamp/week8/start.bat
Normal file
@@ -0,0 +1,71 @@
|
|||||||
|
@echo off
|
||||||
|
REM KnowledgeHub Startup Script for Windows
|
||||||
|
|
||||||
|
echo 🧠 Starting KnowledgeHub...
|
||||||
|
echo.
|
||||||
|
|
||||||
|
REM Check if Ollama is installed
|
||||||
|
where ollama >nul 2>nul
|
||||||
|
if %errorlevel% neq 0 (
|
||||||
|
echo ❌ Ollama is not installed or not in PATH
|
||||||
|
echo Please install Ollama from https://ollama.com/download
|
||||||
|
pause
|
||||||
|
exit /b 1
|
||||||
|
)
|
||||||
|
|
||||||
|
REM Check Python
|
||||||
|
where python >nul 2>nul
|
||||||
|
if %errorlevel% neq 0 (
|
||||||
|
echo ❌ Python is not installed or not in PATH
|
||||||
|
echo Please install Python 3.8+ from https://www.python.org/downloads/
|
||||||
|
pause
|
||||||
|
exit /b 1
|
||||||
|
)
|
||||||
|
|
||||||
|
echo ✅ Prerequisites found
|
||||||
|
echo.
|
||||||
|
|
||||||
|
REM Check if Ollama service is running
|
||||||
|
tasklist /FI "IMAGENAME eq ollama.exe" 2>NUL | find /I /N "ollama.exe">NUL
|
||||||
|
if "%ERRORLEVEL%"=="1" (
|
||||||
|
echo ⚠️ Ollama is not running. Please start Ollama first.
|
||||||
|
echo You can start it from the Start menu or by running: ollama serve
|
||||||
|
pause
|
||||||
|
exit /b 1
|
||||||
|
)
|
||||||
|
|
||||||
|
echo ✅ Ollama is running
|
||||||
|
echo.
|
||||||
|
|
||||||
|
REM Check if model exists
|
||||||
|
ollama list | find "llama3.2" >nul
|
||||||
|
if %errorlevel% neq 0 (
|
||||||
|
echo 📥 Llama 3.2 model not found. Pulling model...
|
||||||
|
echo This may take a few minutes on first run...
|
||||||
|
ollama pull llama3.2
|
||||||
|
)
|
||||||
|
|
||||||
|
echo ✅ Model ready
|
||||||
|
echo.
|
||||||
|
|
||||||
|
REM Install dependencies
|
||||||
|
echo 🔍 Checking dependencies...
|
||||||
|
python -c "import gradio" 2>nul
|
||||||
|
if %errorlevel% neq 0 (
|
||||||
|
echo 📦 Installing dependencies...
|
||||||
|
pip install -r requirements.txt
|
||||||
|
)
|
||||||
|
|
||||||
|
echo ✅ Dependencies ready
|
||||||
|
echo.
|
||||||
|
|
||||||
|
REM Launch application
|
||||||
|
echo 🚀 Launching KnowledgeHub...
|
||||||
|
echo The application will open in your browser at http://127.0.0.1:7860
|
||||||
|
echo.
|
||||||
|
echo Press Ctrl+C to stop the application
|
||||||
|
echo.
|
||||||
|
|
||||||
|
python app.py
|
||||||
|
|
||||||
|
pause
|
||||||
42
community-contributions/sach91-bootcamp/week8/start.sh
Executable file
42
community-contributions/sach91-bootcamp/week8/start.sh
Executable file
@@ -0,0 +1,42 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# KnowledgeHub Startup Script
|
||||||
|
|
||||||
|
echo "🧠 Starting KnowledgeHub..."
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Check if Ollama is running
|
||||||
|
if ! pgrep -x "ollama" > /dev/null; then
|
||||||
|
echo "⚠️ Ollama is not running. Starting Ollama..."
|
||||||
|
ollama serve &
|
||||||
|
sleep 3
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check if llama3.2 model exists
|
||||||
|
if ! ollama list | grep -q "llama3.2"; then
|
||||||
|
echo "📥 Llama 3.2 model not found. Pulling model..."
|
||||||
|
echo "This may take a few minutes on first run..."
|
||||||
|
ollama pull llama3.2
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "✅ Ollama is ready"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Check Python dependencies
|
||||||
|
echo "🔍 Checking dependencies..."
|
||||||
|
if ! python -c "import gradio" 2>/dev/null; then
|
||||||
|
echo "📦 Installing dependencies..."
|
||||||
|
pip install -r requirements.txt
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "✅ Dependencies ready"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Launch the application
|
||||||
|
echo "🚀 Launching KnowledgeHub..."
|
||||||
|
echo "The application will open in your browser at http://127.0.0.1:7860"
|
||||||
|
echo ""
|
||||||
|
echo "Press Ctrl+C to stop the application"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
python app.py
|
||||||
@@ -0,0 +1,12 @@
|
|||||||
|
"""
|
||||||
|
models
|
||||||
|
"""
|
||||||
|
from .document_parser import DocumentParser
|
||||||
|
from .embeddings import EmbeddingModel
|
||||||
|
from .ollama_client import OllamaClient
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
'DocumentParser',
|
||||||
|
'EmbeddingModel',
|
||||||
|
'OllamaClient'
|
||||||
|
]
|
||||||
@@ -0,0 +1,218 @@
|
|||||||
|
"""
|
||||||
|
Document Parser - Extract text from various document formats
|
||||||
|
"""
|
||||||
|
import os
|
||||||
|
from typing import List, Dict, Optional
|
||||||
|
import logging
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class DocumentParser:
|
||||||
|
"""Parse various document formats into text chunks"""
|
||||||
|
|
||||||
|
SUPPORTED_FORMATS = ['.pdf', '.docx', '.txt', '.md', '.html', '.py']
|
||||||
|
|
||||||
|
def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
|
||||||
|
"""
|
||||||
|
Initialize document parser
|
||||||
|
|
||||||
|
Args:
|
||||||
|
chunk_size: Maximum characters per chunk
|
||||||
|
chunk_overlap: Overlap between chunks for context preservation
|
||||||
|
"""
|
||||||
|
self.chunk_size = chunk_size
|
||||||
|
self.chunk_overlap = chunk_overlap
|
||||||
|
|
||||||
|
def parse_file(self, file_path: str) -> Dict:
|
||||||
|
"""
|
||||||
|
Parse a file and return structured document data
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file_path: Path to the file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary with document metadata and chunks
|
||||||
|
"""
|
||||||
|
path = Path(file_path)
|
||||||
|
|
||||||
|
if not path.exists():
|
||||||
|
raise FileNotFoundError(f"File not found: {file_path}")
|
||||||
|
|
||||||
|
extension = path.suffix.lower()
|
||||||
|
|
||||||
|
if extension not in self.SUPPORTED_FORMATS:
|
||||||
|
raise ValueError(f"Unsupported format: {extension}")
|
||||||
|
|
||||||
|
# Extract text based on file type
|
||||||
|
if extension == '.pdf':
|
||||||
|
text = self._parse_pdf(file_path)
|
||||||
|
elif extension == '.docx':
|
||||||
|
text = self._parse_docx(file_path)
|
||||||
|
elif extension == '.txt' or extension == '.py':
|
||||||
|
text = self._parse_txt(file_path)
|
||||||
|
elif extension == '.md':
|
||||||
|
text = self._parse_markdown(file_path)
|
||||||
|
elif extension == '.html':
|
||||||
|
text = self._parse_html(file_path)
|
||||||
|
else:
|
||||||
|
text = ""
|
||||||
|
|
||||||
|
# Create chunks
|
||||||
|
chunks = self._create_chunks(text)
|
||||||
|
|
||||||
|
return {
|
||||||
|
'filename': path.name,
|
||||||
|
'filepath': str(path.absolute()),
|
||||||
|
'extension': extension,
|
||||||
|
'text': text,
|
||||||
|
'chunks': chunks,
|
||||||
|
'num_chunks': len(chunks),
|
||||||
|
'total_chars': len(text)
|
||||||
|
}
|
||||||
|
|
||||||
|
def _parse_pdf(self, file_path: str) -> str:
|
||||||
|
"""Extract text from PDF"""
|
||||||
|
try:
|
||||||
|
from pypdf import PdfReader
|
||||||
|
|
||||||
|
reader = PdfReader(file_path)
|
||||||
|
text = ""
|
||||||
|
|
||||||
|
for page in reader.pages:
|
||||||
|
text += page.extract_text() + "\n\n"
|
||||||
|
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
except ImportError:
|
||||||
|
logger.error("pypdf not installed. Install with: pip install pypdf")
|
||||||
|
return ""
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error parsing PDF: {e}")
|
||||||
|
return ""
|
||||||
|
|
||||||
|
def _parse_docx(self, file_path: str) -> str:
|
||||||
|
"""Extract text from DOCX"""
|
||||||
|
try:
|
||||||
|
from docx import Document
|
||||||
|
|
||||||
|
doc = Document(file_path)
|
||||||
|
text = "\n\n".join([para.text for para in doc.paragraphs if para.text.strip()])
|
||||||
|
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
except ImportError:
|
||||||
|
logger.error("python-docx not installed. Install with: pip install python-docx")
|
||||||
|
return ""
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error parsing DOCX: {e}")
|
||||||
|
return ""
|
||||||
|
|
||||||
|
def _parse_txt(self, file_path: str) -> str:
|
||||||
|
"""Extract text from TXT"""
|
||||||
|
try:
|
||||||
|
with open(file_path, 'r', encoding='utf-8') as f:
|
||||||
|
return f.read().strip()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error parsing TXT: {e}")
|
||||||
|
return ""
|
||||||
|
|
||||||
|
def _parse_markdown(self, file_path: str) -> str:
|
||||||
|
"""Extract text from Markdown"""
|
||||||
|
try:
|
||||||
|
import markdown
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
|
||||||
|
with open(file_path, 'r', encoding='utf-8') as f:
|
||||||
|
md_text = f.read()
|
||||||
|
|
||||||
|
# Convert markdown to HTML then extract text
|
||||||
|
html = markdown.markdown(md_text)
|
||||||
|
soup = BeautifulSoup(html, 'html.parser')
|
||||||
|
text = soup.get_text()
|
||||||
|
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
except ImportError:
|
||||||
|
# Fallback: just read as plain text
|
||||||
|
return self._parse_txt(file_path)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error parsing Markdown: {e}")
|
||||||
|
return ""
|
||||||
|
|
||||||
|
def _parse_html(self, file_path: str) -> str:
|
||||||
|
"""Extract text from HTML"""
|
||||||
|
try:
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
|
||||||
|
with open(file_path, 'r', encoding='utf-8') as f:
|
||||||
|
html = f.read()
|
||||||
|
|
||||||
|
soup = BeautifulSoup(html, 'html.parser')
|
||||||
|
|
||||||
|
# Remove script and style elements
|
||||||
|
for script in soup(["script", "style"]):
|
||||||
|
script.decompose()
|
||||||
|
|
||||||
|
text = soup.get_text()
|
||||||
|
|
||||||
|
# Clean up whitespace
|
||||||
|
lines = (line.strip() for line in text.splitlines())
|
||||||
|
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
|
||||||
|
text = '\n'.join(chunk for chunk in chunks if chunk)
|
||||||
|
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
except ImportError:
|
||||||
|
logger.error("beautifulsoup4 not installed. Install with: pip install beautifulsoup4")
|
||||||
|
return ""
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error parsing HTML: {e}")
|
||||||
|
return ""
|
||||||
|
|
||||||
|
def _create_chunks(self, text: str) -> List[str]:
|
||||||
|
"""
|
||||||
|
Split text into overlapping chunks
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Full text to chunk
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of text chunks
|
||||||
|
"""
|
||||||
|
if not text:
|
||||||
|
return []
|
||||||
|
|
||||||
|
chunks = []
|
||||||
|
start = 0
|
||||||
|
text_length = len(text)
|
||||||
|
|
||||||
|
while start < text_length:
|
||||||
|
logger.info(f'Processing chunk at {start}, for len {text_length}.')
|
||||||
|
|
||||||
|
end = start + self.chunk_size
|
||||||
|
|
||||||
|
# If this isn't the last chunk, try to break at a sentence or paragraph
|
||||||
|
if end < text_length:
|
||||||
|
# Look for paragraph break first
|
||||||
|
break_pos = text.rfind('\n\n', start, end)
|
||||||
|
if break_pos == -1:
|
||||||
|
# Look for sentence break
|
||||||
|
break_pos = text.rfind('. ', start, end)
|
||||||
|
if break_pos == -1:
|
||||||
|
# Look for any space
|
||||||
|
break_pos = text.rfind(' ', start, end)
|
||||||
|
|
||||||
|
if break_pos != -1 and break_pos > start and break_pos > end - self.chunk_overlap:
|
||||||
|
end = break_pos + 1
|
||||||
|
|
||||||
|
chunk = text[start:end].strip()
|
||||||
|
if chunk:
|
||||||
|
chunks.append(chunk)
|
||||||
|
|
||||||
|
# Move start position with overlap
|
||||||
|
start = end - self.chunk_overlap
|
||||||
|
if start < 0:
|
||||||
|
start = 0
|
||||||
|
|
||||||
|
return chunks
|
||||||
@@ -0,0 +1,84 @@
|
|||||||
|
"""
|
||||||
|
Embeddings utility using sentence-transformers
|
||||||
|
"""
|
||||||
|
from sentence_transformers import SentenceTransformer
|
||||||
|
import numpy as np
|
||||||
|
from typing import List, Union
|
||||||
|
import logging
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class EmbeddingModel:
|
||||||
|
"""Wrapper for sentence transformer embeddings"""
|
||||||
|
|
||||||
|
def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
|
||||||
|
"""
|
||||||
|
Initialize embedding model
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model_name: HuggingFace model name for embeddings
|
||||||
|
"""
|
||||||
|
self.model_name = model_name
|
||||||
|
logger.info(f"Loading embedding model: {model_name}")
|
||||||
|
self.model = SentenceTransformer(model_name)
|
||||||
|
self.dimension = self.model.get_sentence_embedding_dimension()
|
||||||
|
logger.info(f"Embedding dimension: {self.dimension}")
|
||||||
|
|
||||||
|
def embed(self, texts: Union[str, List[str]]) -> np.ndarray:
|
||||||
|
"""
|
||||||
|
Generate embeddings for text(s)
|
||||||
|
|
||||||
|
Args:
|
||||||
|
texts: Single text or list of texts
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Numpy array of embeddings
|
||||||
|
"""
|
||||||
|
if isinstance(texts, str):
|
||||||
|
texts = [texts]
|
||||||
|
|
||||||
|
embeddings = self.model.encode(texts, show_progress_bar=False)
|
||||||
|
return embeddings
|
||||||
|
|
||||||
|
def embed_query(self, query: str) -> List[float]:
|
||||||
|
"""
|
||||||
|
Embed a single query - returns as list for ChromaDB compatibility
|
||||||
|
|
||||||
|
Args:
|
||||||
|
query: Query text
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of floats representing the embedding
|
||||||
|
"""
|
||||||
|
embedding = self.model.encode([query], show_progress_bar=False)[0]
|
||||||
|
return embedding.tolist()
|
||||||
|
|
||||||
|
def embed_documents(self, documents: List[str]) -> List[List[float]]:
|
||||||
|
"""
|
||||||
|
Embed multiple documents - returns as list of lists for ChromaDB
|
||||||
|
|
||||||
|
Args:
|
||||||
|
documents: List of document texts
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of embeddings (each as list of floats)
|
||||||
|
"""
|
||||||
|
embeddings = self.model.encode(documents, show_progress_bar=False)
|
||||||
|
return embeddings.tolist()
|
||||||
|
|
||||||
|
def similarity(self, text1: str, text2: str) -> float:
|
||||||
|
"""
|
||||||
|
Calculate cosine similarity between two texts
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text1: First text
|
||||||
|
text2: Second text
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Similarity score between 0 and 1
|
||||||
|
"""
|
||||||
|
emb1, emb2 = self.model.encode([text1, text2])
|
||||||
|
|
||||||
|
# Cosine similarity
|
||||||
|
similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
|
||||||
|
return float(similarity)
|
||||||
@@ -0,0 +1,107 @@
|
|||||||
|
"""
|
||||||
|
Ollama Client - Wrapper for local Ollama API
|
||||||
|
"""
|
||||||
|
import requests
|
||||||
|
import json
|
||||||
|
from typing import List, Dict, Optional
|
||||||
|
import logging
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class OllamaClient:
|
||||||
|
"""Client for interacting with local Ollama models"""
|
||||||
|
|
||||||
|
def __init__(self, base_url: str = "http://localhost:11434", model: str = "llama3.2"):
|
||||||
|
self.base_url = base_url
|
||||||
|
self.model = model
|
||||||
|
self.api_url = f"{base_url}/api"
|
||||||
|
|
||||||
|
def generate(self, prompt: str, system: Optional[str] = None,
|
||||||
|
temperature: float = 0.7, max_tokens: int = 2048) -> str:
|
||||||
|
"""Generate text from a prompt"""
|
||||||
|
try:
|
||||||
|
payload = {
|
||||||
|
"model": self.model,
|
||||||
|
"prompt": prompt,
|
||||||
|
"stream": False,
|
||||||
|
"options": {
|
||||||
|
"temperature": temperature,
|
||||||
|
"num_predict": max_tokens
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if system:
|
||||||
|
payload["system"] = system
|
||||||
|
|
||||||
|
response = requests.post(
|
||||||
|
f"{self.api_url}/generate",
|
||||||
|
json=payload,
|
||||||
|
timeout=1200
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
result = response.json()
|
||||||
|
return result.get("response", "").strip()
|
||||||
|
|
||||||
|
except requests.exceptions.RequestException as e:
|
||||||
|
logger.error(f"Ollama API error: {e}")
|
||||||
|
return f"Error: Unable to connect to Ollama. Is it running? ({str(e)})"
|
||||||
|
|
||||||
|
def chat(self, messages: List[Dict[str, str]],
|
||||||
|
temperature: float = 0.7, max_tokens: int = 2048) -> str:
|
||||||
|
"""Chat completion with message history"""
|
||||||
|
try:
|
||||||
|
payload = {
|
||||||
|
"model": self.model,
|
||||||
|
"messages": messages,
|
||||||
|
"stream": False,
|
||||||
|
"options": {
|
||||||
|
"temperature": temperature,
|
||||||
|
"num_predict": max_tokens
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
response = requests.post(
|
||||||
|
f"{self.api_url}/chat",
|
||||||
|
json=payload,
|
||||||
|
timeout=1200
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
result = response.json()
|
||||||
|
return result.get("message", {}).get("content", "").strip()
|
||||||
|
|
||||||
|
except requests.exceptions.RequestException as e:
|
||||||
|
logger.error(f"Ollama API error: {e}")
|
||||||
|
return f"Error: Unable to connect to Ollama. Is it running? ({str(e)})"
|
||||||
|
|
||||||
|
def check_connection(self) -> bool:
|
||||||
|
"""Check if Ollama is running and model is available"""
|
||||||
|
try:
|
||||||
|
response = requests.get(f"{self.base_url}/api/tags", timeout=5)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
models = response.json().get("models", [])
|
||||||
|
model_names = [m["name"] for m in models]
|
||||||
|
|
||||||
|
if self.model not in model_names:
|
||||||
|
logger.warning(f"Model {self.model} not found. Available: {model_names}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
except requests.exceptions.RequestException as e:
|
||||||
|
logger.error(f"Cannot connect to Ollama: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def list_models(self) -> List[str]:
|
||||||
|
"""List available Ollama models"""
|
||||||
|
try:
|
||||||
|
response = requests.get(f"{self.base_url}/api/tags", timeout=5)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
models = response.json().get("models", [])
|
||||||
|
return [m["name"] for m in models]
|
||||||
|
|
||||||
|
except requests.exceptions.RequestException:
|
||||||
|
return []
|
||||||
129
community-contributions/sach91-bootcamp/week8/verify_setup.py
Normal file
129
community-contributions/sach91-bootcamp/week8/verify_setup.py
Normal file
@@ -0,0 +1,129 @@
|
|||||||
|
"""
|
||||||
|
Setup Verification Script for KnowledgeHub
|
||||||
|
Run this to check if everything is configured correctly
|
||||||
|
"""
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
|
||||||
|
print("🔍 KnowledgeHub Setup Verification\n")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
# Check Python version
|
||||||
|
print(f"✓ Python version: {sys.version}")
|
||||||
|
print(f"✓ Python executable: {sys.executable}")
|
||||||
|
print(f"✓ Current directory: {os.getcwd()}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Check directory structure
|
||||||
|
print("📁 Checking directory structure...")
|
||||||
|
required_dirs = ['agents', 'models', 'utils']
|
||||||
|
for dir_name in required_dirs:
|
||||||
|
if os.path.isdir(dir_name):
|
||||||
|
init_file = os.path.join(dir_name, '__init__.py')
|
||||||
|
if os.path.exists(init_file):
|
||||||
|
print(f" ✓ {dir_name}/ exists with __init__.py")
|
||||||
|
else:
|
||||||
|
print(f" ⚠️ {dir_name}/ exists but missing __init__.py")
|
||||||
|
else:
|
||||||
|
print(f" ❌ {dir_name}/ directory not found")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Check required files
|
||||||
|
print("📄 Checking required files...")
|
||||||
|
required_files = ['app.py', 'requirements.txt']
|
||||||
|
for file_name in required_files:
|
||||||
|
if os.path.exists(file_name):
|
||||||
|
print(f" ✓ {file_name} exists")
|
||||||
|
else:
|
||||||
|
print(f" ❌ {file_name} not found")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Try importing modules
|
||||||
|
print("📦 Testing imports...")
|
||||||
|
errors = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
from utils import OllamaClient, EmbeddingModel, DocumentParser
|
||||||
|
print(" ✓ utils module imported successfully")
|
||||||
|
except ImportError as e:
|
||||||
|
print(f" ❌ Cannot import utils: {e}")
|
||||||
|
errors.append(str(e))
|
||||||
|
|
||||||
|
try:
|
||||||
|
from models import Document, DocumentChunk, SearchResult, Summary
|
||||||
|
print(" ✓ models module imported successfully")
|
||||||
|
except ImportError as e:
|
||||||
|
print(f" ❌ Cannot import models: {e}")
|
||||||
|
errors.append(str(e))
|
||||||
|
|
||||||
|
try:
|
||||||
|
from agents import (
|
||||||
|
IngestionAgent, QuestionAgent, SummaryAgent,
|
||||||
|
ConnectionAgent, ExportAgent
|
||||||
|
)
|
||||||
|
print(" ✓ agents module imported successfully")
|
||||||
|
except ImportError as e:
|
||||||
|
print(f" ❌ Cannot import agents: {e}")
|
||||||
|
errors.append(str(e))
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Check dependencies
|
||||||
|
print("📚 Checking Python dependencies...")
|
||||||
|
required_packages = [
|
||||||
|
'gradio', 'chromadb', 'sentence_transformers',
|
||||||
|
'requests', 'numpy', 'tqdm'
|
||||||
|
]
|
||||||
|
|
||||||
|
missing_packages = []
|
||||||
|
for package in required_packages:
|
||||||
|
try:
|
||||||
|
__import__(package.replace('-', '_'))
|
||||||
|
print(f" ✓ {package} installed")
|
||||||
|
except ImportError:
|
||||||
|
print(f" ❌ {package} not installed")
|
||||||
|
missing_packages.append(package)
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Check Ollama
|
||||||
|
print("🤖 Checking Ollama...")
|
||||||
|
try:
|
||||||
|
import requests
|
||||||
|
response = requests.get('http://localhost:11434/api/tags', timeout=2)
|
||||||
|
if response.status_code == 200:
|
||||||
|
print(" ✓ Ollama is running")
|
||||||
|
models = response.json().get('models', [])
|
||||||
|
if models:
|
||||||
|
print(f" ✓ Available models: {[m['name'] for m in models]}")
|
||||||
|
if any('llama3.2' in m['name'] for m in models):
|
||||||
|
print(" ✓ llama3.2 model found")
|
||||||
|
else:
|
||||||
|
print(" ⚠️ llama3.2 model not found. Run: ollama pull llama3.2")
|
||||||
|
else:
|
||||||
|
print(" ⚠️ No models found. Run: ollama pull llama3.2")
|
||||||
|
else:
|
||||||
|
print(" ⚠️ Ollama responded but with error")
|
||||||
|
except Exception as e:
|
||||||
|
print(f" ❌ Cannot connect to Ollama: {e}")
|
||||||
|
print(" Start Ollama with: ollama serve")
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
# Final summary
|
||||||
|
if errors or missing_packages:
|
||||||
|
print("\n⚠️ ISSUES FOUND:\n")
|
||||||
|
if errors:
|
||||||
|
print("Import Errors:")
|
||||||
|
for error in errors:
|
||||||
|
print(f" - {error}")
|
||||||
|
if missing_packages:
|
||||||
|
print("\nMissing Packages:")
|
||||||
|
print(f" Run: pip install {' '.join(missing_packages)}")
|
||||||
|
print("\n💡 Fix these issues before running app.py")
|
||||||
|
else:
|
||||||
|
print("\n✅ All checks passed! You're ready to run:")
|
||||||
|
print(" python app.py")
|
||||||
|
|
||||||
|
print()
|
||||||
Reference in New Issue
Block a user