Merge pull request #911 from sach91/sach91-bootcamp-wk8

sach91 bootcamp week8 exercise
This commit is contained in:
Ed Donner
2025-10-30 22:08:13 -04:00
committed by GitHub
20 changed files with 3124 additions and 0 deletions

View File

@@ -0,0 +1,259 @@
# 🧠 KnowledgeHub - Personal Knowledge Management & Research Assistant
An elegant, fully local AI-powered knowledge management system that helps you organize, search, and understand your documents using state-of-the-art LLM technology.
## ✨ Features
### 🎯 Core Capabilities
- **📤 Document Ingestion**: Upload PDF, DOCX, TXT, MD, and HTML files
- **❓ Intelligent Q&A**: Ask questions and get answers from your documents using RAG
- **📝 Smart Summarization**: Generate concise summaries with key points
- **🔗 Connection Discovery**: Find relationships between documents
- **💾 Multi-format Export**: Export as Markdown, HTML, or plain text
- **📊 Statistics Dashboard**: Track your knowledge base growth
### 🔒 Privacy-First
- **100% Local Processing**: All data stays on your machine
- **No Cloud Dependencies**: Uses Ollama for local LLM inference
- **Open Source**: Full transparency and control
### ⚡ Technology Stack
- **LLM**: Ollama with Llama 3.2 (3B) or Llama 3.1 (8B)
- **Embeddings**: sentence-transformers (all-MiniLM-L6-v2)
- **Vector Database**: ChromaDB
- **UI**: Gradio
- **Document Processing**: pypdf, python-docx, beautifulsoup4
## 🚀 Quick Start
### Prerequisites
1. **Python 3.8+** installed
2. **Ollama** installed and running
#### Installing Ollama
**macOS/Linux:**
```bash
curl -fsSL https://ollama.com/install.sh | sh
```
**Windows:**
Download from [ollama.com/download](https://ollama.com/download)
### Installation
1. **Clone or download this repository**
2. **Install Python dependencies:**
```bash
pip install -r requirements.txt
```
3. **Pull Llama model using Ollama:**
```bash
# For faster inference (recommended for most users)
ollama pull llama3.2
# OR for better quality (requires more RAM)
ollama pull llama3.1
```
4. **Start Ollama server** (if not already running):
```bash
ollama serve
```
5. **Launch KnowledgeHub:**
```bash
python app.py
```
The application will open in your browser at `http://127.0.0.1:7860`
## 📖 Usage Guide
### 1. Upload Documents
- Go to the "Upload Documents" tab
- Select a file (PDF, DOCX, TXT, MD, or HTML)
- Click "Upload & Process"
- The document will be chunked and stored in your local vector database
### 2. Ask Questions
- Go to the "Ask Questions" tab
- Type your question in natural language
- Adjust the number of sources to retrieve (default: 5)
- Click "Ask" to get an AI-generated answer with sources
### 3. Summarize Documents
- Go to the "Summarize" tab
- Select a document from the dropdown
- Click "Generate Summary"
- Get a concise summary with key points
### 4. Find Connections
- Go to the "Find Connections" tab
- Select a document to analyze
- Adjust how many related documents to find
- See documents that are semantically similar
### 5. Export Knowledge
- Go to the "Export" tab
- Choose your format (Markdown, HTML, or Text)
- Click "Export" to download your knowledge base
### 6. View Statistics
- Go to the "Statistics" tab
- See overview of your knowledge base
- Track total documents, chunks, and characters
## 🏗️ Architecture
```
KnowledgeHub/
├── agents/ # Specialized AI agents
│ ├── base_agent.py # Base class for all agents
│ ├── ingestion_agent.py # Document processing
│ ├── question_agent.py # RAG-based Q&A
│ ├── summary_agent.py # Summarization
│ ├── connection_agent.py # Finding relationships
│ └── export_agent.py # Exporting data
├── models/ # Data models
│ ├── document.py # Document structures
│ └── knowledge_graph.py # Graph structures
├── utils/ # Utilities
│ ├── ollama_client.py # Ollama API wrapper
│ ├── embeddings.py # Embedding generation
│ └── document_parser.py # File parsing
├── vectorstore/ # ChromaDB storage (auto-created)
├── temp_uploads/ # Temporary file storage (auto-created)
├── app.py # Main Gradio application
└── requirements.txt # Python dependencies
```
## 🎯 Multi-Agent Framework
KnowledgeHub uses a sophisticated multi-agent architecture:
1. **Ingestion Agent**: Parses documents, creates chunks, generates embeddings
2. **Question Agent**: Retrieves relevant context and answers questions
3. **Summary Agent**: Creates concise summaries and extracts key points
4. **Connection Agent**: Finds semantic relationships between documents
5. **Export Agent**: Formats and exports knowledge in multiple formats
Each agent is independent, reusable, and focused on a specific task, following best practices in agentic AI development.
## ⚙️ Configuration
### Changing Models
Edit `app.py` to use different models:
```python
# For Llama 3.1 8B (better quality, more RAM)
self.llm_client = OllamaClient(model="llama3.1")
# For Llama 3.2 3B (faster, less RAM)
self.llm_client = OllamaClient(model="llama3.2")
```
### Adjusting Chunk Size
Edit `agents/ingestion_agent.py`:
```python
self.parser = DocumentParser(
chunk_size=1000, # Characters per chunk
chunk_overlap=200 # Overlap between chunks
)
```
### Changing Embedding Model
Edit `app.py`:
```python
self.embedding_model = EmbeddingModel(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
```
## 🔧 Troubleshooting
### "Cannot connect to Ollama"
- Ensure Ollama is installed: `ollama --version`
- Start the Ollama service: `ollama serve`
- Verify the model is pulled: `ollama list`
### "Module not found" errors
- Ensure all dependencies are installed: `pip install -r requirements.txt`
- Try upgrading pip: `pip install --upgrade pip`
### "Out of memory" errors
- Use Llama 3.2 (3B) instead of Llama 3.1 (8B)
- Reduce chunk_size in document parser
- Process fewer documents at once
### Slow response times
- Ensure you're using a CUDA-enabled GPU (if available)
- Reduce the number of retrieved chunks (top_k parameter)
- Use a smaller model (llama3.2)
## 🎓 Learning Resources
This project demonstrates key concepts in LLM engineering:
- **RAG (Retrieval Augmented Generation)**: Combining retrieval with generation
- **Vector Databases**: Using ChromaDB for semantic search
- **Multi-Agent Systems**: Specialized agents working together
- **Embeddings**: Semantic representation of text
- **Local LLM Deployment**: Using Ollama for privacy-focused AI
## 📊 Performance
**Hardware Requirements:**
- Minimum: 8GB RAM, CPU
- Recommended: 16GB RAM, GPU (NVIDIA with CUDA)
- Optimal: 32GB RAM, GPU (RTX 3060 or better)
**Processing Speed** (Llama 3.2 on M1 Mac):
- Document ingestion: ~2-5 seconds per page
- Question answering: ~5-15 seconds
- Summarization: ~10-20 seconds
## 🤝 Contributing
This is a learning project showcasing LLM engineering principles. Feel free to:
- Experiment with different models
- Add new agents for specialized tasks
- Improve the UI
- Optimize performance
## 📄 License
This project is open source and available for educational purposes.
## 🙏 Acknowledgments
Built with:
- [Ollama](https://ollama.com/) - Local LLM runtime
- [Gradio](https://gradio.app/) - UI framework
- [ChromaDB](https://www.trychroma.com/) - Vector database
- [Sentence Transformers](https://www.sbert.net/) - Embeddings
- [Llama](https://ai.meta.com/llama/) - Meta's open source LLMs
## 🎯 Next Steps
Potential enhancements:
1. Add support for images and diagrams
2. Implement multi-document chat history
3. Build a visual knowledge graph
4. Add collaborative features
5. Create mobile app interface
6. Implement advanced filters and search
7. Add citation tracking
8. Create automated study guides
---
**Made with ❤️ for the LLM Engineering Community**

View File

@@ -0,0 +1,18 @@
"""
KnowledgeHub Agents
"""
from .base_agent import BaseAgent
from .ingestion_agent import IngestionAgent
from .question_agent import QuestionAgent
from .summary_agent import SummaryAgent
from .connection_agent import ConnectionAgent
from .export_agent import ExportAgent
__all__ = [
'BaseAgent',
'IngestionAgent',
'QuestionAgent',
'SummaryAgent',
'ConnectionAgent',
'ExportAgent'
]

View File

@@ -0,0 +1,91 @@
"""
Base Agent class - Foundation for all specialized agents
"""
from abc import ABC, abstractmethod
import logging
from typing import Optional, Dict, Any
from utils.ollama_client import OllamaClient
logger = logging.getLogger(__name__)
class BaseAgent(ABC):
"""Abstract base class for all agents"""
def __init__(self, name: str, llm_client: Optional[OllamaClient] = None,
model: str = "llama3.2"):
"""
Initialize base agent
Args:
name: Agent name for logging
llm_client: Shared Ollama client (creates new one if None)
model: Ollama model to use
"""
self.name = name
self.model = model
# Use shared client or create new one
if llm_client is None:
self.llm = OllamaClient(model=model)
logger.info(f"{self.name} initialized with new LLM client (model: {model})")
else:
self.llm = llm_client
logger.info(f"{self.name} initialized with shared LLM client (model: {model})")
def generate(self, prompt: str, system: Optional[str] = None,
temperature: float = 0.7, max_tokens: int = 2048) -> str:
"""
Generate text using the LLM
Args:
prompt: User prompt
system: System message (optional)
temperature: Sampling temperature
max_tokens: Maximum tokens to generate
Returns:
Generated text
"""
logger.info(f"{self.name} generating response")
response = self.llm.generate(
prompt=prompt,
system=system,
temperature=temperature,
max_tokens=max_tokens
)
logger.debug(f"{self.name} generated {len(response)} characters")
return response
def chat(self, messages: list, temperature: float = 0.7,
max_tokens: int = 2048) -> str:
"""
Chat completion with message history
Args:
messages: List of message dicts with 'role' and 'content'
temperature: Sampling temperature
max_tokens: Maximum tokens to generate
Returns:
Generated text
"""
logger.info(f"{self.name} processing chat with {len(messages)} messages")
response = self.llm.chat(
messages=messages,
temperature=temperature,
max_tokens=max_tokens
)
logger.debug(f"{self.name} generated {len(response)} characters")
return response
@abstractmethod
def process(self, *args, **kwargs) -> Any:
"""
Main processing method - must be implemented by subclasses
Each agent implements its specialized logic here
"""
pass
def __str__(self):
return f"{self.name} (model: {self.model})"

View File

@@ -0,0 +1,289 @@
"""
Connection Agent - Finds relationships and connections between documents
"""
import logging
from typing import List, Dict, Tuple
from agents.base_agent import BaseAgent
from models.knowledge_graph import KnowledgeNode, KnowledgeEdge, KnowledgeGraph
from utils.embeddings import EmbeddingModel
import chromadb
import numpy as np
logger = logging.getLogger(__name__)
class ConnectionAgent(BaseAgent):
"""Agent that discovers connections between documents and concepts"""
def __init__(self, collection: chromadb.Collection,
embedding_model: EmbeddingModel,
llm_client=None, model: str = "llama3.2"):
"""
Initialize connection agent
Args:
collection: ChromaDB collection with documents
embedding_model: Model for computing similarities
llm_client: Optional shared LLM client
model: Ollama model name
"""
super().__init__(name="ConnectionAgent", llm_client=llm_client, model=model)
self.collection = collection
self.embedding_model = embedding_model
logger.info(f"{self.name} initialized")
def process(self, document_id: str = None, query: str = None,
top_k: int = 5) -> Dict:
"""
Find documents related to a document or query
Args:
document_id: ID of reference document
query: Search query (used if document_id not provided)
top_k: Number of related documents to find
Returns:
Dictionary with related documents and connections
"""
if document_id:
logger.info(f"{self.name} finding connections for document: {document_id}")
return self._find_related_to_document(document_id, top_k)
elif query:
logger.info(f"{self.name} finding connections for query: {query[:100]}")
return self._find_related_to_query(query, top_k)
else:
return {'related': [], 'error': 'No document_id or query provided'}
def _find_related_to_document(self, document_id: str, top_k: int) -> Dict:
"""Find documents related to a specific document"""
try:
# Get chunks from the document
results = self.collection.get(
where={"document_id": document_id},
include=['embeddings', 'documents', 'metadatas']
)
if not results['ids']:
return {'related': [], 'error': 'Document not found'}
# Use the first chunk's embedding as representative
query_embedding = results['embeddings'][0]
document_name = results['metadatas'][0].get('filename', 'Unknown')
# Search for similar chunks from OTHER documents
search_results = self.collection.query(
query_embeddings=[query_embedding],
n_results=top_k * 3, # Get more to filter out same document
include=['documents', 'metadatas', 'distances']
)
# Filter out chunks from the same document
related = []
seen_docs = set([document_id])
if search_results['ids']:
for i in range(len(search_results['ids'][0])):
related_doc_id = search_results['metadatas'][0][i].get('document_id')
if related_doc_id not in seen_docs:
seen_docs.add(related_doc_id)
similarity = 1.0 - search_results['distances'][0][i]
related.append({
'document_id': related_doc_id,
'document_name': search_results['metadatas'][0][i].get('filename', 'Unknown'),
'similarity': float(similarity),
'preview': search_results['documents'][0][i][:150] + "..."
})
if len(related) >= top_k:
break
return {
'source_document': document_name,
'source_id': document_id,
'related': related,
'num_related': len(related)
}
except Exception as e:
logger.error(f"Error finding related documents: {e}")
return {'related': [], 'error': str(e)}
def _find_related_to_query(self, query: str, top_k: int) -> Dict:
"""Find documents related to a query"""
try:
# Generate query embedding
query_embedding = self.embedding_model.embed_query(query)
# Search
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=top_k * 2, # Get more to deduplicate by document
include=['documents', 'metadatas', 'distances']
)
# Deduplicate by document
related = []
seen_docs = set()
if results['ids']:
for i in range(len(results['ids'][0])):
doc_id = results['metadatas'][0][i].get('document_id')
if doc_id not in seen_docs:
seen_docs.add(doc_id)
similarity = 1.0 - results['distances'][0][i]
related.append({
'document_id': doc_id,
'document_name': results['metadatas'][0][i].get('filename', 'Unknown'),
'similarity': float(similarity),
'preview': results['documents'][0][i][:150] + "..."
})
if len(related) >= top_k:
break
return {
'query': query,
'related': related,
'num_related': len(related)
}
except Exception as e:
logger.error(f"Error finding related documents: {e}")
return {'related': [], 'error': str(e)}
def build_knowledge_graph(self, similarity_threshold: float = 0.7) -> KnowledgeGraph:
"""
Build a knowledge graph showing document relationships
Args:
similarity_threshold: Minimum similarity to create an edge
Returns:
KnowledgeGraph object
"""
logger.info(f"{self.name} building knowledge graph")
graph = KnowledgeGraph()
try:
# Get all documents
all_results = self.collection.get(
include=['embeddings', 'metadatas']
)
if not all_results['ids']:
return graph
# Group by document
documents = {}
for i, metadata in enumerate(all_results['metadatas']):
doc_id = metadata.get('document_id')
if doc_id not in documents:
documents[doc_id] = {
'name': metadata.get('filename', 'Unknown'),
'embedding': all_results['embeddings'][i]
}
# Create nodes
for doc_id, doc_data in documents.items():
node = KnowledgeNode(
id=doc_id,
name=doc_data['name'],
node_type='document',
description=f"Document: {doc_data['name']}"
)
graph.add_node(node)
# Create edges based on similarity
doc_ids = list(documents.keys())
for i, doc_id1 in enumerate(doc_ids):
emb1 = np.array(documents[doc_id1]['embedding'])
for doc_id2 in doc_ids[i+1:]:
emb2 = np.array(documents[doc_id2]['embedding'])
# Calculate similarity
similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
if similarity >= similarity_threshold:
edge = KnowledgeEdge(
source_id=doc_id1,
target_id=doc_id2,
relationship='similar_to',
weight=float(similarity)
)
graph.add_edge(edge)
logger.info(f"{self.name} built graph with {len(graph.nodes)} nodes and {len(graph.edges)} edges")
return graph
except Exception as e:
logger.error(f"Error building knowledge graph: {e}")
return graph
def explain_connection(self, doc_id1: str, doc_id2: str) -> str:
"""
Use LLM to explain why two documents are related
Args:
doc_id1: First document ID
doc_id2: Second document ID
Returns:
Explanation text
"""
try:
# Get sample chunks from each document
results1 = self.collection.get(
where={"document_id": doc_id1},
limit=2,
include=['documents', 'metadatas']
)
results2 = self.collection.get(
where={"document_id": doc_id2},
limit=2,
include=['documents', 'metadatas']
)
if not results1['ids'] or not results2['ids']:
return "Could not retrieve documents"
doc1_name = results1['metadatas'][0].get('filename', 'Document 1')
doc2_name = results2['metadatas'][0].get('filename', 'Document 2')
doc1_text = " ".join(results1['documents'][:2])[:1000]
doc2_text = " ".join(results2['documents'][:2])[:1000]
system_prompt = """You analyze documents and explain their relationships.
Provide a brief, clear explanation of how two documents are related."""
user_prompt = f"""Analyze these two documents and explain how they are related:
Document 1 ({doc1_name}):
{doc1_text}
Document 2 ({doc2_name}):
{doc2_text}
How are these documents related? Provide a concise explanation:"""
explanation = self.generate(
prompt=user_prompt,
system=system_prompt,
temperature=0.3,
max_tokens=256
)
return explanation
except Exception as e:
logger.error(f"Error explaining connection: {e}")
return f"Error: {str(e)}"

View File

@@ -0,0 +1,233 @@
"""
Export Agent - Generates formatted reports and exports
"""
import logging
from typing import List, Dict
from datetime import datetime
from agents.base_agent import BaseAgent
from models.document import Summary
logger = logging.getLogger(__name__)
class ExportAgent(BaseAgent):
"""Agent that exports summaries and reports in various formats"""
def __init__(self, llm_client=None, model: str = "llama3.2"):
"""
Initialize export agent
Args:
llm_client: Optional shared LLM client
model: Ollama model name
"""
super().__init__(name="ExportAgent", llm_client=llm_client, model=model)
logger.info(f"{self.name} initialized")
def process(self, content: Dict, format: str = "markdown") -> str:
"""
Export content in specified format
Args:
content: Content dictionary to export
format: Export format ('markdown', 'text', 'html')
Returns:
Formatted content string
"""
logger.info(f"{self.name} exporting as {format}")
if format == "markdown":
return self._export_markdown(content)
elif format == "text":
return self._export_text(content)
elif format == "html":
return self._export_html(content)
else:
return str(content)
def _export_markdown(self, content: Dict) -> str:
"""Export as Markdown"""
md = []
md.append(f"# Knowledge Report")
md.append(f"\n*Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}*\n")
if 'title' in content:
md.append(f"## {content['title']}\n")
if 'summary' in content:
md.append(f"### Summary\n")
md.append(f"{content['summary']}\n")
if 'key_points' in content and content['key_points']:
md.append(f"### Key Points\n")
for point in content['key_points']:
md.append(f"- {point}")
md.append("")
if 'sections' in content:
for section in content['sections']:
md.append(f"### {section['title']}\n")
md.append(f"{section['content']}\n")
if 'sources' in content and content['sources']:
md.append(f"### Sources\n")
for i, source in enumerate(content['sources'], 1):
md.append(f"{i}. {source}")
md.append("")
return "\n".join(md)
def _export_text(self, content: Dict) -> str:
"""Export as plain text"""
lines = []
lines.append("=" * 60)
lines.append("KNOWLEDGE REPORT")
lines.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
lines.append("=" * 60)
lines.append("")
if 'title' in content:
lines.append(content['title'])
lines.append("-" * len(content['title']))
lines.append("")
if 'summary' in content:
lines.append("SUMMARY:")
lines.append(content['summary'])
lines.append("")
if 'key_points' in content and content['key_points']:
lines.append("KEY POINTS:")
for i, point in enumerate(content['key_points'], 1):
lines.append(f" {i}. {point}")
lines.append("")
if 'sections' in content:
for section in content['sections']:
lines.append(section['title'].upper())
lines.append("-" * 40)
lines.append(section['content'])
lines.append("")
if 'sources' in content and content['sources']:
lines.append("SOURCES:")
for i, source in enumerate(content['sources'], 1):
lines.append(f" {i}. {source}")
lines.append("")
lines.append("=" * 60)
return "\n".join(lines)
def _export_html(self, content: Dict) -> str:
"""Export as HTML"""
html = []
html.append("<!DOCTYPE html>")
html.append("<html>")
html.append("<head>")
html.append(" <meta charset='utf-8'>")
html.append(" <title>Knowledge Report</title>")
html.append(" <style>")
html.append(" body { font-family: Arial, sans-serif; max-width: 800px; margin: 40px auto; padding: 20px; }")
html.append(" h1 { color: #333; border-bottom: 3px solid #007bff; padding-bottom: 10px; }")
html.append(" h2 { color: #555; margin-top: 30px; }")
html.append(" .meta { color: #888; font-style: italic; }")
html.append(" .key-points { background: #f8f9fa; padding: 15px; border-left: 4px solid #007bff; }")
html.append(" .source { color: #666; font-size: 0.9em; }")
html.append(" </style>")
html.append("</head>")
html.append("<body>")
html.append(" <h1>Knowledge Report</h1>")
html.append(f" <p class='meta'>Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}</p>")
if 'title' in content:
html.append(f" <h2>{content['title']}</h2>")
if 'summary' in content:
html.append(f" <h3>Summary</h3>")
html.append(f" <p>{content['summary']}</p>")
if 'key_points' in content and content['key_points']:
html.append(" <h3>Key Points</h3>")
html.append(" <div class='key-points'>")
html.append(" <ul>")
for point in content['key_points']:
html.append(f" <li>{point}</li>")
html.append(" </ul>")
html.append(" </div>")
if 'sections' in content:
for section in content['sections']:
html.append(f" <h3>{section['title']}</h3>")
html.append(f" <p>{section['content']}</p>")
if 'sources' in content and content['sources']:
html.append(" <h3>Sources</h3>")
html.append(" <ol class='source'>")
for source in content['sources']:
html.append(f" <li>{source}</li>")
html.append(" </ol>")
html.append("</body>")
html.append("</html>")
return "\n".join(html)
def create_study_guide(self, summaries: List[Summary]) -> str:
"""
Create a study guide from multiple summaries
Args:
summaries: List of Summary objects
Returns:
Formatted study guide
"""
logger.info(f"{self.name} creating study guide from {len(summaries)} summaries")
# Compile all content
all_summaries = "\n\n".join([
f"{s.document_name}:\n{s.summary_text}"
for s in summaries
])
all_key_points = []
for s in summaries:
all_key_points.extend(s.key_points)
# Use LLM to create cohesive study guide
system_prompt = """You create excellent study guides that synthesize information from multiple sources.
Create a well-organized study guide with clear sections, key concepts, and important points."""
user_prompt = f"""Create a comprehensive study guide based on these document summaries:
{all_summaries}
Create a well-structured study guide with:
1. An overview
2. Key concepts
3. Important details
4. Study tips
Study Guide:"""
study_guide = self.generate(
prompt=user_prompt,
system=system_prompt,
temperature=0.5,
max_tokens=2048
)
# Format as markdown
content = {
'title': 'Study Guide',
'sections': [
{'title': 'Overview', 'content': study_guide},
{'title': 'Key Points from All Documents', 'content': '\n'.join([f"{p}" for p in all_key_points[:15]])}
],
'sources': [s.document_name for s in summaries]
}
return self._export_markdown(content)

View File

@@ -0,0 +1,157 @@
"""
Ingestion Agent - Processes and stores documents in the vector database
"""
import logging
from typing import Dict, List
import uuid
from datetime import datetime
from agents.base_agent import BaseAgent
from models.document import Document, DocumentChunk
from utils.document_parser import DocumentParser
from utils.embeddings import EmbeddingModel
import chromadb
logger = logging.getLogger(__name__)
class IngestionAgent(BaseAgent):
"""Agent responsible for ingesting and storing documents"""
def __init__(self, collection: chromadb.Collection,
embedding_model: EmbeddingModel,
llm_client=None, model: str = "llama3.2"):
"""
Initialize ingestion agent
Args:
collection: ChromaDB collection for storage
embedding_model: Model for generating embeddings
llm_client: Optional shared LLM client
model: Ollama model name
"""
super().__init__(name="IngestionAgent", llm_client=llm_client, model=model)
self.collection = collection
self.embedding_model = embedding_model
self.parser = DocumentParser(chunk_size=1000, chunk_overlap=200)
logger.info(f"{self.name} ready with ChromaDB collection")
def process(self, file_path: str) -> Document:
"""
Process and ingest a document
Args:
file_path: Path to the document file
Returns:
Document object with metadata
"""
logger.info(f"{self.name} processing: {file_path}")
# Parse the document
parsed = self.parser.parse_file(file_path)
# Generate document ID
doc_id = str(uuid.uuid4())
# Create document chunks
chunks = []
chunk_texts = []
chunk_ids = []
chunk_metadatas = []
for i, chunk_text in enumerate(parsed['chunks']):
chunk_id = f"{doc_id}_chunk_{i}"
chunk = DocumentChunk(
id=chunk_id,
document_id=doc_id,
content=chunk_text,
chunk_index=i,
metadata={
'filename': parsed['filename'],
'extension': parsed['extension'],
'total_chunks': len(parsed['chunks'])
}
)
chunks.append(chunk)
chunk_texts.append(chunk_text)
chunk_ids.append(chunk_id)
chunk_metadatas.append({
'document_id': doc_id,
'filename': parsed['filename'],
'chunk_index': i,
'extension': parsed['extension']
})
# Generate embeddings
logger.info(f"{self.name} generating embeddings for {len(chunks)} chunks")
embeddings = self.embedding_model.embed_documents(chunk_texts)
# Store in ChromaDB
logger.info(f"{self.name} storing in ChromaDB")
self.collection.add(
ids=chunk_ids,
documents=chunk_texts,
embeddings=embeddings,
metadatas=chunk_metadatas
)
# Create document object
document = Document(
id=doc_id,
filename=parsed['filename'],
filepath=parsed['filepath'],
content=parsed['text'],
chunks=chunks,
metadata={
'extension': parsed['extension'],
'num_chunks': len(chunks),
'total_chars': parsed['total_chars']
},
created_at=datetime.now()
)
logger.info(f"{self.name} successfully ingested: {document}")
return document
def get_statistics(self) -> Dict:
"""Get statistics about stored documents"""
try:
count = self.collection.count()
return {
'total_chunks': count,
'collection_name': self.collection.name
}
except Exception as e:
logger.error(f"Error getting statistics: {e}")
return {'total_chunks': 0, 'error': str(e)}
def delete_document(self, document_id: str) -> bool:
"""
Delete all chunks of a document
Args:
document_id: ID of document to delete
Returns:
True if successful
"""
try:
# Get all chunk IDs for this document
results = self.collection.get(
where={"document_id": document_id}
)
if results['ids']:
self.collection.delete(ids=results['ids'])
logger.info(f"{self.name} deleted document {document_id}")
return True
return False
except Exception as e:
logger.error(f"Error deleting document: {e}")
return False

View File

@@ -0,0 +1,156 @@
"""
Question Agent - Answers questions using RAG (Retrieval Augmented Generation)
"""
import logging
from typing import List, Dict
from agents.base_agent import BaseAgent
from models.document import SearchResult, DocumentChunk
from utils.embeddings import EmbeddingModel
import chromadb
logger = logging.getLogger(__name__)
class QuestionAgent(BaseAgent):
"""Agent that answers questions using retrieved context"""
def __init__(self, collection: chromadb.Collection,
embedding_model: EmbeddingModel,
llm_client=None, model: str = "llama3.2"):
"""
Initialize question agent
Args:
collection: ChromaDB collection with documents
embedding_model: Model for query embeddings
llm_client: Optional shared LLM client
model: Ollama model name
"""
super().__init__(name="QuestionAgent", llm_client=llm_client, model=model)
self.collection = collection
self.embedding_model = embedding_model
self.top_k = 5 # Number of chunks to retrieve
logger.info(f"{self.name} initialized")
def retrieve(self, query: str, top_k: int = None) -> List[SearchResult]:
"""
Retrieve relevant document chunks for a query
Args:
query: Search query
top_k: Number of results to return (uses self.top_k if None)
Returns:
List of SearchResult objects
"""
if top_k is None:
top_k = self.top_k
logger.info(f"{self.name} retrieving top {top_k} chunks for query")
# Generate query embedding
query_embedding = self.embedding_model.embed_query(query)
# Search ChromaDB
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
# Convert to SearchResult objects
search_results = []
if results['ids'] and len(results['ids']) > 0:
for i in range(len(results['ids'][0])):
chunk = DocumentChunk(
id=results['ids'][0][i],
document_id=results['metadatas'][0][i].get('document_id', ''),
content=results['documents'][0][i],
chunk_index=results['metadatas'][0][i].get('chunk_index', 0),
metadata=results['metadatas'][0][i]
)
result = SearchResult(
chunk=chunk,
score=1.0 - results['distances'][0][i], # Convert distance to similarity
document_id=results['metadatas'][0][i].get('document_id', ''),
document_name=results['metadatas'][0][i].get('filename', 'Unknown')
)
search_results.append(result)
logger.info(f"{self.name} retrieved {len(search_results)} results")
return search_results
def process(self, question: str, top_k: int = None) -> Dict[str, any]:
"""
Answer a question using RAG
Args:
question: User's question
top_k: Number of chunks to retrieve
Returns:
Dictionary with answer and sources
"""
logger.info(f"{self.name} processing question: {question[:100]}...")
# Retrieve relevant chunks
search_results = self.retrieve(question, top_k)
if not search_results:
return {
'answer': "I don't have any relevant information in my knowledge base to answer this question.",
'sources': [],
'context_used': ""
}
# Build context from retrieved chunks
context_parts = []
sources = []
for i, result in enumerate(search_results, 1):
context_parts.append(f"[Source {i}] {result.chunk.content}")
sources.append({
'document': result.document_name,
'score': result.score,
'preview': result.chunk.content[:150] + "..."
})
context = "\n\n".join(context_parts)
# Create prompt for LLM
system_prompt = """You are a helpful research assistant. Answer questions based on the provided context.
Be accurate and cite sources when possible. If the context doesn't contain enough information to answer fully, say so.
Keep your answer concise and relevant."""
user_prompt = f"""Context from my knowledge base:
{context}
Question: {question}
Answer based on the context above. If you reference specific information, mention which source(s) you're using."""
# Generate answer
answer = self.generate(
prompt=user_prompt,
system=system_prompt,
temperature=0.3, # Lower temperature for more factual responses
max_tokens=1024
)
logger.info(f"{self.name} generated answer ({len(answer)} chars)")
return {
'answer': answer,
'sources': sources,
'context_used': context,
'num_sources': len(sources)
}
def set_top_k(self, k: int):
"""Set the number of chunks to retrieve"""
self.top_k = k
logger.info(f"{self.name} top_k set to {k}")

View File

@@ -0,0 +1,181 @@
"""
Summary Agent - Creates summaries and extracts key points from documents
"""
import logging
from typing import Dict, List
from agents.base_agent import BaseAgent
from models.document import Summary
import chromadb
logger = logging.getLogger(__name__)
class SummaryAgent(BaseAgent):
"""Agent that creates summaries of documents"""
def __init__(self, collection: chromadb.Collection,
llm_client=None, model: str = "llama3.2"):
"""
Initialize summary agent
Args:
collection: ChromaDB collection with documents
llm_client: Optional shared LLM client
model: Ollama model name
"""
super().__init__(name="SummaryAgent", llm_client=llm_client, model=model)
self.collection = collection
logger.info(f"{self.name} initialized")
def process(self, document_id: str = None, document_text: str = None,
document_name: str = "Unknown") -> Summary:
"""
Create a summary of a document
Args:
document_id: ID of document in ChromaDB (retrieves chunks if provided)
document_text: Full document text (used if document_id not provided)
document_name: Name of the document
Returns:
Summary object
"""
logger.info(f"{self.name} creating summary for: {document_name}")
# Get document text
if document_id:
text = self._get_document_text(document_id)
if not text:
return Summary(
document_id=document_id,
document_name=document_name,
summary_text="Error: Could not retrieve document",
key_points=[]
)
elif document_text:
text = document_text
else:
return Summary(
document_id="",
document_name=document_name,
summary_text="Error: No document provided",
key_points=[]
)
# Truncate if too long (to fit in context)
max_chars = 8000
if len(text) > max_chars:
logger.warning(f"{self.name} truncating document from {len(text)} to {max_chars} chars")
text = text[:max_chars] + "\n\n[Document truncated...]"
# Generate summary
summary_text = self._generate_summary(text)
# Extract key points
key_points = self._extract_key_points(text)
summary = Summary(
document_id=document_id or "",
document_name=document_name,
summary_text=summary_text,
key_points=key_points
)
logger.info(f"{self.name} completed summary with {len(key_points)} key points")
return summary
def _get_document_text(self, document_id: str) -> str:
"""Retrieve and reconstruct document text from chunks"""
try:
results = self.collection.get(
where={"document_id": document_id}
)
if not results['ids']:
return ""
# Sort by chunk index
chunks_data = list(zip(
results['documents'],
results['metadatas']
))
chunks_data.sort(key=lambda x: x[1].get('chunk_index', 0))
# Combine chunks
text = "\n\n".join([chunk[0] for chunk in chunks_data])
return text
except Exception as e:
logger.error(f"Error retrieving document: {e}")
return ""
def _generate_summary(self, text: str) -> str:
"""Generate a concise summary of the text"""
system_prompt = """You are an expert at creating concise, informative summaries.
Your summaries capture the main ideas and key information in clear, accessible language.
Keep summaries to 3-5 sentences unless the document is very long."""
user_prompt = f"""Please create a concise summary of the following document:
{text}
Summary:"""
summary = self.generate(
prompt=user_prompt,
system=system_prompt,
temperature=0.3,
max_tokens=512
)
return summary.strip()
def _extract_key_points(self, text: str) -> List[str]:
"""Extract key points from the text"""
system_prompt = """You extract the most important key points from documents.
List 3-7 key points as concise bullet points. Each point should be a complete, standalone statement."""
user_prompt = f"""Please extract the key points from the following document:
{text}
List the key points (one per line, without bullets or numbers):"""
response = self.generate(
prompt=user_prompt,
system=system_prompt,
temperature=0.3,
max_tokens=512
)
# Parse the response into a list
key_points = []
for line in response.split('\n'):
line = line.strip()
# Remove common list markers
line = line.lstrip('•-*0123456789.)')
line = line.strip()
if line and len(line) > 10: # Filter out very short lines
key_points.append(line)
return key_points[:7] # Limit to 7 points
def summarize_multiple(self, document_ids: List[str]) -> List[Summary]:
"""
Create summaries for multiple documents
Args:
document_ids: List of document IDs
Returns:
List of Summary objects
"""
summaries = []
for doc_id in document_ids:
summary = self.process(document_id=doc_id)
summaries.append(summary)
return summaries

View File

@@ -0,0 +1,846 @@
"""
KnowledgeHub - Personal Knowledge Management & Research Assistant
Main Gradio Application
"""
import os
import logging
import json
import gradio as gr
from pathlib import Path
import chromadb
from datetime import datetime
# Setup logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Import utilities and agents
from utils import OllamaClient, EmbeddingModel, DocumentParser
from agents import (
IngestionAgent, QuestionAgent, SummaryAgent,
ConnectionAgent, ExportAgent
)
from models import Document
# Constants
VECTORSTORE_PATH = "./vectorstore"
TEMP_UPLOAD_PATH = "./temp_uploads"
DOCUMENTS_METADATA_PATH = "./vectorstore/documents_metadata.json"
# Ensure directories exist
os.makedirs(VECTORSTORE_PATH, exist_ok=True)
os.makedirs(TEMP_UPLOAD_PATH, exist_ok=True)
class KnowledgeHub:
"""Main application class managing all agents"""
def __init__(self):
logger.info("Initializing KnowledgeHub...")
# Initialize ChromaDB
self.client = chromadb.PersistentClient(path=VECTORSTORE_PATH)
self.collection = self.client.get_or_create_collection(
name="knowledge_base",
metadata={"description": "Personal knowledge management collection"}
)
# Initialize embedding model
self.embedding_model = EmbeddingModel()
# Initialize shared LLM client
self.llm_client = OllamaClient(model="llama3.2")
# Check Ollama connection
if not self.llm_client.check_connection():
logger.warning("⚠️ Cannot connect to Ollama. Please ensure Ollama is running.")
logger.warning("Start Ollama with: ollama serve")
else:
logger.info("✓ Connected to Ollama")
# Initialize agents
self.ingestion_agent = IngestionAgent(
collection=self.collection,
embedding_model=self.embedding_model,
llm_client=self.llm_client
)
self.question_agent = QuestionAgent(
collection=self.collection,
embedding_model=self.embedding_model,
llm_client=self.llm_client
)
self.summary_agent = SummaryAgent(
collection=self.collection,
llm_client=self.llm_client
)
self.connection_agent = ConnectionAgent(
collection=self.collection,
embedding_model=self.embedding_model,
llm_client=self.llm_client
)
self.export_agent = ExportAgent(
llm_client=self.llm_client
)
# Track uploaded documents
self.documents = {}
# Load existing documents from metadata file
self._load_documents_metadata()
logger.info("✓ KnowledgeHub initialized successfully")
def _save_documents_metadata(self):
"""Save document metadata to JSON file"""
try:
metadata = {
doc_id: doc.to_dict()
for doc_id, doc in self.documents.items()
}
with open(DOCUMENTS_METADATA_PATH, 'w') as f:
json.dump(metadata, f, indent=2)
logger.debug(f"Saved metadata for {len(metadata)} documents")
except Exception as e:
logger.error(f"Error saving document metadata: {e}")
def _load_documents_metadata(self):
"""Load document metadata from JSON file"""
try:
if os.path.exists(DOCUMENTS_METADATA_PATH):
with open(DOCUMENTS_METADATA_PATH, 'r') as f:
metadata = json.load(f)
# Reconstruct Document objects (simplified - without chunks)
for doc_id, doc_data in metadata.items():
# Create a minimal Document object for UI purposes
# Full chunks are still in ChromaDB
doc = Document(
id=doc_id,
filename=doc_data['filename'],
filepath=doc_data.get('filepath', ''),
content=doc_data.get('content', ''),
chunks=[], # Chunks are in ChromaDB
metadata=doc_data.get('metadata', {}),
created_at=datetime.fromisoformat(doc_data['created_at'])
)
self.documents[doc_id] = doc
logger.info(f"✓ Loaded {len(self.documents)} existing documents from storage")
else:
logger.info("No existing documents found (starting fresh)")
except Exception as e:
logger.error(f"Error loading document metadata: {e}")
logger.info("Starting with empty document list")
def upload_document(self, files, progress=gr.Progress()):
"""Handle document upload - supports single or multiple files with progress tracking"""
if files is None or len(files) == 0:
return "⚠️ Please select file(s) to upload", "", []
# Convert single file to list for consistent handling
if not isinstance(files, list):
files = [files]
results = []
successful = 0
failed = 0
total_chunks = 0
# Initialize progress tracking
progress(0, desc="Starting upload...")
for file_idx, file in enumerate(files, 1):
# Update progress
progress_pct = (file_idx - 1) / len(files)
progress(progress_pct, desc=f"Processing {file_idx}/{len(files)}: {Path(file.name).name}")
try:
logger.info(f"Processing file {file_idx}/{len(files)}: {file.name}")
# Save uploaded file temporarily
temp_path = os.path.join(TEMP_UPLOAD_PATH, Path(file.name).name)
# Copy file content
with open(temp_path, 'wb') as f:
f.write(file.read() if hasattr(file, 'read') else open(file.name, 'rb').read())
# Process document
document = self.ingestion_agent.process(temp_path)
# Store document reference
self.documents[document.id] = document
# Track stats
successful += 1
total_chunks += document.num_chunks
# Add to results
results.append({
'status': '',
'filename': document.filename,
'chunks': document.num_chunks,
'size': f"{document.total_chars:,} chars"
})
# Clean up temp file
os.remove(temp_path)
except Exception as e:
logger.error(f"Error processing {file.name}: {e}")
failed += 1
results.append({
'status': '',
'filename': Path(file.name).name,
'chunks': 0,
'size': f"Error: {str(e)[:50]}"
})
# Final progress update
progress(1.0, desc="Upload complete!")
# Save metadata once after all uploads
if successful > 0:
self._save_documents_metadata()
# Create summary
summary = f"""## Upload Complete! 🎉
**Total Files:** {len(files)}
**✅ Successful:** {successful}
**❌ Failed:** {failed}
**Total Chunks Created:** {total_chunks:,}
{f"⚠️ **{failed} file(s) failed** - Check results table below for details" if failed > 0 else "All files processed successfully!"}
"""
# Create detailed results table
results_table = [[r['status'], r['filename'], r['chunks'], r['size']] for r in results]
# Create preview of first successful document
preview = ""
for doc in self.documents.values():
if doc.filename in [r['filename'] for r in results if r['status'] == '']:
preview = doc.content[:500] + "..." if len(doc.content) > 500 else doc.content
break
return summary, preview, results_table
def ask_question(self, question, top_k, progress=gr.Progress()):
"""Handle question answering with progress tracking"""
if not question.strip():
return "⚠️ Please enter a question", [], ""
try:
# Initial status
progress(0, desc="Processing your question...")
status = "🔄 **Searching knowledge base...**\n\nRetrieving relevant documents..."
logger.info(f"Answering question: {question[:100]}")
# Update progress
progress(0.3, desc="Finding relevant documents...")
result = self.question_agent.process(question, top_k=top_k)
# Update progress
progress(0.7, desc="Generating answer with LLM...")
# Format answer
answer = f"""### Answer\n\n{result['answer']}\n\n"""
if result['sources']:
answer += f"**Sources:** {result['num_sources']} documents referenced\n\n"
# Format sources for display
sources_data = []
for i, source in enumerate(result['sources'], 1):
sources_data.append([
i,
source['document'],
f"{source['score']:.2%}",
source['preview']
])
progress(1.0, desc="Answer ready!")
return answer, sources_data, "✅ Answer generated successfully!"
except Exception as e:
logger.error(f"Error answering question: {e}")
return f"❌ Error: {str(e)}", [], f"❌ Error: {str(e)}"
def create_summary(self, doc_selector, progress=gr.Progress()):
"""Create document summary with progress tracking"""
if not doc_selector:
return "⚠️ Please select a document to summarize", ""
try:
# Initial status
progress(0, desc="Preparing to summarize...")
logger.info(f'doc_selector : {doc_selector}')
doc_id = doc_selector.split(" -|- ")[1]
document = self.documents.get(doc_id)
if not document:
return "", "❌ Document not found"
# Update status
status_msg = f"🔄 **Generating summary for:** {document.filename}\n\nPlease wait, this may take 10-20 seconds..."
progress(0.3, desc=f"Analyzing {document.filename}...")
logger.info(f"Creating summary for: {document.filename}")
# Generate summary
summary = self.summary_agent.process(
document_id=doc_id,
document_name=document.filename
)
progress(1.0, desc="Summary complete!")
# Format result
result = f"""## Summary of {summary.document_name}\n\n{summary.summary_text}\n\n"""
if summary.key_points:
result += "### Key Points\n\n"
for point in summary.key_points:
result += f"- {point}\n"
return result, "✅ Summary generated successfully!"
except Exception as e:
logger.error(f"Error creating summary: {e}")
return "", f"❌ Error: {str(e)}"
def find_connections(self, doc_selector, top_k, progress=gr.Progress()):
"""Find related documents with progress tracking"""
if not doc_selector:
return "⚠️ Please select a document", [], ""
try:
progress(0, desc="Preparing to find connections...")
doc_id = doc_selector.split(" -|- ")[1]
document = self.documents.get(doc_id)
if not document:
return "❌ Document not found", [], "❌ Document not found"
status = f"🔄 **Finding documents related to:** {document.filename}\n\nSearching knowledge base..."
progress(0.3, desc=f"Analyzing {document.filename}...")
logger.info(f"Finding connections for: {document.filename}")
result = self.connection_agent.process(document_id=doc_id, top_k=top_k)
progress(0.8, desc="Calculating similarity scores...")
if 'error' in result:
return f"❌ Error: {result['error']}", [], f"❌ Error: {result['error']}"
message = f"""## Related Documents\n\n**Source:** {result['source_document']}\n\n"""
message += f"**Found {result['num_related']} related documents:**\n\n"""
# Format for table
table_data = []
for i, rel in enumerate(result['related'], 1):
table_data.append([
i,
rel['document_name'],
f"{rel['similarity']:.2%}",
rel['preview']
])
progress(1.0, desc="Connections found!")
return message, table_data, "✅ Related documents found!"
except Exception as e:
logger.error(f"Error finding connections: {e}")
return f"❌ Error: {str(e)}", [], f"❌ Error: {str(e)}"
def export_knowledge(self, format_choice):
"""Export knowledge base"""
try:
logger.info(f"Exporting as {format_choice}")
# Get statistics
stats = self.ingestion_agent.get_statistics()
# Create export content
content = {
'title': 'Knowledge Base Export',
'summary': f"Total documents in knowledge base: {len(self.documents)}",
'sections': [
{
'title': 'Documents',
'content': '\n'.join([f"- {doc.filename}" for doc in self.documents.values()])
},
{
'title': 'Statistics',
'content': f"Total chunks stored: {stats['total_chunks']}"
}
]
}
# Export
if format_choice == "Markdown":
output = self.export_agent.process(content, format="markdown")
filename = f"knowledge_export_{datetime.now().strftime('%Y%m%d_%H%M%S')}.md"
elif format_choice == "HTML":
output = self.export_agent.process(content, format="html")
filename = f"knowledge_export_{datetime.now().strftime('%Y%m%d_%H%M%S')}.html"
else: # Text
output = self.export_agent.process(content, format="text")
filename = f"knowledge_export_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"
# Save file
export_path = os.path.join(TEMP_UPLOAD_PATH, filename)
with open(export_path, 'w', encoding='utf-8') as f:
f.write(output)
return f"✅ Exported as {format_choice}", export_path
except Exception as e:
logger.error(f"Error exporting: {e}")
return f"❌ Error: {str(e)}", None
def get_statistics(self):
"""Get knowledge base statistics"""
try:
stats = self.ingestion_agent.get_statistics()
total_docs = len(self.documents)
total_chunks = stats.get('total_chunks', 0)
total_chars = sum(doc.total_chars for doc in self.documents.values())
# Check if data is persisted
persistence_status = "✅ Enabled" if os.path.exists(DOCUMENTS_METADATA_PATH) else "⚠️ Not configured"
vectorstore_size = self._get_directory_size(VECTORSTORE_PATH)
stats_text = f"""## Knowledge Base Statistics
**Persistence Status:** {persistence_status}
**Total Documents:** {total_docs}
**Total Chunks:** {total_chunks:,}
**Total Characters:** {total_chars:,}
**Vector Store Size:** {vectorstore_size}
### Storage Locations
- **Vector DB:** `{VECTORSTORE_PATH}/`
- **Metadata:** `{DOCUMENTS_METADATA_PATH}`
**📝 Note:** Your data persists across app restarts!
**Recent Documents:**
{chr(10).join([f"- {doc.filename} ({doc.num_chunks} chunks)" for doc in list(self.documents.values())[-5:]])}
"""
if self.documents:
stats_text += "\n".join([f"- {doc.filename} ({doc.num_chunks} chunks, added {doc.created_at.strftime('%Y-%m-%d')})"
for doc in list(self.documents.values())[-10:]])
else:
stats_text += "\n*No documents yet. Upload some to get started!*"
return stats_text
except Exception as e:
return f"❌ Error: {str(e)}"
def _get_directory_size(self, path):
"""Calculate directory size"""
try:
total_size = 0
for dirpath, dirnames, filenames in os.walk(path):
for filename in filenames:
filepath = os.path.join(dirpath, filename)
if os.path.exists(filepath):
total_size += os.path.getsize(filepath)
# Convert to human readable
for unit in ['B', 'KB', 'MB', 'GB']:
if total_size < 1024.0:
return f"{total_size:.1f} {unit}"
total_size /= 1024.0
return f"{total_size:.1f} TB"
except:
return "Unknown"
def get_document_list(self):
"""Get list of documents for dropdown"""
new_choices = [f"{doc.filename} -|- {doc.id}" for doc in self.documents.values()]
return gr.update(choices=new_choices, value=None)
def delete_document(self, doc_selector):
"""Delete a document from the knowledge base"""
if not doc_selector:
return "⚠️ Please select a document to delete", self.get_document_list()
try:
doc_id = doc_selector.split(" - ")[0]
document = self.documents.get(doc_id)
if not document:
return "❌ Document not found", self.get_document_list()
# Delete from ChromaDB
success = self.ingestion_agent.delete_document(doc_id)
if success:
# Remove from documents dict
filename = document.filename
del self.documents[doc_id]
# Save updated metadata
self._save_documents_metadata()
return f"✅ Deleted: {filename}", self.get_document_list()
else:
return f"❌ Error deleting document", self.get_document_list()
except Exception as e:
logger.error(f"Error deleting document: {e}")
return f"❌ Error: {str(e)}", self.get_document_list()
def clear_all_documents(self):
"""Clear entire knowledge base"""
try:
# Delete collection
self.client.delete_collection("knowledge_base")
# Recreate empty collection
self.collection = self.client.create_collection(
name="knowledge_base",
metadata={"description": "Personal knowledge management collection"}
)
# Update agents with new collection
self.ingestion_agent.collection = self.collection
self.question_agent.collection = self.collection
self.summary_agent.collection = self.collection
self.connection_agent.collection = self.collection
# Clear documents
self.documents = {}
self._save_documents_metadata()
return "✅ All documents cleared from knowledge base"
except Exception as e:
logger.error(f"Error clearing database: {e}")
return f"❌ Error: {str(e)}"
def create_ui():
"""Create Gradio interface"""
# Initialize app
app = KnowledgeHub()
# Custom CSS
custom_css = """
.main-header {
text-align: center;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
padding: 30px;
border-radius: 10px;
margin-bottom: 20px;
}
.stat-box {
background: #f8f9fa;
padding: 15px;
border-radius: 8px;
border-left: 4px solid #667eea;
}
"""
with gr.Blocks(title="KnowledgeHub", css=custom_css, theme=gr.themes.Soft()) as interface:
# Header
gr.HTML("""
<div class="main-header">
<h1>🧠 KnowledgeHub</h1>
<p>Personal Knowledge Management & Research Assistant</p>
<p style="font-size: 14px; opacity: 0.9;">
Powered by Ollama (Llama 3.2) • Fully Local & Private
</p>
</div>
""")
# Main tabs
with gr.Tabs():
# Tab 1: Upload Documents
with gr.Tab("📤 Upload Documents"):
gr.Markdown("### Upload your documents to build your knowledge base")
gr.Markdown("*Supported formats: PDF, DOCX, TXT, MD, HTML, PY*")
gr.Markdown("*💡 Tip: You can select multiple files at once!*")
with gr.Row():
with gr.Column():
file_input = gr.File(
label="Select Document(s)",
file_types=[".pdf", ".docx", ".txt", ".md", ".html", ".py"],
file_count="multiple" # Enable multiple file selection
)
upload_btn = gr.Button("📤 Upload & Process", variant="primary")
with gr.Column():
upload_status = gr.Markdown("Ready to upload documents")
# Results table for batch uploads
with gr.Row():
upload_results = gr.Dataframe(
headers=["Status", "Filename", "Chunks", "Size"],
label="Upload Results",
wrap=True,
visible=True
)
with gr.Row():
document_preview = gr.Textbox(
label="Document Preview (First Uploaded)",
lines=10,
max_lines=15
)
upload_btn.click(
fn=app.upload_document,
inputs=[file_input],
outputs=[upload_status, document_preview, upload_results]
)
# Tab 2: Ask Questions
with gr.Tab("❓ Ask Questions"):
gr.Markdown("### Ask questions about your documents")
gr.Markdown("*Uses RAG (Retrieval Augmented Generation) to answer based on your knowledge base*")
with gr.Row():
with gr.Column(scale=3):
question_input = gr.Textbox(
label="Your Question",
placeholder="What would you like to know?",
lines=3
)
with gr.Column(scale=1):
top_k_slider = gr.Slider(
minimum=1,
maximum=10,
value=5,
step=1,
label="Number of sources"
)
ask_btn = gr.Button("🔍 Ask", variant="primary")
qa_status = gr.Markdown("Ready to answer questions")
answer_output = gr.Markdown(label="Answer")
sources_table = gr.Dataframe(
headers=["#", "Document", "Relevance", "Preview"],
label="Sources",
wrap=True
)
ask_btn.click(
fn=app.ask_question,
inputs=[question_input, top_k_slider],
outputs=[answer_output, sources_table, qa_status]
)
# Tab 3: Summarize
with gr.Tab("📝 Summarize"):
gr.Markdown("### Generate summaries and extract key points")
with gr.Row():
with gr.Column():
doc_selector = gr.Dropdown(
choices=[],
label="Select Document",
info="Choose a document to summarize",
allow_custom_value=True
)
refresh_btn = gr.Button("🔄 Refresh List")
summarize_btn = gr.Button("📝 Generate Summary", variant="primary")
summary_status = gr.Markdown("Ready to generate summaries")
with gr.Column(scale=2):
summary_output = gr.Markdown(label="Summary")
summarize_btn.click(
fn=app.create_summary,
inputs=[doc_selector],
outputs=[summary_output, summary_status]
)
refresh_btn.click(
fn=app.get_document_list,
outputs=[doc_selector]
)
# Tab 4: Find Connections
with gr.Tab("🔗 Find Connections"):
gr.Markdown("### Discover relationships between documents")
with gr.Row():
with gr.Column():
conn_doc_selector = gr.Dropdown(
choices=[],
label="Select Document",
info="Find documents related to this one",
allow_custom_value=True
)
conn_top_k = gr.Slider(
minimum=1,
maximum=10,
value=5,
step=1,
label="Number of related documents"
)
refresh_conn_btn = gr.Button("🔄 Refresh List")
find_btn = gr.Button("🔗 Find Connections", variant="primary")
connection_status = gr.Markdown("Ready to find connections")
connection_output = gr.Markdown(label="Connections")
connections_table = gr.Dataframe(
headers=["#", "Document", "Similarity", "Preview"],
label="Related Documents",
wrap=True
)
find_btn.click(
fn=app.find_connections,
inputs=[conn_doc_selector, conn_top_k],
outputs=[connection_output, connections_table, connection_status]
)
refresh_conn_btn.click(
fn=app.get_document_list,
outputs=[conn_doc_selector]
)
# Tab 5: Export
with gr.Tab("💾 Export"):
gr.Markdown("### Export your knowledge base")
with gr.Row():
with gr.Column():
format_choice = gr.Radio(
choices=["Markdown", "HTML", "Text"],
value="Markdown",
label="Export Format"
)
export_btn = gr.Button("💾 Export", variant="primary")
with gr.Column():
export_status = gr.Markdown("Ready to export")
export_file = gr.File(label="Download Export")
export_btn.click(
fn=app.export_knowledge,
inputs=[format_choice],
outputs=[export_status, export_file]
)
# Tab 6: Manage Documents
with gr.Tab("🗂️ Manage Documents"):
gr.Markdown("### Manage your document library")
with gr.Row():
with gr.Column():
gr.Markdown("#### Delete Document")
delete_doc_selector = gr.Dropdown(
choices=[],
label="Select Document to Delete",
info="Choose a document to remove from knowledge base"
)
with gr.Row():
refresh_delete_btn = gr.Button("🔄 Refresh List")
delete_btn = gr.Button("🗑️ Delete Document", variant="stop")
delete_status = gr.Markdown("")
with gr.Column():
gr.Markdown("#### Clear All Documents")
gr.Markdown("⚠️ **Warning:** This will delete your entire knowledge base!")
clear_confirm = gr.Textbox(
label="Type 'DELETE ALL' to confirm",
placeholder="DELETE ALL"
)
clear_all_btn = gr.Button("🗑️ Clear All Documents", variant="stop")
clear_status = gr.Markdown("")
def confirm_and_clear(confirm_text):
if confirm_text.strip() == "DELETE ALL":
return app.clear_all_documents()
else:
return "⚠️ Please type 'DELETE ALL' to confirm"
delete_btn.click(
fn=app.delete_document,
inputs=[delete_doc_selector],
outputs=[delete_status, delete_doc_selector]
)
refresh_delete_btn.click(
fn=app.get_document_list,
outputs=[delete_doc_selector]
)
clear_all_btn.click(
fn=confirm_and_clear,
inputs=[clear_confirm],
outputs=[clear_status]
)
# Tab 7: Statistics
with gr.Tab("📊 Statistics"):
gr.Markdown("### Knowledge Base Overview")
stats_output = gr.Markdown()
stats_btn = gr.Button("🔄 Refresh Statistics", variant="primary")
stats_btn.click(
fn=app.get_statistics,
outputs=[stats_output]
)
# Auto-load stats on tab open
interface.load(
fn=app.get_statistics,
outputs=[stats_output]
)
# Footer
gr.HTML("""
<div style="text-align: center; margin-top: 30px; padding: 20px; color: #666;">
<p>🔒 All processing happens locally on your machine • Your data never leaves your computer</p>
<p style="font-size: 12px;">Powered by Ollama, ChromaDB, and Sentence Transformers</p>
</div>
""")
return interface
if __name__ == "__main__":
logger.info("Starting KnowledgeHub...")
# Create and launch interface
interface = create_ui()
interface.launch(
server_name="127.0.0.1",
server_port=7860,
share=False,
inbrowser=True
)

View File

@@ -0,0 +1,13 @@
"""
models
"""
from .knowledge_graph import KnowledgeGraph
from .document import Document, DocumentChunk, SearchResult, Summary
__all__ = [
'KnowledgeGraph',
'Document',
'DocumentChunk',
'SearchResult',
'Summary'
]

View File

@@ -0,0 +1,82 @@
"""
Document data models
"""
from dataclasses import dataclass, field
from typing import List, Dict, Optional
from datetime import datetime
@dataclass
class DocumentChunk:
"""Represents a chunk of a document"""
id: str
document_id: str
content: str
chunk_index: int
metadata: Dict = field(default_factory=dict)
def __str__(self):
preview = self.content[:100] + "..." if len(self.content) > 100 else self.content
return f"Chunk {self.chunk_index}: {preview}"
@dataclass
class Document:
"""Represents a complete document"""
id: str
filename: str
filepath: str
content: str
chunks: List[DocumentChunk]
metadata: Dict = field(default_factory=dict)
created_at: datetime = field(default_factory=datetime.now)
@property
def num_chunks(self) -> int:
return len(self.chunks)
@property
def total_chars(self) -> int:
return len(self.content)
@property
def extension(self) -> str:
return self.metadata.get('extension', '')
def __str__(self):
return f"Document: {self.filename} ({self.num_chunks} chunks, {self.total_chars} chars)"
def to_dict(self) -> Dict:
"""Convert to dictionary for storage"""
return {
'id': self.id,
'filename': self.filename,
'filepath': self.filepath,
'content': self.content[:500] + '...' if len(self.content) > 500 else self.content,
'num_chunks': self.num_chunks,
'total_chars': self.total_chars,
'extension': self.extension,
'created_at': self.created_at.isoformat(),
'metadata': self.metadata
}
@dataclass
class SearchResult:
"""Represents a search result from the vector database"""
chunk: DocumentChunk
score: float
document_id: str
document_name: str
def __str__(self):
return f"{self.document_name} (score: {self.score:.2f})"
@dataclass
class Summary:
"""Represents a document summary"""
document_id: str
document_name: str
summary_text: str
key_points: List[str] = field(default_factory=list)
created_at: datetime = field(default_factory=datetime.now)
def __str__(self):
return f"Summary of {self.document_name}: {self.summary_text[:100]}..."

View File

@@ -0,0 +1,110 @@
"""
Knowledge Graph data models
"""
from dataclasses import dataclass, field
from typing import List, Dict, Set
from datetime import datetime
@dataclass
class KnowledgeNode:
"""Represents a concept or entity in the knowledge graph"""
id: str
name: str
node_type: str # 'document', 'concept', 'entity', 'topic'
description: str = ""
metadata: Dict = field(default_factory=dict)
created_at: datetime = field(default_factory=datetime.now)
def __str__(self):
return f"{self.node_type.capitalize()}: {self.name}"
@dataclass
class KnowledgeEdge:
"""Represents a relationship between nodes"""
source_id: str
target_id: str
relationship: str # 'related_to', 'cites', 'contains', 'similar_to'
weight: float = 1.0
metadata: Dict = field(default_factory=dict)
def __str__(self):
return f"{self.source_id} --[{self.relationship}]--> {self.target_id}"
@dataclass
class KnowledgeGraph:
"""Represents the complete knowledge graph"""
nodes: Dict[str, KnowledgeNode] = field(default_factory=dict)
edges: List[KnowledgeEdge] = field(default_factory=list)
def add_node(self, node: KnowledgeNode):
"""Add a node to the graph"""
self.nodes[node.id] = node
def add_edge(self, edge: KnowledgeEdge):
"""Add an edge to the graph"""
if edge.source_id in self.nodes and edge.target_id in self.nodes:
self.edges.append(edge)
def get_neighbors(self, node_id: str) -> List[str]:
"""Get all nodes connected to a given node"""
neighbors = set()
for edge in self.edges:
if edge.source_id == node_id:
neighbors.add(edge.target_id)
elif edge.target_id == node_id:
neighbors.add(edge.source_id)
return list(neighbors)
def get_related_documents(self, node_id: str, max_depth: int = 2) -> Set[str]:
"""Get all documents related to a node within max_depth hops"""
related = set()
visited = set()
queue = [(node_id, 0)]
while queue:
current_id, depth = queue.pop(0)
if current_id in visited or depth > max_depth:
continue
visited.add(current_id)
# If this is a document node, add it
if current_id in self.nodes and self.nodes[current_id].node_type == 'document':
related.add(current_id)
# Add neighbors to queue
if depth < max_depth:
for neighbor_id in self.get_neighbors(current_id):
if neighbor_id not in visited:
queue.append((neighbor_id, depth + 1))
return related
def to_networkx(self):
"""Convert to NetworkX graph for visualization"""
try:
import networkx as nx
G = nx.Graph()
# Add nodes
for node_id, node in self.nodes.items():
G.add_node(node_id,
name=node.name,
type=node.node_type,
description=node.description)
# Add edges
for edge in self.edges:
G.add_edge(edge.source_id, edge.target_id,
relationship=edge.relationship,
weight=edge.weight)
return G
except ImportError:
return None
def __str__(self):
return f"KnowledgeGraph: {len(self.nodes)} nodes, {len(self.edges)} edges"

View File

@@ -0,0 +1,26 @@
# Core Dependencies
gradio>=4.0.0
chromadb>=0.4.0
sentence-transformers>=2.2.0
python-dotenv>=1.0.0
# Document Processing
pypdf>=3.0.0
python-docx>=1.0.0
markdown>=3.4.0
beautifulsoup4>=4.12.0
# Data Processing
numpy>=1.24.0
pandas>=2.0.0
tqdm>=4.65.0
# Visualization
plotly>=5.14.0
networkx>=3.0
# Ollama Client
requests>=2.31.0
# Optional but useful
scikit-learn>=1.3.0

View File

@@ -0,0 +1,71 @@
@echo off
REM KnowledgeHub Startup Script for Windows
echo 🧠 Starting KnowledgeHub...
echo.
REM Check if Ollama is installed
where ollama >nul 2>nul
if %errorlevel% neq 0 (
echo ❌ Ollama is not installed or not in PATH
echo Please install Ollama from https://ollama.com/download
pause
exit /b 1
)
REM Check Python
where python >nul 2>nul
if %errorlevel% neq 0 (
echo ❌ Python is not installed or not in PATH
echo Please install Python 3.8+ from https://www.python.org/downloads/
pause
exit /b 1
)
echo ✅ Prerequisites found
echo.
REM Check if Ollama service is running
tasklist /FI "IMAGENAME eq ollama.exe" 2>NUL | find /I /N "ollama.exe">NUL
if "%ERRORLEVEL%"=="1" (
echo ⚠️ Ollama is not running. Please start Ollama first.
echo You can start it from the Start menu or by running: ollama serve
pause
exit /b 1
)
echo ✅ Ollama is running
echo.
REM Check if model exists
ollama list | find "llama3.2" >nul
if %errorlevel% neq 0 (
echo 📥 Llama 3.2 model not found. Pulling model...
echo This may take a few minutes on first run...
ollama pull llama3.2
)
echo ✅ Model ready
echo.
REM Install dependencies
echo 🔍 Checking dependencies...
python -c "import gradio" 2>nul
if %errorlevel% neq 0 (
echo 📦 Installing dependencies...
pip install -r requirements.txt
)
echo ✅ Dependencies ready
echo.
REM Launch application
echo 🚀 Launching KnowledgeHub...
echo The application will open in your browser at http://127.0.0.1:7860
echo.
echo Press Ctrl+C to stop the application
echo.
python app.py
pause

View File

@@ -0,0 +1,42 @@
#!/bin/bash
# KnowledgeHub Startup Script
echo "🧠 Starting KnowledgeHub..."
echo ""
# Check if Ollama is running
if ! pgrep -x "ollama" > /dev/null; then
echo "⚠️ Ollama is not running. Starting Ollama..."
ollama serve &
sleep 3
fi
# Check if llama3.2 model exists
if ! ollama list | grep -q "llama3.2"; then
echo "📥 Llama 3.2 model not found. Pulling model..."
echo "This may take a few minutes on first run..."
ollama pull llama3.2
fi
echo "✅ Ollama is ready"
echo ""
# Check Python dependencies
echo "🔍 Checking dependencies..."
if ! python -c "import gradio" 2>/dev/null; then
echo "📦 Installing dependencies..."
pip install -r requirements.txt
fi
echo "✅ Dependencies ready"
echo ""
# Launch the application
echo "🚀 Launching KnowledgeHub..."
echo "The application will open in your browser at http://127.0.0.1:7860"
echo ""
echo "Press Ctrl+C to stop the application"
echo ""
python app.py

View File

@@ -0,0 +1,12 @@
"""
models
"""
from .document_parser import DocumentParser
from .embeddings import EmbeddingModel
from .ollama_client import OllamaClient
__all__ = [
'DocumentParser',
'EmbeddingModel',
'OllamaClient'
]

View File

@@ -0,0 +1,218 @@
"""
Document Parser - Extract text from various document formats
"""
import os
from typing import List, Dict, Optional
import logging
from pathlib import Path
logger = logging.getLogger(__name__)
class DocumentParser:
"""Parse various document formats into text chunks"""
SUPPORTED_FORMATS = ['.pdf', '.docx', '.txt', '.md', '.html', '.py']
def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
"""
Initialize document parser
Args:
chunk_size: Maximum characters per chunk
chunk_overlap: Overlap between chunks for context preservation
"""
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
def parse_file(self, file_path: str) -> Dict:
"""
Parse a file and return structured document data
Args:
file_path: Path to the file
Returns:
Dictionary with document metadata and chunks
"""
path = Path(file_path)
if not path.exists():
raise FileNotFoundError(f"File not found: {file_path}")
extension = path.suffix.lower()
if extension not in self.SUPPORTED_FORMATS:
raise ValueError(f"Unsupported format: {extension}")
# Extract text based on file type
if extension == '.pdf':
text = self._parse_pdf(file_path)
elif extension == '.docx':
text = self._parse_docx(file_path)
elif extension == '.txt' or extension == '.py':
text = self._parse_txt(file_path)
elif extension == '.md':
text = self._parse_markdown(file_path)
elif extension == '.html':
text = self._parse_html(file_path)
else:
text = ""
# Create chunks
chunks = self._create_chunks(text)
return {
'filename': path.name,
'filepath': str(path.absolute()),
'extension': extension,
'text': text,
'chunks': chunks,
'num_chunks': len(chunks),
'total_chars': len(text)
}
def _parse_pdf(self, file_path: str) -> str:
"""Extract text from PDF"""
try:
from pypdf import PdfReader
reader = PdfReader(file_path)
text = ""
for page in reader.pages:
text += page.extract_text() + "\n\n"
return text.strip()
except ImportError:
logger.error("pypdf not installed. Install with: pip install pypdf")
return ""
except Exception as e:
logger.error(f"Error parsing PDF: {e}")
return ""
def _parse_docx(self, file_path: str) -> str:
"""Extract text from DOCX"""
try:
from docx import Document
doc = Document(file_path)
text = "\n\n".join([para.text for para in doc.paragraphs if para.text.strip()])
return text.strip()
except ImportError:
logger.error("python-docx not installed. Install with: pip install python-docx")
return ""
except Exception as e:
logger.error(f"Error parsing DOCX: {e}")
return ""
def _parse_txt(self, file_path: str) -> str:
"""Extract text from TXT"""
try:
with open(file_path, 'r', encoding='utf-8') as f:
return f.read().strip()
except Exception as e:
logger.error(f"Error parsing TXT: {e}")
return ""
def _parse_markdown(self, file_path: str) -> str:
"""Extract text from Markdown"""
try:
import markdown
from bs4 import BeautifulSoup
with open(file_path, 'r', encoding='utf-8') as f:
md_text = f.read()
# Convert markdown to HTML then extract text
html = markdown.markdown(md_text)
soup = BeautifulSoup(html, 'html.parser')
text = soup.get_text()
return text.strip()
except ImportError:
# Fallback: just read as plain text
return self._parse_txt(file_path)
except Exception as e:
logger.error(f"Error parsing Markdown: {e}")
return ""
def _parse_html(self, file_path: str) -> str:
"""Extract text from HTML"""
try:
from bs4 import BeautifulSoup
with open(file_path, 'r', encoding='utf-8') as f:
html = f.read()
soup = BeautifulSoup(html, 'html.parser')
# Remove script and style elements
for script in soup(["script", "style"]):
script.decompose()
text = soup.get_text()
# Clean up whitespace
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
return text.strip()
except ImportError:
logger.error("beautifulsoup4 not installed. Install with: pip install beautifulsoup4")
return ""
except Exception as e:
logger.error(f"Error parsing HTML: {e}")
return ""
def _create_chunks(self, text: str) -> List[str]:
"""
Split text into overlapping chunks
Args:
text: Full text to chunk
Returns:
List of text chunks
"""
if not text:
return []
chunks = []
start = 0
text_length = len(text)
while start < text_length:
logger.info(f'Processing chunk at {start}, for len {text_length}.')
end = start + self.chunk_size
# If this isn't the last chunk, try to break at a sentence or paragraph
if end < text_length:
# Look for paragraph break first
break_pos = text.rfind('\n\n', start, end)
if break_pos == -1:
# Look for sentence break
break_pos = text.rfind('. ', start, end)
if break_pos == -1:
# Look for any space
break_pos = text.rfind(' ', start, end)
if break_pos != -1 and break_pos > start and break_pos > end - self.chunk_overlap:
end = break_pos + 1
chunk = text[start:end].strip()
if chunk:
chunks.append(chunk)
# Move start position with overlap
start = end - self.chunk_overlap
if start < 0:
start = 0
return chunks

View File

@@ -0,0 +1,84 @@
"""
Embeddings utility using sentence-transformers
"""
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Union
import logging
logger = logging.getLogger(__name__)
class EmbeddingModel:
"""Wrapper for sentence transformer embeddings"""
def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
"""
Initialize embedding model
Args:
model_name: HuggingFace model name for embeddings
"""
self.model_name = model_name
logger.info(f"Loading embedding model: {model_name}")
self.model = SentenceTransformer(model_name)
self.dimension = self.model.get_sentence_embedding_dimension()
logger.info(f"Embedding dimension: {self.dimension}")
def embed(self, texts: Union[str, List[str]]) -> np.ndarray:
"""
Generate embeddings for text(s)
Args:
texts: Single text or list of texts
Returns:
Numpy array of embeddings
"""
if isinstance(texts, str):
texts = [texts]
embeddings = self.model.encode(texts, show_progress_bar=False)
return embeddings
def embed_query(self, query: str) -> List[float]:
"""
Embed a single query - returns as list for ChromaDB compatibility
Args:
query: Query text
Returns:
List of floats representing the embedding
"""
embedding = self.model.encode([query], show_progress_bar=False)[0]
return embedding.tolist()
def embed_documents(self, documents: List[str]) -> List[List[float]]:
"""
Embed multiple documents - returns as list of lists for ChromaDB
Args:
documents: List of document texts
Returns:
List of embeddings (each as list of floats)
"""
embeddings = self.model.encode(documents, show_progress_bar=False)
return embeddings.tolist()
def similarity(self, text1: str, text2: str) -> float:
"""
Calculate cosine similarity between two texts
Args:
text1: First text
text2: Second text
Returns:
Similarity score between 0 and 1
"""
emb1, emb2 = self.model.encode([text1, text2])
# Cosine similarity
similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
return float(similarity)

View File

@@ -0,0 +1,107 @@
"""
Ollama Client - Wrapper for local Ollama API
"""
import requests
import json
from typing import List, Dict, Optional
import logging
logger = logging.getLogger(__name__)
class OllamaClient:
"""Client for interacting with local Ollama models"""
def __init__(self, base_url: str = "http://localhost:11434", model: str = "llama3.2"):
self.base_url = base_url
self.model = model
self.api_url = f"{base_url}/api"
def generate(self, prompt: str, system: Optional[str] = None,
temperature: float = 0.7, max_tokens: int = 2048) -> str:
"""Generate text from a prompt"""
try:
payload = {
"model": self.model,
"prompt": prompt,
"stream": False,
"options": {
"temperature": temperature,
"num_predict": max_tokens
}
}
if system:
payload["system"] = system
response = requests.post(
f"{self.api_url}/generate",
json=payload,
timeout=1200
)
response.raise_for_status()
result = response.json()
return result.get("response", "").strip()
except requests.exceptions.RequestException as e:
logger.error(f"Ollama API error: {e}")
return f"Error: Unable to connect to Ollama. Is it running? ({str(e)})"
def chat(self, messages: List[Dict[str, str]],
temperature: float = 0.7, max_tokens: int = 2048) -> str:
"""Chat completion with message history"""
try:
payload = {
"model": self.model,
"messages": messages,
"stream": False,
"options": {
"temperature": temperature,
"num_predict": max_tokens
}
}
response = requests.post(
f"{self.api_url}/chat",
json=payload,
timeout=1200
)
response.raise_for_status()
result = response.json()
return result.get("message", {}).get("content", "").strip()
except requests.exceptions.RequestException as e:
logger.error(f"Ollama API error: {e}")
return f"Error: Unable to connect to Ollama. Is it running? ({str(e)})"
def check_connection(self) -> bool:
"""Check if Ollama is running and model is available"""
try:
response = requests.get(f"{self.base_url}/api/tags", timeout=5)
response.raise_for_status()
models = response.json().get("models", [])
model_names = [m["name"] for m in models]
if self.model not in model_names:
logger.warning(f"Model {self.model} not found. Available: {model_names}")
return False
return True
except requests.exceptions.RequestException as e:
logger.error(f"Cannot connect to Ollama: {e}")
return False
def list_models(self) -> List[str]:
"""List available Ollama models"""
try:
response = requests.get(f"{self.base_url}/api/tags", timeout=5)
response.raise_for_status()
models = response.json().get("models", [])
return [m["name"] for m in models]
except requests.exceptions.RequestException:
return []

View File

@@ -0,0 +1,129 @@
"""
Setup Verification Script for KnowledgeHub
Run this to check if everything is configured correctly
"""
import sys
import os
print("🔍 KnowledgeHub Setup Verification\n")
print("=" * 60)
# Check Python version
print(f"✓ Python version: {sys.version}")
print(f"✓ Python executable: {sys.executable}")
print(f"✓ Current directory: {os.getcwd()}")
print()
# Check directory structure
print("📁 Checking directory structure...")
required_dirs = ['agents', 'models', 'utils']
for dir_name in required_dirs:
if os.path.isdir(dir_name):
init_file = os.path.join(dir_name, '__init__.py')
if os.path.exists(init_file):
print(f"{dir_name}/ exists with __init__.py")
else:
print(f" ⚠️ {dir_name}/ exists but missing __init__.py")
else:
print(f"{dir_name}/ directory not found")
print()
# Check required files
print("📄 Checking required files...")
required_files = ['app.py', 'requirements.txt']
for file_name in required_files:
if os.path.exists(file_name):
print(f"{file_name} exists")
else:
print(f"{file_name} not found")
print()
# Try importing modules
print("📦 Testing imports...")
errors = []
try:
from utils import OllamaClient, EmbeddingModel, DocumentParser
print(" ✓ utils module imported successfully")
except ImportError as e:
print(f" ❌ Cannot import utils: {e}")
errors.append(str(e))
try:
from models import Document, DocumentChunk, SearchResult, Summary
print(" ✓ models module imported successfully")
except ImportError as e:
print(f" ❌ Cannot import models: {e}")
errors.append(str(e))
try:
from agents import (
IngestionAgent, QuestionAgent, SummaryAgent,
ConnectionAgent, ExportAgent
)
print(" ✓ agents module imported successfully")
except ImportError as e:
print(f" ❌ Cannot import agents: {e}")
errors.append(str(e))
print()
# Check dependencies
print("📚 Checking Python dependencies...")
required_packages = [
'gradio', 'chromadb', 'sentence_transformers',
'requests', 'numpy', 'tqdm'
]
missing_packages = []
for package in required_packages:
try:
__import__(package.replace('-', '_'))
print(f"{package} installed")
except ImportError:
print(f"{package} not installed")
missing_packages.append(package)
print()
# Check Ollama
print("🤖 Checking Ollama...")
try:
import requests
response = requests.get('http://localhost:11434/api/tags', timeout=2)
if response.status_code == 200:
print(" ✓ Ollama is running")
models = response.json().get('models', [])
if models:
print(f" ✓ Available models: {[m['name'] for m in models]}")
if any('llama3.2' in m['name'] for m in models):
print(" ✓ llama3.2 model found")
else:
print(" ⚠️ llama3.2 model not found. Run: ollama pull llama3.2")
else:
print(" ⚠️ No models found. Run: ollama pull llama3.2")
else:
print(" ⚠️ Ollama responded but with error")
except Exception as e:
print(f" ❌ Cannot connect to Ollama: {e}")
print(" Start Ollama with: ollama serve")
print()
print("=" * 60)
# Final summary
if errors or missing_packages:
print("\n⚠️ ISSUES FOUND:\n")
if errors:
print("Import Errors:")
for error in errors:
print(f" - {error}")
if missing_packages:
print("\nMissing Packages:")
print(f" Run: pip install {' '.join(missing_packages)}")
print("\n💡 Fix these issues before running app.py")
else:
print("\n✅ All checks passed! You're ready to run:")
print(" python app.py")
print()