Adding shabsi4u youtube video summarizer for day 1

This commit is contained in:
shabsi4u
2025-09-16 15:45:47 +05:30
parent 236749eb29
commit 07ccfaa3ed
7 changed files with 6223 additions and 0 deletions

View File

@@ -0,0 +1,188 @@
# YouTube Video Summarizer
A Python tool that automatically fetches YouTube video transcripts and generates comprehensive summaries using OpenAI's GPT-4o-mini model. Features intelligent chunking for large videos and high-quality summarization.
## Features
- 🎬 **YouTube Integration**: Automatically fetches video transcripts
- 🤖 **AI-Powered Summaries**: Uses GPT-4o-mini for high-quality summaries
- 📊 **Smart Chunking**: Handles large videos by splitting into manageable chunks
- 🔄 **Automatic Stitching**: Combines chunk summaries into cohesive final summaries
- 💰 **Cost-Effective**: Optimized for GPT-4o-mini's token limits
- 🛡️ **Error Handling**: Robust error handling with helpful messages
## Installation
### Prerequisites
- Python 3.8 or higher
### Option 1: Using the installation script (Recommended)
```bash
# Run the automated installation script
python install.py
# The script will let you choose between UV and pip
# Then run the script with your chosen method
```
### Option 2: Using UV
```bash
# Install UV if not already installed
pip install uv
# Install dependencies and create virtual environment
uv sync
# Run the script
uv run python youtube_video_summarizer.py
```
### Option 3: Using pip
```bash
# Install dependencies
pip install -r requirements.txt
# Run the script
python youtube_video_summarizer.py
```
### Optional Dependencies
#### With UV:
```bash
# For Jupyter notebook support
uv sync --extra jupyter
# For development dependencies (testing, linting, etc.)
uv sync --extra dev
```
#### With pip:
```bash
# For Jupyter notebook support
pip install ipython jupyter
# For development dependencies
pip install pytest black flake8 mypy
```
## Setup
1. **Get an OpenAI API Key**:
- Visit [OpenAI API](https://platform.openai.com/api-keys)
- Create a new API key
2. **Create a .env file**:
```bash
echo "OPENAI_API_KEY=your_api_key_here" > .env
```
3. **Update the video URL** in `youtube_video_summarizer.py`:
```python
video_url = "https://www.youtube.com/watch?v=YOUR_VIDEO_ID"
```
## Usage
### Basic Usage
```python
from youtube_video_summarizer import YouTubeVideo, summarize_video
# Create video object
video = YouTubeVideo("https://www.youtube.com/watch?v=VIDEO_ID")
# Generate summary
summary = summarize_video(video)
print(summary)
```
### Advanced Usage with Custom Settings
```python
# Custom chunking settings
summary = summarize_video(
video,
use_chunking=True,
max_chunk_tokens=4000
)
```
## How It Works
1. **Video Processing**: Fetches YouTube video metadata and transcript
2. **Token Analysis**: Counts tokens to determine if chunking is needed
3. **Smart Chunking**: Splits large transcripts into manageable pieces
4. **Individual Summaries**: Generates summaries for each chunk
5. **Intelligent Stitching**: Combines chunk summaries into final result
## Configuration
### Model Settings
- **Model**: GPT-4o-mini (cost-effective and high-quality)
- **Temperature**: 0.3 (focused, consistent output)
- **Max Tokens**: 2,000 (optimal for summaries)
### Chunking Settings
- **Max Chunk Size**: 4,000 tokens (auto-calculated per model)
- **Overlap**: 5% of chunk size (maintains context)
- **Auto-detection**: Automatically determines if chunking is needed
## Error Handling
The script includes comprehensive error handling:
- ✅ **Missing Dependencies**: Clear installation instructions
- ✅ **Invalid URLs**: YouTube URL validation
- ✅ **API Errors**: OpenAI API error handling
- ✅ **Network Issues**: Request timeout and retry logic
## Requirements
- **Python**: 3.8 or higher
- **OpenAI API Key**: Required for summarization
- **Internet Connection**: For YouTube and OpenAI API access
## Dependencies
### Core Dependencies
- `requests`: HTTP requests
- `tiktoken`: Token counting
- `python-dotenv`: Environment variable management
- `openai`: OpenAI API client
- `youtube-transcript-api`: YouTube transcript fetching
- `beautifulsoup4`: HTML parsing
### Optional Dependencies
- `ipython`: Jupyter notebook support
- `jupyter`: Jupyter notebook support
## Troubleshooting
### Common Issues
1. **ModuleNotFoundError**:
- With UV: Run `uv sync` to install dependencies
- With pip: Run `pip install -r requirements.txt`
2. **UV not found**: Install UV with `pip install uv` or run `python install.py`
3. **OpenAI API Error**: Check your API key in `.env` file
4. **YouTube Transcript Error**: Video may not have transcripts available
5. **Token Limit Error**: Video transcript is too long (rare with chunking)
### Getting Help
If you encounter issues:
1. Check the error messages (they include helpful installation instructions)
2. Ensure all dependencies are installed:
- With UV: `uv sync`
- With pip: `pip install -r requirements.txt`
3. Verify your OpenAI API key is correct
4. Check that the YouTube video has transcripts available
5. Try running with the appropriate command:
- With UV: `uv run python youtube_video_summarizer.py`
- With pip: `python youtube_video_summarizer.py`
## License
This project is part of the LLM Engineering course materials.
## Contributing
Feel free to submit issues and enhancement requests!

View File

@@ -0,0 +1,178 @@
#!/usr/bin/env python3
"""
Installation script for YouTube Video Summarizer
This script installs all required dependencies for the project using either UV or pip.
"""
import subprocess
import sys
import os
import shutil
def run_command(command, description):
"""Run a command and handle errors"""
print(f"🔄 {description}...")
try:
result = subprocess.run(command, shell=True, check=True, capture_output=True, text=True)
print(f"{description} completed successfully")
return True
except subprocess.CalledProcessError as e:
print(f"{description} failed:")
print(f" Error: {e.stderr}")
return False
def check_python_version():
"""Check if Python version is compatible"""
version = sys.version_info
if version.major < 3 or (version.major == 3 and version.minor < 8):
print("❌ Python 3.8 or higher is required")
print(f" Current version: {version.major}.{version.minor}.{version.micro}")
return False
print(f"✅ Python {version.major}.{version.minor}.{version.micro} is compatible")
return True
def check_uv_installed():
"""Check if UV is installed"""
if shutil.which("uv"):
print("✅ UV is already installed")
return True
else:
print("❌ UV is not installed")
return False
def install_uv():
"""Install UV package manager"""
print("🔄 Installing UV...")
try:
# Try to install UV using pip first
if not run_command(f"{sys.executable} -m pip install uv", "Installing UV via pip"):
# Fallback to curl installation
install_script = "curl -LsSf https://astral.sh/uv/install.sh | sh"
if not run_command(install_script, "Installing UV via curl"):
print("❌ Failed to install UV. Please install it manually:")
print(" pip install uv")
print(" or visit: https://github.com/astral-sh/uv")
return False
return True
except Exception as e:
print(f"❌ Error installing UV: {e}")
return False
def choose_package_manager():
"""Let user choose between UV and pip"""
print("\n📦 Choose your package manager:")
print("1. UV (recommended - faster, better dependency resolution)")
print("2. pip (traditional Python package manager)")
while True:
choice = input("\nEnter your choice (1 or 2): ").strip()
if choice == "1":
return "uv"
elif choice == "2":
return "pip"
else:
print("❌ Invalid choice. Please enter 1 or 2.")
def install_dependencies_uv():
"""Install dependencies using UV"""
print("🚀 Installing YouTube Video Summarizer dependencies with UV...")
print("=" * 60)
# Check if UV is installed, install if not
if not check_uv_installed():
if not install_uv():
return False
# Check if pyproject.toml exists
pyproject_file = os.path.join(os.path.dirname(__file__), "pyproject.toml")
if not os.path.exists(pyproject_file):
print("❌ pyproject.toml not found. Please ensure you're in the project directory.")
return False
# Install dependencies using UV
if not run_command("uv sync", "Installing dependencies with UV"):
return False
print("=" * 60)
print("🎉 Installation completed successfully!")
print("\n📋 Next steps:")
print("1. Create a .env file with your OpenAI API key:")
print(" OPENAI_API_KEY=your_api_key_here")
print("2. Run the script:")
print(" uv run python youtube_video_summarizer.py")
print("\n💡 For Jupyter notebook support, install with:")
print(" uv sync --extra jupyter")
print("\n💡 For development dependencies, install with:")
print(" uv sync --extra dev")
return True
def install_dependencies_pip():
"""Install dependencies using pip"""
print("🚀 Installing YouTube Video Summarizer dependencies with pip...")
print("=" * 60)
# Upgrade pip first
if not run_command(f"{sys.executable} -m pip install --upgrade pip", "Upgrading pip"):
return False
# Install dependencies from requirements.txt
requirements_file = os.path.join(os.path.dirname(__file__), "requirements.txt")
if os.path.exists(requirements_file):
if not run_command(f"{sys.executable} -m pip install -r {requirements_file}", "Installing dependencies from requirements.txt"):
return False
else:
# Install core dependencies individually
core_deps = [
"requests",
"tiktoken",
"python-dotenv",
"openai",
"youtube-transcript-api",
"beautifulsoup4"
]
for dep in core_deps:
if not run_command(f"{sys.executable} -m pip install {dep}", f"Installing {dep}"):
return False
print("=" * 60)
print("🎉 Installation completed successfully!")
print("\n📋 Next steps:")
print("1. Create a .env file with your OpenAI API key:")
print(" OPENAI_API_KEY=your_api_key_here")
print("2. Run the script:")
print(" python youtube_video_summarizer.py")
print("\n💡 For Jupyter notebook support, also install:")
print(" pip install jupyter ipython")
return True
def install_dependencies():
"""Install required dependencies using chosen package manager"""
# Check Python version
if not check_python_version():
return False
# Let user choose package manager
package_manager = choose_package_manager()
if package_manager == "uv":
return install_dependencies_uv()
else:
return install_dependencies_pip()
def main():
"""Main installation function"""
print("🎬 YouTube Video Summarizer - Installation Script")
print("=" * 60)
if install_dependencies():
print("\n✅ All dependencies installed successfully!")
print("🚀 You can now run the YouTube Video Summarizer!")
else:
print("\n❌ Installation failed. Please check the error messages above.")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,78 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "youtube-video-summarizer"
version = "1.0.0"
description = "A tool to summarize YouTube videos using OpenAI's GPT models"
readme = "README.md"
requires-python = ">=3.8"
license = {text = "MIT"}
authors = [
{name = "YouTube Video Summarizer Team"},
]
keywords = ["youtube", "summarizer", "openai", "transcript", "ai"]
classifiers = [
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Topic :: Multimedia :: Video",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
]
dependencies = [
"requests>=2.25.0",
"tiktoken>=0.5.0",
"python-dotenv>=0.19.0",
"openai>=1.0.0",
"youtube-transcript-api>=0.6.0",
"beautifulsoup4>=4.9.0",
]
[project.optional-dependencies]
jupyter = [
"ipython>=7.0.0",
"jupyter>=1.0.0",
]
dev = [
"pytest>=6.0.0",
"black>=22.0.0",
"flake8>=4.0.0",
"mypy>=0.950",
]
[project.urls]
Homepage = "https://github.com/your-username/youtube-video-summarizer"
Repository = "https://github.com/your-username/youtube-video-summarizer"
Issues = "https://github.com/your-username/youtube-video-summarizer/issues"
[project.scripts]
youtube-summarizer = "youtube_video_summarizer:main"
[tool.uv]
dev-dependencies = [
"pytest>=6.0.0",
"black>=22.0.0",
"flake8>=4.0.0",
"mypy>=0.950",
]
[tool.black]
line-length = 88
target-version = ['py38']
[tool.mypy]
python_version = "3.8"
warn_return_any = true
warn_unused_configs = true
disallow_untyped_defs = true

View File

@@ -0,0 +1,17 @@
# Core dependencies for YouTube Video Summarizer
requests>=2.25.0
tiktoken>=0.5.0
python-dotenv>=0.19.0
openai>=1.0.0
youtube-transcript-api>=0.6.0
beautifulsoup4>=4.9.0
# Optional dependencies for Jupyter notebook support
ipython>=7.0.0
jupyter>=1.0.0
# Development dependencies (optional)
pytest>=6.0.0
black>=22.0.0
flake8>=4.0.0
mypy>=0.950

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,906 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "e371ea2b",
"metadata": {},
"source": [
"# YouTube Video Summarizer\n",
"\n",
"This notebook provides a comprehensive solution for summarizing YouTube videos using OpenAI's GPT models. It includes:\n",
"\n",
"- **Automatic transcript extraction** from YouTube videos\n",
"- **Intelligent chunking** for large videos that exceed token limits\n",
"- **Smart summarization** with academic-quality output\n",
"- **Error handling** and dependency management\n",
"\n",
"## Features\n",
"\n",
"- ✅ Extracts transcripts from YouTube videos\n",
"- ✅ Handles videos of any length with automatic chunking\n",
"- ✅ Generates structured, academic-quality summaries\n",
"- ✅ Includes proper error handling and dependency checks\n",
"- ✅ Optimized for different OpenAI models\n",
"- ✅ Interactive notebook format for easy testing\n",
"\n",
"## Prerequisites\n",
"\n",
"Make sure you have the required dependencies installed:\n",
"```bash\n",
"pip install -r requirements.txt\n",
"```\n",
"\n",
"You'll also need an OpenAI API key set in your environment variables or `.env` file.\n"
]
},
{
"cell_type": "markdown",
"id": "95b713e0",
"metadata": {},
"source": [
"## 1. Import Dependencies and Setup\n",
"\n",
"First, let's import all required libraries and set up the environment.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c940970b",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import re\n",
"import sys\n",
"\n",
"# Check for required dependencies and provide helpful error messages\n",
"try:\n",
" import requests\n",
" print(\"✅ requests imported successfully\")\n",
"except ImportError:\n",
" print(\"❌ Error: 'requests' module not found.\")\n",
" print(\"💡 Install with: pip install requests\")\n",
" print(\" Or: pip install -r requirements.txt\")\n",
" sys.exit(1)\n",
"\n",
"try:\n",
" import tiktoken\n",
" print(\"✅ tiktoken imported successfully\")\n",
"except ImportError:\n",
" print(\"❌ Error: 'tiktoken' module not found.\")\n",
" print(\"💡 Install with: pip install tiktoken\")\n",
" print(\" Or: pip install -r requirements.txt\")\n",
" sys.exit(1)\n",
"\n",
"try:\n",
" from dotenv import load_dotenv\n",
" print(\"✅ python-dotenv imported successfully\")\n",
"except ImportError:\n",
" print(\"❌ Error: 'python-dotenv' module not found.\")\n",
" print(\"💡 Install with: pip install python-dotenv\")\n",
" print(\" Or: pip install -r requirements.txt\")\n",
" sys.exit(1)\n",
"\n",
"try:\n",
" from openai import OpenAI\n",
" print(\"✅ openai imported successfully\")\n",
"except ImportError:\n",
" print(\"❌ Error: 'openai' module not found.\")\n",
" print(\"💡 Install with: pip install openai\")\n",
" print(\" Or: pip install -r requirements.txt\")\n",
" sys.exit(1)\n",
"\n",
"try:\n",
" from youtube_transcript_api import YouTubeTranscriptApi\n",
" print(\"✅ youtube-transcript-api imported successfully\")\n",
"except ImportError:\n",
" print(\"❌ Error: 'youtube-transcript-api' module not found.\")\n",
" print(\"💡 Install with: pip install youtube-transcript-api\")\n",
" print(\" Or: pip install -r requirements.txt\")\n",
" sys.exit(1)\n",
"\n",
"try:\n",
" from bs4 import BeautifulSoup\n",
" print(\"✅ beautifulsoup4 imported successfully\")\n",
"except ImportError:\n",
" print(\"❌ Error: 'beautifulsoup4' module not found.\")\n",
" print(\"💡 Install with: pip install beautifulsoup4\")\n",
" print(\" Or: pip install -r requirements.txt\")\n",
" sys.exit(1)\n",
"\n",
"try:\n",
" from IPython.display import Markdown, display\n",
" print(\"✅ IPython.display imported successfully\")\n",
"except ImportError:\n",
" # IPython is optional for Jupyter notebooks\n",
" print(\"⚠️ Warning: IPython not available (optional for Jupyter notebooks)\")\n",
" Markdown = None\n",
" display = None\n",
"\n",
"print(\"\\n🎉 All dependencies imported successfully!\")\n"
]
},
{
"cell_type": "markdown",
"id": "603e9c3b",
"metadata": {},
"source": [
"## 2. Configuration and Constants\n",
"\n",
"Set up headers for web scraping and define the YouTubeVideo class.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8584ca1a",
"metadata": {},
"outputs": [],
"source": [
"# Headers for website scraping\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class YouTubeVideo:\n",
" \"\"\"Class to handle YouTube video data extraction and processing\"\"\"\n",
" \n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Initialize YouTube video object\n",
" \n",
" Args:\n",
" url (str): YouTube video URL\n",
" \"\"\"\n",
" self.url = url\n",
" youtube_pattern = r'https://www\\.youtube\\.com/watch\\?v=[a-zA-Z0-9_-]+'\n",
" \n",
" if re.match(youtube_pattern, url):\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.video_id = url.split(\"v=\")[1]\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" self.transcript = YouTubeTranscriptApi().fetch(self.video_id)\n",
" else:\n",
" raise ValueError(\"Invalid YouTube URL\")\n",
" \n",
" def get_transcript_text(self):\n",
" \"\"\"Get transcript as a single text string\"\"\"\n",
" return \" \".join([segment.text for segment in self.transcript])\n",
" \n",
" def get_video_info(self):\n",
" \"\"\"Get basic video information\"\"\"\n",
" return {\n",
" \"title\": self.title,\n",
" \"video_id\": self.video_id,\n",
" \"url\": self.url,\n",
" \"transcript_length\": len(self.transcript)\n",
" }\n",
"\n",
"print(\"✅ YouTubeVideo class defined successfully\")\n"
]
},
{
"cell_type": "markdown",
"id": "235e9998",
"metadata": {},
"source": [
"## 3. OpenAI API Setup\n",
"\n",
"Functions to handle OpenAI API key and client initialization.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4fa7aba3",
"metadata": {},
"outputs": [],
"source": [
"def get_api_key():\n",
" \"\"\"Get OpenAI API key from environment variables\"\"\"\n",
" load_dotenv(override=True)\n",
" api_key = os.getenv(\"OPENAI_API_KEY\")\n",
" if not api_key:\n",
" raise ValueError(\"OPENAI_API_KEY is not set. Please set it in your environment variables or .env file.\")\n",
" return api_key\n",
"\n",
"def get_openai_client():\n",
" \"\"\"Initialize and return OpenAI client\"\"\"\n",
" api_key = get_api_key()\n",
" return OpenAI(api_key=api_key)\n",
"\n",
"# Test API connection\n",
"try:\n",
" client = get_openai_client()\n",
" print(\"✅ OpenAI client initialized successfully\")\n",
" print(\"✅ API key is valid\")\n",
"except Exception as e:\n",
" print(f\"❌ Error initializing OpenAI client: {e}\")\n",
" print(\"💡 Make sure you have set your OPENAI_API_KEY environment variable\")\n"
]
},
{
"cell_type": "markdown",
"id": "4d3223f4",
"metadata": {},
"source": [
"## 4. Token Counting and Chunking Functions\n",
"\n",
"Functions to handle token counting and intelligent chunking of large transcripts.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "71f68ad0",
"metadata": {},
"outputs": [],
"source": [
"def count_tokens(text, model=\"gpt-4o-mini\"):\n",
" \"\"\"Count tokens in text using tiktoken with fallback\"\"\"\n",
" try:\n",
" # Try model-specific encoding first\n",
" encoding = tiktoken.encoding_for_model(model)\n",
" return len(encoding.encode(text))\n",
" except KeyError:\n",
" # Fallback to cl100k_base encoding (used by most OpenAI models)\n",
" # This ensures compatibility even if model-specific encoding isn't available\n",
" encoding = tiktoken.get_encoding(\"cl100k_base\")\n",
" return len(encoding.encode(text))\n",
" except Exception as e:\n",
" # Ultimate fallback - rough estimation\n",
" print(f\"Warning: Token counting failed ({e}), using rough estimation\")\n",
" return len(text.split()) * 1.3 # Rough word-to-token ratio\n",
"\n",
"def get_optimal_chunk_size(model=\"gpt-4o-mini\"):\n",
" \"\"\"Calculate optimal chunk size based on model's context window\"\"\"\n",
" model_limits = {\n",
" \"gpt-4o-mini\": 8192,\n",
" \"gpt-4o\": 128000,\n",
" \"gpt-4-turbo\": 128000,\n",
" \"gpt-3.5-turbo\": 4096,\n",
" \"gpt-4\": 8192,\n",
" }\n",
" \n",
" context_window = model_limits.get(model, 8192) # Default to 8K\n",
" \n",
" # Reserve tokens for:\n",
" # - System prompt: ~800 tokens\n",
" # - User prompt overhead: ~300 tokens \n",
" # - Output: ~2000 tokens\n",
" # - Safety buffer: ~500 tokens\n",
" reserved_tokens = 800 + 300 + 2000 + 500\n",
" \n",
" optimal_chunk_size = context_window - reserved_tokens\n",
" \n",
" # Ensure minimum chunk size\n",
" return max(optimal_chunk_size, 2000)\n",
"\n",
"print(\"✅ Token counting and chunk size functions defined\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6647838",
"metadata": {},
"outputs": [],
"source": [
"def chunk_transcript(transcript, max_tokens=4000, overlap_tokens=200, model=\"gpt-4o-mini\"):\n",
" \"\"\"\n",
" Split transcript into chunks that fit within token limits\n",
" \n",
" Args:\n",
" transcript: List of transcript segments from YouTube\n",
" max_tokens: Maximum tokens per chunk (auto-calculated if None)\n",
" overlap_tokens: Number of tokens to overlap between chunks\n",
" model: Model name for token limit calculation\n",
" \n",
" Returns:\n",
" List of transcript chunks\n",
" \"\"\"\n",
" # Auto-calculate max_tokens based on model if not provided\n",
" if max_tokens is None:\n",
" max_tokens = get_optimal_chunk_size(model)\n",
" \n",
" # Auto-calculate overlap as percentage of max_tokens\n",
" if overlap_tokens is None:\n",
" overlap_tokens = int(max_tokens * 0.05) # 5% overlap\n",
" \n",
" # Convert transcript to text\n",
" transcript_text = \" \".join([segment.text for segment in transcript])\n",
" \n",
" # If transcript is small enough, return as single chunk\n",
" if count_tokens(transcript_text) <= max_tokens:\n",
" return [transcript_text]\n",
" \n",
" # Split into sentences for better chunking\n",
" sentences = re.split(r'[.!?]+', transcript_text)\n",
" chunks = []\n",
" current_chunk = \"\"\n",
" \n",
" for sentence in sentences:\n",
" sentence = sentence.strip()\n",
" if not sentence:\n",
" continue\n",
" \n",
" # Check if adding this sentence would exceed token limit\n",
" test_chunk = current_chunk + \" \" + sentence if current_chunk else sentence\n",
" \n",
" if count_tokens(test_chunk) <= max_tokens:\n",
" current_chunk = test_chunk\n",
" else:\n",
" # Save current chunk and start new one\n",
" if current_chunk:\n",
" chunks.append(current_chunk)\n",
" \n",
" # Start new chunk with overlap from previous chunk\n",
" if chunks and overlap_tokens > 0:\n",
" # Get last few words from previous chunk for overlap\n",
" prev_words = current_chunk.split()[-overlap_tokens//4:] # Rough word-to-token ratio\n",
" current_chunk = \" \".join(prev_words) + \" \" + sentence\n",
" else:\n",
" current_chunk = sentence\n",
" \n",
" # Add the last chunk\n",
" if current_chunk:\n",
" chunks.append(current_chunk)\n",
" \n",
" return chunks\n",
"\n",
"print(\"✅ Chunking function defined\")\n"
]
},
{
"cell_type": "markdown",
"id": "7ee3f8a4",
"metadata": {},
"source": [
"## 5. Prompt Generation Functions\n",
"\n",
"Functions to generate system prompts, user prompts, and stitching prompts for the summarization process.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7f20bf5",
"metadata": {},
"outputs": [],
"source": [
"def generate_system_prompt():\n",
" \"\"\"Generate the system prompt for video summarization\"\"\"\n",
" return f\"\"\"\n",
" You are an expert YouTube video summarizer. Your job is to take the full transcript of a video and generate a structured, precise, and academically grounded summary.\n",
"\n",
" Your output must include:\n",
"\n",
" 1. Title\n",
" - Either reuse the video's title (if it is clear, accurate, and concise)\n",
" - Or generate a new, sharper, more descriptive title that best reflects the actual content covered.\n",
"\n",
" 2. Topic & Area of Coverage\n",
" - Provide a 12 line highlight of the main topic of the video and the specific area it best covers.\n",
" - Format:\n",
" - Domain (e.g., Finance, Health, Technology, Psychology, Fitness, Productivity, etc.)\n",
" - Sub-area (e.g., investment strategies, portfolio design; training routine, best exercises; productivity systems, cognitive science insights, etc.)\n",
"\n",
" 3. Summary of the Video\n",
" - A structured, clear, and concise summary of the video.\n",
" - Focus only on relevant, high-value content.\n",
" - Skip fluff, tangents, product promotions, personal banter, or irrelevant side discussions.\n",
" - Include key insights, frameworks, step-by-step methods, and actionable advice.\n",
" - Where applicable, reference scientific studies, historical sources, or authoritative references (with author + year or journal if mentioned in the video, or inferred if the reference is well known).\n",
"\n",
" Style & Quality Rules:\n",
" - Be extremely specific: avoid vague generalizations.\n",
" - Use precise language and structured formatting (bullet points, numbered lists, sub-sections if needed).\n",
" - Prioritize clarity and factual accuracy.\n",
" - Write as though preparing an executive briefing or academic digest.\n",
" - If the transcript includes non-relevant sections (jokes, ads, unrelated chit-chat), skip summarizing them entirely.\n",
" \"\"\"\n",
"\n",
"def generate_user_prompt(website, transcript_chunk=None):\n",
" \"\"\"Generate user prompt for video summarization\"\"\"\n",
" if transcript_chunk:\n",
" return f\"\"\"Here is a portion of a YouTube video transcript. Use the system instructions to generate a summary of this section.\n",
"\n",
" Video Title: {website.title}\n",
"\n",
" Transcript Section: {transcript_chunk}\n",
" \"\"\"\n",
" else:\n",
" return f\"\"\"Here is the transcript of a YouTube video. Use the system instructions to generate the output.\n",
"\n",
" Video Title: {website.title}\n",
"\n",
" Transcript: {website.transcript}\n",
" \"\"\"\n",
"\n",
"def generate_stitching_prompt(chunk_summaries, video_title):\n",
" \"\"\"Generate prompt for stitching together chunk summaries\"\"\"\n",
" return f\"\"\"You are an expert at combining multiple summaries into a cohesive, comprehensive summary.\n",
"\n",
" Video Title: {video_title}\n",
"\n",
" Below are summaries of different sections of this video. Combine them into a single, well-structured summary that:\n",
" 1. Maintains the original structure and quality standards\n",
" 2. Eliminates redundancy between sections\n",
" 3. Creates smooth transitions between topics\n",
" 4. Preserves all important information \n",
" 5. Maintains the academic, professional tone\n",
" 6. Include examples and nuances where relevant\n",
" 7. Include the citations and references where applicable\n",
"\n",
" Section Summaries:\n",
" {chr(10).join([f\"Section {i+1}: {summary}\" for i, summary in enumerate(chunk_summaries)])}\n",
"\n",
" Please provide a unified, comprehensive summary following the same format as the individual sections.\n",
" Make sure the final summary is cohesive and logical.\n",
" \"\"\"\n",
"\n",
"print(\"✅ Prompt generation functions defined\")\n"
]
},
{
"cell_type": "markdown",
"id": "5c9a620d",
"metadata": {},
"source": [
"## 6. Summarization Functions\n",
"\n",
"Core functions for summarizing videos with support for both single-chunk and chunked processing.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc8a183b",
"metadata": {},
"outputs": [],
"source": [
"def summarize_single_chunk(website, client):\n",
" \"\"\"Summarize a single chunk (small video)\"\"\"\n",
" system_prompt = generate_system_prompt()\n",
" user_prompt = generate_user_prompt(website)\n",
" \n",
" try:\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" max_tokens=2000,\n",
" temperature=0.3\n",
" )\n",
" \n",
" return response.choices[0].message.content\n",
" \n",
" except Exception as e:\n",
" return f\"Error generating summary: {str(e)}\"\n",
"\n",
"def summarize_with_chunking(website, client, max_chunk_tokens=4000):\n",
" \"\"\"Summarize a large video by chunking and stitching\"\"\"\n",
" print(\"Video is large, using chunking strategy...\")\n",
" \n",
" # Chunk the transcript\n",
" chunks = chunk_transcript(website.transcript, max_chunk_tokens)\n",
" print(f\"Split into {len(chunks)} chunks\")\n",
" \n",
" # Summarize each chunk\n",
" chunk_summaries = []\n",
" system_prompt = generate_system_prompt()\n",
" \n",
" for i, chunk in enumerate(chunks):\n",
" print(f\"Processing chunk {i+1}/{len(chunks)}...\")\n",
" user_prompt = generate_user_prompt(website, chunk)\n",
" \n",
" try:\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" max_tokens=1500, # Smaller for chunks\n",
" temperature=0.3\n",
" )\n",
" \n",
" chunk_summaries.append(response.choices[0].message.content)\n",
" \n",
" except Exception as e:\n",
" chunk_summaries.append(f\"Error in chunk {i+1}: {str(e)}\")\n",
" \n",
" # Stitch the summaries together\n",
" print(\"Stitching summaries together...\")\n",
" stitching_prompt = generate_stitching_prompt(chunk_summaries, website.title)\n",
" \n",
" try:\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are an expert at combining multiple summaries into a cohesive, comprehensive summary.\"},\n",
" {\"role\": \"user\", \"content\": stitching_prompt}\n",
" ],\n",
" max_tokens=2000,\n",
" temperature=0.3\n",
" )\n",
" \n",
" return response.choices[0].message.content\n",
" \n",
" except Exception as e:\n",
" return f\"Error stitching summaries: {str(e)}\"\n",
"\n",
"print(\"✅ Summarization functions defined\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "99168160",
"metadata": {},
"outputs": [],
"source": [
"def summarize_video(website, use_chunking=True, max_chunk_tokens=4000):\n",
" \"\"\"Summarize a YouTube video using OpenAI API with optional chunking for large videos\"\"\"\n",
" client = get_openai_client()\n",
" \n",
" # Check if we need chunking\n",
" transcript_text = \" \".join([segment.text for segment in website.transcript])\n",
" total_tokens = count_tokens(transcript_text)\n",
" \n",
" print(f\"Total transcript tokens: {total_tokens}\")\n",
" \n",
" if total_tokens <= max_chunk_tokens and not use_chunking:\n",
" # Single summary for small videos\n",
" return summarize_single_chunk(website, client)\n",
" else:\n",
" # Chunked summary for large videos\n",
" return summarize_with_chunking(website, client, max_chunk_tokens)\n",
"\n",
"print(\"✅ Main summarization function defined\")\n"
]
},
{
"cell_type": "markdown",
"id": "54a76dab",
"metadata": {},
"source": [
"## 7. Interactive Demo\n",
"\n",
"Now let's test the YouTube video summarizer with a sample video. You can replace the URL with any YouTube video you want to summarize.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "87badeff",
"metadata": {},
"outputs": [],
"source": [
"# Example usage - replace with your YouTube URL\n",
"video_url = \"https://www.youtube.com/watch?v=Xan5JnecLNA\"\n",
"\n",
"try:\n",
" # Create YouTube video object\n",
" print(\"🎬 Fetching video data...\")\n",
" video = YouTubeVideo(video_url)\n",
" \n",
" # Display video info\n",
" print(f\"📺 Video Title: {video.title}\")\n",
" print(f\"🆔 Video ID: {video.video_id}\")\n",
" \n",
" # Count tokens in transcript\n",
" transcript_text = video.get_transcript_text()\n",
" total_tokens = count_tokens(transcript_text)\n",
" print(f\"📊 Total transcript tokens: {total_tokens}\")\n",
" \n",
" # Show video info\n",
" info = video.get_video_info()\n",
" print(f\"📝 Transcript segments: {info['transcript_length']}\")\n",
" \n",
"except Exception as e:\n",
" print(f\"❌ Error: {str(e)}\")\n",
" print(\"💡 Make sure the YouTube URL is valid and the video has captions available\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9e4cf2f",
"metadata": {},
"outputs": [],
"source": [
"# Generate summary (automatically uses chunking if needed)\n",
"if 'video' in locals():\n",
" print(\"\\n🤖 Generating summary...\")\n",
" print(\"⏳ This may take a few minutes for long videos...\")\n",
" \n",
" try:\n",
" summary = summarize_video(video, use_chunking=True, max_chunk_tokens=4000)\n",
" \n",
" # Display results with nice formatting\n",
" print(\"\\n\" + \"=\"*60)\n",
" print(\"📋 FINAL SUMMARY\")\n",
" print(\"=\"*60)\n",
" \n",
" # Use IPython display if available for better formatting\n",
" if display and Markdown:\n",
" display(Markdown(summary))\n",
" else:\n",
" print(summary)\n",
" \n",
" except Exception as e:\n",
" print(f\"❌ Error generating summary: {str(e)}\")\n",
"else:\n",
" print(\"⚠️ Please run the previous cell first to load a video\")\n"
]
},
{
"cell_type": "markdown",
"id": "42ff8a15",
"metadata": {},
"source": [
"## 8. Testing and Utility Functions\n",
"\n",
"Additional functions for testing the chunking functionality and other utilities.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d798b08f",
"metadata": {},
"outputs": [],
"source": [
"def test_chunking():\n",
" \"\"\"Test function to demonstrate chunking with a sample transcript\"\"\"\n",
" # Sample transcript for testing\n",
" sample_transcript = [\n",
" {\"text\": \"This is a sample transcript segment 1. \" * 100}, # ~1000 tokens\n",
" {\"text\": \"This is a sample transcript segment 2. \" * 100}, # ~1000 tokens\n",
" {\"text\": \"This is a sample transcript segment 3. \" * 100}, # ~1000 tokens\n",
" {\"text\": \"This is a sample transcript segment 4. \" * 100}, # ~1000 tokens\n",
" {\"text\": \"This is a sample transcript segment 5. \" * 100}, # ~1000 tokens\n",
" ]\n",
" \n",
" print(\"🧪 Testing chunking functionality...\")\n",
" chunks = chunk_transcript(sample_transcript, max_tokens=2000, overlap_tokens=100)\n",
" \n",
" print(f\"📊 Original transcript: {count_tokens(' '.join([s['text'] for s in sample_transcript]))} tokens\")\n",
" print(f\"📦 Number of chunks: {len(chunks)}\")\n",
" \n",
" for i, chunk in enumerate(chunks):\n",
" print(f\"📄 Chunk {i+1}: {count_tokens(chunk)} tokens\")\n",
"\n",
"def analyze_video_tokens(video_url):\n",
" \"\"\"Analyze token count and chunking strategy for a video\"\"\"\n",
" try:\n",
" video = YouTubeVideo(video_url)\n",
" transcript_text = video.get_transcript_text()\n",
" total_tokens = count_tokens(transcript_text)\n",
" \n",
" print(f\"📺 Video: {video.title}\")\n",
" print(f\"📊 Total tokens: {total_tokens}\")\n",
" print(f\"📦 Optimal chunk size: {get_optimal_chunk_size()}\")\n",
" \n",
" if total_tokens > 4000:\n",
" chunks = chunk_transcript(video.transcript, max_tokens=4000)\n",
" print(f\"🔀 Would be split into {len(chunks)} chunks\")\n",
" print(\"✅ Chunking strategy recommended\")\n",
" else:\n",
" print(\"✅ Single summary strategy sufficient\")\n",
" \n",
" except Exception as e:\n",
" print(f\"❌ Error analyzing video: {str(e)}\")\n",
"\n",
"print(\"✅ Testing and utility functions defined\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bfd789e5",
"metadata": {},
"outputs": [],
"source": [
"# Test chunking functionality (optional)\n",
"# Uncomment the line below to test chunking with sample data\n",
"# test_chunking()\n"
]
},
{
"cell_type": "markdown",
"id": "3528125f",
"metadata": {},
"source": [
"## 9. Usage Instructions\n",
"\n",
"### How to Use This Notebook\n",
"\n",
"1. **Set up your OpenAI API key**:\n",
" - Create a `.env` file in the same directory as this notebook\n",
" - Add your API key: `OPENAI_API_KEY=your_api_key_here`\n",
" - Or set it as an environment variable\n",
"\n",
"2. **Install dependencies**:\n",
" ```bash\n",
" pip install -r requirements.txt\n",
" ```\n",
"\n",
"3. **Run the cells in order**:\n",
" - Start with the import and setup cells\n",
" - Modify the `video_url` variable in the demo section\n",
" - Run the demo cells to test the summarizer\n",
"\n",
"### Customization Options\n",
"\n",
"- **Change the model**: Modify the model parameter in the summarization functions\n",
"- **Adjust chunk size**: Change `max_chunk_tokens` parameter\n",
"- **Modify prompts**: Edit the prompt generation functions for different output styles\n",
"- **Add error handling**: Extend the exception handling as needed\n",
"\n",
"### Features\n",
"\n",
"- ✅ **Automatic transcript extraction** from YouTube videos\n",
"- ✅ **Intelligent chunking** for videos exceeding token limits\n",
"- ✅ **Academic-quality summaries** with structured output\n",
"- ✅ **Error handling** and dependency validation\n",
"- ✅ **Interactive testing** with sample data\n",
"- ✅ **Token analysis** and optimization recommendations\n",
"\n",
"### Troubleshooting\n",
"\n",
"- **\"No transcript available\"**: The video may not have captions enabled\n",
"- **\"Invalid YouTube URL\"**: Make sure the URL follows the correct format\n",
"- **\"API key not set\"**: Check your `.env` file or environment variables\n",
"- **Import errors**: Run `pip install -r requirements.txt` to install dependencies\n"
]
},
{
"cell_type": "markdown",
"id": "a5a44fb8",
"metadata": {},
"source": [
"## 10. Advanced Usage Examples\n",
"\n",
"Here are some advanced usage patterns you can try with this notebook.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2bef390a",
"metadata": {},
"outputs": [],
"source": [
"# Example 1: Analyze multiple videos\n",
"video_urls = [\n",
" \"https://www.youtube.com/watch?v=Xan5JnecLNA\",\n",
" # Add more URLs here\n",
"]\n",
"\n",
"for url in video_urls:\n",
" print(f\"\\n{'='*50}\")\n",
" print(f\"Analyzing: {url}\")\n",
" print('='*50)\n",
" analyze_video_tokens(url)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fbdb5cd8",
"metadata": {},
"outputs": [],
"source": [
"# Example 2: Custom summarization with different parameters\n",
"def custom_summarize(video_url, model=\"gpt-4o-mini\", max_tokens=3000, temperature=0.1):\n",
" \"\"\"Custom summarization with specific parameters\"\"\"\n",
" try:\n",
" video = YouTubeVideo(video_url)\n",
" client = get_openai_client()\n",
" \n",
" # Use custom chunking parameters\n",
" chunks = chunk_transcript(video.transcript, max_tokens=max_tokens)\n",
" \n",
" if len(chunks) == 1:\n",
" # Single chunk\n",
" system_prompt = generate_system_prompt()\n",
" user_prompt = generate_user_prompt(video, chunks[0])\n",
" \n",
" response = client.chat.completions.create(\n",
" model=model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" max_tokens=2000,\n",
" temperature=temperature\n",
" )\n",
" \n",
" return response.choices[0].message.content\n",
" else:\n",
" # Multiple chunks - use standard chunking approach\n",
" return summarize_with_chunking(video, client, max_tokens)\n",
" \n",
" except Exception as e:\n",
" return f\"Error: {str(e)}\"\n",
"\n",
"# Example usage:\n",
"# custom_summary = custom_summarize(\"https://www.youtube.com/watch?v=Xan5JnecLNA\")\n",
"# print(custom_summary)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7a5a9e9",
"metadata": {},
"outputs": [],
"source": [
"# Generate summary (automatically uses chunking if needed)\n",
"if 'video' in locals():\n",
" print(\"\\n🤖 Generating summary...\")\n",
" print(\"⏳ This may take a few minutes for long videos...\")\n",
" \n",
" try:\n",
" summary = summarize_video(video, use_chunking=True, max_chunk_tokens=4000)\n",
" \n",
" # Display results with nice formatting\n",
" print(\"\\n\" + \"=\"*60)\n",
" print(\"📋 FINAL SUMMARY\")\n",
" print(\"=\"*60)\n",
" \n",
" # Use IPython display if available for better formatting\n",
" if display and Markdown:\n",
" display(Markdown(summary))\n",
" else:\n",
" print(summary)\n",
" \n",
" except Exception as e:\n",
" print(f\"❌ Error generating summary: {str(e)}\")\n",
"else:\n",
" print(\"⚠️ Please run the previous cell first to load a video\")\n"
]
},
{
"cell_type": "markdown",
"id": "4028fa5e",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "c100b384-2c3e-49de-92ce-f5dd0b4b58c0",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,421 @@
import os
import re
import sys
# Check for required dependencies and provide helpful error messages
try:
import requests
except ImportError:
print("❌ Error: 'requests' module not found.")
print("💡 Install with: pip install requests")
print(" Or: pip install -r requirements.txt")
sys.exit(1)
try:
import tiktoken
except ImportError:
print("❌ Error: 'tiktoken' module not found.")
print("💡 Install with: pip install tiktoken")
print(" Or: pip install -r requirements.txt")
sys.exit(1)
try:
from dotenv import load_dotenv
except ImportError:
print("❌ Error: 'python-dotenv' module not found.")
print("💡 Install with: pip install python-dotenv")
print(" Or: pip install -r requirements.txt")
sys.exit(1)
try:
from openai import OpenAI
except ImportError:
print("❌ Error: 'openai' module not found.")
print("💡 Install with: pip install openai")
print(" Or: pip install -r requirements.txt")
sys.exit(1)
try:
from youtube_transcript_api import YouTubeTranscriptApi
except ImportError:
print("❌ Error: 'youtube-transcript-api' module not found.")
print("💡 Install with: pip install youtube-transcript-api")
print(" Or: pip install -r requirements.txt")
sys.exit(1)
try:
from bs4 import BeautifulSoup
except ImportError:
print("❌ Error: 'beautifulsoup4' module not found.")
print("💡 Install with: pip install beautifulsoup4")
print(" Or: pip install -r requirements.txt")
sys.exit(1)
try:
from IPython.display import Markdown, display
except ImportError:
# IPython is optional for Jupyter notebooks
print("⚠️ Warning: IPython not available (optional for Jupyter notebooks)")
Markdown = None
display = None
#headers and class for website to summarize
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}
class YouTubeVideo:
def __init__(self, url):
self.url = url
youtube_pattern = r'https://www\.youtube\.com/watch\?v=[a-zA-Z0-9_-]+'
if re.match(youtube_pattern, url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
self.video_id = url.split("v=")[1]
self.title = soup.title.string if soup.title else "No title found"
self.transcript = YouTubeTranscriptApi().fetch(self.video_id)
else:
raise ValueError("Invalid YouTube URL")
#get api key and openai client
def get_api_key():
load_dotenv(override=True)
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY is not set")
return api_key
def get_openai_client():
api_key = get_api_key()
return OpenAI(api_key=api_key)
#count tokens
def count_tokens(text, model="gpt-4o-mini"):
"""Count tokens in text using tiktoken with fallback"""
try:
# Try model-specific encoding first
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
except KeyError:
# Fallback to cl100k_base encoding (used by most OpenAI models)
# This ensures compatibility even if model-specific encoding isn't available
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
except Exception as e:
# Ultimate fallback - rough estimation
print(f"Warning: Token counting failed ({e}), using rough estimation")
return len(text.split()) * 1.3 # Rough word-to-token ratio
def get_optimal_chunk_size(model="gpt-4o-mini"):
"""Calculate optimal chunk size based on model's context window"""
model_limits = {
"gpt-4o-mini": 8192,
"gpt-4o": 128000,
"gpt-4-turbo": 128000,
"gpt-3.5-turbo": 4096,
"gpt-4": 8192,
}
context_window = model_limits.get(model, 8192) # Default to 8K
# Reserve tokens for:
# - System prompt: ~800 tokens
# - User prompt overhead: ~300 tokens
# - Output: ~2000 tokens
# - Safety buffer: ~500 tokens
reserved_tokens = 800 + 300 + 2000 + 500
optimal_chunk_size = context_window - reserved_tokens
# Ensure minimum chunk size
return max(optimal_chunk_size, 2000)
#chunk transcript
def chunk_transcript(transcript, max_tokens=4000, overlap_tokens=200, model="gpt-4o-mini"):
"""
Split transcript into chunks that fit within token limits
Args:
transcript: List of transcript segments from YouTube
max_tokens: Maximum tokens per chunk (auto-calculated if None)
overlap_tokens: Number of tokens to overlap between chunks
model: Model name for token limit calculation
Returns:
List of transcript chunks
"""
# Auto-calculate max_tokens based on model if not provided
if max_tokens is None:
max_tokens = get_optimal_chunk_size(model)
# Auto-calculate overlap as percentage of max_tokens
if overlap_tokens is None:
overlap_tokens = int(max_tokens * 0.05) # 5% overlap
# Convert transcript to text
transcript_text = " ".join([segment.text for segment in transcript])
# If transcript is small enough, return as single chunk
if count_tokens(transcript_text) <= max_tokens:
return [transcript_text]
# Split into sentences for better chunking
sentences = re.split(r'[.!?]+', transcript_text)
chunks = []
current_chunk = ""
for sentence in sentences:
sentence = sentence.strip()
if not sentence:
continue
# Check if adding this sentence would exceed token limit
test_chunk = current_chunk + " " + sentence if current_chunk else sentence
if count_tokens(test_chunk) <= max_tokens:
current_chunk = test_chunk
else:
# Save current chunk and start new one
if current_chunk:
chunks.append(current_chunk)
# Start new chunk with overlap from previous chunk
if chunks and overlap_tokens > 0:
# Get last few words from previous chunk for overlap
prev_words = current_chunk.split()[-overlap_tokens//4:] # Rough word-to-token ratio
current_chunk = " ".join(prev_words) + " " + sentence
else:
current_chunk = sentence
# Add the last chunk
if current_chunk:
chunks.append(current_chunk)
return chunks
#generate system prompt
def generate_system_prompt():
return f"""
You are an expert YouTube video summarizer. Your job is to take the full transcript of a video and generate a structured, precise, and academically grounded summary.
Your output must include:
1. Title
- Either reuse the videos title (if it is clear, accurate, and concise)
- Or generate a new, sharper, more descriptive title that best reflects the actual content covered.
2. Topic & Area of Coverage
- Provide a 12 line highlight of the main topic of the video and the specific area it best covers.
- Format:
- Domain (e.g., Finance, Health, Technology, Psychology, Fitness, Productivity, etc.)
- Sub-area (e.g., investment strategies, portfolio design; training routine, best exercises; productivity systems, cognitive science insights, etc.)
3. Summary of the Video
- A structured, clear, and concise summary of the video.
- Focus only on relevant, high-value content.
- Skip fluff, tangents, product promotions, personal banter, or irrelevant side discussions.
- Include key insights, frameworks, step-by-step methods, and actionable advice.
- Where applicable, reference scientific studies, historical sources, or authoritative references (with author + year or journal if mentioned in the video, or inferred if the reference is well known).
Style & Quality Rules:
- Be extremely specific: avoid vague generalizations.
- Use precise language and structured formatting (bullet points, numbered lists, sub-sections if needed).
- Prioritize clarity and factual accuracy.
- Write as though preparing an executive briefing or academic digest.
- If the transcript includes non-relevant sections (jokes, ads, unrelated chit-chat), skip summarizing them entirely.
"""
#generate user prompt
def generate_user_prompt(website, transcript_chunk=None):
if transcript_chunk:
return f"""Here is a portion of a YouTube video transcript. Use the system instructions to generate a summary of this section.
Video Title: {website.title}
Transcript Section: {transcript_chunk}
"""
else:
return f"""Here is the transcript of a YouTube video. Use the system instructions to generate the output.
Video Title: {website.title}
Transcript: {website.transcript}
"""
#generate stitching prompt
def generate_stitching_prompt(chunk_summaries, video_title):
"""Generate prompt for stitching together chunk summaries"""
return f"""You are an expert at combining multiple summaries into a cohesive, comprehensive summary.
Video Title: {video_title}
Below are summaries of different sections of this video. Combine them into a single, well-structured summary that:
1. Maintains the original structure and quality standards
2. Eliminates redundancy between sections
3. Creates smooth transitions between topics
4. Preserves all important information
5. Maintains the academic, professional tone
6. Include examples and nuances where relevant
7. Include the citations and references where applicable
Section Summaries:
{chr(10).join([f"Section {i+1}: {summary}" for i, summary in enumerate(chunk_summaries)])}
Please provide a unified, comprehensive summary following the same format as the individual sections.
Make sure the final summary is cohesive and logical.
"""
#summarize video
def summarize_video(website, use_chunking=True, max_chunk_tokens=4000):
"""Summarize a YouTube video using OpenAI API with optional chunking for large videos"""
client = get_openai_client()
# Check if we need chunking
transcript_text = " ".join([segment.text for segment in website.transcript])
total_tokens = count_tokens(transcript_text)
print(f"Total transcript tokens: {total_tokens}")
if total_tokens <= max_chunk_tokens and not use_chunking:
# Single summary for small videos
return summarize_single_chunk(website, client)
else:
# Chunked summary for large videos
return summarize_with_chunking(website, client, max_chunk_tokens)
#summarize single chunk
def summarize_single_chunk(website, client):
"""Summarize a single chunk (small video)"""
system_prompt = generate_system_prompt()
user_prompt = generate_user_prompt(website)
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=2000,
temperature=0.3
)
return response.choices[0].message.content
except Exception as e:
return f"Error generating summary: {str(e)}"
#summarize with chunking
def summarize_with_chunking(website, client, max_chunk_tokens=4000):
"""Summarize a large video by chunking and stitching"""
print("Video is large, using chunking strategy...")
# Chunk the transcript
chunks = chunk_transcript(website.transcript, max_chunk_tokens)
print(f"Split into {len(chunks)} chunks")
# Summarize each chunk
chunk_summaries = []
system_prompt = generate_system_prompt()
for i, chunk in enumerate(chunks):
print(f"Processing chunk {i+1}/{len(chunks)}...")
user_prompt = generate_user_prompt(website, chunk)
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=1500, # Smaller for chunks
temperature=0.3
)
chunk_summaries.append(response.choices[0].message.content)
except Exception as e:
chunk_summaries.append(f"Error in chunk {i+1}: {str(e)}")
# Stitch the summaries together
print("Stitching summaries together...")
stitching_prompt = generate_stitching_prompt(chunk_summaries, website.title)
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are an expert at combining multiple summaries into a cohesive, comprehensive summary."},
{"role": "user", "content": stitching_prompt}
],
max_tokens=2000,
temperature=0.3
)
return response.choices[0].message.content
except Exception as e:
return f"Error stitching summaries: {str(e)}"
#main function
def main():
"""Main function to demonstrate usage"""
# Example usage - replace with actual YouTube URL
video_url = "https://www.youtube.com/watch?v=Xan5JnecLNA"
try:
# Create YouTube video object
print("Fetching video data...")
video = YouTubeVideo(video_url)
# Display video info
print(f"Video Title: {video.title}")
print(f"Video ID: {video.video_id}")
# Count tokens in transcript
transcript_text = " ".join([segment.text for segment in video.transcript])
total_tokens = count_tokens(transcript_text)
print(f"Total transcript tokens: {total_tokens}")
# Generate summary (automatically uses chunking if needed)
print("\nGenerating summary...")
summary = summarize_video(video, use_chunking=True, max_chunk_tokens=4000)
# Display results
print("\n" + "="*50)
print("FINAL SUMMARY")
print("="*50)
print(summary)
except Exception as e:
print(f"Error: {str(e)}")
def test_chunking():
"""Test function to demonstrate chunking with a sample transcript"""
# Sample transcript for testing
sample_transcript = [
{"text": "This is a sample transcript segment 1. " * 100}, # ~1000 tokens
{"text": "This is a sample transcript segment 2. " * 100}, # ~1000 tokens
{"text": "This is a sample transcript segment 3. " * 100}, # ~1000 tokens
{"text": "This is a sample transcript segment 4. " * 100}, # ~1000 tokens
{"text": "This is a sample transcript segment 5. " * 100}, # ~1000 tokens
]
print("Testing chunking functionality...")
chunks = chunk_transcript(sample_transcript, max_tokens=2000, overlap_tokens=100)
print(f"Original transcript: {count_tokens(' '.join([s['text'] for s in sample_transcript]))} tokens")
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {count_tokens(chunk)} tokens")
if __name__ == "__main__":
# Uncomment the line below to test chunking
# test_chunking()
# Run main function
main()