Merge pull request #665 from shabsi4u/community-contributions-branch
[shabsi4u] Add YouTube Video Summarizer with AI-powered transcript analysis
This commit is contained in:
188
week1/community-contributions/Youtube_video_summarizer/README.md
Normal file
188
week1/community-contributions/Youtube_video_summarizer/README.md
Normal file
@@ -0,0 +1,188 @@
|
|||||||
|
# YouTube Video Summarizer
|
||||||
|
|
||||||
|
A Python tool that automatically fetches YouTube video transcripts and generates comprehensive summaries using OpenAI's GPT-4o-mini model. Features intelligent chunking for large videos and high-quality summarization.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- 🎬 **YouTube Integration**: Automatically fetches video transcripts
|
||||||
|
- 🤖 **AI-Powered Summaries**: Uses GPT-4o-mini for high-quality summaries
|
||||||
|
- 📊 **Smart Chunking**: Handles large videos by splitting into manageable chunks
|
||||||
|
- 🔄 **Automatic Stitching**: Combines chunk summaries into cohesive final summaries
|
||||||
|
- 💰 **Cost-Effective**: Optimized for GPT-4o-mini's token limits
|
||||||
|
- 🛡️ **Error Handling**: Robust error handling with helpful messages
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
- Python 3.8 or higher
|
||||||
|
|
||||||
|
### Option 1: Using the installation script (Recommended)
|
||||||
|
```bash
|
||||||
|
# Run the automated installation script
|
||||||
|
python install.py
|
||||||
|
|
||||||
|
# The script will let you choose between UV and pip
|
||||||
|
# Then run the script with your chosen method
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option 2: Using UV
|
||||||
|
```bash
|
||||||
|
# Install UV if not already installed
|
||||||
|
pip install uv
|
||||||
|
|
||||||
|
# Install dependencies and create virtual environment
|
||||||
|
uv sync
|
||||||
|
|
||||||
|
# Run the script
|
||||||
|
uv run python youtube_video_summarizer.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option 3: Using pip
|
||||||
|
```bash
|
||||||
|
# Install dependencies
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
|
# Run the script
|
||||||
|
python youtube_video_summarizer.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Optional Dependencies
|
||||||
|
|
||||||
|
#### With UV:
|
||||||
|
```bash
|
||||||
|
# For Jupyter notebook support
|
||||||
|
uv sync --extra jupyter
|
||||||
|
|
||||||
|
# For development dependencies (testing, linting, etc.)
|
||||||
|
uv sync --extra dev
|
||||||
|
```
|
||||||
|
|
||||||
|
#### With pip:
|
||||||
|
```bash
|
||||||
|
# For Jupyter notebook support
|
||||||
|
pip install ipython jupyter
|
||||||
|
|
||||||
|
# For development dependencies
|
||||||
|
pip install pytest black flake8 mypy
|
||||||
|
```
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
1. **Get an OpenAI API Key**:
|
||||||
|
- Visit [OpenAI API](https://platform.openai.com/api-keys)
|
||||||
|
- Create a new API key
|
||||||
|
|
||||||
|
2. **Create a .env file**:
|
||||||
|
```bash
|
||||||
|
echo "OPENAI_API_KEY=your_api_key_here" > .env
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Update the video URL** in `youtube_video_summarizer.py`:
|
||||||
|
```python
|
||||||
|
video_url = "https://www.youtube.com/watch?v=YOUR_VIDEO_ID"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Basic Usage
|
||||||
|
```python
|
||||||
|
from youtube_video_summarizer import YouTubeVideo, summarize_video
|
||||||
|
|
||||||
|
# Create video object
|
||||||
|
video = YouTubeVideo("https://www.youtube.com/watch?v=VIDEO_ID")
|
||||||
|
|
||||||
|
# Generate summary
|
||||||
|
summary = summarize_video(video)
|
||||||
|
print(summary)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Advanced Usage with Custom Settings
|
||||||
|
```python
|
||||||
|
# Custom chunking settings
|
||||||
|
summary = summarize_video(
|
||||||
|
video,
|
||||||
|
use_chunking=True,
|
||||||
|
max_chunk_tokens=4000
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## How It Works
|
||||||
|
|
||||||
|
1. **Video Processing**: Fetches YouTube video metadata and transcript
|
||||||
|
2. **Token Analysis**: Counts tokens to determine if chunking is needed
|
||||||
|
3. **Smart Chunking**: Splits large transcripts into manageable pieces
|
||||||
|
4. **Individual Summaries**: Generates summaries for each chunk
|
||||||
|
5. **Intelligent Stitching**: Combines chunk summaries into final result
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### Model Settings
|
||||||
|
- **Model**: GPT-4o-mini (cost-effective and high-quality)
|
||||||
|
- **Temperature**: 0.3 (focused, consistent output)
|
||||||
|
- **Max Tokens**: 2,000 (optimal for summaries)
|
||||||
|
|
||||||
|
### Chunking Settings
|
||||||
|
- **Max Chunk Size**: 4,000 tokens (auto-calculated per model)
|
||||||
|
- **Overlap**: 5% of chunk size (maintains context)
|
||||||
|
- **Auto-detection**: Automatically determines if chunking is needed
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
The script includes comprehensive error handling:
|
||||||
|
- ✅ **Missing Dependencies**: Clear installation instructions
|
||||||
|
- ✅ **Invalid URLs**: YouTube URL validation
|
||||||
|
- ✅ **API Errors**: OpenAI API error handling
|
||||||
|
- ✅ **Network Issues**: Request timeout and retry logic
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- **Python**: 3.8 or higher
|
||||||
|
- **OpenAI API Key**: Required for summarization
|
||||||
|
- **Internet Connection**: For YouTube and OpenAI API access
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
### Core Dependencies
|
||||||
|
- `requests`: HTTP requests
|
||||||
|
- `tiktoken`: Token counting
|
||||||
|
- `python-dotenv`: Environment variable management
|
||||||
|
- `openai`: OpenAI API client
|
||||||
|
- `youtube-transcript-api`: YouTube transcript fetching
|
||||||
|
- `beautifulsoup4`: HTML parsing
|
||||||
|
|
||||||
|
### Optional Dependencies
|
||||||
|
- `ipython`: Jupyter notebook support
|
||||||
|
- `jupyter`: Jupyter notebook support
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Common Issues
|
||||||
|
|
||||||
|
1. **ModuleNotFoundError**:
|
||||||
|
- With UV: Run `uv sync` to install dependencies
|
||||||
|
- With pip: Run `pip install -r requirements.txt`
|
||||||
|
2. **UV not found**: Install UV with `pip install uv` or run `python install.py`
|
||||||
|
3. **OpenAI API Error**: Check your API key in `.env` file
|
||||||
|
4. **YouTube Transcript Error**: Video may not have transcripts available
|
||||||
|
5. **Token Limit Error**: Video transcript is too long (rare with chunking)
|
||||||
|
|
||||||
|
### Getting Help
|
||||||
|
|
||||||
|
If you encounter issues:
|
||||||
|
1. Check the error messages (they include helpful installation instructions)
|
||||||
|
2. Ensure all dependencies are installed:
|
||||||
|
- With UV: `uv sync`
|
||||||
|
- With pip: `pip install -r requirements.txt`
|
||||||
|
3. Verify your OpenAI API key is correct
|
||||||
|
4. Check that the YouTube video has transcripts available
|
||||||
|
5. Try running with the appropriate command:
|
||||||
|
- With UV: `uv run python youtube_video_summarizer.py`
|
||||||
|
- With pip: `python youtube_video_summarizer.py`
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
This project is part of the LLM Engineering course materials.
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
Feel free to submit issues and enhancement requests!
|
||||||
@@ -0,0 +1,178 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Installation script for YouTube Video Summarizer
|
||||||
|
This script installs all required dependencies for the project using either UV or pip.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
|
||||||
|
def run_command(command, description):
|
||||||
|
"""Run a command and handle errors"""
|
||||||
|
print(f"🔄 {description}...")
|
||||||
|
try:
|
||||||
|
result = subprocess.run(command, shell=True, check=True, capture_output=True, text=True)
|
||||||
|
print(f"✅ {description} completed successfully")
|
||||||
|
return True
|
||||||
|
except subprocess.CalledProcessError as e:
|
||||||
|
print(f"❌ {description} failed:")
|
||||||
|
print(f" Error: {e.stderr}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def check_python_version():
|
||||||
|
"""Check if Python version is compatible"""
|
||||||
|
version = sys.version_info
|
||||||
|
if version.major < 3 or (version.major == 3 and version.minor < 8):
|
||||||
|
print("❌ Python 3.8 or higher is required")
|
||||||
|
print(f" Current version: {version.major}.{version.minor}.{version.micro}")
|
||||||
|
return False
|
||||||
|
print(f"✅ Python {version.major}.{version.minor}.{version.micro} is compatible")
|
||||||
|
return True
|
||||||
|
|
||||||
|
def check_uv_installed():
|
||||||
|
"""Check if UV is installed"""
|
||||||
|
if shutil.which("uv"):
|
||||||
|
print("✅ UV is already installed")
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
print("❌ UV is not installed")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def install_uv():
|
||||||
|
"""Install UV package manager"""
|
||||||
|
print("🔄 Installing UV...")
|
||||||
|
try:
|
||||||
|
# Try to install UV using pip first
|
||||||
|
if not run_command(f"{sys.executable} -m pip install uv", "Installing UV via pip"):
|
||||||
|
# Fallback to curl installation
|
||||||
|
install_script = "curl -LsSf https://astral.sh/uv/install.sh | sh"
|
||||||
|
if not run_command(install_script, "Installing UV via curl"):
|
||||||
|
print("❌ Failed to install UV. Please install it manually:")
|
||||||
|
print(" pip install uv")
|
||||||
|
print(" or visit: https://github.com/astral-sh/uv")
|
||||||
|
return False
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Error installing UV: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def choose_package_manager():
|
||||||
|
"""Let user choose between UV and pip"""
|
||||||
|
print("\n📦 Choose your package manager:")
|
||||||
|
print("1. UV (recommended - faster, better dependency resolution)")
|
||||||
|
print("2. pip (traditional Python package manager)")
|
||||||
|
|
||||||
|
while True:
|
||||||
|
choice = input("\nEnter your choice (1 or 2): ").strip()
|
||||||
|
if choice == "1":
|
||||||
|
return "uv"
|
||||||
|
elif choice == "2":
|
||||||
|
return "pip"
|
||||||
|
else:
|
||||||
|
print("❌ Invalid choice. Please enter 1 or 2.")
|
||||||
|
|
||||||
|
def install_dependencies_uv():
|
||||||
|
"""Install dependencies using UV"""
|
||||||
|
print("🚀 Installing YouTube Video Summarizer dependencies with UV...")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
# Check if UV is installed, install if not
|
||||||
|
if not check_uv_installed():
|
||||||
|
if not install_uv():
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Check if pyproject.toml exists
|
||||||
|
pyproject_file = os.path.join(os.path.dirname(__file__), "pyproject.toml")
|
||||||
|
if not os.path.exists(pyproject_file):
|
||||||
|
print("❌ pyproject.toml not found. Please ensure you're in the project directory.")
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Install dependencies using UV
|
||||||
|
if not run_command("uv sync", "Installing dependencies with UV"):
|
||||||
|
return False
|
||||||
|
|
||||||
|
print("=" * 60)
|
||||||
|
print("🎉 Installation completed successfully!")
|
||||||
|
print("\n📋 Next steps:")
|
||||||
|
print("1. Create a .env file with your OpenAI API key:")
|
||||||
|
print(" OPENAI_API_KEY=your_api_key_here")
|
||||||
|
print("2. Run the script:")
|
||||||
|
print(" uv run python youtube_video_summarizer.py")
|
||||||
|
print("\n💡 For Jupyter notebook support, install with:")
|
||||||
|
print(" uv sync --extra jupyter")
|
||||||
|
print("\n💡 For development dependencies, install with:")
|
||||||
|
print(" uv sync --extra dev")
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
def install_dependencies_pip():
|
||||||
|
"""Install dependencies using pip"""
|
||||||
|
print("🚀 Installing YouTube Video Summarizer dependencies with pip...")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
# Upgrade pip first
|
||||||
|
if not run_command(f"{sys.executable} -m pip install --upgrade pip", "Upgrading pip"):
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Install dependencies from requirements.txt
|
||||||
|
requirements_file = os.path.join(os.path.dirname(__file__), "requirements.txt")
|
||||||
|
if os.path.exists(requirements_file):
|
||||||
|
if not run_command(f"{sys.executable} -m pip install -r {requirements_file}", "Installing dependencies from requirements.txt"):
|
||||||
|
return False
|
||||||
|
else:
|
||||||
|
# Install core dependencies individually
|
||||||
|
core_deps = [
|
||||||
|
"requests",
|
||||||
|
"tiktoken",
|
||||||
|
"python-dotenv",
|
||||||
|
"openai",
|
||||||
|
"youtube-transcript-api",
|
||||||
|
"beautifulsoup4"
|
||||||
|
]
|
||||||
|
|
||||||
|
for dep in core_deps:
|
||||||
|
if not run_command(f"{sys.executable} -m pip install {dep}", f"Installing {dep}"):
|
||||||
|
return False
|
||||||
|
|
||||||
|
print("=" * 60)
|
||||||
|
print("🎉 Installation completed successfully!")
|
||||||
|
print("\n📋 Next steps:")
|
||||||
|
print("1. Create a .env file with your OpenAI API key:")
|
||||||
|
print(" OPENAI_API_KEY=your_api_key_here")
|
||||||
|
print("2. Run the script:")
|
||||||
|
print(" python youtube_video_summarizer.py")
|
||||||
|
print("\n💡 For Jupyter notebook support, also install:")
|
||||||
|
print(" pip install jupyter ipython")
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
def install_dependencies():
|
||||||
|
"""Install required dependencies using chosen package manager"""
|
||||||
|
# Check Python version
|
||||||
|
if not check_python_version():
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Let user choose package manager
|
||||||
|
package_manager = choose_package_manager()
|
||||||
|
|
||||||
|
if package_manager == "uv":
|
||||||
|
return install_dependencies_uv()
|
||||||
|
else:
|
||||||
|
return install_dependencies_pip()
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""Main installation function"""
|
||||||
|
print("🎬 YouTube Video Summarizer - Installation Script")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
if install_dependencies():
|
||||||
|
print("\n✅ All dependencies installed successfully!")
|
||||||
|
print("🚀 You can now run the YouTube Video Summarizer!")
|
||||||
|
else:
|
||||||
|
print("\n❌ Installation failed. Please check the error messages above.")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,78 @@
|
|||||||
|
[build-system]
|
||||||
|
requires = ["hatchling"]
|
||||||
|
build-backend = "hatchling.build"
|
||||||
|
|
||||||
|
[project]
|
||||||
|
name = "youtube-video-summarizer"
|
||||||
|
version = "1.0.0"
|
||||||
|
description = "A tool to summarize YouTube videos using OpenAI's GPT models"
|
||||||
|
readme = "README.md"
|
||||||
|
requires-python = ">=3.8"
|
||||||
|
license = {text = "MIT"}
|
||||||
|
authors = [
|
||||||
|
{name = "YouTube Video Summarizer Team"},
|
||||||
|
]
|
||||||
|
keywords = ["youtube", "summarizer", "openai", "transcript", "ai"]
|
||||||
|
classifiers = [
|
||||||
|
"Development Status :: 4 - Beta",
|
||||||
|
"Intended Audience :: Developers",
|
||||||
|
"License :: OSI Approved :: MIT License",
|
||||||
|
"Programming Language :: Python :: 3",
|
||||||
|
"Programming Language :: Python :: 3.8",
|
||||||
|
"Programming Language :: Python :: 3.9",
|
||||||
|
"Programming Language :: Python :: 3.10",
|
||||||
|
"Programming Language :: Python :: 3.11",
|
||||||
|
"Programming Language :: Python :: 3.12",
|
||||||
|
"Topic :: Multimedia :: Video",
|
||||||
|
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
||||||
|
]
|
||||||
|
|
||||||
|
dependencies = [
|
||||||
|
"requests>=2.25.0",
|
||||||
|
"tiktoken>=0.5.0",
|
||||||
|
"python-dotenv>=0.19.0",
|
||||||
|
"openai>=1.0.0",
|
||||||
|
"youtube-transcript-api>=0.6.0",
|
||||||
|
"beautifulsoup4>=4.9.0",
|
||||||
|
]
|
||||||
|
|
||||||
|
[project.optional-dependencies]
|
||||||
|
jupyter = [
|
||||||
|
"ipython>=7.0.0",
|
||||||
|
"jupyter>=1.0.0",
|
||||||
|
]
|
||||||
|
dev = [
|
||||||
|
"pytest>=6.0.0",
|
||||||
|
"black>=22.0.0",
|
||||||
|
"flake8>=4.0.0",
|
||||||
|
"mypy>=0.950",
|
||||||
|
]
|
||||||
|
|
||||||
|
[project.urls]
|
||||||
|
Homepage = "https://github.com/your-username/youtube-video-summarizer"
|
||||||
|
Repository = "https://github.com/your-username/youtube-video-summarizer"
|
||||||
|
Issues = "https://github.com/your-username/youtube-video-summarizer/issues"
|
||||||
|
|
||||||
|
[project.scripts]
|
||||||
|
youtube-summarizer = "youtube_video_summarizer:main"
|
||||||
|
|
||||||
|
[tool.uv]
|
||||||
|
dev-dependencies = [
|
||||||
|
"pytest>=6.0.0",
|
||||||
|
"black>=22.0.0",
|
||||||
|
"flake8>=4.0.0",
|
||||||
|
"mypy>=0.950",
|
||||||
|
]
|
||||||
|
|
||||||
|
[tool.black]
|
||||||
|
line-length = 88
|
||||||
|
target-version = ['py38']
|
||||||
|
|
||||||
|
[tool.mypy]
|
||||||
|
python_version = "3.8"
|
||||||
|
warn_return_any = true
|
||||||
|
warn_unused_configs = true
|
||||||
|
disallow_untyped_defs = true
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@@ -0,0 +1,17 @@
|
|||||||
|
# Core dependencies for YouTube Video Summarizer
|
||||||
|
requests>=2.25.0
|
||||||
|
tiktoken>=0.5.0
|
||||||
|
python-dotenv>=0.19.0
|
||||||
|
openai>=1.0.0
|
||||||
|
youtube-transcript-api>=0.6.0
|
||||||
|
beautifulsoup4>=4.9.0
|
||||||
|
|
||||||
|
# Optional dependencies for Jupyter notebook support
|
||||||
|
ipython>=7.0.0
|
||||||
|
jupyter>=1.0.0
|
||||||
|
|
||||||
|
# Development dependencies (optional)
|
||||||
|
pytest>=6.0.0
|
||||||
|
black>=22.0.0
|
||||||
|
flake8>=4.0.0
|
||||||
|
mypy>=0.950
|
||||||
4435
week1/community-contributions/Youtube_video_summarizer/uv.lock
generated
Normal file
4435
week1/community-contributions/Youtube_video_summarizer/uv.lock
generated
Normal file
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,906 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "e371ea2b",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# YouTube Video Summarizer\n",
|
||||||
|
"\n",
|
||||||
|
"This notebook provides a comprehensive solution for summarizing YouTube videos using OpenAI's GPT models. It includes:\n",
|
||||||
|
"\n",
|
||||||
|
"- **Automatic transcript extraction** from YouTube videos\n",
|
||||||
|
"- **Intelligent chunking** for large videos that exceed token limits\n",
|
||||||
|
"- **Smart summarization** with academic-quality output\n",
|
||||||
|
"- **Error handling** and dependency management\n",
|
||||||
|
"\n",
|
||||||
|
"## Features\n",
|
||||||
|
"\n",
|
||||||
|
"- ✅ Extracts transcripts from YouTube videos\n",
|
||||||
|
"- ✅ Handles videos of any length with automatic chunking\n",
|
||||||
|
"- ✅ Generates structured, academic-quality summaries\n",
|
||||||
|
"- ✅ Includes proper error handling and dependency checks\n",
|
||||||
|
"- ✅ Optimized for different OpenAI models\n",
|
||||||
|
"- ✅ Interactive notebook format for easy testing\n",
|
||||||
|
"\n",
|
||||||
|
"## Prerequisites\n",
|
||||||
|
"\n",
|
||||||
|
"Make sure you have the required dependencies installed:\n",
|
||||||
|
"```bash\n",
|
||||||
|
"pip install -r requirements.txt\n",
|
||||||
|
"```\n",
|
||||||
|
"\n",
|
||||||
|
"You'll also need an OpenAI API key set in your environment variables or `.env` file.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "95b713e0",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 1. Import Dependencies and Setup\n",
|
||||||
|
"\n",
|
||||||
|
"First, let's import all required libraries and set up the environment.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "c940970b",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import os\n",
|
||||||
|
"import re\n",
|
||||||
|
"import sys\n",
|
||||||
|
"\n",
|
||||||
|
"# Check for required dependencies and provide helpful error messages\n",
|
||||||
|
"try:\n",
|
||||||
|
" import requests\n",
|
||||||
|
" print(\"✅ requests imported successfully\")\n",
|
||||||
|
"except ImportError:\n",
|
||||||
|
" print(\"❌ Error: 'requests' module not found.\")\n",
|
||||||
|
" print(\"💡 Install with: pip install requests\")\n",
|
||||||
|
" print(\" Or: pip install -r requirements.txt\")\n",
|
||||||
|
" sys.exit(1)\n",
|
||||||
|
"\n",
|
||||||
|
"try:\n",
|
||||||
|
" import tiktoken\n",
|
||||||
|
" print(\"✅ tiktoken imported successfully\")\n",
|
||||||
|
"except ImportError:\n",
|
||||||
|
" print(\"❌ Error: 'tiktoken' module not found.\")\n",
|
||||||
|
" print(\"💡 Install with: pip install tiktoken\")\n",
|
||||||
|
" print(\" Or: pip install -r requirements.txt\")\n",
|
||||||
|
" sys.exit(1)\n",
|
||||||
|
"\n",
|
||||||
|
"try:\n",
|
||||||
|
" from dotenv import load_dotenv\n",
|
||||||
|
" print(\"✅ python-dotenv imported successfully\")\n",
|
||||||
|
"except ImportError:\n",
|
||||||
|
" print(\"❌ Error: 'python-dotenv' module not found.\")\n",
|
||||||
|
" print(\"💡 Install with: pip install python-dotenv\")\n",
|
||||||
|
" print(\" Or: pip install -r requirements.txt\")\n",
|
||||||
|
" sys.exit(1)\n",
|
||||||
|
"\n",
|
||||||
|
"try:\n",
|
||||||
|
" from openai import OpenAI\n",
|
||||||
|
" print(\"✅ openai imported successfully\")\n",
|
||||||
|
"except ImportError:\n",
|
||||||
|
" print(\"❌ Error: 'openai' module not found.\")\n",
|
||||||
|
" print(\"💡 Install with: pip install openai\")\n",
|
||||||
|
" print(\" Or: pip install -r requirements.txt\")\n",
|
||||||
|
" sys.exit(1)\n",
|
||||||
|
"\n",
|
||||||
|
"try:\n",
|
||||||
|
" from youtube_transcript_api import YouTubeTranscriptApi\n",
|
||||||
|
" print(\"✅ youtube-transcript-api imported successfully\")\n",
|
||||||
|
"except ImportError:\n",
|
||||||
|
" print(\"❌ Error: 'youtube-transcript-api' module not found.\")\n",
|
||||||
|
" print(\"💡 Install with: pip install youtube-transcript-api\")\n",
|
||||||
|
" print(\" Or: pip install -r requirements.txt\")\n",
|
||||||
|
" sys.exit(1)\n",
|
||||||
|
"\n",
|
||||||
|
"try:\n",
|
||||||
|
" from bs4 import BeautifulSoup\n",
|
||||||
|
" print(\"✅ beautifulsoup4 imported successfully\")\n",
|
||||||
|
"except ImportError:\n",
|
||||||
|
" print(\"❌ Error: 'beautifulsoup4' module not found.\")\n",
|
||||||
|
" print(\"💡 Install with: pip install beautifulsoup4\")\n",
|
||||||
|
" print(\" Or: pip install -r requirements.txt\")\n",
|
||||||
|
" sys.exit(1)\n",
|
||||||
|
"\n",
|
||||||
|
"try:\n",
|
||||||
|
" from IPython.display import Markdown, display\n",
|
||||||
|
" print(\"✅ IPython.display imported successfully\")\n",
|
||||||
|
"except ImportError:\n",
|
||||||
|
" # IPython is optional for Jupyter notebooks\n",
|
||||||
|
" print(\"⚠️ Warning: IPython not available (optional for Jupyter notebooks)\")\n",
|
||||||
|
" Markdown = None\n",
|
||||||
|
" display = None\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\n🎉 All dependencies imported successfully!\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "603e9c3b",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 2. Configuration and Constants\n",
|
||||||
|
"\n",
|
||||||
|
"Set up headers for web scraping and define the YouTubeVideo class.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "8584ca1a",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Headers for website scraping\n",
|
||||||
|
"headers = {\n",
|
||||||
|
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"class YouTubeVideo:\n",
|
||||||
|
" \"\"\"Class to handle YouTube video data extraction and processing\"\"\"\n",
|
||||||
|
" \n",
|
||||||
|
" def __init__(self, url):\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Initialize YouTube video object\n",
|
||||||
|
" \n",
|
||||||
|
" Args:\n",
|
||||||
|
" url (str): YouTube video URL\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" self.url = url\n",
|
||||||
|
" youtube_pattern = r'https://www\\.youtube\\.com/watch\\?v=[a-zA-Z0-9_-]+'\n",
|
||||||
|
" \n",
|
||||||
|
" if re.match(youtube_pattern, url):\n",
|
||||||
|
" response = requests.get(url, headers=headers)\n",
|
||||||
|
" soup = BeautifulSoup(response.content, 'html.parser')\n",
|
||||||
|
" self.video_id = url.split(\"v=\")[1]\n",
|
||||||
|
" self.title = soup.title.string if soup.title else \"No title found\"\n",
|
||||||
|
" self.transcript = YouTubeTranscriptApi().fetch(self.video_id)\n",
|
||||||
|
" else:\n",
|
||||||
|
" raise ValueError(\"Invalid YouTube URL\")\n",
|
||||||
|
" \n",
|
||||||
|
" def get_transcript_text(self):\n",
|
||||||
|
" \"\"\"Get transcript as a single text string\"\"\"\n",
|
||||||
|
" return \" \".join([segment.text for segment in self.transcript])\n",
|
||||||
|
" \n",
|
||||||
|
" def get_video_info(self):\n",
|
||||||
|
" \"\"\"Get basic video information\"\"\"\n",
|
||||||
|
" return {\n",
|
||||||
|
" \"title\": self.title,\n",
|
||||||
|
" \"video_id\": self.video_id,\n",
|
||||||
|
" \"url\": self.url,\n",
|
||||||
|
" \"transcript_length\": len(self.transcript)\n",
|
||||||
|
" }\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"✅ YouTubeVideo class defined successfully\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "235e9998",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 3. OpenAI API Setup\n",
|
||||||
|
"\n",
|
||||||
|
"Functions to handle OpenAI API key and client initialization.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "4fa7aba3",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def get_api_key():\n",
|
||||||
|
" \"\"\"Get OpenAI API key from environment variables\"\"\"\n",
|
||||||
|
" load_dotenv(override=True)\n",
|
||||||
|
" api_key = os.getenv(\"OPENAI_API_KEY\")\n",
|
||||||
|
" if not api_key:\n",
|
||||||
|
" raise ValueError(\"OPENAI_API_KEY is not set. Please set it in your environment variables or .env file.\")\n",
|
||||||
|
" return api_key\n",
|
||||||
|
"\n",
|
||||||
|
"def get_openai_client():\n",
|
||||||
|
" \"\"\"Initialize and return OpenAI client\"\"\"\n",
|
||||||
|
" api_key = get_api_key()\n",
|
||||||
|
" return OpenAI(api_key=api_key)\n",
|
||||||
|
"\n",
|
||||||
|
"# Test API connection\n",
|
||||||
|
"try:\n",
|
||||||
|
" client = get_openai_client()\n",
|
||||||
|
" print(\"✅ OpenAI client initialized successfully\")\n",
|
||||||
|
" print(\"✅ API key is valid\")\n",
|
||||||
|
"except Exception as e:\n",
|
||||||
|
" print(f\"❌ Error initializing OpenAI client: {e}\")\n",
|
||||||
|
" print(\"💡 Make sure you have set your OPENAI_API_KEY environment variable\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "4d3223f4",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 4. Token Counting and Chunking Functions\n",
|
||||||
|
"\n",
|
||||||
|
"Functions to handle token counting and intelligent chunking of large transcripts.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "71f68ad0",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def count_tokens(text, model=\"gpt-4o-mini\"):\n",
|
||||||
|
" \"\"\"Count tokens in text using tiktoken with fallback\"\"\"\n",
|
||||||
|
" try:\n",
|
||||||
|
" # Try model-specific encoding first\n",
|
||||||
|
" encoding = tiktoken.encoding_for_model(model)\n",
|
||||||
|
" return len(encoding.encode(text))\n",
|
||||||
|
" except KeyError:\n",
|
||||||
|
" # Fallback to cl100k_base encoding (used by most OpenAI models)\n",
|
||||||
|
" # This ensures compatibility even if model-specific encoding isn't available\n",
|
||||||
|
" encoding = tiktoken.get_encoding(\"cl100k_base\")\n",
|
||||||
|
" return len(encoding.encode(text))\n",
|
||||||
|
" except Exception as e:\n",
|
||||||
|
" # Ultimate fallback - rough estimation\n",
|
||||||
|
" print(f\"Warning: Token counting failed ({e}), using rough estimation\")\n",
|
||||||
|
" return len(text.split()) * 1.3 # Rough word-to-token ratio\n",
|
||||||
|
"\n",
|
||||||
|
"def get_optimal_chunk_size(model=\"gpt-4o-mini\"):\n",
|
||||||
|
" \"\"\"Calculate optimal chunk size based on model's context window\"\"\"\n",
|
||||||
|
" model_limits = {\n",
|
||||||
|
" \"gpt-4o-mini\": 8192,\n",
|
||||||
|
" \"gpt-4o\": 128000,\n",
|
||||||
|
" \"gpt-4-turbo\": 128000,\n",
|
||||||
|
" \"gpt-3.5-turbo\": 4096,\n",
|
||||||
|
" \"gpt-4\": 8192,\n",
|
||||||
|
" }\n",
|
||||||
|
" \n",
|
||||||
|
" context_window = model_limits.get(model, 8192) # Default to 8K\n",
|
||||||
|
" \n",
|
||||||
|
" # Reserve tokens for:\n",
|
||||||
|
" # - System prompt: ~800 tokens\n",
|
||||||
|
" # - User prompt overhead: ~300 tokens \n",
|
||||||
|
" # - Output: ~2000 tokens\n",
|
||||||
|
" # - Safety buffer: ~500 tokens\n",
|
||||||
|
" reserved_tokens = 800 + 300 + 2000 + 500\n",
|
||||||
|
" \n",
|
||||||
|
" optimal_chunk_size = context_window - reserved_tokens\n",
|
||||||
|
" \n",
|
||||||
|
" # Ensure minimum chunk size\n",
|
||||||
|
" return max(optimal_chunk_size, 2000)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"✅ Token counting and chunk size functions defined\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "b6647838",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def chunk_transcript(transcript, max_tokens=4000, overlap_tokens=200, model=\"gpt-4o-mini\"):\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Split transcript into chunks that fit within token limits\n",
|
||||||
|
" \n",
|
||||||
|
" Args:\n",
|
||||||
|
" transcript: List of transcript segments from YouTube\n",
|
||||||
|
" max_tokens: Maximum tokens per chunk (auto-calculated if None)\n",
|
||||||
|
" overlap_tokens: Number of tokens to overlap between chunks\n",
|
||||||
|
" model: Model name for token limit calculation\n",
|
||||||
|
" \n",
|
||||||
|
" Returns:\n",
|
||||||
|
" List of transcript chunks\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" # Auto-calculate max_tokens based on model if not provided\n",
|
||||||
|
" if max_tokens is None:\n",
|
||||||
|
" max_tokens = get_optimal_chunk_size(model)\n",
|
||||||
|
" \n",
|
||||||
|
" # Auto-calculate overlap as percentage of max_tokens\n",
|
||||||
|
" if overlap_tokens is None:\n",
|
||||||
|
" overlap_tokens = int(max_tokens * 0.05) # 5% overlap\n",
|
||||||
|
" \n",
|
||||||
|
" # Convert transcript to text\n",
|
||||||
|
" transcript_text = \" \".join([segment.text for segment in transcript])\n",
|
||||||
|
" \n",
|
||||||
|
" # If transcript is small enough, return as single chunk\n",
|
||||||
|
" if count_tokens(transcript_text) <= max_tokens:\n",
|
||||||
|
" return [transcript_text]\n",
|
||||||
|
" \n",
|
||||||
|
" # Split into sentences for better chunking\n",
|
||||||
|
" sentences = re.split(r'[.!?]+', transcript_text)\n",
|
||||||
|
" chunks = []\n",
|
||||||
|
" current_chunk = \"\"\n",
|
||||||
|
" \n",
|
||||||
|
" for sentence in sentences:\n",
|
||||||
|
" sentence = sentence.strip()\n",
|
||||||
|
" if not sentence:\n",
|
||||||
|
" continue\n",
|
||||||
|
" \n",
|
||||||
|
" # Check if adding this sentence would exceed token limit\n",
|
||||||
|
" test_chunk = current_chunk + \" \" + sentence if current_chunk else sentence\n",
|
||||||
|
" \n",
|
||||||
|
" if count_tokens(test_chunk) <= max_tokens:\n",
|
||||||
|
" current_chunk = test_chunk\n",
|
||||||
|
" else:\n",
|
||||||
|
" # Save current chunk and start new one\n",
|
||||||
|
" if current_chunk:\n",
|
||||||
|
" chunks.append(current_chunk)\n",
|
||||||
|
" \n",
|
||||||
|
" # Start new chunk with overlap from previous chunk\n",
|
||||||
|
" if chunks and overlap_tokens > 0:\n",
|
||||||
|
" # Get last few words from previous chunk for overlap\n",
|
||||||
|
" prev_words = current_chunk.split()[-overlap_tokens//4:] # Rough word-to-token ratio\n",
|
||||||
|
" current_chunk = \" \".join(prev_words) + \" \" + sentence\n",
|
||||||
|
" else:\n",
|
||||||
|
" current_chunk = sentence\n",
|
||||||
|
" \n",
|
||||||
|
" # Add the last chunk\n",
|
||||||
|
" if current_chunk:\n",
|
||||||
|
" chunks.append(current_chunk)\n",
|
||||||
|
" \n",
|
||||||
|
" return chunks\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"✅ Chunking function defined\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "7ee3f8a4",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 5. Prompt Generation Functions\n",
|
||||||
|
"\n",
|
||||||
|
"Functions to generate system prompts, user prompts, and stitching prompts for the summarization process.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "e7f20bf5",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def generate_system_prompt():\n",
|
||||||
|
" \"\"\"Generate the system prompt for video summarization\"\"\"\n",
|
||||||
|
" return f\"\"\"\n",
|
||||||
|
" You are an expert YouTube video summarizer. Your job is to take the full transcript of a video and generate a structured, precise, and academically grounded summary.\n",
|
||||||
|
"\n",
|
||||||
|
" Your output must include:\n",
|
||||||
|
"\n",
|
||||||
|
" 1. Title\n",
|
||||||
|
" - Either reuse the video's title (if it is clear, accurate, and concise)\n",
|
||||||
|
" - Or generate a new, sharper, more descriptive title that best reflects the actual content covered.\n",
|
||||||
|
"\n",
|
||||||
|
" 2. Topic & Area of Coverage\n",
|
||||||
|
" - Provide a 1–2 line highlight of the main topic of the video and the specific area it best covers.\n",
|
||||||
|
" - Format:\n",
|
||||||
|
" - Domain (e.g., Finance, Health, Technology, Psychology, Fitness, Productivity, etc.)\n",
|
||||||
|
" - Sub-area (e.g., investment strategies, portfolio design; training routine, best exercises; productivity systems, cognitive science insights, etc.)\n",
|
||||||
|
"\n",
|
||||||
|
" 3. Summary of the Video\n",
|
||||||
|
" - A structured, clear, and concise summary of the video.\n",
|
||||||
|
" - Focus only on relevant, high-value content.\n",
|
||||||
|
" - Skip fluff, tangents, product promotions, personal banter, or irrelevant side discussions.\n",
|
||||||
|
" - Include key insights, frameworks, step-by-step methods, and actionable advice.\n",
|
||||||
|
" - Where applicable, reference scientific studies, historical sources, or authoritative references (with author + year or journal if mentioned in the video, or inferred if the reference is well known).\n",
|
||||||
|
"\n",
|
||||||
|
" Style & Quality Rules:\n",
|
||||||
|
" - Be extremely specific: avoid vague generalizations.\n",
|
||||||
|
" - Use precise language and structured formatting (bullet points, numbered lists, sub-sections if needed).\n",
|
||||||
|
" - Prioritize clarity and factual accuracy.\n",
|
||||||
|
" - Write as though preparing an executive briefing or academic digest.\n",
|
||||||
|
" - If the transcript includes non-relevant sections (jokes, ads, unrelated chit-chat), skip summarizing them entirely.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
"def generate_user_prompt(website, transcript_chunk=None):\n",
|
||||||
|
" \"\"\"Generate user prompt for video summarization\"\"\"\n",
|
||||||
|
" if transcript_chunk:\n",
|
||||||
|
" return f\"\"\"Here is a portion of a YouTube video transcript. Use the system instructions to generate a summary of this section.\n",
|
||||||
|
"\n",
|
||||||
|
" Video Title: {website.title}\n",
|
||||||
|
"\n",
|
||||||
|
" Transcript Section: {transcript_chunk}\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" else:\n",
|
||||||
|
" return f\"\"\"Here is the transcript of a YouTube video. Use the system instructions to generate the output.\n",
|
||||||
|
"\n",
|
||||||
|
" Video Title: {website.title}\n",
|
||||||
|
"\n",
|
||||||
|
" Transcript: {website.transcript}\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
"def generate_stitching_prompt(chunk_summaries, video_title):\n",
|
||||||
|
" \"\"\"Generate prompt for stitching together chunk summaries\"\"\"\n",
|
||||||
|
" return f\"\"\"You are an expert at combining multiple summaries into a cohesive, comprehensive summary.\n",
|
||||||
|
"\n",
|
||||||
|
" Video Title: {video_title}\n",
|
||||||
|
"\n",
|
||||||
|
" Below are summaries of different sections of this video. Combine them into a single, well-structured summary that:\n",
|
||||||
|
" 1. Maintains the original structure and quality standards\n",
|
||||||
|
" 2. Eliminates redundancy between sections\n",
|
||||||
|
" 3. Creates smooth transitions between topics\n",
|
||||||
|
" 4. Preserves all important information \n",
|
||||||
|
" 5. Maintains the academic, professional tone\n",
|
||||||
|
" 6. Include examples and nuances where relevant\n",
|
||||||
|
" 7. Include the citations and references where applicable\n",
|
||||||
|
"\n",
|
||||||
|
" Section Summaries:\n",
|
||||||
|
" {chr(10).join([f\"Section {i+1}: {summary}\" for i, summary in enumerate(chunk_summaries)])}\n",
|
||||||
|
"\n",
|
||||||
|
" Please provide a unified, comprehensive summary following the same format as the individual sections.\n",
|
||||||
|
" Make sure the final summary is cohesive and logical.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"✅ Prompt generation functions defined\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "5c9a620d",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 6. Summarization Functions\n",
|
||||||
|
"\n",
|
||||||
|
"Core functions for summarizing videos with support for both single-chunk and chunked processing.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "cc8a183b",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def summarize_single_chunk(website, client):\n",
|
||||||
|
" \"\"\"Summarize a single chunk (small video)\"\"\"\n",
|
||||||
|
" system_prompt = generate_system_prompt()\n",
|
||||||
|
" user_prompt = generate_user_prompt(website)\n",
|
||||||
|
" \n",
|
||||||
|
" try:\n",
|
||||||
|
" response = client.chat.completions.create(\n",
|
||||||
|
" model=\"gpt-4o-mini\",\n",
|
||||||
|
" messages=[\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt}\n",
|
||||||
|
" ],\n",
|
||||||
|
" max_tokens=2000,\n",
|
||||||
|
" temperature=0.3\n",
|
||||||
|
" )\n",
|
||||||
|
" \n",
|
||||||
|
" return response.choices[0].message.content\n",
|
||||||
|
" \n",
|
||||||
|
" except Exception as e:\n",
|
||||||
|
" return f\"Error generating summary: {str(e)}\"\n",
|
||||||
|
"\n",
|
||||||
|
"def summarize_with_chunking(website, client, max_chunk_tokens=4000):\n",
|
||||||
|
" \"\"\"Summarize a large video by chunking and stitching\"\"\"\n",
|
||||||
|
" print(\"Video is large, using chunking strategy...\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Chunk the transcript\n",
|
||||||
|
" chunks = chunk_transcript(website.transcript, max_chunk_tokens)\n",
|
||||||
|
" print(f\"Split into {len(chunks)} chunks\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Summarize each chunk\n",
|
||||||
|
" chunk_summaries = []\n",
|
||||||
|
" system_prompt = generate_system_prompt()\n",
|
||||||
|
" \n",
|
||||||
|
" for i, chunk in enumerate(chunks):\n",
|
||||||
|
" print(f\"Processing chunk {i+1}/{len(chunks)}...\")\n",
|
||||||
|
" user_prompt = generate_user_prompt(website, chunk)\n",
|
||||||
|
" \n",
|
||||||
|
" try:\n",
|
||||||
|
" response = client.chat.completions.create(\n",
|
||||||
|
" model=\"gpt-4o-mini\",\n",
|
||||||
|
" messages=[\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt}\n",
|
||||||
|
" ],\n",
|
||||||
|
" max_tokens=1500, # Smaller for chunks\n",
|
||||||
|
" temperature=0.3\n",
|
||||||
|
" )\n",
|
||||||
|
" \n",
|
||||||
|
" chunk_summaries.append(response.choices[0].message.content)\n",
|
||||||
|
" \n",
|
||||||
|
" except Exception as e:\n",
|
||||||
|
" chunk_summaries.append(f\"Error in chunk {i+1}: {str(e)}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Stitch the summaries together\n",
|
||||||
|
" print(\"Stitching summaries together...\")\n",
|
||||||
|
" stitching_prompt = generate_stitching_prompt(chunk_summaries, website.title)\n",
|
||||||
|
" \n",
|
||||||
|
" try:\n",
|
||||||
|
" response = client.chat.completions.create(\n",
|
||||||
|
" model=\"gpt-4o-mini\",\n",
|
||||||
|
" messages=[\n",
|
||||||
|
" {\"role\": \"system\", \"content\": \"You are an expert at combining multiple summaries into a cohesive, comprehensive summary.\"},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": stitching_prompt}\n",
|
||||||
|
" ],\n",
|
||||||
|
" max_tokens=2000,\n",
|
||||||
|
" temperature=0.3\n",
|
||||||
|
" )\n",
|
||||||
|
" \n",
|
||||||
|
" return response.choices[0].message.content\n",
|
||||||
|
" \n",
|
||||||
|
" except Exception as e:\n",
|
||||||
|
" return f\"Error stitching summaries: {str(e)}\"\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"✅ Summarization functions defined\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "99168160",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def summarize_video(website, use_chunking=True, max_chunk_tokens=4000):\n",
|
||||||
|
" \"\"\"Summarize a YouTube video using OpenAI API with optional chunking for large videos\"\"\"\n",
|
||||||
|
" client = get_openai_client()\n",
|
||||||
|
" \n",
|
||||||
|
" # Check if we need chunking\n",
|
||||||
|
" transcript_text = \" \".join([segment.text for segment in website.transcript])\n",
|
||||||
|
" total_tokens = count_tokens(transcript_text)\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"Total transcript tokens: {total_tokens}\")\n",
|
||||||
|
" \n",
|
||||||
|
" if total_tokens <= max_chunk_tokens and not use_chunking:\n",
|
||||||
|
" # Single summary for small videos\n",
|
||||||
|
" return summarize_single_chunk(website, client)\n",
|
||||||
|
" else:\n",
|
||||||
|
" # Chunked summary for large videos\n",
|
||||||
|
" return summarize_with_chunking(website, client, max_chunk_tokens)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"✅ Main summarization function defined\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "54a76dab",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 7. Interactive Demo\n",
|
||||||
|
"\n",
|
||||||
|
"Now let's test the YouTube video summarizer with a sample video. You can replace the URL with any YouTube video you want to summarize.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "87badeff",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Example usage - replace with your YouTube URL\n",
|
||||||
|
"video_url = \"https://www.youtube.com/watch?v=Xan5JnecLNA\"\n",
|
||||||
|
"\n",
|
||||||
|
"try:\n",
|
||||||
|
" # Create YouTube video object\n",
|
||||||
|
" print(\"🎬 Fetching video data...\")\n",
|
||||||
|
" video = YouTubeVideo(video_url)\n",
|
||||||
|
" \n",
|
||||||
|
" # Display video info\n",
|
||||||
|
" print(f\"📺 Video Title: {video.title}\")\n",
|
||||||
|
" print(f\"🆔 Video ID: {video.video_id}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Count tokens in transcript\n",
|
||||||
|
" transcript_text = video.get_transcript_text()\n",
|
||||||
|
" total_tokens = count_tokens(transcript_text)\n",
|
||||||
|
" print(f\"📊 Total transcript tokens: {total_tokens}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Show video info\n",
|
||||||
|
" info = video.get_video_info()\n",
|
||||||
|
" print(f\"📝 Transcript segments: {info['transcript_length']}\")\n",
|
||||||
|
" \n",
|
||||||
|
"except Exception as e:\n",
|
||||||
|
" print(f\"❌ Error: {str(e)}\")\n",
|
||||||
|
" print(\"💡 Make sure the YouTube URL is valid and the video has captions available\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "b9e4cf2f",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Generate summary (automatically uses chunking if needed)\n",
|
||||||
|
"if 'video' in locals():\n",
|
||||||
|
" print(\"\\n🤖 Generating summary...\")\n",
|
||||||
|
" print(\"⏳ This may take a few minutes for long videos...\")\n",
|
||||||
|
" \n",
|
||||||
|
" try:\n",
|
||||||
|
" summary = summarize_video(video, use_chunking=True, max_chunk_tokens=4000)\n",
|
||||||
|
" \n",
|
||||||
|
" # Display results with nice formatting\n",
|
||||||
|
" print(\"\\n\" + \"=\"*60)\n",
|
||||||
|
" print(\"📋 FINAL SUMMARY\")\n",
|
||||||
|
" print(\"=\"*60)\n",
|
||||||
|
" \n",
|
||||||
|
" # Use IPython display if available for better formatting\n",
|
||||||
|
" if display and Markdown:\n",
|
||||||
|
" display(Markdown(summary))\n",
|
||||||
|
" else:\n",
|
||||||
|
" print(summary)\n",
|
||||||
|
" \n",
|
||||||
|
" except Exception as e:\n",
|
||||||
|
" print(f\"❌ Error generating summary: {str(e)}\")\n",
|
||||||
|
"else:\n",
|
||||||
|
" print(\"⚠️ Please run the previous cell first to load a video\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "42ff8a15",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 8. Testing and Utility Functions\n",
|
||||||
|
"\n",
|
||||||
|
"Additional functions for testing the chunking functionality and other utilities.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "d798b08f",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def test_chunking():\n",
|
||||||
|
" \"\"\"Test function to demonstrate chunking with a sample transcript\"\"\"\n",
|
||||||
|
" # Sample transcript for testing\n",
|
||||||
|
" sample_transcript = [\n",
|
||||||
|
" {\"text\": \"This is a sample transcript segment 1. \" * 100}, # ~1000 tokens\n",
|
||||||
|
" {\"text\": \"This is a sample transcript segment 2. \" * 100}, # ~1000 tokens\n",
|
||||||
|
" {\"text\": \"This is a sample transcript segment 3. \" * 100}, # ~1000 tokens\n",
|
||||||
|
" {\"text\": \"This is a sample transcript segment 4. \" * 100}, # ~1000 tokens\n",
|
||||||
|
" {\"text\": \"This is a sample transcript segment 5. \" * 100}, # ~1000 tokens\n",
|
||||||
|
" ]\n",
|
||||||
|
" \n",
|
||||||
|
" print(\"🧪 Testing chunking functionality...\")\n",
|
||||||
|
" chunks = chunk_transcript(sample_transcript, max_tokens=2000, overlap_tokens=100)\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"📊 Original transcript: {count_tokens(' '.join([s['text'] for s in sample_transcript]))} tokens\")\n",
|
||||||
|
" print(f\"📦 Number of chunks: {len(chunks)}\")\n",
|
||||||
|
" \n",
|
||||||
|
" for i, chunk in enumerate(chunks):\n",
|
||||||
|
" print(f\"📄 Chunk {i+1}: {count_tokens(chunk)} tokens\")\n",
|
||||||
|
"\n",
|
||||||
|
"def analyze_video_tokens(video_url):\n",
|
||||||
|
" \"\"\"Analyze token count and chunking strategy for a video\"\"\"\n",
|
||||||
|
" try:\n",
|
||||||
|
" video = YouTubeVideo(video_url)\n",
|
||||||
|
" transcript_text = video.get_transcript_text()\n",
|
||||||
|
" total_tokens = count_tokens(transcript_text)\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"📺 Video: {video.title}\")\n",
|
||||||
|
" print(f\"📊 Total tokens: {total_tokens}\")\n",
|
||||||
|
" print(f\"📦 Optimal chunk size: {get_optimal_chunk_size()}\")\n",
|
||||||
|
" \n",
|
||||||
|
" if total_tokens > 4000:\n",
|
||||||
|
" chunks = chunk_transcript(video.transcript, max_tokens=4000)\n",
|
||||||
|
" print(f\"🔀 Would be split into {len(chunks)} chunks\")\n",
|
||||||
|
" print(\"✅ Chunking strategy recommended\")\n",
|
||||||
|
" else:\n",
|
||||||
|
" print(\"✅ Single summary strategy sufficient\")\n",
|
||||||
|
" \n",
|
||||||
|
" except Exception as e:\n",
|
||||||
|
" print(f\"❌ Error analyzing video: {str(e)}\")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"✅ Testing and utility functions defined\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "bfd789e5",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Test chunking functionality (optional)\n",
|
||||||
|
"# Uncomment the line below to test chunking with sample data\n",
|
||||||
|
"# test_chunking()\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "3528125f",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 9. Usage Instructions\n",
|
||||||
|
"\n",
|
||||||
|
"### How to Use This Notebook\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Set up your OpenAI API key**:\n",
|
||||||
|
" - Create a `.env` file in the same directory as this notebook\n",
|
||||||
|
" - Add your API key: `OPENAI_API_KEY=your_api_key_here`\n",
|
||||||
|
" - Or set it as an environment variable\n",
|
||||||
|
"\n",
|
||||||
|
"2. **Install dependencies**:\n",
|
||||||
|
" ```bash\n",
|
||||||
|
" pip install -r requirements.txt\n",
|
||||||
|
" ```\n",
|
||||||
|
"\n",
|
||||||
|
"3. **Run the cells in order**:\n",
|
||||||
|
" - Start with the import and setup cells\n",
|
||||||
|
" - Modify the `video_url` variable in the demo section\n",
|
||||||
|
" - Run the demo cells to test the summarizer\n",
|
||||||
|
"\n",
|
||||||
|
"### Customization Options\n",
|
||||||
|
"\n",
|
||||||
|
"- **Change the model**: Modify the model parameter in the summarization functions\n",
|
||||||
|
"- **Adjust chunk size**: Change `max_chunk_tokens` parameter\n",
|
||||||
|
"- **Modify prompts**: Edit the prompt generation functions for different output styles\n",
|
||||||
|
"- **Add error handling**: Extend the exception handling as needed\n",
|
||||||
|
"\n",
|
||||||
|
"### Features\n",
|
||||||
|
"\n",
|
||||||
|
"- ✅ **Automatic transcript extraction** from YouTube videos\n",
|
||||||
|
"- ✅ **Intelligent chunking** for videos exceeding token limits\n",
|
||||||
|
"- ✅ **Academic-quality summaries** with structured output\n",
|
||||||
|
"- ✅ **Error handling** and dependency validation\n",
|
||||||
|
"- ✅ **Interactive testing** with sample data\n",
|
||||||
|
"- ✅ **Token analysis** and optimization recommendations\n",
|
||||||
|
"\n",
|
||||||
|
"### Troubleshooting\n",
|
||||||
|
"\n",
|
||||||
|
"- **\"No transcript available\"**: The video may not have captions enabled\n",
|
||||||
|
"- **\"Invalid YouTube URL\"**: Make sure the URL follows the correct format\n",
|
||||||
|
"- **\"API key not set\"**: Check your `.env` file or environment variables\n",
|
||||||
|
"- **Import errors**: Run `pip install -r requirements.txt` to install dependencies\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "a5a44fb8",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 10. Advanced Usage Examples\n",
|
||||||
|
"\n",
|
||||||
|
"Here are some advanced usage patterns you can try with this notebook.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "2bef390a",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Example 1: Analyze multiple videos\n",
|
||||||
|
"video_urls = [\n",
|
||||||
|
" \"https://www.youtube.com/watch?v=Xan5JnecLNA\",\n",
|
||||||
|
" # Add more URLs here\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"for url in video_urls:\n",
|
||||||
|
" print(f\"\\n{'='*50}\")\n",
|
||||||
|
" print(f\"Analyzing: {url}\")\n",
|
||||||
|
" print('='*50)\n",
|
||||||
|
" analyze_video_tokens(url)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "fbdb5cd8",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Example 2: Custom summarization with different parameters\n",
|
||||||
|
"def custom_summarize(video_url, model=\"gpt-4o-mini\", max_tokens=3000, temperature=0.1):\n",
|
||||||
|
" \"\"\"Custom summarization with specific parameters\"\"\"\n",
|
||||||
|
" try:\n",
|
||||||
|
" video = YouTubeVideo(video_url)\n",
|
||||||
|
" client = get_openai_client()\n",
|
||||||
|
" \n",
|
||||||
|
" # Use custom chunking parameters\n",
|
||||||
|
" chunks = chunk_transcript(video.transcript, max_tokens=max_tokens)\n",
|
||||||
|
" \n",
|
||||||
|
" if len(chunks) == 1:\n",
|
||||||
|
" # Single chunk\n",
|
||||||
|
" system_prompt = generate_system_prompt()\n",
|
||||||
|
" user_prompt = generate_user_prompt(video, chunks[0])\n",
|
||||||
|
" \n",
|
||||||
|
" response = client.chat.completions.create(\n",
|
||||||
|
" model=model,\n",
|
||||||
|
" messages=[\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt}\n",
|
||||||
|
" ],\n",
|
||||||
|
" max_tokens=2000,\n",
|
||||||
|
" temperature=temperature\n",
|
||||||
|
" )\n",
|
||||||
|
" \n",
|
||||||
|
" return response.choices[0].message.content\n",
|
||||||
|
" else:\n",
|
||||||
|
" # Multiple chunks - use standard chunking approach\n",
|
||||||
|
" return summarize_with_chunking(video, client, max_tokens)\n",
|
||||||
|
" \n",
|
||||||
|
" except Exception as e:\n",
|
||||||
|
" return f\"Error: {str(e)}\"\n",
|
||||||
|
"\n",
|
||||||
|
"# Example usage:\n",
|
||||||
|
"# custom_summary = custom_summarize(\"https://www.youtube.com/watch?v=Xan5JnecLNA\")\n",
|
||||||
|
"# print(custom_summary)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "f7a5a9e9",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Generate summary (automatically uses chunking if needed)\n",
|
||||||
|
"if 'video' in locals():\n",
|
||||||
|
" print(\"\\n🤖 Generating summary...\")\n",
|
||||||
|
" print(\"⏳ This may take a few minutes for long videos...\")\n",
|
||||||
|
" \n",
|
||||||
|
" try:\n",
|
||||||
|
" summary = summarize_video(video, use_chunking=True, max_chunk_tokens=4000)\n",
|
||||||
|
" \n",
|
||||||
|
" # Display results with nice formatting\n",
|
||||||
|
" print(\"\\n\" + \"=\"*60)\n",
|
||||||
|
" print(\"📋 FINAL SUMMARY\")\n",
|
||||||
|
" print(\"=\"*60)\n",
|
||||||
|
" \n",
|
||||||
|
" # Use IPython display if available for better formatting\n",
|
||||||
|
" if display and Markdown:\n",
|
||||||
|
" display(Markdown(summary))\n",
|
||||||
|
" else:\n",
|
||||||
|
" print(summary)\n",
|
||||||
|
" \n",
|
||||||
|
" except Exception as e:\n",
|
||||||
|
" print(f\"❌ Error generating summary: {str(e)}\")\n",
|
||||||
|
"else:\n",
|
||||||
|
" print(\"⚠️ Please run the previous cell first to load a video\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "4028fa5e",
|
||||||
|
"metadata": {},
|
||||||
|
"source": []
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "c100b384-2c3e-49de-92ce-f5dd0b4b58c0",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": []
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.11"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
@@ -0,0 +1,421 @@
|
|||||||
|
import os
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
|
||||||
|
# Check for required dependencies and provide helpful error messages
|
||||||
|
try:
|
||||||
|
import requests
|
||||||
|
except ImportError:
|
||||||
|
print("❌ Error: 'requests' module not found.")
|
||||||
|
print("💡 Install with: pip install requests")
|
||||||
|
print(" Or: pip install -r requirements.txt")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
try:
|
||||||
|
import tiktoken
|
||||||
|
except ImportError:
|
||||||
|
print("❌ Error: 'tiktoken' module not found.")
|
||||||
|
print("💡 Install with: pip install tiktoken")
|
||||||
|
print(" Or: pip install -r requirements.txt")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
try:
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
except ImportError:
|
||||||
|
print("❌ Error: 'python-dotenv' module not found.")
|
||||||
|
print("💡 Install with: pip install python-dotenv")
|
||||||
|
print(" Or: pip install -r requirements.txt")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
try:
|
||||||
|
from openai import OpenAI
|
||||||
|
except ImportError:
|
||||||
|
print("❌ Error: 'openai' module not found.")
|
||||||
|
print("💡 Install with: pip install openai")
|
||||||
|
print(" Or: pip install -r requirements.txt")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
try:
|
||||||
|
from youtube_transcript_api import YouTubeTranscriptApi
|
||||||
|
except ImportError:
|
||||||
|
print("❌ Error: 'youtube-transcript-api' module not found.")
|
||||||
|
print("💡 Install with: pip install youtube-transcript-api")
|
||||||
|
print(" Or: pip install -r requirements.txt")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
try:
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
except ImportError:
|
||||||
|
print("❌ Error: 'beautifulsoup4' module not found.")
|
||||||
|
print("💡 Install with: pip install beautifulsoup4")
|
||||||
|
print(" Or: pip install -r requirements.txt")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
try:
|
||||||
|
from IPython.display import Markdown, display
|
||||||
|
except ImportError:
|
||||||
|
# IPython is optional for Jupyter notebooks
|
||||||
|
print("⚠️ Warning: IPython not available (optional for Jupyter notebooks)")
|
||||||
|
Markdown = None
|
||||||
|
display = None
|
||||||
|
|
||||||
|
#headers and class for website to summarize
|
||||||
|
headers = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
|
||||||
|
}
|
||||||
|
|
||||||
|
class YouTubeVideo:
|
||||||
|
def __init__(self, url):
|
||||||
|
self.url = url
|
||||||
|
youtube_pattern = r'https://www\.youtube\.com/watch\?v=[a-zA-Z0-9_-]+'
|
||||||
|
if re.match(youtube_pattern, url):
|
||||||
|
response = requests.get(url, headers=headers)
|
||||||
|
soup = BeautifulSoup(response.content, 'html.parser')
|
||||||
|
self.video_id = url.split("v=")[1]
|
||||||
|
self.title = soup.title.string if soup.title else "No title found"
|
||||||
|
self.transcript = YouTubeTranscriptApi().fetch(self.video_id)
|
||||||
|
else:
|
||||||
|
raise ValueError("Invalid YouTube URL")
|
||||||
|
|
||||||
|
#get api key and openai client
|
||||||
|
def get_api_key():
|
||||||
|
load_dotenv(override=True)
|
||||||
|
api_key = os.getenv("OPENAI_API_KEY")
|
||||||
|
if not api_key:
|
||||||
|
raise ValueError("OPENAI_API_KEY is not set")
|
||||||
|
return api_key
|
||||||
|
|
||||||
|
def get_openai_client():
|
||||||
|
api_key = get_api_key()
|
||||||
|
return OpenAI(api_key=api_key)
|
||||||
|
|
||||||
|
#count tokens
|
||||||
|
def count_tokens(text, model="gpt-4o-mini"):
|
||||||
|
"""Count tokens in text using tiktoken with fallback"""
|
||||||
|
try:
|
||||||
|
# Try model-specific encoding first
|
||||||
|
encoding = tiktoken.encoding_for_model(model)
|
||||||
|
return len(encoding.encode(text))
|
||||||
|
except KeyError:
|
||||||
|
# Fallback to cl100k_base encoding (used by most OpenAI models)
|
||||||
|
# This ensures compatibility even if model-specific encoding isn't available
|
||||||
|
encoding = tiktoken.get_encoding("cl100k_base")
|
||||||
|
return len(encoding.encode(text))
|
||||||
|
except Exception as e:
|
||||||
|
# Ultimate fallback - rough estimation
|
||||||
|
print(f"Warning: Token counting failed ({e}), using rough estimation")
|
||||||
|
return len(text.split()) * 1.3 # Rough word-to-token ratio
|
||||||
|
|
||||||
|
|
||||||
|
def get_optimal_chunk_size(model="gpt-4o-mini"):
|
||||||
|
"""Calculate optimal chunk size based on model's context window"""
|
||||||
|
model_limits = {
|
||||||
|
"gpt-4o-mini": 8192,
|
||||||
|
"gpt-4o": 128000,
|
||||||
|
"gpt-4-turbo": 128000,
|
||||||
|
"gpt-3.5-turbo": 4096,
|
||||||
|
"gpt-4": 8192,
|
||||||
|
}
|
||||||
|
|
||||||
|
context_window = model_limits.get(model, 8192) # Default to 8K
|
||||||
|
|
||||||
|
# Reserve tokens for:
|
||||||
|
# - System prompt: ~800 tokens
|
||||||
|
# - User prompt overhead: ~300 tokens
|
||||||
|
# - Output: ~2000 tokens
|
||||||
|
# - Safety buffer: ~500 tokens
|
||||||
|
reserved_tokens = 800 + 300 + 2000 + 500
|
||||||
|
|
||||||
|
optimal_chunk_size = context_window - reserved_tokens
|
||||||
|
|
||||||
|
# Ensure minimum chunk size
|
||||||
|
return max(optimal_chunk_size, 2000)
|
||||||
|
|
||||||
|
#chunk transcript
|
||||||
|
def chunk_transcript(transcript, max_tokens=4000, overlap_tokens=200, model="gpt-4o-mini"):
|
||||||
|
"""
|
||||||
|
Split transcript into chunks that fit within token limits
|
||||||
|
|
||||||
|
Args:
|
||||||
|
transcript: List of transcript segments from YouTube
|
||||||
|
max_tokens: Maximum tokens per chunk (auto-calculated if None)
|
||||||
|
overlap_tokens: Number of tokens to overlap between chunks
|
||||||
|
model: Model name for token limit calculation
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of transcript chunks
|
||||||
|
"""
|
||||||
|
# Auto-calculate max_tokens based on model if not provided
|
||||||
|
if max_tokens is None:
|
||||||
|
max_tokens = get_optimal_chunk_size(model)
|
||||||
|
|
||||||
|
# Auto-calculate overlap as percentage of max_tokens
|
||||||
|
if overlap_tokens is None:
|
||||||
|
overlap_tokens = int(max_tokens * 0.05) # 5% overlap
|
||||||
|
# Convert transcript to text
|
||||||
|
transcript_text = " ".join([segment.text for segment in transcript])
|
||||||
|
|
||||||
|
# If transcript is small enough, return as single chunk
|
||||||
|
if count_tokens(transcript_text) <= max_tokens:
|
||||||
|
return [transcript_text]
|
||||||
|
|
||||||
|
# Split into sentences for better chunking
|
||||||
|
sentences = re.split(r'[.!?]+', transcript_text)
|
||||||
|
chunks = []
|
||||||
|
current_chunk = ""
|
||||||
|
|
||||||
|
for sentence in sentences:
|
||||||
|
sentence = sentence.strip()
|
||||||
|
if not sentence:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Check if adding this sentence would exceed token limit
|
||||||
|
test_chunk = current_chunk + " " + sentence if current_chunk else sentence
|
||||||
|
|
||||||
|
if count_tokens(test_chunk) <= max_tokens:
|
||||||
|
current_chunk = test_chunk
|
||||||
|
else:
|
||||||
|
# Save current chunk and start new one
|
||||||
|
if current_chunk:
|
||||||
|
chunks.append(current_chunk)
|
||||||
|
|
||||||
|
# Start new chunk with overlap from previous chunk
|
||||||
|
if chunks and overlap_tokens > 0:
|
||||||
|
# Get last few words from previous chunk for overlap
|
||||||
|
prev_words = current_chunk.split()[-overlap_tokens//4:] # Rough word-to-token ratio
|
||||||
|
current_chunk = " ".join(prev_words) + " " + sentence
|
||||||
|
else:
|
||||||
|
current_chunk = sentence
|
||||||
|
|
||||||
|
# Add the last chunk
|
||||||
|
if current_chunk:
|
||||||
|
chunks.append(current_chunk)
|
||||||
|
|
||||||
|
return chunks
|
||||||
|
|
||||||
|
#generate system prompt
|
||||||
|
def generate_system_prompt():
|
||||||
|
return f"""
|
||||||
|
You are an expert YouTube video summarizer. Your job is to take the full transcript of a video and generate a structured, precise, and academically grounded summary.
|
||||||
|
|
||||||
|
Your output must include:
|
||||||
|
|
||||||
|
1. Title
|
||||||
|
- Either reuse the video’s title (if it is clear, accurate, and concise)
|
||||||
|
- Or generate a new, sharper, more descriptive title that best reflects the actual content covered.
|
||||||
|
|
||||||
|
2. Topic & Area of Coverage
|
||||||
|
- Provide a 1–2 line highlight of the main topic of the video and the specific area it best covers.
|
||||||
|
- Format:
|
||||||
|
- Domain (e.g., Finance, Health, Technology, Psychology, Fitness, Productivity, etc.)
|
||||||
|
- Sub-area (e.g., investment strategies, portfolio design; training routine, best exercises; productivity systems, cognitive science insights, etc.)
|
||||||
|
|
||||||
|
3. Summary of the Video
|
||||||
|
- A structured, clear, and concise summary of the video.
|
||||||
|
- Focus only on relevant, high-value content.
|
||||||
|
- Skip fluff, tangents, product promotions, personal banter, or irrelevant side discussions.
|
||||||
|
- Include key insights, frameworks, step-by-step methods, and actionable advice.
|
||||||
|
- Where applicable, reference scientific studies, historical sources, or authoritative references (with author + year or journal if mentioned in the video, or inferred if the reference is well known).
|
||||||
|
|
||||||
|
Style & Quality Rules:
|
||||||
|
- Be extremely specific: avoid vague generalizations.
|
||||||
|
- Use precise language and structured formatting (bullet points, numbered lists, sub-sections if needed).
|
||||||
|
- Prioritize clarity and factual accuracy.
|
||||||
|
- Write as though preparing an executive briefing or academic digest.
|
||||||
|
- If the transcript includes non-relevant sections (jokes, ads, unrelated chit-chat), skip summarizing them entirely.
|
||||||
|
"""
|
||||||
|
|
||||||
|
#generate user prompt
|
||||||
|
def generate_user_prompt(website, transcript_chunk=None):
|
||||||
|
if transcript_chunk:
|
||||||
|
return f"""Here is a portion of a YouTube video transcript. Use the system instructions to generate a summary of this section.
|
||||||
|
|
||||||
|
Video Title: {website.title}
|
||||||
|
|
||||||
|
Transcript Section: {transcript_chunk}
|
||||||
|
"""
|
||||||
|
else:
|
||||||
|
return f"""Here is the transcript of a YouTube video. Use the system instructions to generate the output.
|
||||||
|
|
||||||
|
Video Title: {website.title}
|
||||||
|
|
||||||
|
Transcript: {website.transcript}
|
||||||
|
"""
|
||||||
|
|
||||||
|
#generate stitching prompt
|
||||||
|
def generate_stitching_prompt(chunk_summaries, video_title):
|
||||||
|
"""Generate prompt for stitching together chunk summaries"""
|
||||||
|
return f"""You are an expert at combining multiple summaries into a cohesive, comprehensive summary.
|
||||||
|
|
||||||
|
Video Title: {video_title}
|
||||||
|
|
||||||
|
Below are summaries of different sections of this video. Combine them into a single, well-structured summary that:
|
||||||
|
1. Maintains the original structure and quality standards
|
||||||
|
2. Eliminates redundancy between sections
|
||||||
|
3. Creates smooth transitions between topics
|
||||||
|
4. Preserves all important information
|
||||||
|
5. Maintains the academic, professional tone
|
||||||
|
6. Include examples and nuances where relevant
|
||||||
|
7. Include the citations and references where applicable
|
||||||
|
|
||||||
|
Section Summaries:
|
||||||
|
{chr(10).join([f"Section {i+1}: {summary}" for i, summary in enumerate(chunk_summaries)])}
|
||||||
|
|
||||||
|
Please provide a unified, comprehensive summary following the same format as the individual sections.
|
||||||
|
Make sure the final summary is cohesive and logical.
|
||||||
|
"""
|
||||||
|
|
||||||
|
#summarize video
|
||||||
|
def summarize_video(website, use_chunking=True, max_chunk_tokens=4000):
|
||||||
|
"""Summarize a YouTube video using OpenAI API with optional chunking for large videos"""
|
||||||
|
client = get_openai_client()
|
||||||
|
|
||||||
|
# Check if we need chunking
|
||||||
|
transcript_text = " ".join([segment.text for segment in website.transcript])
|
||||||
|
total_tokens = count_tokens(transcript_text)
|
||||||
|
|
||||||
|
print(f"Total transcript tokens: {total_tokens}")
|
||||||
|
|
||||||
|
if total_tokens <= max_chunk_tokens and not use_chunking:
|
||||||
|
# Single summary for small videos
|
||||||
|
return summarize_single_chunk(website, client)
|
||||||
|
else:
|
||||||
|
# Chunked summary for large videos
|
||||||
|
return summarize_with_chunking(website, client, max_chunk_tokens)
|
||||||
|
|
||||||
|
#summarize single chunk
|
||||||
|
def summarize_single_chunk(website, client):
|
||||||
|
"""Summarize a single chunk (small video)"""
|
||||||
|
system_prompt = generate_system_prompt()
|
||||||
|
user_prompt = generate_user_prompt(website)
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model="gpt-4o-mini",
|
||||||
|
messages=[
|
||||||
|
{"role": "system", "content": system_prompt},
|
||||||
|
{"role": "user", "content": user_prompt}
|
||||||
|
],
|
||||||
|
max_tokens=2000,
|
||||||
|
temperature=0.3
|
||||||
|
)
|
||||||
|
|
||||||
|
return response.choices[0].message.content
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return f"Error generating summary: {str(e)}"
|
||||||
|
|
||||||
|
#summarize with chunking
|
||||||
|
def summarize_with_chunking(website, client, max_chunk_tokens=4000):
|
||||||
|
"""Summarize a large video by chunking and stitching"""
|
||||||
|
print("Video is large, using chunking strategy...")
|
||||||
|
|
||||||
|
# Chunk the transcript
|
||||||
|
chunks = chunk_transcript(website.transcript, max_chunk_tokens)
|
||||||
|
print(f"Split into {len(chunks)} chunks")
|
||||||
|
|
||||||
|
# Summarize each chunk
|
||||||
|
chunk_summaries = []
|
||||||
|
system_prompt = generate_system_prompt()
|
||||||
|
|
||||||
|
for i, chunk in enumerate(chunks):
|
||||||
|
print(f"Processing chunk {i+1}/{len(chunks)}...")
|
||||||
|
user_prompt = generate_user_prompt(website, chunk)
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model="gpt-4o-mini",
|
||||||
|
messages=[
|
||||||
|
{"role": "system", "content": system_prompt},
|
||||||
|
{"role": "user", "content": user_prompt}
|
||||||
|
],
|
||||||
|
max_tokens=1500, # Smaller for chunks
|
||||||
|
temperature=0.3
|
||||||
|
)
|
||||||
|
|
||||||
|
chunk_summaries.append(response.choices[0].message.content)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
chunk_summaries.append(f"Error in chunk {i+1}: {str(e)}")
|
||||||
|
|
||||||
|
# Stitch the summaries together
|
||||||
|
print("Stitching summaries together...")
|
||||||
|
stitching_prompt = generate_stitching_prompt(chunk_summaries, website.title)
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model="gpt-4o-mini",
|
||||||
|
messages=[
|
||||||
|
{"role": "system", "content": "You are an expert at combining multiple summaries into a cohesive, comprehensive summary."},
|
||||||
|
{"role": "user", "content": stitching_prompt}
|
||||||
|
],
|
||||||
|
max_tokens=2000,
|
||||||
|
temperature=0.3
|
||||||
|
)
|
||||||
|
|
||||||
|
return response.choices[0].message.content
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return f"Error stitching summaries: {str(e)}"
|
||||||
|
|
||||||
|
#main function
|
||||||
|
def main():
|
||||||
|
"""Main function to demonstrate usage"""
|
||||||
|
# Example usage - replace with actual YouTube URL
|
||||||
|
video_url = "https://www.youtube.com/watch?v=Xan5JnecLNA"
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create YouTube video object
|
||||||
|
print("Fetching video data...")
|
||||||
|
video = YouTubeVideo(video_url)
|
||||||
|
|
||||||
|
# Display video info
|
||||||
|
print(f"Video Title: {video.title}")
|
||||||
|
print(f"Video ID: {video.video_id}")
|
||||||
|
|
||||||
|
# Count tokens in transcript
|
||||||
|
transcript_text = " ".join([segment.text for segment in video.transcript])
|
||||||
|
total_tokens = count_tokens(transcript_text)
|
||||||
|
print(f"Total transcript tokens: {total_tokens}")
|
||||||
|
|
||||||
|
# Generate summary (automatically uses chunking if needed)
|
||||||
|
print("\nGenerating summary...")
|
||||||
|
summary = summarize_video(video, use_chunking=True, max_chunk_tokens=4000)
|
||||||
|
|
||||||
|
# Display results
|
||||||
|
print("\n" + "="*50)
|
||||||
|
print("FINAL SUMMARY")
|
||||||
|
print("="*50)
|
||||||
|
print(summary)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
|
def test_chunking():
|
||||||
|
"""Test function to demonstrate chunking with a sample transcript"""
|
||||||
|
# Sample transcript for testing
|
||||||
|
sample_transcript = [
|
||||||
|
{"text": "This is a sample transcript segment 1. " * 100}, # ~1000 tokens
|
||||||
|
{"text": "This is a sample transcript segment 2. " * 100}, # ~1000 tokens
|
||||||
|
{"text": "This is a sample transcript segment 3. " * 100}, # ~1000 tokens
|
||||||
|
{"text": "This is a sample transcript segment 4. " * 100}, # ~1000 tokens
|
||||||
|
{"text": "This is a sample transcript segment 5. " * 100}, # ~1000 tokens
|
||||||
|
]
|
||||||
|
|
||||||
|
print("Testing chunking functionality...")
|
||||||
|
chunks = chunk_transcript(sample_transcript, max_tokens=2000, overlap_tokens=100)
|
||||||
|
|
||||||
|
print(f"Original transcript: {count_tokens(' '.join([s['text'] for s in sample_transcript]))} tokens")
|
||||||
|
print(f"Number of chunks: {len(chunks)}")
|
||||||
|
|
||||||
|
for i, chunk in enumerate(chunks):
|
||||||
|
print(f"Chunk {i+1}: {count_tokens(chunk)} tokens")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Uncomment the line below to test chunking
|
||||||
|
# test_chunking()
|
||||||
|
|
||||||
|
# Run main function
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user