diff --git a/week3/community-contributions/hopeogbons/README.md b/week3/community-contributions/hopeogbons/README.md new file mode 100644 index 0000000..741ff59 --- /dev/null +++ b/week3/community-contributions/hopeogbons/README.md @@ -0,0 +1,197 @@ +# ๐ŸŽ™๏ธ Audio Transcription Assistant + +An AI-powered audio transcription tool that converts speech to text in multiple languages using OpenAI's Whisper model. + +## Why I Built This + +In today's content-driven world, audio and video are everywhereโ€”podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings? + +Manual transcription is time-consuming and expensive. I wanted to build something that could: + +- Accept audio files in any format (MP3, WAV, etc.) +- Transcribe them accurately using AI +- Support multiple languages +- Work locally on my Mac **and** on cloud GPUs (Google Colab) + +That's where **Whisper** comes inโ€”OpenAI's powerful speech recognition model. + +## Features + +- ๐Ÿ“ค **Upload any audio file** (MP3, WAV, M4A, FLAC, etc.) +- ๐ŸŒ **12+ languages supported** with auto-detection +- ๐Ÿค– **Accurate AI-powered transcription** using Whisper +- โšก **Cross-platform** - works on CPU (Mac) or GPU (Colab) +- ๐ŸŽจ **Clean web interface** built with Gradio +- ๐Ÿš€ **Fast processing** with optimized model settings + +## Tech Stack + +- **OpenAI Whisper** - Speech recognition model +- **Gradio** - Web interface framework +- **PyTorch** - Deep learning backend +- **NumPy** - Numerical computing +- **ffmpeg** - Audio file processing + +## Installation + +### Prerequisites + +- Python 3.12+ +- ffmpeg (for audio processing) +- uv package manager (or pip) + +### Setup + +1. Clone this repository or download the notebook + +2. Install dependencies: + +```bash +# Install compatible NumPy version +uv pip install --reinstall "numpy==1.26.4" + +# Install PyTorch +uv pip install torch torchvision torchaudio + +# Install Gradio and Whisper +uv pip install gradio openai-whisper ffmpeg-python + +# (Optional) Install Ollama for LLM features +uv pip install ollama +``` + +3. **For Mac users**, ensure ffmpeg is installed: + +```bash +brew install ffmpeg +``` + +## Usage + +### Running Locally + +1. Open the Jupyter notebook `week3 EXERCISE_hopeogbons.ipynb` + +2. Run all cells in order: + + - Cell 1: Install dependencies + - Cell 2: Import libraries + - Cell 3: Load Whisper model + - Cell 4: Define transcription function + - Cell 5: Build Gradio interface + - Cell 6: Launch the app + +3. The app will automatically open in your browser + +4. Upload an audio file, select the language, and click Submit! + +### Running on Google Colab + +For GPU acceleration: + +1. Open the notebook in Google Colab +2. Runtime โ†’ Change runtime type โ†’ **GPU (T4)** +3. Run all cells in order +4. The model will automatically use GPU acceleration + +**Note:** First run downloads the Whisper model (~140MB) - this is a one-time download. + +## Supported Languages + +- ๐Ÿ‡ฌ๐Ÿ‡ง English +- ๐Ÿ‡ช๐Ÿ‡ธ Spanish +- ๐Ÿ‡ซ๐Ÿ‡ท French +- ๐Ÿ‡ฉ๐Ÿ‡ช German +- ๐Ÿ‡ฎ๐Ÿ‡น Italian +- ๐Ÿ‡ต๐Ÿ‡น Portuguese +- ๐Ÿ‡จ๐Ÿ‡ณ Chinese +- ๐Ÿ‡ฏ๐Ÿ‡ต Japanese +- ๐Ÿ‡ฐ๐Ÿ‡ท Korean +- ๐Ÿ‡ท๐Ÿ‡บ Russian +- ๐Ÿ‡ธ๐Ÿ‡ฆ Arabic +- ๐ŸŒ Auto-detect + +## How It Works + +1. **Upload** - User uploads an audio file through the Gradio interface +2. **Process** - ffmpeg decodes the audio file +3. **Transcribe** - Whisper model processes the audio and generates text +4. **Display** - Transcription is shown in the output box + +The Whisper "base" model is used for a balance between speed and accuracy: + +- Fast enough for real-time use on CPU +- Accurate enough for most transcription needs +- Small enough (~140MB) for quick downloads + +## Example Transcriptions + +The app successfully transcribed: + +- English podcast episodes +- French language audio (detected and transcribed) +- Multi-speaker conversations +- Audio with background noise + +## What I Learned + +Building this transcription assistant taught me: + +- **Audio processing** with ffmpeg and Whisper +- **Cross-platform compatibility** (Mac CPU vs Colab GPU) +- **Dependency management** (dealing with NumPy version conflicts!) +- **Async handling** in Jupyter notebooks with Gradio +- **Model optimization** (choosing the right Whisper model size) + +The biggest challenge? Getting ffmpeg and NumPy to play nice together across different environments. But solving those issues made me understand the stack much better. + +## Troubleshooting + +### Common Issues + +**1. "No module named 'whisper'" error** + +- Make sure you've installed `openai-whisper`, not just `whisper` +- Restart your kernel after installation + +**2. "ffmpeg not found" error** + +- Install ffmpeg: `brew install ffmpeg` (Mac) or `apt-get install ffmpeg` (Linux) + +**3. NumPy version conflicts** + +- Use NumPy 1.26.4: `uv pip install --reinstall "numpy==1.26.4"` +- Restart kernel after reinstalling + +**4. Gradio event loop errors** + +- Use `prevent_thread_lock=True` in `app.launch()` +- Restart kernel if errors persist + +## Future Enhancements + +- [ ] Support for real-time audio streaming +- [ ] Speaker diarization (identifying different speakers) +- [ ] Export transcripts to multiple formats (SRT, VTT, TXT) +- [ ] Integration with LLMs for summarization +- [ ] Batch processing for multiple files + +## Contributing + +Feel free to fork this project and submit pull requests with improvements! + +## License + +This project is open source and available under the MIT License. + +## Acknowledgments + +- **OpenAI** for the amazing Whisper model +- **Gradio** team for the intuitive interface framework +- **Andela LLM Engineering Program** for the learning opportunity + +--- + +**Built with โค๏ธ as part of the Andela LLM Engineering Program** + +For questions or feedback, feel free to reach out! diff --git a/week3/community-contributions/hopeogbons/french_language_i_do_not_understand.mp3 b/week3/community-contributions/hopeogbons/french_language_i_do_not_understand.mp3 new file mode 100644 index 0000000..3eb5bf8 Binary files /dev/null and b/week3/community-contributions/hopeogbons/french_language_i_do_not_understand.mp3 differ diff --git a/week3/community-contributions/hopeogbons/week3 EXERCISE_hopeogbons.ipynb b/week3/community-contributions/hopeogbons/week3 EXERCISE_hopeogbons.ipynb new file mode 100644 index 0000000..d843850 --- /dev/null +++ b/week3/community-contributions/hopeogbons/week3 EXERCISE_hopeogbons.ipynb @@ -0,0 +1,397 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "270ed08b", + "metadata": {}, + "source": [ + "# ๐ŸŽ™๏ธ Audio Transcription Assistant\n", + "\n", + "## Why I Built This\n", + "\n", + "In today's content-driven world, audio and video are everywhereโ€”podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings?\n", + "\n", + "Manual transcription is time-consuming and expensive. I wanted to build something that could:\n", + "- Accept audio files in any format (MP3, WAV, etc.)\n", + "- Transcribe them accurately using AI\n", + "- Support multiple languages\n", + "- Work locally on my Mac **and** on cloud GPUs (Google Colab)\n", + "\n", + "That's where **Whisper** comes inโ€”OpenAI's powerful speech recognition model.\n", + "\n", + "---\n", + "\n", + "## What This Does\n", + "\n", + "This app lets you:\n", + "- ๐Ÿ“ค Upload any audio file\n", + "- ๐ŸŒ Choose from 12+ languages (or auto-detect)\n", + "- ๐Ÿค– Get accurate AI-powered transcription\n", + "- โšก Process on CPU (Mac) or GPU (Colab)\n", + "\n", + "**Tech:** OpenAI Whisper โ€ข Gradio UI โ€ข PyTorch โ€ข Cross-platform (Mac/Colab)\n", + "\n", + "---\n", + "\n", + "**Note:** This is a demonstration. For production use, consider privacy and data handling policies.\n" + ] + }, + { + "cell_type": "markdown", + "id": "c37e5165", + "metadata": {}, + "source": [ + "## Step 1: Install Dependencies\n", + "\n", + "Installing everything needed:\n", + "- **NumPy 1.26.4** - Compatible version for Whisper\n", + "- **PyTorch** - Deep learning framework\n", + "- **Whisper** - OpenAI's speech recognition model\n", + "- **Gradio** - Web interface\n", + "- **ffmpeg** - Audio file processing\n", + "- **Ollama** - For local LLM support (optional)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "8c66b0ca", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/usr/local/bin/ffmpeg\n" + ] + } + ], + "source": [ + "# Package installation\n", + "\n", + "!uv pip install -q --reinstall \"numpy==1.26.4\"\n", + "!uv pip install -q torch torchvision torchaudio\n", + "!uv pip install -q gradio openai-whisper ffmpeg-python\n", + "!uv pip install -q ollama\n", + "\n", + "# Ensure ffmpeg is available (Mac)\n", + "!which ffmpeg || brew install ffmpeg" + ] + }, + { + "cell_type": "markdown", + "id": "f31d64ee", + "metadata": {}, + "source": [ + "## Step 2: Import Libraries\n", + "\n", + "The essentials: NumPy for arrays, Gradio for the UI, Whisper for transcription, PyTorch for the model backend, and Ollama for optional LLM features.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "4782261a", + "metadata": {}, + "outputs": [], + "source": [ + "# Imports\n", + "\n", + "import os\n", + "import numpy as np\n", + "import gradio as gr\n", + "import whisper\n", + "import torch\n", + "import ollama" + ] + }, + { + "cell_type": "markdown", + "id": "93a41b23", + "metadata": {}, + "source": [ + "## Step 3: Load Whisper Model\n", + "\n", + "Loading the **base** modelโ€”a balanced choice between speed and accuracy. It works on both CPU (Mac) and GPU (Colab). The model is ~140MB and will download automatically on first run.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "130ed059", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loading Whisper model...\n", + "Using device: cpu\n", + "โœ… Model loaded successfully!\n", + "Model type: \n", + "Has transcribe method: True\n" + ] + } + ], + "source": [ + "# Model initialization\n", + "\n", + "print(\"Loading Whisper model...\")\n", + "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", + "print(f\"Using device: {device}\")\n", + "\n", + "whisper_model = whisper.load_model(\"base\", device=device)\n", + "print(\"โœ… Model loaded successfully!\")\n", + "print(f\"Model type: {type(whisper_model)}\")\n", + "print(f\"Has transcribe method: {hasattr(whisper_model, 'transcribe')}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "d84f6cfe", + "metadata": {}, + "source": [ + "## Step 4: Transcription Function\n", + "\n", + "This is the core logic:\n", + "- Accepts an audio file and target language\n", + "- Maps language names to Whisper's language codes\n", + "- Transcribes the audio using the loaded model\n", + "- Returns the transcribed text\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "4f2c4b2c", + "metadata": {}, + "outputs": [], + "source": [ + "# Transcription function\n", + "\n", + "def transcribe_audio(audio_file, target_language):\n", + " \"\"\"Transcribe audio file to text in the specified language.\"\"\"\n", + " if audio_file is None:\n", + " return \"Please upload an audio file.\"\n", + " \n", + " try:\n", + " # Language codes for Whisper\n", + " language_map = {\n", + " \"English\": \"en\",\n", + " \"Spanish\": \"es\",\n", + " \"French\": \"fr\",\n", + " \"German\": \"de\",\n", + " \"Italian\": \"it\",\n", + " \"Portuguese\": \"pt\",\n", + " \"Chinese\": \"zh\",\n", + " \"Japanese\": \"ja\",\n", + " \"Korean\": \"ko\",\n", + " \"Russian\": \"ru\",\n", + " \"Arabic\": \"ar\",\n", + " \"Auto-detect\": None\n", + " }\n", + " \n", + " lang_code = language_map.get(target_language)\n", + " \n", + " # Get file path from Gradio File component (returns path string directly)\n", + " audio_path = audio_file.name if hasattr(audio_file, 'name') else audio_file\n", + " \n", + " if not audio_path or not os.path.exists(audio_path):\n", + " return \"Invalid audio file or file not found\"\n", + "\n", + " # Transcribe using whisper_model.transcribe()\n", + " result = whisper_model.transcribe(\n", + " audio_path,\n", + " language=lang_code,\n", + " task=\"transcribe\",\n", + " verbose=False # Hide confusing progress bar\n", + " )\n", + " \n", + " return result[\"text\"]\n", + " \n", + " except Exception as e:\n", + " return f\"Error: {str(e)}\"\n" + ] + }, + { + "cell_type": "markdown", + "id": "dd928784", + "metadata": {}, + "source": [ + "## Step 5: Build the Interface\n", + "\n", + "Creating a simple, clean Gradio interface with:\n", + "- **File uploader** for audio files\n", + "- **Language dropdown** with 12+ options\n", + "- **Transcription output** box\n", + "- Auto-launches in browser for convenience\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "5ce2c944", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "โœ… App ready! Run the next cell to launch.\n" + ] + } + ], + "source": [ + "# Gradio interface\n", + "\n", + "app = gr.Interface(\n", + " fn=transcribe_audio,\n", + " inputs=[\n", + " gr.File(label=\"Upload Audio File\", file_types=[\"audio\"]),\n", + " gr.Dropdown(\n", + " choices=[\n", + " \"English\", \"Spanish\", \"French\", \"German\", \"Italian\",\n", + " \"Portuguese\", \"Chinese\", \"Japanese\", \"Korean\",\n", + " \"Russian\", \"Arabic\", \"Auto-detect\"\n", + " ],\n", + " value=\"English\",\n", + " label=\"Language\"\n", + " )\n", + " ],\n", + " outputs=gr.Textbox(label=\"Transcription\", lines=15),\n", + " title=\"๐ŸŽ™๏ธ Audio Transcription\",\n", + " description=\"Upload an audio file to transcribe it.\",\n", + " flagging_mode=\"never\"\n", + ")\n", + "\n", + "print(\"โœ… App ready! Run the next cell to launch.\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "049ac197", + "metadata": {}, + "source": [ + "## Step 6: Launch the App\n", + "\n", + "Starting the Gradio server with Jupyter compatibility (`prevent_thread_lock=True`). The app will open automatically in your browser.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa6c8d9a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "* Running on local URL: http://127.0.0.1:7860\n", + "* To create a public link, set `share=True` in `launch()`.\n" + ] + }, + { + "data": { + "text/html": [ + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n", + " warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n", + "100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10416/10416 [00:06<00:00, 1723.31frames/s]\n", + "/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n", + " warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n", + "100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10416/10416 [00:30<00:00, 341.64frames/s]\n", + "/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n", + " warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n", + "100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2289/2289 [00:01<00:00, 1205.18frames/s]\n" + ] + } + ], + "source": [ + "# Launch\n", + "\n", + "# Close any previous instances\n", + "try:\n", + " app.close()\n", + "except:\n", + " pass\n", + "\n", + "# Start the app\n", + "app.launch(inbrowser=True, prevent_thread_lock=True)\n" + ] + }, + { + "cell_type": "markdown", + "id": "c3c2ec24", + "metadata": {}, + "source": [ + "---\n", + "\n", + "## ๐Ÿ’ก How to Use\n", + "\n", + "1. **Upload** an audio file (MP3, WAV, M4A, etc.)\n", + "2. **Select** your language (or use Auto-detect)\n", + "3. **Click** Submit\n", + "4. **Get** your transcription!\n", + "\n", + "---\n", + "\n", + "## ๐Ÿš€ Running on Google Colab\n", + "\n", + "For GPU acceleration on Colab:\n", + "1. Runtime โ†’ Change runtime type โ†’ **GPU (T4)**\n", + "2. Run all cells in order\n", + "3. The model will use GPU automatically\n", + "\n", + "**Note:** First run downloads the Whisper model (~140MB) - this is a one-time download.\n", + "\n", + "---\n", + "\n", + "## ๐Ÿ“ Supported Languages\n", + "\n", + "English โ€ข Spanish โ€ข French โ€ข German โ€ข Italian โ€ข Portuguese โ€ข Chinese โ€ข Japanese โ€ข Korean โ€ข Russian โ€ข Arabic โ€ข Auto-detect\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}