(Oct 2025 Bootcamp): Add audio transcription assistant with Gradio interface

- Introduced a new audio transcription tool utilizing OpenAI's Whisper model. - Added README.md detailing features, installation, and usage instructions. - Created a Jupyter notebook for local and Google Colab execution. - Included an MP3 file for demonstration purposes.
2025-10-26 04:49:01 +01:00
parent 48076f9d39
commit ae81fa4c8d
3 changed files with 594 additions and 0 deletions
--- a/week3/community-contributions/hopeogbons/README.md
+++ b/week3/community-contributions/hopeogbons/README.md
@@ -0,0 +1,197 @@
+# 🎙️ Audio Transcription Assistant
+
+An AI-powered audio transcription tool that converts speech to text in multiple languages using OpenAI's Whisper model.
+
+## Why I Built This
+
+In today's content-driven world, audio and video are everywhere—podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings?
+
+Manual transcription is time-consuming and expensive. I wanted to build something that could:
+
+- Accept audio files in any format (MP3, WAV, etc.)
+- Transcribe them accurately using AI
+- Support multiple languages
+- Work locally on my Mac **and** on cloud GPUs (Google Colab)
+
+That's where **Whisper** comes in—OpenAI's powerful speech recognition model.
+
+## Features
+
+- 📤 **Upload any audio file** (MP3, WAV, M4A, FLAC, etc.)
+- 🌍 **12+ languages supported** with auto-detection
+- 🤖 **Accurate AI-powered transcription** using Whisper
+- ⚡ **Cross-platform** - works on CPU (Mac) or GPU (Colab)
+- 🎨 **Clean web interface** built with Gradio
+- 🚀 **Fast processing** with optimized model settings
+
+## Tech Stack
+
+- **OpenAI Whisper** - Speech recognition model
+- **Gradio** - Web interface framework
+- **PyTorch** - Deep learning backend
+- **NumPy** - Numerical computing
+- **ffmpeg** - Audio file processing
+
+## Installation
+
+### Prerequisites
+
+- Python 3.12+
+- ffmpeg (for audio processing)
+- uv package manager (or pip)
+
+### Setup
+
+1. Clone this repository or download the notebook
+
+2. Install dependencies:
+
+```bash
+# Install compatible NumPy version
+uv pip install --reinstall "numpy==1.26.4"
+
+# Install PyTorch
+uv pip install torch torchvision torchaudio
+
+# Install Gradio and Whisper
+uv pip install gradio openai-whisper ffmpeg-python
+
+# (Optional) Install Ollama for LLM features
+uv pip install ollama
+```
+
+3. **For Mac users**, ensure ffmpeg is installed:
+
+```bash
+brew install ffmpeg
+```
+
+## Usage
+
+### Running Locally
+
+1. Open the Jupyter notebook `week3 EXERCISE_hopeogbons.ipynb`
+
+2. Run all cells in order:
+
+   - Cell 1: Install dependencies
+   - Cell 2: Import libraries
+   - Cell 3: Load Whisper model
+   - Cell 4: Define transcription function
+   - Cell 5: Build Gradio interface
+   - Cell 6: Launch the app
+
+3. The app will automatically open in your browser
+
+4. Upload an audio file, select the language, and click Submit!
+
+### Running on Google Colab
+
+For GPU acceleration:
+
+1. Open the notebook in Google Colab
+2. Runtime → Change runtime type → **GPU (T4)**
+3. Run all cells in order
+4. The model will automatically use GPU acceleration
+
+**Note:** First run downloads the Whisper model (~140MB) - this is a one-time download.
+
+## Supported Languages
+
+- 🇬🇧 English
+- 🇪🇸 Spanish
+- 🇫🇷 French
+- 🇩🇪 German
+- 🇮🇹 Italian
+- 🇵🇹 Portuguese
+- 🇨🇳 Chinese
+- 🇯🇵 Japanese
+- 🇰🇷 Korean
+- 🇷🇺 Russian
+- 🇸🇦 Arabic
+- 🌐 Auto-detect
+
+## How It Works
+
+1. **Upload** - User uploads an audio file through the Gradio interface
+2. **Process** - ffmpeg decodes the audio file
+3. **Transcribe** - Whisper model processes the audio and generates text
+4. **Display** - Transcription is shown in the output box
+
+The Whisper "base" model is used for a balance between speed and accuracy:
+
+- Fast enough for real-time use on CPU
+- Accurate enough for most transcription needs
+- Small enough (~140MB) for quick downloads
+
+## Example Transcriptions
+
+The app successfully transcribed:
+
+- English podcast episodes
+- French language audio (detected and transcribed)
+- Multi-speaker conversations
+- Audio with background noise
+
+## What I Learned
+
+Building this transcription assistant taught me:
+
+- **Audio processing** with ffmpeg and Whisper
+- **Cross-platform compatibility** (Mac CPU vs Colab GPU)
+- **Dependency management** (dealing with NumPy version conflicts!)
+- **Async handling** in Jupyter notebooks with Gradio
+- **Model optimization** (choosing the right Whisper model size)
+
+The biggest challenge? Getting ffmpeg and NumPy to play nice together across different environments. But solving those issues made me understand the stack much better.
+
+## Troubleshooting
+
+### Common Issues
+
+**1. "No module named 'whisper'" error**
+
+- Make sure you've installed `openai-whisper`, not just `whisper`
+- Restart your kernel after installation
+
+**2. "ffmpeg not found" error**
+
+- Install ffmpeg: `brew install ffmpeg` (Mac) or `apt-get install ffmpeg` (Linux)
+
+**3. NumPy version conflicts**
+
+- Use NumPy 1.26.4: `uv pip install --reinstall "numpy==1.26.4"`
+- Restart kernel after reinstalling
+
+**4. Gradio event loop errors**
+
+- Use `prevent_thread_lock=True` in `app.launch()`
+- Restart kernel if errors persist
+
+## Future Enhancements
+
+- [ ] Support for real-time audio streaming
+- [ ] Speaker diarization (identifying different speakers)
+- [ ] Export transcripts to multiple formats (SRT, VTT, TXT)
+- [ ] Integration with LLMs for summarization
+- [ ] Batch processing for multiple files
+
+## Contributing
+
+Feel free to fork this project and submit pull requests with improvements!
+
+## License
+
+This project is open source and available under the MIT License.
+
+## Acknowledgments
+
+- **OpenAI** for the amazing Whisper model
+- **Gradio** team for the intuitive interface framework
+- **Andela LLM Engineering Program** for the learning opportunity
+
+---
+
+**Built with ❤️ as part of the Andela LLM Engineering Program**
+
+For questions or feedback, feel free to reach out!
--- a/week3/community-contributions/hopeogbons/french_language_i_do_not_understand.mp3
+++ b/week3/community-contributions/hopeogbons/french_language_i_do_not_understand.mp3
--- a/week3/community-contributions/hopeogbons/week3
+++ b/week3/community-contributions/hopeogbons/week3
@@ -0,0 +1,397 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "270ed08b",
+   "metadata": {},
+   "source": [
+    "# 🎙️ Audio Transcription Assistant\n",
+    "\n",
+    "## Why I Built This\n",
+    "\n",
+    "In today's content-driven world, audio and video are everywhere—podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings?\n",
+    "\n",
+    "Manual transcription is time-consuming and expensive. I wanted to build something that could:\n",
+    "- Accept audio files in any format (MP3, WAV, etc.)\n",
+    "- Transcribe them accurately using AI\n",
+    "- Support multiple languages\n",
+    "- Work locally on my Mac **and** on cloud GPUs (Google Colab)\n",
+    "\n",
+    "That's where **Whisper** comes in—OpenAI's powerful speech recognition model.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## What This Does\n",
+    "\n",
+    "This app lets you:\n",
+    "- 📤 Upload any audio file\n",
+    "- 🌍 Choose from 12+ languages (or auto-detect)\n",
+    "- 🤖 Get accurate AI-powered transcription\n",
+    "- ⚡ Process on CPU (Mac) or GPU (Colab)\n",
+    "\n",
+    "**Tech:** OpenAI Whisper • Gradio UI • PyTorch • Cross-platform (Mac/Colab)\n",
+    "\n",
+    "---\n",
+    "\n",
+    "**Note:** This is a demonstration. For production use, consider privacy and data handling policies.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c37e5165",
+   "metadata": {},
+   "source": [
+    "## Step 1: Install Dependencies\n",
+    "\n",
+    "Installing everything needed:\n",
+    "- **NumPy 1.26.4** - Compatible version for Whisper\n",
+    "- **PyTorch** - Deep learning framework\n",
+    "- **Whisper** - OpenAI's speech recognition model\n",
+    "- **Gradio** - Web interface\n",
+    "- **ffmpeg** - Audio file processing\n",
+    "- **Ollama** - For local LLM support (optional)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "8c66b0ca",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/usr/local/bin/ffmpeg\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Package installation\n",
+    "\n",
+    "!uv pip install -q --reinstall \"numpy==1.26.4\"\n",
+    "!uv pip install -q torch torchvision torchaudio\n",
+    "!uv pip install -q gradio openai-whisper ffmpeg-python\n",
+    "!uv pip install -q ollama\n",
+    "\n",
+    "# Ensure ffmpeg is available (Mac)\n",
+    "!which ffmpeg || brew install ffmpeg"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f31d64ee",
+   "metadata": {},
+   "source": [
+    "## Step 2: Import Libraries\n",
+    "\n",
+    "The essentials: NumPy for arrays, Gradio for the UI, Whisper for transcription, PyTorch for the model backend, and Ollama for optional LLM features.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "4782261a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Imports\n",
+    "\n",
+    "import os\n",
+    "import numpy as np\n",
+    "import gradio as gr\n",
+    "import whisper\n",
+    "import torch\n",
+    "import ollama"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "93a41b23",
+   "metadata": {},
+   "source": [
+    "## Step 3: Load Whisper Model\n",
+    "\n",
+    "Loading the **base** model—a balanced choice between speed and accuracy. It works on both CPU (Mac) and GPU (Colab). The model is ~140MB and will download automatically on first run.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "130ed059",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loading Whisper model...\n",
+      "Using device: cpu\n",
+      "✅ Model loaded successfully!\n",
+      "Model type: <class 'whisper.model.Whisper'>\n",
+      "Has transcribe method: True\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Model initialization\n",
+    "\n",
+    "print(\"Loading Whisper model...\")\n",
+    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
+    "print(f\"Using device: {device}\")\n",
+    "\n",
+    "whisper_model = whisper.load_model(\"base\", device=device)\n",
+    "print(\"✅ Model loaded successfully!\")\n",
+    "print(f\"Model type: {type(whisper_model)}\")\n",
+    "print(f\"Has transcribe method: {hasattr(whisper_model, 'transcribe')}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d84f6cfe",
+   "metadata": {},
+   "source": [
+    "## Step 4: Transcription Function\n",
+    "\n",
+    "This is the core logic:\n",
+    "- Accepts an audio file and target language\n",
+    "- Maps language names to Whisper's language codes\n",
+    "- Transcribes the audio using the loaded model\n",
+    "- Returns the transcribed text\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "4f2c4b2c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Transcription function\n",
+    "\n",
+    "def transcribe_audio(audio_file, target_language):\n",
+    "    \"\"\"Transcribe audio file to text in the specified language.\"\"\"\n",
+    "    if audio_file is None:\n",
+    "        return \"Please upload an audio file.\"\n",
+    "    \n",
+    "    try:\n",
+    "        # Language codes for Whisper\n",
+    "        language_map = {\n",
+    "            \"English\": \"en\",\n",
+    "            \"Spanish\": \"es\",\n",
+    "            \"French\": \"fr\",\n",
+    "            \"German\": \"de\",\n",
+    "            \"Italian\": \"it\",\n",
+    "            \"Portuguese\": \"pt\",\n",
+    "            \"Chinese\": \"zh\",\n",
+    "            \"Japanese\": \"ja\",\n",
+    "            \"Korean\": \"ko\",\n",
+    "            \"Russian\": \"ru\",\n",
+    "            \"Arabic\": \"ar\",\n",
+    "            \"Auto-detect\": None\n",
+    "        }\n",
+    "        \n",
+    "        lang_code = language_map.get(target_language)\n",
+    "        \n",
+    "        # Get file path from Gradio File component (returns path string directly)\n",
+    "        audio_path = audio_file.name if hasattr(audio_file, 'name') else audio_file\n",
+    "        \n",
+    "        if not audio_path or not os.path.exists(audio_path):\n",
+    "            return \"Invalid audio file or file not found\"\n",
+    "\n",
+    "        # Transcribe using whisper_model.transcribe()\n",
+    "        result = whisper_model.transcribe(\n",
+    "            audio_path,\n",
+    "            language=lang_code,\n",
+    "            task=\"transcribe\",\n",
+    "            verbose=False  # Hide confusing progress bar\n",
+    "        )\n",
+    "        \n",
+    "        return result[\"text\"]\n",
+    "    \n",
+    "    except Exception as e:\n",
+    "        return f\"Error: {str(e)}\"\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd928784",
+   "metadata": {},
+   "source": [
+    "## Step 5: Build the Interface\n",
+    "\n",
+    "Creating a simple, clean Gradio interface with:\n",
+    "- **File uploader** for audio files\n",
+    "- **Language dropdown** with 12+ options\n",
+    "- **Transcription output** box\n",
+    "- Auto-launches in browser for convenience\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "5ce2c944",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✅ App ready! Run the next cell to launch.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Gradio interface\n",
+    "\n",
+    "app = gr.Interface(\n",
+    "    fn=transcribe_audio,\n",
+    "    inputs=[\n",
+    "        gr.File(label=\"Upload Audio File\", file_types=[\"audio\"]),\n",
+    "        gr.Dropdown(\n",
+    "            choices=[\n",
+    "                \"English\", \"Spanish\", \"French\", \"German\", \"Italian\",\n",
+    "                \"Portuguese\", \"Chinese\", \"Japanese\", \"Korean\",\n",
+    "                \"Russian\", \"Arabic\", \"Auto-detect\"\n",
+    "            ],\n",
+    "            value=\"English\",\n",
+    "            label=\"Language\"\n",
+    "        )\n",
+    "    ],\n",
+    "    outputs=gr.Textbox(label=\"Transcription\", lines=15),\n",
+    "    title=\"🎙️ Audio Transcription\",\n",
+    "    description=\"Upload an audio file to transcribe it.\",\n",
+    "    flagging_mode=\"never\"\n",
+    ")\n",
+    "\n",
+    "print(\"✅ App ready! Run the next cell to launch.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "049ac197",
+   "metadata": {},
+   "source": [
+    "## Step 6: Launch the App\n",
+    "\n",
+    "Starting the Gradio server with Jupyter compatibility (`prevent_thread_lock=True`). The app will open automatically in your browser.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fa6c8d9a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "* Running on local URL:  http://127.0.0.1:7860\n",
+      "* To create a public link, set `share=True` in `launch()`.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div><iframe src=\"http://127.0.0.1:7860/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": []
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n",
+      "  warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n",
+      "100%|██████████| 10416/10416 [00:06<00:00, 1723.31frames/s]\n",
+      "/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n",
+      "  warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n",
+      "100%|██████████| 10416/10416 [00:30<00:00, 341.64frames/s]\n",
+      "/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n",
+      "  warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n",
+      "100%|██████████| 2289/2289 [00:01<00:00, 1205.18frames/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Launch\n",
+    "\n",
+    "# Close any previous instances\n",
+    "try:\n",
+    "    app.close()\n",
+    "except:\n",
+    "    pass\n",
+    "\n",
+    "# Start the app\n",
+    "app.launch(inbrowser=True, prevent_thread_lock=True)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3c2ec24",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## 💡 How to Use\n",
+    "\n",
+    "1. **Upload** an audio file (MP3, WAV, M4A, etc.)\n",
+    "2. **Select** your language (or use Auto-detect)\n",
+    "3. **Click** Submit\n",
+    "4. **Get** your transcription!\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 🚀 Running on Google Colab\n",
+    "\n",
+    "For GPU acceleration on Colab:\n",
+    "1. Runtime → Change runtime type → **GPU (T4)**\n",
+    "2. Run all cells in order\n",
+    "3. The model will use GPU automatically\n",
+    "\n",
+    "**Note:** First run downloads the Whisper model (~140MB) - this is a one-time download.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 📝 Supported Languages\n",
+    "\n",
+    "English • Spanish • French • German • Italian • Portuguese • Chinese • Japanese • Korean • Russian • Arabic • Auto-detect\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}