{ "cells": [ { "cell_type": "markdown", "id": "270ed08b", "metadata": {}, "source": [ "# πŸŽ™οΈ Audio Transcription Assistant\n", "\n", "## Why I Built This\n", "\n", "In today's content-driven world, audio and video are everywhereβ€”podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings?\n", "\n", "Manual transcription is time-consuming and expensive. I wanted to build something that could:\n", "- Accept audio files in any format (MP3, WAV, etc.)\n", "- Transcribe them accurately using AI\n", "- Support multiple languages\n", "- Work locally on my Mac **and** on cloud GPUs (Google Colab)\n", "\n", "That's where **Whisper** comes inβ€”OpenAI's powerful speech recognition model.\n", "\n", "---\n", "\n", "## What This Does\n", "\n", "This app lets you:\n", "- πŸ“€ Upload any audio file\n", "- 🌍 Choose from 12+ languages (or auto-detect)\n", "- πŸ€– Get accurate AI-powered transcription\n", "- ⚑ Process on CPU (Mac) or GPU (Colab)\n", "\n", "**Tech:** OpenAI Whisper β€’ Gradio UI β€’ PyTorch β€’ Cross-platform (Mac/Colab)\n", "\n", "---\n", "\n", "**Note:** This is a demonstration. For production use, consider privacy and data handling policies.\n" ] }, { "cell_type": "markdown", "id": "c37e5165", "metadata": {}, "source": [ "## Step 1: Install Dependencies\n", "\n", "Installing everything needed:\n", "- **NumPy 1.26.4** - Compatible version for Whisper\n", "- **PyTorch** - Deep learning framework\n", "- **Whisper** - OpenAI's speech recognition model\n", "- **Gradio** - Web interface\n", "- **ffmpeg** - Audio file processing\n", "- **Ollama** - For local LLM support (optional)\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "8c66b0ca", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/usr/local/bin/ffmpeg\n" ] } ], "source": [ "# Package installation\n", "\n", "!uv pip install -q --reinstall \"numpy==1.26.4\"\n", "!uv pip install -q torch torchvision torchaudio\n", "!uv pip install -q gradio openai-whisper ffmpeg-python\n", "!uv pip install -q ollama\n", "\n", "# Ensure ffmpeg is available (Mac)\n", "!which ffmpeg || brew install ffmpeg" ] }, { "cell_type": "markdown", "id": "f31d64ee", "metadata": {}, "source": [ "## Step 2: Import Libraries\n", "\n", "The essentials: NumPy for arrays, Gradio for the UI, Whisper for transcription, PyTorch for the model backend, and Ollama for optional LLM features.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "4782261a", "metadata": {}, "outputs": [], "source": [ "# Imports\n", "\n", "import os\n", "import numpy as np\n", "import gradio as gr\n", "import whisper\n", "import torch\n", "import ollama" ] }, { "cell_type": "markdown", "id": "93a41b23", "metadata": {}, "source": [ "## Step 3: Load Whisper Model\n", "\n", "Loading the **base** modelβ€”a balanced choice between speed and accuracy. It works on both CPU (Mac) and GPU (Colab). The model is ~140MB and will download automatically on first run.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "130ed059", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading Whisper model...\n", "Using device: cpu\n", "βœ… Model loaded successfully!\n", "Model type: \n", "Has transcribe method: True\n" ] } ], "source": [ "# Model initialization\n", "\n", "print(\"Loading Whisper model...\")\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "print(f\"Using device: {device}\")\n", "\n", "whisper_model = whisper.load_model(\"base\", device=device)\n", "print(\"βœ… Model loaded successfully!\")\n", "print(f\"Model type: {type(whisper_model)}\")\n", "print(f\"Has transcribe method: {hasattr(whisper_model, 'transcribe')}\")\n" ] }, { "cell_type": "markdown", "id": "d84f6cfe", "metadata": {}, "source": [ "## Step 4: Transcription Function\n", "\n", "This is the core logic:\n", "- Accepts an audio file and target language\n", "- Maps language names to Whisper's language codes\n", "- Transcribes the audio using the loaded model\n", "- Returns the transcribed text\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "4f2c4b2c", "metadata": {}, "outputs": [], "source": [ "# Transcription function\n", "\n", "def transcribe_audio(audio_file, target_language):\n", " \"\"\"Transcribe audio file to text in the specified language.\"\"\"\n", " if audio_file is None:\n", " return \"Please upload an audio file.\"\n", " \n", " try:\n", " # Language codes for Whisper\n", " language_map = {\n", " \"English\": \"en\",\n", " \"Spanish\": \"es\",\n", " \"French\": \"fr\",\n", " \"German\": \"de\",\n", " \"Italian\": \"it\",\n", " \"Portuguese\": \"pt\",\n", " \"Chinese\": \"zh\",\n", " \"Japanese\": \"ja\",\n", " \"Korean\": \"ko\",\n", " \"Russian\": \"ru\",\n", " \"Arabic\": \"ar\",\n", " \"Auto-detect\": None\n", " }\n", " \n", " lang_code = language_map.get(target_language)\n", " \n", " # Get file path from Gradio File component (returns path string directly)\n", " audio_path = audio_file.name if hasattr(audio_file, 'name') else audio_file\n", " \n", " if not audio_path or not os.path.exists(audio_path):\n", " return \"Invalid audio file or file not found\"\n", "\n", " # Transcribe using whisper_model.transcribe()\n", " result = whisper_model.transcribe(\n", " audio_path,\n", " language=lang_code,\n", " task=\"transcribe\",\n", " verbose=False # Hide confusing progress bar\n", " )\n", " \n", " return result[\"text\"]\n", " \n", " except Exception as e:\n", " return f\"Error: {str(e)}\"\n" ] }, { "cell_type": "markdown", "id": "dd928784", "metadata": {}, "source": [ "## Step 5: Build the Interface\n", "\n", "Creating a simple, clean Gradio interface with:\n", "- **File uploader** for audio files\n", "- **Language dropdown** with 12+ options\n", "- **Transcription output** box\n", "- Auto-launches in browser for convenience\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "5ce2c944", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "βœ… App ready! Run the next cell to launch.\n" ] } ], "source": [ "# Gradio interface\n", "\n", "app = gr.Interface(\n", " fn=transcribe_audio,\n", " inputs=[\n", " gr.File(label=\"Upload Audio File\", file_types=[\"audio\"]),\n", " gr.Dropdown(\n", " choices=[\n", " \"English\", \"Spanish\", \"French\", \"German\", \"Italian\",\n", " \"Portuguese\", \"Chinese\", \"Japanese\", \"Korean\",\n", " \"Russian\", \"Arabic\", \"Auto-detect\"\n", " ],\n", " value=\"English\",\n", " label=\"Language\"\n", " )\n", " ],\n", " outputs=gr.Textbox(label=\"Transcription\", lines=15),\n", " title=\"πŸŽ™οΈ Audio Transcription\",\n", " description=\"Upload an audio file to transcribe it.\",\n", " flagging_mode=\"never\"\n", ")\n", "\n", "print(\"βœ… App ready! Run the next cell to launch.\")\n" ] }, { "cell_type": "markdown", "id": "049ac197", "metadata": {}, "source": [ "## Step 6: Launch the App\n", "\n", "Starting the Gradio server with Jupyter compatibility (`prevent_thread_lock=True`). The app will open automatically in your browser.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fa6c8d9a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* Running on local URL: http://127.0.0.1:7860\n", "* To create a public link, set `share=True` in `launch()`.\n" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n", " warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n", "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10416/10416 [00:06<00:00, 1723.31frames/s]\n", "/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n", " warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n", "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10416/10416 [00:30<00:00, 341.64frames/s]\n", "/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n", " warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n", "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2289/2289 [00:01<00:00, 1205.18frames/s]\n" ] } ], "source": [ "# Launch\n", "\n", "# Close any previous instances\n", "try:\n", " app.close()\n", "except:\n", " pass\n", "\n", "# Start the app\n", "app.launch(inbrowser=True, prevent_thread_lock=True)\n" ] }, { "cell_type": "markdown", "id": "c3c2ec24", "metadata": {}, "source": [ "---\n", "\n", "## πŸ’‘ How to Use\n", "\n", "1. **Upload** an audio file (MP3, WAV, M4A, etc.)\n", "2. **Select** your language (or use Auto-detect)\n", "3. **Click** Submit\n", "4. **Get** your transcription!\n", "\n", "---\n", "\n", "## πŸš€ Running on Google Colab\n", "\n", "For GPU acceleration on Colab:\n", "1. Runtime β†’ Change runtime type β†’ **GPU (T4)**\n", "2. Run all cells in order\n", "3. The model will use GPU automatically\n", "\n", "**Note:** First run downloads the Whisper model (~140MB) - this is a one-time download.\n", "\n", "---\n", "\n", "## πŸ“ Supported Languages\n", "\n", "English β€’ Spanish β€’ French β€’ German β€’ Italian β€’ Portuguese β€’ Chinese β€’ Japanese β€’ Korean β€’ Russian β€’ Arabic β€’ Auto-detect\n" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }