{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "270ed08b",
   "metadata": {},
   "source": [
    "# 🎙️ Audio Transcription Assistant\n",
    "\n",
    "## Why I Built This\n",
    "\n",
    "In today's content-driven world, audio and video are everywhere—podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings?\n",
    "\n",
    "Manual transcription is time-consuming and expensive. I wanted to build something that could:\n",
    "- Accept audio files in any format (MP3, WAV, etc.)\n",
    "- Transcribe them accurately using AI\n",
    "- Support multiple languages\n",
    "- Work locally on my Mac **and** on cloud GPUs (Google Colab)\n",
    "\n",
    "That's where **Whisper** comes in—OpenAI's powerful speech recognition model.\n",
    "\n",
    "---\n",
    "\n",
    "## What This Does\n",
    "\n",
    "This app lets you:\n",
    "- 📤 Upload any audio file\n",
    "- 🌍 Choose from 12+ languages (or auto-detect)\n",
    "- 🤖 Get accurate AI-powered transcription\n",
    "- ⚡ Process on CPU (Mac) or GPU (Colab)\n",
    "\n",
    "**Tech:** OpenAI Whisper • Gradio UI • PyTorch • Cross-platform (Mac/Colab)\n",
    "\n",
    "---\n",
    "\n",
    "**Note:** This is a demonstration. For production use, consider privacy and data handling policies.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c37e5165",
   "metadata": {},
   "source": [
    "## Step 1: Install Dependencies\n",
    "\n",
    "Installing everything needed:\n",
    "- **NumPy 1.26.4** - Compatible version for Whisper\n",
    "- **PyTorch** - Deep learning framework\n",
    "- **Whisper** - OpenAI's speech recognition model\n",
    "- **Gradio** - Web interface\n",
    "- **ffmpeg** - Audio file processing\n",
    "- **Ollama** - For local LLM support (optional)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8c66b0ca",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/usr/local/bin/ffmpeg\n"
     ]
    }
   ],
   "source": [
    "# Package installation\n",
    "\n",
    "!uv pip install -q --reinstall \"numpy==1.26.4\"\n",
    "!uv pip install -q torch torchvision torchaudio\n",
    "!uv pip install -q gradio openai-whisper ffmpeg-python\n",
    "!uv pip install -q ollama\n",
    "\n",
    "# Ensure ffmpeg is available (Mac)\n",
    "!which ffmpeg || brew install ffmpeg"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f31d64ee",
   "metadata": {},
   "source": [
    "## Step 2: Import Libraries\n",
    "\n",
    "The essentials: NumPy for arrays, Gradio for the UI, Whisper for transcription, PyTorch for the model backend, and Ollama for optional LLM features.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "4782261a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Imports\n",
    "\n",
    "import os\n",
    "import numpy as np\n",
    "import gradio as gr\n",
    "import whisper\n",
    "import torch\n",
    "import ollama"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93a41b23",
   "metadata": {},
   "source": [
    "## Step 3: Load Whisper Model\n",
    "\n",
    "Loading the **base** model—a balanced choice between speed and accuracy. It works on both CPU (Mac) and GPU (Colab). The model is ~140MB and will download automatically on first run.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "130ed059",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading Whisper model...\n",
      "Using device: cpu\n",
      "✅ Model loaded successfully!\n",
      "Model type: <class 'whisper.model.Whisper'>\n",
      "Has transcribe method: True\n"
     ]
    }
   ],
   "source": [
    "# Model initialization\n",
    "\n",
    "print(\"Loading Whisper model...\")\n",
    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    "print(f\"Using device: {device}\")\n",
    "\n",
    "whisper_model = whisper.load_model(\"base\", device=device)\n",
    "print(\"✅ Model loaded successfully!\")\n",
    "print(f\"Model type: {type(whisper_model)}\")\n",
    "print(f\"Has transcribe method: {hasattr(whisper_model, 'transcribe')}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d84f6cfe",
   "metadata": {},
   "source": [
    "## Step 4: Transcription Function\n",
    "\n",
    "This is the core logic:\n",
    "- Accepts an audio file and target language\n",
    "- Maps language names to Whisper's language codes\n",
    "- Transcribes the audio using the loaded model\n",
    "- Returns the transcribed text\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "4f2c4b2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Transcription function\n",
    "\n",
    "def transcribe_audio(audio_file, target_language):\n",
    "    \"\"\"Transcribe audio file to text in the specified language.\"\"\"\n",
    "    if audio_file is None:\n",
    "        return \"Please upload an audio file.\"\n",
    "    \n",
    "    try:\n",
    "        # Language codes for Whisper\n",
    "        language_map = {\n",
    "            \"English\": \"en\",\n",
    "            \"Spanish\": \"es\",\n",
    "            \"French\": \"fr\",\n",
    "            \"German\": \"de\",\n",
    "            \"Italian\": \"it\",\n",
    "            \"Portuguese\": \"pt\",\n",
    "            \"Chinese\": \"zh\",\n",
    "            \"Japanese\": \"ja\",\n",
    "            \"Korean\": \"ko\",\n",
    "            \"Russian\": \"ru\",\n",
    "            \"Arabic\": \"ar\",\n",
    "            \"Auto-detect\": None\n",
    "        }\n",
    "        \n",
    "        lang_code = language_map.get(target_language)\n",
    "        \n",
    "        # Get file path from Gradio File component (returns path string directly)\n",
    "        audio_path = audio_file.name if hasattr(audio_file, 'name') else audio_file\n",
    "        \n",
    "        if not audio_path or not os.path.exists(audio_path):\n",
    "            return \"Invalid audio file or file not found\"\n",
    "\n",
    "        # Transcribe using whisper_model.transcribe()\n",
    "        result = whisper_model.transcribe(\n",
    "            audio_path,\n",
    "            language=lang_code,\n",
    "            task=\"transcribe\",\n",
    "            verbose=False  # Hide confusing progress bar\n",
    "        )\n",
    "        \n",
    "        return result[\"text\"]\n",
    "    \n",
    "    except Exception as e:\n",
    "        return f\"Error: {str(e)}\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd928784",
   "metadata": {},
   "source": [
    "## Step 5: Build the Interface\n",
    "\n",
    "Creating a simple, clean Gradio interface with:\n",
    "- **File uploader** for audio files\n",
    "- **Language dropdown** with 12+ options\n",
    "- **Transcription output** box\n",
    "- Auto-launches in browser for convenience\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5ce2c944",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ App ready! Run the next cell to launch.\n"
     ]
    }
   ],
   "source": [
    "# Gradio interface\n",
    "\n",
    "app = gr.Interface(\n",
    "    fn=transcribe_audio,\n",
    "    inputs=[\n",
    "        gr.File(label=\"Upload Audio File\", file_types=[\"audio\"]),\n",
    "        gr.Dropdown(\n",
    "            choices=[\n",
    "                \"English\", \"Spanish\", \"French\", \"German\", \"Italian\",\n",
    "                \"Portuguese\", \"Chinese\", \"Japanese\", \"Korean\",\n",
    "                \"Russian\", \"Arabic\", \"Auto-detect\"\n",
    "            ],\n",
    "            value=\"English\",\n",
    "            label=\"Language\"\n",
    "        )\n",
    "    ],\n",
    "    outputs=gr.Textbox(label=\"Transcription\", lines=15),\n",
    "    title=\"🎙️ Audio Transcription\",\n",
    "    description=\"Upload an audio file to transcribe it.\",\n",
    "    flagging_mode=\"never\"\n",
    ")\n",
    "\n",
    "print(\"✅ App ready! Run the next cell to launch.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "049ac197",
   "metadata": {},
   "source": [
    "## Step 6: Launch the App\n",
    "\n",
    "Starting the Gradio server with Jupyter compatibility (`prevent_thread_lock=True`). The app will open automatically in your browser.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa6c8d9a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "* Running on local URL:  http://127.0.0.1:7860\n",
      "* To create a public link, set `share=True` in `launch()`.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div><iframe src=\"http://127.0.0.1:7860/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n",
      "  warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n",
      "100%|██████████| 10416/10416 [00:06<00:00, 1723.31frames/s]\n",
      "/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n",
      "  warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n",
      "100%|██████████| 10416/10416 [00:30<00:00, 341.64frames/s]\n",
      "/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n",
      "  warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n",
      "100%|██████████| 2289/2289 [00:01<00:00, 1205.18frames/s]\n"
     ]
    }
   ],
   "source": [
    "# Launch\n",
    "\n",
    "# Close any previous instances\n",
    "try:\n",
    "    app.close()\n",
    "except:\n",
    "    pass\n",
    "\n",
    "# Start the app\n",
    "app.launch(inbrowser=True, prevent_thread_lock=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3c2ec24",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 💡 How to Use\n",
    "\n",
    "1. **Upload** an audio file (MP3, WAV, M4A, etc.)\n",
    "2. **Select** your language (or use Auto-detect)\n",
    "3. **Click** Submit\n",
    "4. **Get** your transcription!\n",
    "\n",
    "---\n",
    "\n",
    "## 🚀 Running on Google Colab\n",
    "\n",
    "For GPU acceleration on Colab:\n",
    "1. Runtime → Change runtime type → **GPU (T4)**\n",
    "2. Run all cells in order\n",
    "3. The model will use GPU automatically\n",
    "\n",
    "**Note:** First run downloads the Whisper model (~140MB) - this is a one-time download.\n",
    "\n",
    "---\n",
    "\n",
    "## 📝 Supported Languages\n",
    "\n",
    "English • Spanish • French • German • Italian • Portuguese • Chinese • Japanese • Korean • Russian • Arabic • Auto-detect\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}