- Introduced a new audio transcription tool utilizing OpenAI's Whisper model. - Added README.md detailing features, installation, and usage instructions. - Created a Jupyter notebook for local and Google Colab execution. - Included an MP3 file for demonstration purposes.
398 lines
12 KiB
Plaintext
398 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "270ed08b",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 🎙️ Audio Transcription Assistant\n",
|
|
"\n",
|
|
"## Why I Built This\n",
|
|
"\n",
|
|
"In today's content-driven world, audio and video are everywhere—podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings?\n",
|
|
"\n",
|
|
"Manual transcription is time-consuming and expensive. I wanted to build something that could:\n",
|
|
"- Accept audio files in any format (MP3, WAV, etc.)\n",
|
|
"- Transcribe them accurately using AI\n",
|
|
"- Support multiple languages\n",
|
|
"- Work locally on my Mac **and** on cloud GPUs (Google Colab)\n",
|
|
"\n",
|
|
"That's where **Whisper** comes in—OpenAI's powerful speech recognition model.\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"## What This Does\n",
|
|
"\n",
|
|
"This app lets you:\n",
|
|
"- 📤 Upload any audio file\n",
|
|
"- 🌍 Choose from 12+ languages (or auto-detect)\n",
|
|
"- 🤖 Get accurate AI-powered transcription\n",
|
|
"- ⚡ Process on CPU (Mac) or GPU (Colab)\n",
|
|
"\n",
|
|
"**Tech:** OpenAI Whisper • Gradio UI • PyTorch • Cross-platform (Mac/Colab)\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"**Note:** This is a demonstration. For production use, consider privacy and data handling policies.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c37e5165",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 1: Install Dependencies\n",
|
|
"\n",
|
|
"Installing everything needed:\n",
|
|
"- **NumPy 1.26.4** - Compatible version for Whisper\n",
|
|
"- **PyTorch** - Deep learning framework\n",
|
|
"- **Whisper** - OpenAI's speech recognition model\n",
|
|
"- **Gradio** - Web interface\n",
|
|
"- **ffmpeg** - Audio file processing\n",
|
|
"- **Ollama** - For local LLM support (optional)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "8c66b0ca",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/usr/local/bin/ffmpeg\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Package installation\n",
|
|
"\n",
|
|
"!uv pip install -q --reinstall \"numpy==1.26.4\"\n",
|
|
"!uv pip install -q torch torchvision torchaudio\n",
|
|
"!uv pip install -q gradio openai-whisper ffmpeg-python\n",
|
|
"!uv pip install -q ollama\n",
|
|
"\n",
|
|
"# Ensure ffmpeg is available (Mac)\n",
|
|
"!which ffmpeg || brew install ffmpeg"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f31d64ee",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 2: Import Libraries\n",
|
|
"\n",
|
|
"The essentials: NumPy for arrays, Gradio for the UI, Whisper for transcription, PyTorch for the model backend, and Ollama for optional LLM features.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "4782261a",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Imports\n",
|
|
"\n",
|
|
"import os\n",
|
|
"import numpy as np\n",
|
|
"import gradio as gr\n",
|
|
"import whisper\n",
|
|
"import torch\n",
|
|
"import ollama"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "93a41b23",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 3: Load Whisper Model\n",
|
|
"\n",
|
|
"Loading the **base** model—a balanced choice between speed and accuracy. It works on both CPU (Mac) and GPU (Colab). The model is ~140MB and will download automatically on first run.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "130ed059",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Loading Whisper model...\n",
|
|
"Using device: cpu\n",
|
|
"✅ Model loaded successfully!\n",
|
|
"Model type: <class 'whisper.model.Whisper'>\n",
|
|
"Has transcribe method: True\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Model initialization\n",
|
|
"\n",
|
|
"print(\"Loading Whisper model...\")\n",
|
|
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
|
"print(f\"Using device: {device}\")\n",
|
|
"\n",
|
|
"whisper_model = whisper.load_model(\"base\", device=device)\n",
|
|
"print(\"✅ Model loaded successfully!\")\n",
|
|
"print(f\"Model type: {type(whisper_model)}\")\n",
|
|
"print(f\"Has transcribe method: {hasattr(whisper_model, 'transcribe')}\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d84f6cfe",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 4: Transcription Function\n",
|
|
"\n",
|
|
"This is the core logic:\n",
|
|
"- Accepts an audio file and target language\n",
|
|
"- Maps language names to Whisper's language codes\n",
|
|
"- Transcribes the audio using the loaded model\n",
|
|
"- Returns the transcribed text\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "4f2c4b2c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Transcription function\n",
|
|
"\n",
|
|
"def transcribe_audio(audio_file, target_language):\n",
|
|
" \"\"\"Transcribe audio file to text in the specified language.\"\"\"\n",
|
|
" if audio_file is None:\n",
|
|
" return \"Please upload an audio file.\"\n",
|
|
" \n",
|
|
" try:\n",
|
|
" # Language codes for Whisper\n",
|
|
" language_map = {\n",
|
|
" \"English\": \"en\",\n",
|
|
" \"Spanish\": \"es\",\n",
|
|
" \"French\": \"fr\",\n",
|
|
" \"German\": \"de\",\n",
|
|
" \"Italian\": \"it\",\n",
|
|
" \"Portuguese\": \"pt\",\n",
|
|
" \"Chinese\": \"zh\",\n",
|
|
" \"Japanese\": \"ja\",\n",
|
|
" \"Korean\": \"ko\",\n",
|
|
" \"Russian\": \"ru\",\n",
|
|
" \"Arabic\": \"ar\",\n",
|
|
" \"Auto-detect\": None\n",
|
|
" }\n",
|
|
" \n",
|
|
" lang_code = language_map.get(target_language)\n",
|
|
" \n",
|
|
" # Get file path from Gradio File component (returns path string directly)\n",
|
|
" audio_path = audio_file.name if hasattr(audio_file, 'name') else audio_file\n",
|
|
" \n",
|
|
" if not audio_path or not os.path.exists(audio_path):\n",
|
|
" return \"Invalid audio file or file not found\"\n",
|
|
"\n",
|
|
" # Transcribe using whisper_model.transcribe()\n",
|
|
" result = whisper_model.transcribe(\n",
|
|
" audio_path,\n",
|
|
" language=lang_code,\n",
|
|
" task=\"transcribe\",\n",
|
|
" verbose=False # Hide confusing progress bar\n",
|
|
" )\n",
|
|
" \n",
|
|
" return result[\"text\"]\n",
|
|
" \n",
|
|
" except Exception as e:\n",
|
|
" return f\"Error: {str(e)}\"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dd928784",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 5: Build the Interface\n",
|
|
"\n",
|
|
"Creating a simple, clean Gradio interface with:\n",
|
|
"- **File uploader** for audio files\n",
|
|
"- **Language dropdown** with 12+ options\n",
|
|
"- **Transcription output** box\n",
|
|
"- Auto-launches in browser for convenience\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "5ce2c944",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"✅ App ready! Run the next cell to launch.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Gradio interface\n",
|
|
"\n",
|
|
"app = gr.Interface(\n",
|
|
" fn=transcribe_audio,\n",
|
|
" inputs=[\n",
|
|
" gr.File(label=\"Upload Audio File\", file_types=[\"audio\"]),\n",
|
|
" gr.Dropdown(\n",
|
|
" choices=[\n",
|
|
" \"English\", \"Spanish\", \"French\", \"German\", \"Italian\",\n",
|
|
" \"Portuguese\", \"Chinese\", \"Japanese\", \"Korean\",\n",
|
|
" \"Russian\", \"Arabic\", \"Auto-detect\"\n",
|
|
" ],\n",
|
|
" value=\"English\",\n",
|
|
" label=\"Language\"\n",
|
|
" )\n",
|
|
" ],\n",
|
|
" outputs=gr.Textbox(label=\"Transcription\", lines=15),\n",
|
|
" title=\"🎙️ Audio Transcription\",\n",
|
|
" description=\"Upload an audio file to transcribe it.\",\n",
|
|
" flagging_mode=\"never\"\n",
|
|
")\n",
|
|
"\n",
|
|
"print(\"✅ App ready! Run the next cell to launch.\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "049ac197",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 6: Launch the App\n",
|
|
"\n",
|
|
"Starting the Gradio server with Jupyter compatibility (`prevent_thread_lock=True`). The app will open automatically in your browser.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "fa6c8d9a",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"* Running on local URL: http://127.0.0.1:7860\n",
|
|
"* To create a public link, set `share=True` in `launch()`.\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div><iframe src=\"http://127.0.0.1:7860/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
|
|
],
|
|
"text/plain": [
|
|
"<IPython.core.display.HTML object>"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": []
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n",
|
|
" warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n",
|
|
"100%|██████████| 10416/10416 [00:06<00:00, 1723.31frames/s]\n",
|
|
"/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n",
|
|
" warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n",
|
|
"100%|██████████| 10416/10416 [00:30<00:00, 341.64frames/s]\n",
|
|
"/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n",
|
|
" warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n",
|
|
"100%|██████████| 2289/2289 [00:01<00:00, 1205.18frames/s]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Launch\n",
|
|
"\n",
|
|
"# Close any previous instances\n",
|
|
"try:\n",
|
|
" app.close()\n",
|
|
"except:\n",
|
|
" pass\n",
|
|
"\n",
|
|
"# Start the app\n",
|
|
"app.launch(inbrowser=True, prevent_thread_lock=True)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c3c2ec24",
|
|
"metadata": {},
|
|
"source": [
|
|
"---\n",
|
|
"\n",
|
|
"## 💡 How to Use\n",
|
|
"\n",
|
|
"1. **Upload** an audio file (MP3, WAV, M4A, etc.)\n",
|
|
"2. **Select** your language (or use Auto-detect)\n",
|
|
"3. **Click** Submit\n",
|
|
"4. **Get** your transcription!\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"## 🚀 Running on Google Colab\n",
|
|
"\n",
|
|
"For GPU acceleration on Colab:\n",
|
|
"1. Runtime → Change runtime type → **GPU (T4)**\n",
|
|
"2. Run all cells in order\n",
|
|
"3. The model will use GPU automatically\n",
|
|
"\n",
|
|
"**Note:** First run downloads the Whisper model (~140MB) - this is a one-time download.\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"## 📝 Supported Languages\n",
|
|
"\n",
|
|
"English • Spanish • French • German • Italian • Portuguese • Chinese • Japanese • Korean • Russian • Arabic • Auto-detect\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": ".venv",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.12.12"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|