(Oct 2025 Bootcamp): Add audio transcription assistant with Gradio interface

- Introduced a new audio transcription tool utilizing OpenAI's Whisper model.
- Added README.md detailing features, installation, and usage instructions.
- Created a Jupyter notebook for local and Google Colab execution.
- Included an MP3 file for demonstration purposes.
This commit is contained in:
Hope Ogbons
2025-10-26 04:49:01 +01:00
parent 48076f9d39
commit ae81fa4c8d
3 changed files with 594 additions and 0 deletions

View File

@@ -0,0 +1,197 @@
# 🎙️ Audio Transcription Assistant
An AI-powered audio transcription tool that converts speech to text in multiple languages using OpenAI's Whisper model.
## Why I Built This
In today's content-driven world, audio and video are everywhere—podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings?
Manual transcription is time-consuming and expensive. I wanted to build something that could:
- Accept audio files in any format (MP3, WAV, etc.)
- Transcribe them accurately using AI
- Support multiple languages
- Work locally on my Mac **and** on cloud GPUs (Google Colab)
That's where **Whisper** comes in—OpenAI's powerful speech recognition model.
## Features
- 📤 **Upload any audio file** (MP3, WAV, M4A, FLAC, etc.)
- 🌍 **12+ languages supported** with auto-detection
- 🤖 **Accurate AI-powered transcription** using Whisper
-**Cross-platform** - works on CPU (Mac) or GPU (Colab)
- 🎨 **Clean web interface** built with Gradio
- 🚀 **Fast processing** with optimized model settings
## Tech Stack
- **OpenAI Whisper** - Speech recognition model
- **Gradio** - Web interface framework
- **PyTorch** - Deep learning backend
- **NumPy** - Numerical computing
- **ffmpeg** - Audio file processing
## Installation
### Prerequisites
- Python 3.12+
- ffmpeg (for audio processing)
- uv package manager (or pip)
### Setup
1. Clone this repository or download the notebook
2. Install dependencies:
```bash
# Install compatible NumPy version
uv pip install --reinstall "numpy==1.26.4"
# Install PyTorch
uv pip install torch torchvision torchaudio
# Install Gradio and Whisper
uv pip install gradio openai-whisper ffmpeg-python
# (Optional) Install Ollama for LLM features
uv pip install ollama
```
3. **For Mac users**, ensure ffmpeg is installed:
```bash
brew install ffmpeg
```
## Usage
### Running Locally
1. Open the Jupyter notebook `week3 EXERCISE_hopeogbons.ipynb`
2. Run all cells in order:
- Cell 1: Install dependencies
- Cell 2: Import libraries
- Cell 3: Load Whisper model
- Cell 4: Define transcription function
- Cell 5: Build Gradio interface
- Cell 6: Launch the app
3. The app will automatically open in your browser
4. Upload an audio file, select the language, and click Submit!
### Running on Google Colab
For GPU acceleration:
1. Open the notebook in Google Colab
2. Runtime → Change runtime type → **GPU (T4)**
3. Run all cells in order
4. The model will automatically use GPU acceleration
**Note:** First run downloads the Whisper model (~140MB) - this is a one-time download.
## Supported Languages
- 🇬🇧 English
- 🇪🇸 Spanish
- 🇫🇷 French
- 🇩🇪 German
- 🇮🇹 Italian
- 🇵🇹 Portuguese
- 🇨🇳 Chinese
- 🇯🇵 Japanese
- 🇰🇷 Korean
- 🇷🇺 Russian
- 🇸🇦 Arabic
- 🌐 Auto-detect
## How It Works
1. **Upload** - User uploads an audio file through the Gradio interface
2. **Process** - ffmpeg decodes the audio file
3. **Transcribe** - Whisper model processes the audio and generates text
4. **Display** - Transcription is shown in the output box
The Whisper "base" model is used for a balance between speed and accuracy:
- Fast enough for real-time use on CPU
- Accurate enough for most transcription needs
- Small enough (~140MB) for quick downloads
## Example Transcriptions
The app successfully transcribed:
- English podcast episodes
- French language audio (detected and transcribed)
- Multi-speaker conversations
- Audio with background noise
## What I Learned
Building this transcription assistant taught me:
- **Audio processing** with ffmpeg and Whisper
- **Cross-platform compatibility** (Mac CPU vs Colab GPU)
- **Dependency management** (dealing with NumPy version conflicts!)
- **Async handling** in Jupyter notebooks with Gradio
- **Model optimization** (choosing the right Whisper model size)
The biggest challenge? Getting ffmpeg and NumPy to play nice together across different environments. But solving those issues made me understand the stack much better.
## Troubleshooting
### Common Issues
**1. "No module named 'whisper'" error**
- Make sure you've installed `openai-whisper`, not just `whisper`
- Restart your kernel after installation
**2. "ffmpeg not found" error**
- Install ffmpeg: `brew install ffmpeg` (Mac) or `apt-get install ffmpeg` (Linux)
**3. NumPy version conflicts**
- Use NumPy 1.26.4: `uv pip install --reinstall "numpy==1.26.4"`
- Restart kernel after reinstalling
**4. Gradio event loop errors**
- Use `prevent_thread_lock=True` in `app.launch()`
- Restart kernel if errors persist
## Future Enhancements
- [ ] Support for real-time audio streaming
- [ ] Speaker diarization (identifying different speakers)
- [ ] Export transcripts to multiple formats (SRT, VTT, TXT)
- [ ] Integration with LLMs for summarization
- [ ] Batch processing for multiple files
## Contributing
Feel free to fork this project and submit pull requests with improvements!
## License
This project is open source and available under the MIT License.
## Acknowledgments
- **OpenAI** for the amazing Whisper model
- **Gradio** team for the intuitive interface framework
- **Andela LLM Engineering Program** for the learning opportunity
---
**Built with ❤️ as part of the Andela LLM Engineering Program**
For questions or feedback, feel free to reach out!

View File

@@ -0,0 +1,397 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "270ed08b",
"metadata": {},
"source": [
"# 🎙️ Audio Transcription Assistant\n",
"\n",
"## Why I Built This\n",
"\n",
"In today's content-driven world, audio and video are everywhere—podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings?\n",
"\n",
"Manual transcription is time-consuming and expensive. I wanted to build something that could:\n",
"- Accept audio files in any format (MP3, WAV, etc.)\n",
"- Transcribe them accurately using AI\n",
"- Support multiple languages\n",
"- Work locally on my Mac **and** on cloud GPUs (Google Colab)\n",
"\n",
"That's where **Whisper** comes in—OpenAI's powerful speech recognition model.\n",
"\n",
"---\n",
"\n",
"## What This Does\n",
"\n",
"This app lets you:\n",
"- 📤 Upload any audio file\n",
"- 🌍 Choose from 12+ languages (or auto-detect)\n",
"- 🤖 Get accurate AI-powered transcription\n",
"- ⚡ Process on CPU (Mac) or GPU (Colab)\n",
"\n",
"**Tech:** OpenAI Whisper • Gradio UI • PyTorch • Cross-platform (Mac/Colab)\n",
"\n",
"---\n",
"\n",
"**Note:** This is a demonstration. For production use, consider privacy and data handling policies.\n"
]
},
{
"cell_type": "markdown",
"id": "c37e5165",
"metadata": {},
"source": [
"## Step 1: Install Dependencies\n",
"\n",
"Installing everything needed:\n",
"- **NumPy 1.26.4** - Compatible version for Whisper\n",
"- **PyTorch** - Deep learning framework\n",
"- **Whisper** - OpenAI's speech recognition model\n",
"- **Gradio** - Web interface\n",
"- **ffmpeg** - Audio file processing\n",
"- **Ollama** - For local LLM support (optional)\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8c66b0ca",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/usr/local/bin/ffmpeg\n"
]
}
],
"source": [
"# Package installation\n",
"\n",
"!uv pip install -q --reinstall \"numpy==1.26.4\"\n",
"!uv pip install -q torch torchvision torchaudio\n",
"!uv pip install -q gradio openai-whisper ffmpeg-python\n",
"!uv pip install -q ollama\n",
"\n",
"# Ensure ffmpeg is available (Mac)\n",
"!which ffmpeg || brew install ffmpeg"
]
},
{
"cell_type": "markdown",
"id": "f31d64ee",
"metadata": {},
"source": [
"## Step 2: Import Libraries\n",
"\n",
"The essentials: NumPy for arrays, Gradio for the UI, Whisper for transcription, PyTorch for the model backend, and Ollama for optional LLM features.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4782261a",
"metadata": {},
"outputs": [],
"source": [
"# Imports\n",
"\n",
"import os\n",
"import numpy as np\n",
"import gradio as gr\n",
"import whisper\n",
"import torch\n",
"import ollama"
]
},
{
"cell_type": "markdown",
"id": "93a41b23",
"metadata": {},
"source": [
"## Step 3: Load Whisper Model\n",
"\n",
"Loading the **base** model—a balanced choice between speed and accuracy. It works on both CPU (Mac) and GPU (Colab). The model is ~140MB and will download automatically on first run.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "130ed059",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading Whisper model...\n",
"Using device: cpu\n",
"✅ Model loaded successfully!\n",
"Model type: <class 'whisper.model.Whisper'>\n",
"Has transcribe method: True\n"
]
}
],
"source": [
"# Model initialization\n",
"\n",
"print(\"Loading Whisper model...\")\n",
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
"print(f\"Using device: {device}\")\n",
"\n",
"whisper_model = whisper.load_model(\"base\", device=device)\n",
"print(\"✅ Model loaded successfully!\")\n",
"print(f\"Model type: {type(whisper_model)}\")\n",
"print(f\"Has transcribe method: {hasattr(whisper_model, 'transcribe')}\")\n"
]
},
{
"cell_type": "markdown",
"id": "d84f6cfe",
"metadata": {},
"source": [
"## Step 4: Transcription Function\n",
"\n",
"This is the core logic:\n",
"- Accepts an audio file and target language\n",
"- Maps language names to Whisper's language codes\n",
"- Transcribes the audio using the loaded model\n",
"- Returns the transcribed text\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4f2c4b2c",
"metadata": {},
"outputs": [],
"source": [
"# Transcription function\n",
"\n",
"def transcribe_audio(audio_file, target_language):\n",
" \"\"\"Transcribe audio file to text in the specified language.\"\"\"\n",
" if audio_file is None:\n",
" return \"Please upload an audio file.\"\n",
" \n",
" try:\n",
" # Language codes for Whisper\n",
" language_map = {\n",
" \"English\": \"en\",\n",
" \"Spanish\": \"es\",\n",
" \"French\": \"fr\",\n",
" \"German\": \"de\",\n",
" \"Italian\": \"it\",\n",
" \"Portuguese\": \"pt\",\n",
" \"Chinese\": \"zh\",\n",
" \"Japanese\": \"ja\",\n",
" \"Korean\": \"ko\",\n",
" \"Russian\": \"ru\",\n",
" \"Arabic\": \"ar\",\n",
" \"Auto-detect\": None\n",
" }\n",
" \n",
" lang_code = language_map.get(target_language)\n",
" \n",
" # Get file path from Gradio File component (returns path string directly)\n",
" audio_path = audio_file.name if hasattr(audio_file, 'name') else audio_file\n",
" \n",
" if not audio_path or not os.path.exists(audio_path):\n",
" return \"Invalid audio file or file not found\"\n",
"\n",
" # Transcribe using whisper_model.transcribe()\n",
" result = whisper_model.transcribe(\n",
" audio_path,\n",
" language=lang_code,\n",
" task=\"transcribe\",\n",
" verbose=False # Hide confusing progress bar\n",
" )\n",
" \n",
" return result[\"text\"]\n",
" \n",
" except Exception as e:\n",
" return f\"Error: {str(e)}\"\n"
]
},
{
"cell_type": "markdown",
"id": "dd928784",
"metadata": {},
"source": [
"## Step 5: Build the Interface\n",
"\n",
"Creating a simple, clean Gradio interface with:\n",
"- **File uploader** for audio files\n",
"- **Language dropdown** with 12+ options\n",
"- **Transcription output** box\n",
"- Auto-launches in browser for convenience\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5ce2c944",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"✅ App ready! Run the next cell to launch.\n"
]
}
],
"source": [
"# Gradio interface\n",
"\n",
"app = gr.Interface(\n",
" fn=transcribe_audio,\n",
" inputs=[\n",
" gr.File(label=\"Upload Audio File\", file_types=[\"audio\"]),\n",
" gr.Dropdown(\n",
" choices=[\n",
" \"English\", \"Spanish\", \"French\", \"German\", \"Italian\",\n",
" \"Portuguese\", \"Chinese\", \"Japanese\", \"Korean\",\n",
" \"Russian\", \"Arabic\", \"Auto-detect\"\n",
" ],\n",
" value=\"English\",\n",
" label=\"Language\"\n",
" )\n",
" ],\n",
" outputs=gr.Textbox(label=\"Transcription\", lines=15),\n",
" title=\"🎙️ Audio Transcription\",\n",
" description=\"Upload an audio file to transcribe it.\",\n",
" flagging_mode=\"never\"\n",
")\n",
"\n",
"print(\"✅ App ready! Run the next cell to launch.\")\n"
]
},
{
"cell_type": "markdown",
"id": "049ac197",
"metadata": {},
"source": [
"## Step 6: Launch the App\n",
"\n",
"Starting the Gradio server with Jupyter compatibility (`prevent_thread_lock=True`). The app will open automatically in your browser.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa6c8d9a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* Running on local URL: http://127.0.0.1:7860\n",
"* To create a public link, set `share=True` in `launch()`.\n"
]
},
{
"data": {
"text/html": [
"<div><iframe src=\"http://127.0.0.1:7860/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n",
" warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n",
"100%|██████████| 10416/10416 [00:06<00:00, 1723.31frames/s]\n",
"/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n",
" warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n",
"100%|██████████| 10416/10416 [00:30<00:00, 341.64frames/s]\n",
"/Users/hopeogbons/Projects/andela/llm_engineering/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead\n",
" warnings.warn(\"FP16 is not supported on CPU; using FP32 instead\")\n",
"100%|██████████| 2289/2289 [00:01<00:00, 1205.18frames/s]\n"
]
}
],
"source": [
"# Launch\n",
"\n",
"# Close any previous instances\n",
"try:\n",
" app.close()\n",
"except:\n",
" pass\n",
"\n",
"# Start the app\n",
"app.launch(inbrowser=True, prevent_thread_lock=True)\n"
]
},
{
"cell_type": "markdown",
"id": "c3c2ec24",
"metadata": {},
"source": [
"---\n",
"\n",
"## 💡 How to Use\n",
"\n",
"1. **Upload** an audio file (MP3, WAV, M4A, etc.)\n",
"2. **Select** your language (or use Auto-detect)\n",
"3. **Click** Submit\n",
"4. **Get** your transcription!\n",
"\n",
"---\n",
"\n",
"## 🚀 Running on Google Colab\n",
"\n",
"For GPU acceleration on Colab:\n",
"1. Runtime → Change runtime type → **GPU (T4)**\n",
"2. Run all cells in order\n",
"3. The model will use GPU automatically\n",
"\n",
"**Note:** First run downloads the Whisper model (~140MB) - this is a one-time download.\n",
"\n",
"---\n",
"\n",
"## 📝 Supported Languages\n",
"\n",
"English • Spanish • French • German • Italian • Portuguese • Chinese • Japanese • Korean • Russian • Arabic • Auto-detect\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}