Files

Hope Ogbons ae81fa4c8d (Oct 2025 Bootcamp): Add audio transcription assistant with Gradio interface

- Introduced a new audio transcription tool utilizing OpenAI's Whisper model.
- Added README.md detailing features, installation, and usage instructions.
- Created a Jupyter notebook for local and Google Colab execution.
- Included an MP3 file for demonstration purposes.

2025-10-26 04:49:01 +01:00

french_language_i_do_not_understand.mp3

(Oct 2025 Bootcamp): Add audio transcription assistant with Gradio interface

2025-10-26 04:49:01 +01:00

README.md

(Oct 2025 Bootcamp): Add audio transcription assistant with Gradio interface

2025-10-26 04:49:01 +01:00

week3 EXERCISE_hopeogbons.ipynb

(Oct 2025 Bootcamp): Add audio transcription assistant with Gradio interface

2025-10-26 04:49:01 +01:00

README.md

🎙️ Audio Transcription Assistant

An AI-powered audio transcription tool that converts speech to text in multiple languages using OpenAI's Whisper model.

Why I Built This

In today's content-driven world, audio and video are everywhere—podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings?

Manual transcription is time-consuming and expensive. I wanted to build something that could:

Accept audio files in any format (MP3, WAV, etc.)
Transcribe them accurately using AI
Support multiple languages
Work locally on my Mac and on cloud GPUs (Google Colab)

That's where Whisper comes in—OpenAI's powerful speech recognition model.

Features

📤 Upload any audio file (MP3, WAV, M4A, FLAC, etc.)
🌍 12+ languages supported with auto-detection
🤖 Accurate AI-powered transcription using Whisper
⚡ Cross-platform - works on CPU (Mac) or GPU (Colab)
🎨 Clean web interface built with Gradio
🚀 Fast processing with optimized model settings

Tech Stack

OpenAI Whisper - Speech recognition model
Gradio - Web interface framework
PyTorch - Deep learning backend
NumPy - Numerical computing
ffmpeg - Audio file processing

Installation

Prerequisites

Python 3.12+
ffmpeg (for audio processing)
uv package manager (or pip)

Setup

Clone this repository or download the notebook
Install dependencies:

# Install compatible NumPy version
uv pip install --reinstall "numpy==1.26.4"

# Install PyTorch
uv pip install torch torchvision torchaudio

# Install Gradio and Whisper
uv pip install gradio openai-whisper ffmpeg-python

# (Optional) Install Ollama for LLM features
uv pip install ollama

For Mac users, ensure ffmpeg is installed:

brew install ffmpeg

Usage

Running Locally

Open the Jupyter notebook week3 EXERCISE_hopeogbons.ipynb
Run all cells in order:
- Cell 1: Install dependencies
- Cell 2: Import libraries
- Cell 3: Load Whisper model
- Cell 4: Define transcription function
- Cell 5: Build Gradio interface
- Cell 6: Launch the app
The app will automatically open in your browser
Upload an audio file, select the language, and click Submit!

Running on Google Colab

For GPU acceleration:

Open the notebook in Google Colab
Runtime → Change runtime type → GPU (T4)
Run all cells in order
The model will automatically use GPU acceleration

Note: First run downloads the Whisper model (~140MB) - this is a one-time download.

Supported Languages

🇬🇧 English
🇪🇸 Spanish
🇫🇷 French
🇩🇪 German
🇮🇹 Italian
🇵🇹 Portuguese
🇨🇳 Chinese
🇯🇵 Japanese
🇰🇷 Korean
🇷🇺 Russian
🇸🇦 Arabic
🌐 Auto-detect

How It Works

Upload - User uploads an audio file through the Gradio interface
Process - ffmpeg decodes the audio file
Transcribe - Whisper model processes the audio and generates text
Display - Transcription is shown in the output box

The Whisper "base" model is used for a balance between speed and accuracy:

Fast enough for real-time use on CPU
Accurate enough for most transcription needs
Small enough (~140MB) for quick downloads

Example Transcriptions

The app successfully transcribed:

English podcast episodes
French language audio (detected and transcribed)
Multi-speaker conversations
Audio with background noise

What I Learned

Building this transcription assistant taught me:

Audio processing with ffmpeg and Whisper
Cross-platform compatibility (Mac CPU vs Colab GPU)
Dependency management (dealing with NumPy version conflicts!)
Async handling in Jupyter notebooks with Gradio
Model optimization (choosing the right Whisper model size)

The biggest challenge? Getting ffmpeg and NumPy to play nice together across different environments. But solving those issues made me understand the stack much better.

Troubleshooting

Common Issues

1. "No module named 'whisper'" error

Make sure you've installed openai-whisper, not just whisper
Restart your kernel after installation

2. "ffmpeg not found" error

Install ffmpeg: brew install ffmpeg (Mac) or apt-get install ffmpeg (Linux)

3. NumPy version conflicts

Use NumPy 1.26.4: uv pip install --reinstall "numpy==1.26.4"
Restart kernel after reinstalling

4. Gradio event loop errors

Use prevent_thread_lock=True in app.launch()
Restart kernel if errors persist

Future Enhancements

Support for real-time audio streaming
Speaker diarization (identifying different speakers)
Export transcripts to multiple formats (SRT, VTT, TXT)
Integration with LLMs for summarization
Batch processing for multiple files

Contributing

Feel free to fork this project and submit pull requests with improvements!

License

This project is open source and available under the MIT License.

Acknowledgments

OpenAI for the amazing Whisper model
Gradio team for the intuitive interface framework
Andela LLM Engineering Program for the learning opportunity

Built with ❤️ as part of the Andela LLM Engineering Program

For questions or feedback, feel free to reach out!