- Introduced a new audio transcription tool utilizing OpenAI's Whisper model. - Added README.md detailing features, installation, and usage instructions. - Created a Jupyter notebook for local and Google Colab execution. - Included an MP3 file for demonstration purposes.
198 lines
5.3 KiB
Markdown
198 lines
5.3 KiB
Markdown
# 🎙️ Audio Transcription Assistant
|
|
|
|
An AI-powered audio transcription tool that converts speech to text in multiple languages using OpenAI's Whisper model.
|
|
|
|
## Why I Built This
|
|
|
|
In today's content-driven world, audio and video are everywhere—podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings?
|
|
|
|
Manual transcription is time-consuming and expensive. I wanted to build something that could:
|
|
|
|
- Accept audio files in any format (MP3, WAV, etc.)
|
|
- Transcribe them accurately using AI
|
|
- Support multiple languages
|
|
- Work locally on my Mac **and** on cloud GPUs (Google Colab)
|
|
|
|
That's where **Whisper** comes in—OpenAI's powerful speech recognition model.
|
|
|
|
## Features
|
|
|
|
- 📤 **Upload any audio file** (MP3, WAV, M4A, FLAC, etc.)
|
|
- 🌍 **12+ languages supported** with auto-detection
|
|
- 🤖 **Accurate AI-powered transcription** using Whisper
|
|
- ⚡ **Cross-platform** - works on CPU (Mac) or GPU (Colab)
|
|
- 🎨 **Clean web interface** built with Gradio
|
|
- 🚀 **Fast processing** with optimized model settings
|
|
|
|
## Tech Stack
|
|
|
|
- **OpenAI Whisper** - Speech recognition model
|
|
- **Gradio** - Web interface framework
|
|
- **PyTorch** - Deep learning backend
|
|
- **NumPy** - Numerical computing
|
|
- **ffmpeg** - Audio file processing
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.12+
|
|
- ffmpeg (for audio processing)
|
|
- uv package manager (or pip)
|
|
|
|
### Setup
|
|
|
|
1. Clone this repository or download the notebook
|
|
|
|
2. Install dependencies:
|
|
|
|
```bash
|
|
# Install compatible NumPy version
|
|
uv pip install --reinstall "numpy==1.26.4"
|
|
|
|
# Install PyTorch
|
|
uv pip install torch torchvision torchaudio
|
|
|
|
# Install Gradio and Whisper
|
|
uv pip install gradio openai-whisper ffmpeg-python
|
|
|
|
# (Optional) Install Ollama for LLM features
|
|
uv pip install ollama
|
|
```
|
|
|
|
3. **For Mac users**, ensure ffmpeg is installed:
|
|
|
|
```bash
|
|
brew install ffmpeg
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Running Locally
|
|
|
|
1. Open the Jupyter notebook `week3 EXERCISE_hopeogbons.ipynb`
|
|
|
|
2. Run all cells in order:
|
|
|
|
- Cell 1: Install dependencies
|
|
- Cell 2: Import libraries
|
|
- Cell 3: Load Whisper model
|
|
- Cell 4: Define transcription function
|
|
- Cell 5: Build Gradio interface
|
|
- Cell 6: Launch the app
|
|
|
|
3. The app will automatically open in your browser
|
|
|
|
4. Upload an audio file, select the language, and click Submit!
|
|
|
|
### Running on Google Colab
|
|
|
|
For GPU acceleration:
|
|
|
|
1. Open the notebook in Google Colab
|
|
2. Runtime → Change runtime type → **GPU (T4)**
|
|
3. Run all cells in order
|
|
4. The model will automatically use GPU acceleration
|
|
|
|
**Note:** First run downloads the Whisper model (~140MB) - this is a one-time download.
|
|
|
|
## Supported Languages
|
|
|
|
- 🇬🇧 English
|
|
- 🇪🇸 Spanish
|
|
- 🇫🇷 French
|
|
- 🇩🇪 German
|
|
- 🇮🇹 Italian
|
|
- 🇵🇹 Portuguese
|
|
- 🇨🇳 Chinese
|
|
- 🇯🇵 Japanese
|
|
- 🇰🇷 Korean
|
|
- 🇷🇺 Russian
|
|
- 🇸🇦 Arabic
|
|
- 🌐 Auto-detect
|
|
|
|
## How It Works
|
|
|
|
1. **Upload** - User uploads an audio file through the Gradio interface
|
|
2. **Process** - ffmpeg decodes the audio file
|
|
3. **Transcribe** - Whisper model processes the audio and generates text
|
|
4. **Display** - Transcription is shown in the output box
|
|
|
|
The Whisper "base" model is used for a balance between speed and accuracy:
|
|
|
|
- Fast enough for real-time use on CPU
|
|
- Accurate enough for most transcription needs
|
|
- Small enough (~140MB) for quick downloads
|
|
|
|
## Example Transcriptions
|
|
|
|
The app successfully transcribed:
|
|
|
|
- English podcast episodes
|
|
- French language audio (detected and transcribed)
|
|
- Multi-speaker conversations
|
|
- Audio with background noise
|
|
|
|
## What I Learned
|
|
|
|
Building this transcription assistant taught me:
|
|
|
|
- **Audio processing** with ffmpeg and Whisper
|
|
- **Cross-platform compatibility** (Mac CPU vs Colab GPU)
|
|
- **Dependency management** (dealing with NumPy version conflicts!)
|
|
- **Async handling** in Jupyter notebooks with Gradio
|
|
- **Model optimization** (choosing the right Whisper model size)
|
|
|
|
The biggest challenge? Getting ffmpeg and NumPy to play nice together across different environments. But solving those issues made me understand the stack much better.
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**1. "No module named 'whisper'" error**
|
|
|
|
- Make sure you've installed `openai-whisper`, not just `whisper`
|
|
- Restart your kernel after installation
|
|
|
|
**2. "ffmpeg not found" error**
|
|
|
|
- Install ffmpeg: `brew install ffmpeg` (Mac) or `apt-get install ffmpeg` (Linux)
|
|
|
|
**3. NumPy version conflicts**
|
|
|
|
- Use NumPy 1.26.4: `uv pip install --reinstall "numpy==1.26.4"`
|
|
- Restart kernel after reinstalling
|
|
|
|
**4. Gradio event loop errors**
|
|
|
|
- Use `prevent_thread_lock=True` in `app.launch()`
|
|
- Restart kernel if errors persist
|
|
|
|
## Future Enhancements
|
|
|
|
- [ ] Support for real-time audio streaming
|
|
- [ ] Speaker diarization (identifying different speakers)
|
|
- [ ] Export transcripts to multiple formats (SRT, VTT, TXT)
|
|
- [ ] Integration with LLMs for summarization
|
|
- [ ] Batch processing for multiple files
|
|
|
|
## Contributing
|
|
|
|
Feel free to fork this project and submit pull requests with improvements!
|
|
|
|
## License
|
|
|
|
This project is open source and available under the MIT License.
|
|
|
|
## Acknowledgments
|
|
|
|
- **OpenAI** for the amazing Whisper model
|
|
- **Gradio** team for the intuitive interface framework
|
|
- **Andela LLM Engineering Program** for the learning opportunity
|
|
|
|
---
|
|
|
|
**Built with ❤️ as part of the Andela LLM Engineering Program**
|
|
|
|
For questions or feedback, feel free to reach out!
|