# ๐ŸŽ™๏ธ Audio Transcription Assistant An AI-powered audio transcription tool that converts speech to text in multiple languages using OpenAI's Whisper model. ## Why I Built This In today's content-driven world, audio and video are everywhereโ€”podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings? Manual transcription is time-consuming and expensive. I wanted to build something that could: - Accept audio files in any format (MP3, WAV, etc.) - Transcribe them accurately using AI - Support multiple languages - Work locally on my Mac **and** on cloud GPUs (Google Colab) That's where **Whisper** comes inโ€”OpenAI's powerful speech recognition model. ## Features - ๐Ÿ“ค **Upload any audio file** (MP3, WAV, M4A, FLAC, etc.) - ๐ŸŒ **12+ languages supported** with auto-detection - ๐Ÿค– **Accurate AI-powered transcription** using Whisper - โšก **Cross-platform** - works on CPU (Mac) or GPU (Colab) - ๐ŸŽจ **Clean web interface** built with Gradio - ๐Ÿš€ **Fast processing** with optimized model settings ## Tech Stack - **OpenAI Whisper** - Speech recognition model - **Gradio** - Web interface framework - **PyTorch** - Deep learning backend - **NumPy** - Numerical computing - **ffmpeg** - Audio file processing ## Installation ### Prerequisites - Python 3.12+ - ffmpeg (for audio processing) - uv package manager (or pip) ### Setup 1. Clone this repository or download the notebook 2. Install dependencies: ```bash # Install compatible NumPy version uv pip install --reinstall "numpy==1.26.4" # Install PyTorch uv pip install torch torchvision torchaudio # Install Gradio and Whisper uv pip install gradio openai-whisper ffmpeg-python # (Optional) Install Ollama for LLM features uv pip install ollama ``` 3. **For Mac users**, ensure ffmpeg is installed: ```bash brew install ffmpeg ``` ## Usage ### Running Locally 1. Open the Jupyter notebook `week3 EXERCISE_hopeogbons.ipynb` 2. Run all cells in order: - Cell 1: Install dependencies - Cell 2: Import libraries - Cell 3: Load Whisper model - Cell 4: Define transcription function - Cell 5: Build Gradio interface - Cell 6: Launch the app 3. The app will automatically open in your browser 4. Upload an audio file, select the language, and click Submit! ### Running on Google Colab For GPU acceleration: 1. Open the notebook in Google Colab 2. Runtime โ†’ Change runtime type โ†’ **GPU (T4)** 3. Run all cells in order 4. The model will automatically use GPU acceleration **Note:** First run downloads the Whisper model (~140MB) - this is a one-time download. ## Supported Languages - ๐Ÿ‡ฌ๐Ÿ‡ง English - ๐Ÿ‡ช๐Ÿ‡ธ Spanish - ๐Ÿ‡ซ๐Ÿ‡ท French - ๐Ÿ‡ฉ๐Ÿ‡ช German - ๐Ÿ‡ฎ๐Ÿ‡น Italian - ๐Ÿ‡ต๐Ÿ‡น Portuguese - ๐Ÿ‡จ๐Ÿ‡ณ Chinese - ๐Ÿ‡ฏ๐Ÿ‡ต Japanese - ๐Ÿ‡ฐ๐Ÿ‡ท Korean - ๐Ÿ‡ท๐Ÿ‡บ Russian - ๐Ÿ‡ธ๐Ÿ‡ฆ Arabic - ๐ŸŒ Auto-detect ## How It Works 1. **Upload** - User uploads an audio file through the Gradio interface 2. **Process** - ffmpeg decodes the audio file 3. **Transcribe** - Whisper model processes the audio and generates text 4. **Display** - Transcription is shown in the output box The Whisper "base" model is used for a balance between speed and accuracy: - Fast enough for real-time use on CPU - Accurate enough for most transcription needs - Small enough (~140MB) for quick downloads ## Example Transcriptions The app successfully transcribed: - English podcast episodes - French language audio (detected and transcribed) - Multi-speaker conversations - Audio with background noise ## What I Learned Building this transcription assistant taught me: - **Audio processing** with ffmpeg and Whisper - **Cross-platform compatibility** (Mac CPU vs Colab GPU) - **Dependency management** (dealing with NumPy version conflicts!) - **Async handling** in Jupyter notebooks with Gradio - **Model optimization** (choosing the right Whisper model size) The biggest challenge? Getting ffmpeg and NumPy to play nice together across different environments. But solving those issues made me understand the stack much better. ## Troubleshooting ### Common Issues **1. "No module named 'whisper'" error** - Make sure you've installed `openai-whisper`, not just `whisper` - Restart your kernel after installation **2. "ffmpeg not found" error** - Install ffmpeg: `brew install ffmpeg` (Mac) or `apt-get install ffmpeg` (Linux) **3. NumPy version conflicts** - Use NumPy 1.26.4: `uv pip install --reinstall "numpy==1.26.4"` - Restart kernel after reinstalling **4. Gradio event loop errors** - Use `prevent_thread_lock=True` in `app.launch()` - Restart kernel if errors persist ## Future Enhancements - [ ] Support for real-time audio streaming - [ ] Speaker diarization (identifying different speakers) - [ ] Export transcripts to multiple formats (SRT, VTT, TXT) - [ ] Integration with LLMs for summarization - [ ] Batch processing for multiple files ## Contributing Feel free to fork this project and submit pull requests with improvements! ## License This project is open source and available under the MIT License. ## Acknowledgments - **OpenAI** for the amazing Whisper model - **Gradio** team for the intuitive interface framework - **Andela LLM Engineering Program** for the learning opportunity --- **Built with โค๏ธ as part of the Andela LLM Engineering Program** For questions or feedback, feel free to reach out!