How does my real-time YouTube transcription tool work?
I am proud to share with you my latest open-source project: a tool that transcribes the YouTube videos you watch live (in real time).
Here is a step-by-step guide to how it works:
1. Capturing the sound coming out of your computer
The tool "listens" to exactly what your speakers are playing using a virtual audio cable (called loopback). The most well-known are: VB-Cable (Windows), BlackHole (Mac), or Stereo Mix when enabled.
2. Cutting the sound into small pieces
Every ~5 seconds, we take a piece of audio. It's short enough to be fast, long enough for the AI to understand the context.
3. Automatic transcription by local AI
Each small piece is sent to a Whisper model (Whisper.cpp). The model runs entirely on your machine → zero sending to the cloud, zero subscriptions, zero data leaks.
4. Automatic addition of timestamps
As soon as the text is found, the exact time is noted:
[00:03:42] And that's where the story gets really interesting...
5. Progressive writing to a text file
Everything is continuously added to a .txt file with the date and time of day. At the end of the video → you have the complete transcript, ready to copy and paste or reread.
Quick summary & technologies used
In a nutshell:
You put on a YouTube video → you run the script → you watch as usual → Ctrl+C to stop → you have all the text time-stamped in a file.
Technologies used:
- sounddevice: captures live audio
- whisper.cpp: fast, local speech recognition
- numpy: audio buffer management
Happy transcription to everyone! 🚀