Transcripts will be commonplace

I began transcribing podcast episodes four years ago. I used Otter.ai, a transcription service that's grown into a product that transcribes meetings, captures slides, and generates summaries. Today, there are several machine-learning-based transcription offerings, some of which are free if you run them on your device.

We're almost at a point where artificial intelligence can transcribe large audio pieces without making mistakes.

I've used Descript¹ (paid) heavily over the past few years and explored other alternatives. Descript transcribes and lets you edit audio and video by editing text. Transcriptions are highly accurate, but even a small error rate results in serious editing if you want transcripts of hours-long conversations to be correct, especially when using technical keywords and niche words. These tools let you provide sample text and a glossary of words in our audio to improve their accuracy.²

These workflows are still much better (and faster) than transcribing manually. If you don't have time to fix mistakes, you can often get away by adding a disclaimer that "transcripts were automatically generated and may contain errors."

In September 2022, OpenAI trained and open-sourced a neural network called Whisper that, in their own words, "approaches human level robustness and accuracy on English speech recognition." That's a big step. The community can extend Whisper and use it for free.³ (I've been using Whisper and played it with on Live 98.)

These systems can predict word-level timestamps—making it possible to highlight the exact word spoken at a given time—and perform something called speaker diarization, a fancy way to say that the AI knows who's talking when by identifying the active speaker.

Soon enough, transcripts will be commonplace. Transcripts will be free, automatic, and accurate; we'll expect them to be there.

Indeed, Spotify is already transcribing trending podcasts, and YouTube generates captions for every video. I imagine WhatsApp will transcribe voice notes so you can read them when you can't play them.

Transcripts are helpful to listeners to follow content, browse through long pieces, or refer to particular points of a conversation. But they also provide a way for editors to navigate episodes quickly and get an idea of their content, making it easier to edit by removing or moving blocks of audio around to make a conversation more fluid. They also help editors write episode descriptions, notes, and chapters, and machine learning is starting to do these tasks for us automatically—which is exciting.

Soon enough, we'll hit record, delegate all this manual labor to the machine⁴, and focus on our next piece of content when done.

Descript is a paid service. Their Pro subscription comes with thirty hours of monthly transcription. ↩
Descript has a Glossary of words for this purpose, and Whisper's command-line interface accepts a parameter called --initial_prompt to provide text style and uncommon words. ↩
Whisper is also available as a cloud service that costs half a cent of a dollar for each minute of audio transcribed ($0.006/min). You can browse OpenAI's API Pricing here. ↩
OpenAI's GPT-3.5-turbo and GPT-4—the models behind ChatGPT—can perform these tasks. You may ask them to Summarize a text or Extract keywords and topics from a paragraph. OneAI already offers a service to extract relevant keywords and generate text summaries. ↩

March 28, 2023
Nono Martínez Alonso

My sketches and stories, in your inbox.

One email per week. No spam ever.