STT (Speech-to-Text)
Definition
AI technology that converts spoken audio into written text. Also called ASR (Automatic Speech Recognition).
Why It Matters
STT (also called ASR, Automatic Speech Recognition) is the gateway for every workflow that starts with spoken input: meeting notes, podcast transcripts, voicemail-to-text, voice-controlled apps, accessibility captions. Modern STT is near-human accurate on clean audio and degrades gracefully on noise.
Key Points
- Whisper large-v3: 1.5B parameters, 99-language support. WER ~3 % on English broadcast audio (released November 2023).
- Word Error Rate (WER) = (substitutions + deletions + insertions) / total reference words. Below 10 % is usable; below 5 % is good.
- Speaker diarisation (who said what, when) is a separate model, pyannote.audio is the most widely used open-source option.
- Real-time STT requires streaming models; Whisper processes audio in fixed chunks and is not natively streaming. Faster-Whisper and Parakeet are streaming-capable alternatives.
- Output formats: plain text, SRT subtitles, VTT subtitles, JSON with per-word timestamps and confidence scores.
Example
Whisper large-v3 transcribes a 1-hour meeting in ~5 minutes on a single A100 with speaker diarisation. Word error rate on conference-room audio runs 3–8 %; on phone-call audio, 8–15 %. Output can be plain text, SRT subtitles, or JSON with timestamps.
Common Misconception
Benchmark transcription accuracy numbers are always measured on clean, specific-domain audio recordings. Accent, background noise, overlapping speakers, and technical vocabulary (medical, legal, financial) each independently degrade WER by 5–20 percentage points from the published benchmark figure.
Related Terms
- TTS (Text-to-Speech)AI technology that converts written text into natural-sounding spoken audio.
- Computer VisionAI that can understand and analyze images and video content.
- Multimodal AIAI models that can process multiple types of input, text, images, audio, video.
STT (Speech-to-Text) on Rewind.ai
Rewind.ai's transcribe tool runs Whisper large-v3 + Parakeet for speed–quality trade-offs. Upload audio; output is plain text, SRT, VTT, or timestamped JSON.
Explore the ToolsQuick Facts
| Term | STT (Speech-to-Text) |
| Related | TTS (Text-to-Speech), Computer Vision, Multimodal AI |