Agent Readout

Faster-Whisper: Real-Time Speech-to-Text on GeneralCompute

Faster-Whisper reimplements OpenAI's Whisper on CTranslate2 with INT8 inference, running several times faster at the same accuracy. Here is how it works, how streaming differs from batch transcription, and how it fits into a real-time STT to LLM to TTS voice pipeline.

Author
General Compute
Published
2026-06-09
Tags
faster-whisper, speech-to-text, voice-ai, ctranslate2

Markdown body


Whisper is the model most teams reach for when they need speech-to-text, and for good reason. It is accurate, it handles dozens of languages, and the weights are open. The problem shows up the moment you try to use it in something interactive. The reference implementation from OpenAI is built on PyTorch, and it is slow enough that a few seconds of audio can take longer than the audio itself to transcribe. That is fine for batch jobs over a podcast archive. It is a dealbreaker for a voice agent that has to answer someone in real time.

Faster-Whisper is the project that closes that gap. It is a reimplementation of Whisper on top of CTranslate2, and it runs the same models several times faster while using less memory and producing the same transcripts. This post covers what it actually does, how INT8 quantization fits in, the difference between streaming and batch transcription, and where it sits in a full voice pipeline.

## What Faster-Whisper Is

Faster-Whisper is not a new model. It uses the exact same Whisper weights that OpenAI released, from tiny up through large-v3. What changes is the engine underneath. Instead of running the model through PyTorch, Faster-Whisper runs it through [CTranslate2](https://github.com/OpenNMT/CTranslate2), a C++ inference library originally built for fast transformer translation and later extended to cover speech and other architectures.

The practical result is a drop-in replacement that is faster and lighter. On the same hardware, transcribing the same audio with the large-v3 model, Faster-Whisper typically runs around four times faster than the reference implementation and uses noticeably less GPU memory. With INT8 quantization the memory footprint drops further, which lets you fit larger models on smaller GPUs or run more concurrent streams on the same card.

The API stays close to what you would expect:

```python
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe("audio.wav", beam_size=5)

print(f"Detected language: {info.language} ({info.language_probability:.2f})")
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
```

The `segments` generator is lazy. Transcription does not actually run until you start iterating, which matters for streaming, as we will get to. The `info` object gives you the detected language and a confidence score, plus the duration the model saw.

## Why CTranslate2 Makes It Faster

The speedup is not magic. CTranslate2 applies a set of inference optimizations that the reference PyTorch path either skips or does less aggressively. A few of them carry most of the weight.

It uses optimized, fused kernels written for inference rather than the general-purpose operations PyTorch uses for both training and inference. Fusing operations means fewer trips to GPU memory and less kernel launch overhead, which adds up across the many layers of the encoder and decoder.

It supports lower-precision compute, including FP16, BF16, and INT8, and it picks efficient kernels for each. Whisper's encoder processes a fixed-size spectrogram, and the decoder generates text tokens one at a time, so both halves benefit from faster matrix multiplies at reduced precision.

It manages memory carefully, reusing buffers and keeping the working set small. This is where the lower memory usage comes from, and it is also part of why you can run more concurrent transcriptions on a single GPU.

The encoder runs once per audio chunk and is compute-heavy. The decoder runs autoregressively, one token per step, and is latency-sensitive in the same way an LLM's decode phase is. CTranslate2 optimizes both, which is why the end-to-end speedup holds across short clips and long files.

## INT8 and the Accuracy Question

The compute type you choose is the main knob you have for trading speed and memory against a small amount of accuracy. Faster-Whisper exposes it directly through the `compute_type` argument.

The common options are `float16` (the usual default on GPU), `int8_float16` (INT8 weights with FP16 compute for some operations), and `int8` (full INT8, which is the typical choice on CPU). Quantizing the weights to INT8 roughly halves the memory the model needs and speeds up the matrix multiplies, because 8-bit integer operations move less data and run faster on hardware that supports them.

The accuracy cost is usually small. For most audio, an INT8 large-v3 transcribes within a hair of the FP16 version, and the difference is invisible in normal use. Where you can see it is on hard audio: heavy accents, noisy backgrounds, overlapping speakers, or domain-specific terminology. There the small precision loss can nudge the word error rate up. The right move is to test on audio that looks like your real traffic rather than on clean read speech, because clean speech hides the degradation that messy audio reveals.

```python
# INT8 on CPU, useful when you have no GPU or want to pack many streams
model = WhisperModel("large-v3", device="cpu", compute_type="int8")

# INT8 weights with FP16 compute on GPU, a good speed/quality balance
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")
```

A useful pattern is to pick the model size first, then the precision. A smaller model at higher precision and a larger model at INT8 can land at similar memory budgets, and which one transcribes your audio better is an empirical question. In practice, a larger model quantized to INT8 often beats a smaller model at FP16 on the same memory, because Whisper's accuracy scales strongly with size.

## Streaming Versus Batch

There are two very different ways to use Faster-Whisper, and confusing them is a common source of frustration.

Batch transcription is the simple case. You have a complete audio file, you hand the whole thing to the model, and you get back the full transcript. This is what you want for transcribing recordings: meetings, calls, video, archives. You care about total throughput and accuracy, not about latency on the first word. Batch mode also lets the model use its full context, including the VAD (voice activity detection) filter that skips silence, which both speeds things up and avoids hallucinated text during quiet stretches.

```python
segments, info = model.transcribe(
    "meeting.wav",
    beam_size=5,
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500),
)
```

Streaming transcription is the hard case, and it is what real-time voice needs. The audio arrives a chunk at a time from a microphone or a phone call, and you want partial transcripts as the person is still talking, not a single result after they stop. Whisper was not designed as a streaming model. It expects a 30-second window and processes it as a unit. To make it feel real-time, you run it on a sliding buffer of recent audio, transcribe that buffer repeatedly as new audio comes in, and reconcile the overlapping results.

The trade-offs in a streaming setup come down to chunk size. Short chunks give you lower latency but less context per inference, which hurts accuracy and can produce unstable partial transcripts that rewrite themselves as more audio arrives. Long chunks are more accurate but make the user wait. Most real-time systems use a VAD to detect when someone has finished an utterance and finalize the transcript at that boundary, while emitting rough partials in between for responsiveness. The VAD endpoint is doing double duty here: it stabilizes the transcript and it tells the rest of your pipeline when it is the agent's turn to respond.

A practical streaming loop looks roughly like this:

```python
import numpy as np

# audio_stream yields raw float32 PCM chunks from your mic or call
buffer = np.array([], dtype=np.float32)

for chunk in audio_stream:
    buffer = np.concatenate([buffer, chunk])

    # transcribe the rolling buffer; keep the last few seconds of context
    segments, _ = model.transcribe(buffer, beam_size=1, language="en")
    partial = " ".join(s.text for s in segments)
    emit_partial(partial)

    # when the VAD says the utterance ended, finalize and reset
    if utterance_ended(buffer):
        emit_final(partial)
        buffer = np.array([], dtype=np.float32)
```

Note `beam_size=1` here. Greedy decoding is faster than beam search and the latency matters more than the small accuracy gain in a live setting. For batch jobs you would raise the beam size back up.

## The STT to LLM to TTS Pipeline

Real-time speech-to-text rarely stands alone. It is usually the front of a voice agent: the user speaks, you transcribe (STT), you send the text to a language model (LLM), the model responds, and you speak that response back with text-to-speech (TTS). The user's sense of how fast the system is comes from the sum of those three stages plus the network between them.

This is where inference speed across the whole stack starts to matter, because the latencies add up. A rough budget for a conversational agent that feels natural is on the order of 800 milliseconds to a second from the moment the user stops talking to the moment they hear a response. Push past that and the conversation starts to feel like a walkie-talkie. Within that budget you have to fit:

- **Endpoint detection.** Deciding the user has actually finished, not just paused. This is the VAD, and tuning it is its own problem: too eager and you cut people off, too patient and every exchange feels sluggish.
- **Final transcription.** Faster-Whisper turning the last chunk of audio into text. With a quantized model on a fast GPU this is tens of milliseconds for a short utterance.
- **LLM generation.** The model reading the transcript and producing a response. This is usually the largest and most variable slice, and it is dominated by time to first token plus how fast the model streams.
- **TTS.** Turning the response text into audio, ideally streaming the first words out before the full response is generated.

Two things keep this budget achievable. The first is streaming everything. You do not wait for the full transcript before sending text to the LLM, and you do not wait for the full LLM response before starting TTS. You pipeline the stages so the first audio of the answer starts playing while the later words are still being generated. The second is fast inference at each stage, because the slowest stage sets the floor. A transcription step that takes 400 milliseconds eats half your budget before the LLM has even seen the text.

This is the reason the model engine matters as much as the model. Faster-Whisper makes the STT stage cheap enough to leave room for the LLM, and running the LLM stage on infrastructure tuned for low time-to-first-token and high token throughput is what keeps the whole loop inside the window. If any single stage runs on slow inference, the conversation drags regardless of how good the others are. Serving the language model on hardware built for latency is what makes a sub-second voice loop realistic rather than aspirational.

## Practical Notes

A few things worth knowing before you ship.

The VAD filter is worth turning on for almost everything. Whisper has a known tendency to hallucinate text during silence, repeating a phrase or inventing one when there is nothing to transcribe. Filtering out non-speech with the built-in Silero VAD both removes those hallucinations and saves compute by not running the model on quiet audio.

Language matters for latency. If you know the audio is English, set `language="en"` instead of letting the model detect it. Language detection runs an extra pass over the first chunk, and skipping it shaves time off the first result.

Model size is the biggest lever on the speed-accuracy curve. For real-time voice on clean audio, the `small` or `medium` models are often fast and accurate enough, and they leave far more latency budget for the LLM than large-v3 does. For batch transcription of important recordings, use large-v3 and a higher beam size, since you are not racing a clock.

Concurrency is where INT8 pays off twice. The smaller memory footprint lets you run more simultaneous transcription streams on one GPU, which is exactly what you need when a voice product has many calls happening at once. Sizing this is a throughput exercise: measure how many concurrent streams a single GPU sustains at your target latency, then scale out from there.

## Trying It

Faster-Whisper is one of the easier wins in a voice stack because it is a drop-in speedup. You keep the Whisper weights and accuracy you already trust, and you get several times the speed and lower memory for the cost of switching the engine. Start with a `medium` or `large-v3` model at `int8_float16` on a GPU, turn on the VAD filter, and measure the real latency on audio that looks like your traffic.

The harder part is the rest of the pipeline. The transcription stage is only fast enough to matter if the language model behind it is also fast, because the user hears the sum of every stage. If you are building a voice agent and want the LLM half of the loop to fit inside a sub-second budget, you can point an OpenAI-compatible client at [General Compute](https://generalcompute.com) and run the generation step on inference tuned for low latency and high throughput. Pair a quantized Faster-Whisper front end with a fast LLM backend and the conversation starts to feel like a conversation rather than a series of pauses.
ModeHumanAgent
Faster-Whisper: Real-Time Speech-to-Text on GeneralCompute | General Compute