Build a Real-Time Voice AI Agent with General Compute

Most voice AI feels sluggish. You say something, wait a beat too long, and the illusion of a natural conversation breaks. The problem usually isn't the speech-to-text or text-to-speech models. It's the LLM inference in the middle.

In this tutorial, we'll build a real-time voice AI agent that responds in under 500ms end-to-end. We'll also show something that no other inference provider can currently offer: using a reasoning model in a voice pipeline without blowing through your latency budget.

How Voice AI Agents Work

A voice AI agent is a three-stage pipeline:

Speech-to-Text (STT): Converts the user's audio into text. Typical latency: 100-300ms.
LLM Inference: Processes the transcribed text and generates a response. Typical latency: 200-2000ms.
Text-to-Speech (TTS): Converts the LLM's text response back into audio. Typical latency: 100-300ms.

The LLM step accounts for 50-70% of total latency in most setups. Human conversational turn-taking has a natural gap of about 200-300ms. Anything above a second feels like you're talking to someone on a bad connection. Anything above two seconds and users start checking if the thing is frozen.

The critical insight: TTS needs to start playing as soon as the first tokens arrive from the LLM. You stream tokens out of the model and into the speech synthesizer in real time. This means time-to-first-token (TTFT) matters more than total generation time for perceived responsiveness. In voice, TTFT directly determines time-to-first-audio-token (TTFAT), which is what the user actually perceives.

The Reasoning Model Problem in Voice AI

Here's something that doesn't get talked about enough in the voice AI space: everyone is stuck using basic chat models.

Reasoning models (DeepSeek R1, Qwen QwQ, models with chain-of-thought) produce significantly better answers for complex queries. They think through problems step by step before responding. For a customer support agent that needs to reason about a billing issue, or a medical triage bot that needs to weigh symptoms, the quality difference between a standard chat model and a reasoning model is substantial.

But reasoning models have a problem for voice: they think before they speak. That thinking phase adds hundreds of milliseconds to multiple seconds of latency before the first useful token comes out. On most inference providers, the TTFT for a reasoning model is so high that it completely destroys the conversational experience. You'd be asking users to sit in silence for 3-5 seconds while the model thinks. That's unusable for voice.

This is why virtually every voice AI company today is limited to standard chat models. The TTFAT budget is too tight for reasoning on slow infrastructure.

With General Compute, the math changes. Our inference is fast enough that you can run a reasoning model and still hit voice-grade latency targets. The thinking phase that takes 2-3 seconds on other providers happens in a few hundred milliseconds on our infrastructure. That means you can give your voice agent the ability to actually reason through complex questions while still responding fast enough to maintain natural conversation flow.

This is a meaningful capability gap. Your competitors' voice agents are limited to pattern-matching with chat models. Yours can think.

Choosing the Stack

For this tutorial we'll use:

STT: Deepgram -- fast streaming transcription, generous free tier
LLM: Llama 3.3 70B via General Compute (and optionally a reasoning model for complex queries)
TTS: Cartesia Sonic -- low-latency, high-quality streaming voice synthesis
Framework: Pipecat -- open-source Python framework for voice AI pipelines
Transport: Daily -- WebRTC, built into Pipecat for browser-based interaction

Why Pipecat? It handles the plumbing of wiring STT, LLM, and TTS together with proper streaming, interruption handling, and voice activity detection. It supports OpenAI-compatible providers out of the box, which means General Compute works with no custom integration.

Setting Up

You'll need Python 3.10+ and API keys for each service. Install the dependencies:

pip install "pipecat-ai[daily,deepgram,cartesia,openai]"

Set up your environment variables:

export GENERAL_COMPUTE_API_KEY="your-gc-api-key"
export DEEPGRAM_API_KEY="your-deepgram-api-key"
export CARTESIA_API_KEY="your-cartesia-api-key"
export DAILY_API_KEY="your-daily-api-key"

The General Compute API key works just like an OpenAI key. Sign up at generalcompute.com to get one.

Building the Agent

Here's the full agent. We'll walk through each part below.

import asyncio
import os
from pipecat.frames.frames import LLMMessagesFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask, PipelineParams
from pipecat.services.openai import OpenAILLMService
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.transports.services.daily import DailyTransport, DailyParams


async def main():
    # Transport -- WebRTC via Daily
    transport = DailyTransport(
        room_url="",  # Will be created automatically
        token="",
        bot_name="Voice Agent",
        params=DailyParams(audio_out_enabled=True, audio_in_enabled=True),
    )

    # Speech-to-Text
    stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))

    # LLM -- General Compute via OpenAI-compatible API
    llm = OpenAILLMService(
        api_key=os.getenv("GENERAL_COMPUTE_API_KEY"),
        base_url="https://api.generalcompute.com",
        model="llama-3.3-70b",
    )

    # Text-to-Speech
    tts = CartesiaTTSService(
        api_key=os.getenv("CARTESIA_API_KEY"),
        voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # Friendly voice
    )

    # System prompt -- keep it concise for voice
    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful voice assistant. Keep your responses "
                "concise -- one to two sentences when possible. Be natural "
                "and conversational. Don't use markdown, bullet points, or "
                "formatting since your responses will be spoken aloud."
            ),
        }
    ]

    # Build the pipeline: STT -> LLM -> TTS
    pipeline = Pipeline([
        transport.input(),
        stt,
        llm,
        tts,
        transport.output(),
    ])

    task = PipelineTask(
        pipeline,
        params=PipelineParams(
            allow_interruptions=True,
            enable_metrics=True,
        ),
    )

    # Send the initial context to prime the LLM
    await task.queue_frame(LLMMessagesFrame(messages))

    runner = PipelineRunner()
    await runner.run(task)


if __name__ == "__main__":
    asyncio.run(main())

The LLM Configuration

The important part is here:

llm = OpenAILLMService(
    api_key=os.getenv("GENERAL_COMPUTE_API_KEY"),
    base_url="https://api.generalcompute.com",
    model="llama-3.3-70b",
)

Because General Compute's API is OpenAI-compatible, Pipecat's built-in OpenAI service works without modifications. You point it at GC's base URL and you're done. If you're already using another OpenAI-compatible provider, switching is a one-line change.

Swapping in a Reasoning Model

Want your voice agent to actually think through complex questions? Change the model:

llm = OpenAILLMService(
    api_key=os.getenv("GENERAL_COMPUTE_API_KEY"),
    base_url="https://api.generalcompute.com",
    model="deepseek-r1-0528",
)

On General Compute, DeepSeek R1's thinking phase is fast enough that the additional latency stays within voice-grade bounds. On other providers, this same model would add seconds of silence before the first word is spoken.

You can also build a hybrid approach: route simple queries to a fast chat model and complex queries to a reasoning model. Pipecat's pipeline is modular enough to support this with a classification step before the LLM.

Prompt Engineering for Voice

Voice agents need different prompts than chat agents. A few things to keep in mind:

Short responses. A three-paragraph answer that looks great in a chat UI is painful to listen to. Instruct the model to keep responses to one or two sentences.
No formatting. Markdown, bullet points, and numbered lists don't translate to speech. Tell the model to write in plain, conversational language.
Conversational tone. Written text and spoken text sound different. "I'd be happy to assist you with that" sounds robotic when spoken aloud. "Sure, here's what I found" sounds natural.

Handling Interruptions

Real conversations involve interruptions. A user might start talking while the agent is still responding. Pipecat handles this through Voice Activity Detection (VAD). When it detects the user speaking, it stops the current TTS output and processes the new input.

This is enabled with allow_interruptions=True in the pipeline params. Without it, the agent would finish its entire response before listening again, which feels unnatural.

Measuring Latency

Once your agent is running, you'll want to measure where time is being spent. Pipecat's enable_metrics=True flag logs timing for each pipeline stage.

The metrics you care about:

TTFT (Time to First Token): How long after STT completes does the LLM start generating? This is the single most important number for voice AI.
TTFAT (Time to First Audio Token): End-to-end time from user silence to agent audio. This is what the user actually experiences.
TPS (Tokens Per Second): How fast the LLM generates output. Higher TPS means the spoken response keeps up without awkward pauses mid-sentence.

You can also measure the LLM step in isolation:

import time

async def measure_llm_latency(llm, messages):
    start = time.perf_counter()
    first_token_time = None

    response = await llm.client.chat.completions.create(
        model=llm.model,
        messages=messages,
        stream=True,
    )

    async for chunk in response:
        if chunk.choices[0].delta.content and first_token_time is None:
            first_token_time = time.perf_counter()
            print(f"TTFT: {(first_token_time - start) * 1000:.0f}ms")

    total = time.perf_counter() - start
    print(f"Total generation: {total * 1000:.0f}ms")

With General Compute serving Llama 3.3 70B, you should see TTFT in the 80-150ms range. With a reasoning model like DeepSeek R1, the thinking overhead adds some latency, but it stays well under the 500ms TTFAT threshold that voice requires. Try the same reasoning model on another provider and you'll see why this matters.

Running It

Start the agent:

python agent.py

Pipecat will create a Daily room and print the URL. Open it in your browser, allow microphone access, and start talking.

Production Considerations

This tutorial gives you a working prototype. Here's what to think about for production.

Scaling concurrent sessions. Each voice session needs its own pipeline instance. Daily and LiveKit both handle WebRTC scaling, but you'll need to manage pipeline instances. Consider running each session as a separate process or using an orchestrator.

Model routing. In production, you probably want a mix of models. Simple queries ("what are your hours?") go to a fast 8B model. Complex queries ("I was charged twice and my refund was applied to the wrong account") get routed to a reasoning model. General Compute serves multiple model sizes, so you can route dynamically based on query complexity.

Function calling. For real applications, you'll want the agent to do things: check a calendar, look up an order, book a reservation. General Compute's API supports function calling, so you can add tools to the LLM step and the agent will call them as part of the conversation.

Phone integration. For phone-based agents, swap Daily for Twilio as your transport layer. The rest of the pipeline stays the same.

Persistent memory. For multi-turn conversations that span sessions, store the message array to a database keyed by user or session ID and reload it when they come back.

Why This Matters

The voice AI space is growing fast, but almost every company in it is constrained by their inference provider. They're all using standard chat models because reasoning models are too slow on available infrastructure. They're all designing around the same latency limitations.

General Compute removes that constraint. You get fast enough inference to use the best models available, including reasoning models, while staying within the tight latency requirements that voice demands. That means your voice agents can be both fast and smart, which is a combination that wasn't previously available.

The full code from this tutorial works out of the box with a General Compute API key. Sign up at generalcompute.com, grab your key, and you can have a working voice agent running in about 15 minutes.