Deepgram - Voice AI API for Speech-to-Text, TTS, and Voice Agents

Most voice AI vendors make you ask a sales rep what speech-to-text actually costs. Deepgram publishes a rate card. Nova-3, its current transcription model, runs $0.0048 per minute for streaming audio and $0.0077 per minute for pre-recorded files; new accounts start with $200 in free credit on a pay-as-you-go plan. That transparency is the first useful signal about who the platform is built for: developers who want to wire voice into an application and see the per-minute math before they commit, not enterprises that only move after a procurement cycle.

Deepgram is an API-first voice platform with three things behind the same key. Speech-to-text is the Nova family. Text-to-speech is Aura. And the newer pitch ties them together: a single Voice Agent API that orchestrates transcription, a language model, and speech synthesis in one endpoint so you are not stitching three vendors into a real-time loop yourself.

The three pieces

Nova for transcription. Nova-3 is the listed default for turning audio into text, in both a real-time streaming mode and a batch mode for recorded files. Deepgram advertises 45+ languages on the speech-to-text side, with monolingual and multilingual variants priced separately — the multilingual model costs a little more ($0.0058/min streaming versus $0.0048 for monolingual).

Aura for synthesis. On the text-to-speech side, Aura is billed per character rather than per minute: Aura-2 at $0.030 per 1,000 characters, the older Aura-1 at half that. Per-character pricing is the honest unit for TTS — what you pay tracks the script length, not how slowly the voice reads it.

Flux and the Voice Agent API. The piece Deepgram is leaning into hardest is the conversational layer. Flux is a speech recognition model aimed specifically at voice agents — the kind of system that has to know when a caller has actually finished talking, not just transcribe a file after the fact. It is multilingual across ten languages (English, Spanish, German, French, Hindi, Russian, Portuguese, Japanese, Italian, Dutch). The Voice Agent API wraps that turn-taking model together with an LLM and Aura, and the explicit argument is latency and cost: one orchestrated API call instead of three round-trips you have to glue together and keep in sync.

What it actually costs to run

The pricing is concrete enough to reason about before signing up:

Nova-3 (STT, monolingual): $0.0048/min streaming, $0.0077/min pre-recorded
Nova-3 (STT, multilingual): $0.0058/min streaming, $0.0092/min pre-recorded
Flux (STT, English): $0.0065/min streaming
Aura-2 (TTS): $0.030 / 1,000 characters
Aura-1 (TTS): $0.0150 / 1,000 characters
Add-ons: redaction and speaker diarization at $0.0020/min each

Roughly, an hour of streaming transcription on Nova-3 lands near 29 cents; the $200 of starter credit covers a lot of experimentation before a card is charged. Annual prepayment on the Growth plan is advertised at up to 20% off.

Why the self-hosting note matters

Most hosted voice APIs only run in the vendor’s cloud. Deepgram lists on-premise and private-cloud deployment as a supported option, and pairs it with the compliance set you would expect for regulated audio: SOC 2 Type 1 and 2, HIPAA, GDPR, CCPA, and PCI. Read together, those two facts point at the real target customer — contact centers, medical transcription, and speech analytics shops that cannot send raw call recordings to an arbitrary third-party endpoint. If you fit that description, the self-hosted path is the detail to dig into first, because it changes the entire deployment and pricing conversation.

Who should look elsewhere

This is a usage-priced cloud service, and that cuts both ways. If you have a one-off recording to transcribe or a hobby project, an open-source local model you run yourself will cost nothing per minute and keeps the audio on your machine — the per-minute rate only pays off when transcription or synthesis is a continuous part of a product. The Voice Agent API is also clearly meant for people building real-time conversational systems; if all you need is plain batch transcription, you are buying into a larger surface area than the job requires.

One honest gap: the landing page sells low latency and consolidation hard but does not put public accuracy or latency numbers next to those claims, so the real test is your own audio. The $200 credit exists precisely so you can run that test before deciding. If you like the “one API instead of many” model in general, it rhymes with what OpenRouter does for text LLMs — a single interface standing in front of several moving parts.

More speech, model, and inference tooling lives in the AI & machine learning section.

See for yourself

The fastest way to judge Deepgram is to throw your own audio at it. Grab the free credit, read the pricing, and try the models at deepgram.com — your recordings will tell you more than any benchmark on the homepage.