/v1/audio/speech is OpenAI-SDK compatible: same body shape, same response shape (binary audio with Content-Type per format). Supports full or streaming responses and respects org-level ZDR flags.
Syntax
Content-Type indicates the format.
Output formats
response_format | Content-Type | When to use |
|---|---|---|
mp3 (default) | audio/mpeg | Compatible with all browsers and players |
opus | audio/opus | Low-latency bidirectional streaming (WebRTC) |
wav | audio/wav | Post-processing (ASR, mixing). Lossless |
pcm | audio/L16 | No container, custom engine integration |
aac | audio/aac | Native iOS / Apple devices |
flac | audio/flac | Lossless, compressed. Archival |
Real-time streaming
Passstream: true to receive audio in chunks as the model generates it (lower perceptual latency):
Tone instructions (gpt-4o-mini-tts)
Theopenai/gpt-4o-mini-tts model accepts an instructions field with style prompting:
Available models
| Model ID | Voices | Price | ZDR |
|---|---|---|---|
openai/tts-1 | alloy, echo, fable, onyx, nova, shimmer | $15 / 1M chars | ✓ |
openai/tts-1-hd | alloy, echo, fable, onyx, nova, shimmer | $30 / 1M chars | ✓ |
openai/gpt-4o-mini-tts | alloy, echo, fable, onyx, nova, shimmer, verse | $12 / 1M chars | ✓ |
Pricing structure
TTS models are charged per characters of synthesized text, not tokens or seconds. This mirrors provider pricing and makes cost predictable: 1000 characters ≈ 150 words regardless of audio duration. Geek Hub applies the standard +5% markup.ZDR pre-flight
If your org requires ZDR for the TTS model’s group, or if the request includeszdr: true, the gateway verifies before processing. Unverified → HTTP 422 with alternative models.
See Zero Data Retention.