All Leaderboards

Audio Model Leaderboard

Compare 15 audio models for TTS, STT, and music generation

#1

Fish Speech V1.5

fish-speech-v1.5

Top choice for multilingual accuracy with industry-leading performance scores

Fish Audio
TTS
Multilingual Excellence
Industry-Leading Accuracy
Zero-shot
Low Latency
Cost/1M Chars
$2.50
Languages
50
Voice Options
500+
#2

CosyVoice2 0.5B

cosyvoice2-0.5b

Ultra-low 150ms latency with fine-grained emotion and dialect control

Alibaba
TTS
Ultra-Low Latency
Emotion Control
Dialect Support
Cross-lingual
Cost/1M Chars
$2.00
Languages
20
Voice Options
300+
#3

IndexTTS-2

indextts-2

Professional zero-shot TTS with precise emotion and duration control

IndexAI
TTS
Zero-shot
Emotion Control
Duration Control
Professional Grade
Cost/1M Chars
$4.00
Languages
30
Voice Options
400+
#4

Kimi Audio

kimi-audio

Conversational AI that can chat and respond in spoken form with emotion

Moonshot AI
CONVERSATIONAL
Speech Conversation
ASR
Audio QA
Emotion Recognition
Multimodal
Cost/Minute
$0.050
Languages
15
#5

Dia 1.6B

dia-1.6b

Open TTS specialized in dialogue with emotions and nonverbal sounds

Open Source
TTS
Dialogue Specialized
Emotions
Nonverbal Sounds
Open Source
0
Languages
20
Voice Options
100+
#6

MiniMax Audio

minimax-audio

Advanced platform combining TTS, voice cloning, custom voices, and music generation

MiniMax
PLATFORM
Text-to-Speech
Voice Cloning
Custom Voices
Music Generation
Long-form
Cost/1M Chars
$5.00
Languages
40
Voice Options
300+
#7

ElevenLabs Turbo v2

eleven-turbo-v2

Leading TTS with natural voices and emotion control

ElevenLabs
TTS
Voice Cloning
Multilingual
Low Latency
Emotional Range
Cost/1M Chars
$3.00
Languages
32
Voice Options
1000+
#8

High-quality text-to-speech from OpenAI

OpenAI
TTS
High Quality
Multilingual
API Integration
Cost/1M Chars
$15.00
Languages
57
Voice Options
6+
#9

Whisper Large v3

whisper-large-v3

Best-in-class speech recognition, open source

OpenAI
STT
Multilingual
Timestamps
Word-level
Speaker Detection
Cost/Minute
$0.006
Languages
99
#10

Suno Music v4

suno-music-v4

Excels at vocal synthesis with remarkably natural-sounding singing voices

Suno
MUSIC
Music Generation
Vocal Synthesis
Lyrics
Multiple Genres
Natural Singing
#11

Udio Music v2

udio-music-v2

Hierarchical framework with specialized networks for sophisticated structure

Udio
MUSIC
Extended Songs
Hierarchical Generation
Structural Awareness
Audio Inpainting
Stems
#12

Stable Audio 2.0

stable-audio-2.0

Sound design capabilities for experimental music and innovative sonic territories

Stability AI
MUSIC
Music Generation
Sound Design
Experimental
Innovative Sonics
Open Source

Fast, accurate speech-to-text optimized for real-time

Deepgram
STT
Real-time
Speaker Detection
High Accuracy
Low Latency
Cost/Minute
$0.004
Languages
36
#14

PlayHT 3.0

playht-3.0

Ultra-realistic TTS with advanced voice cloning

PlayHT
TTS
Voice Cloning
Emotional Control
Ultra-realistic
Cost/1M Chars
$8.00
Languages
142
Voice Options
900+
#15

Azure Neural TTS

azure-neural-tts

Enterprise TTS with custom voice support

Microsoft
TTS
Enterprise
Custom Voices
SSML
Neural
Cost/1M Chars
$16.00
Languages
119
Voice Options
400+
Showing 15 of 15 models