ElevenLabs v3: Setting new standards in AI-powered speech synthesis

June 13, 2025

ElevenLabs has released Eleven v3 (Alpha), its most advanced text-to-speech model to date. The new model brings significant improvements in the expressiveness and naturalness of synthetic speech, greatly expanding the possibilities for professional applications.

Introducing Eleven v3 (alpha) – the most expressive Text to Speech model ever.

Supporting 70+ languages, multi-speaker dialogue, and audio tags such as [excited], [sighs], [laughing], and [whispers].

Now in public alpha and 80% off in June. pic.twitter.com/n56BersdUc
— ElevenLabs (@elevenlabsio) June 5, 2025

Audio tags: Precise control over emotions and style

The central feature of ElevenLabs v3 is inline audio tags. These allow users to specifically control emotional and stylistic aspects of the generated speech.

The tags function as direct instructions in the text:

[whispers] for quiet, confidential passages
[laughs] for natural laughter
[angry] for angry or intense moments
[excited] for enthusiastic expressions
[sighs] for thoughtful pauses

These control elements can also be combined: [happily][shouts] We did it! [laughs]. The result is speech output that sounds much more natural and expressive than previous systems.

Comprehensive language support: 70+ languages available

ElevenLabs v3 supports over 70 languages, covering a large part of global communication. The spectrum ranges from widely spoken languages such as German, English, and Mandarin Chinese to less common languages such as Luxembourgish and Lingala.

The model takes into account language-specific characteristics such as regional accents, cultural intonation patterns, and the characteristic speech melody of different languages. German texts sound authentically German, while French texts retain the typical French intonation.

Dialogue mode: Natural conversations between multiple speakers

An important new feature is the text-to-dialogue mode. For the first time, users can generate realistic conversations between different speakers. The system is capable of:

Natural interruptions in the flow of conversation
Emotional transitions between different speakers
Context-aware responses to previous statements
Smooth speaker changes without audible breaks

The new Text-to-Dialogue API works with structured JSON objects that define each contribution to the conversation. The model automatically organizes the conversation flow and ensures natural-sounding dialogues.

Technical improvements over its predecessor

ElevenLabs v3 is based on a completely redesigned architecture. Compared to its predecessor, v2, the new version offers significant advances in several areas.

While v2 already achieved good results with individual voices, v3 enables true multi-speaker dialogues for the first time. Support for audio tags has been expanded from basic functions to a comprehensive system for emotional and stylistic control.

Language support has been expanded from 29 languages in v2 to over 70 languages in v3. The new dialogue capability in particular was not available in v2 and represents an important enhancement of functionality.

However, the new model requires more prompt engineering than its predecessors, but offers significantly better control over the result.

Availability and conditions

ElevenLabs v3 is available now via the ElevenLabs platform. Until the end of June 2025, users of the user interface will receive an 80% discount on usage.

The public API is still in development. Companies can already request early access through sales. For applications with real-time requirements, ElevenLabs continues to recommend the v2.5 Turbo or Flash models, as v3 has been optimized primarily for quality-oriented applications.

Important note: Professional Voice Clones do not currently work optimally with v3. ElevenLabs recommends using Instant Voice Clones or the predefined voices instead for best results with the new features.

Significance for the speech synthesis industry

ElevenLabs v3 represents an important advance in the development of natural speech synthesis. The ability to specifically control emotions, intonation, and non-verbal elements such as laughter or sighing greatly expands the range of possible applications.

The development also demonstrates the rapid progress being made in the field of generative AI. The combination of improved emotional expressiveness, comprehensive language support, and dialogue functions makes v3 a versatile tool for a variety of applications.

Conclusion: Significant improvements in speech quality

ElevenLabs v3 brings noticeable improvements in the quality and naturalness of synthetic speech.

The new audio tags, expanded language support, and dialogue mode significantly expand the possibilities for professional applications.

For users who work with language technology, v3 offers new possibilities for creating expressive audio content. Discounted access during the alpha phase makes it easy to test and evaluate the new features.

Sources

ElevenLabs. (2024). Introducing Eleven v3 Alpha. https://elevenlabs.io/de/v3
ElevenLabs Blog. (2025). Eleven v3: Most Expressive AI Text to Speech Model Launched. https://elevenlabs.io/blog/eleven-v3
CIOL. (2025). ElevenLabs Launches v3: Most Expressive Text-to-Speech Model Yet. https://www.ciol.com/generative-ai/elevenlabs-launches-v3-most-expressive-text-to-speech-model-yet-9339467
Quasa.io. (2025). ElevenLabs Unveils Eleven v3 (Alpha): The Most Expressive Text-to-Speech Model Yet. https://quasa.io/media/elevenlabs-unveils-eleven-v3-alpha-the-most-expressive-text-to-speech-model-yet

Justus Becker

I have a passion for storytelling. AI enthusiast and addicted to midjourney.