ElevenLabs | CodeWorlds

ElevenLabs - complete guide to the AI voice platform

What is ElevenLabs?

ElevenLabs is an AI platform specializing in generating realistic speech, voice cloning, and building conversational voice agents. Founded in 2022 by Piotr Dąbkowski (ex-Google ML) and Mati Staniszewski (ex-Palantir), the company quickly became the market leader in speech synthesis.

In February 2026, ElevenLabs announced a $500 million funding round at an $11 billion valuation, eyeing a potential IPO. The platform offers over 3,000 voices in 70+ languages, voice cloning from just a few minutes of recording, and Conversational AI for building interactive voice agents.

What sets ElevenLabs apart from the competition is the quality of generated speech - voices sound natural, with proper intonation, emotions, and pacing. The Eleven v3 model can interpret the emotional context of text and modulate the voice accordingly.

Why ElevenLabs?

Key advantages

Highest voice quality - The most realistic AI voices on the market
70+ languages - Including Polish, with culturally appropriate pronunciation and intonation
Voice cloning - Clone a voice from just a few minutes of recording
Conversational AI - Platform for building real-time voice agents
Audio tags (v3) - Control emotions and speaking style directly in text
Multi-speaker dialogue - Natural dialogues between multiple speakers in a single audio file
Ultra-low latency - Flash v2.5 achieves 75ms, ideal for real-time applications

ElevenLabs vs Amazon Polly vs Google Cloud TTS vs OpenAI TTS

Feature	ElevenLabs	Amazon Polly	Google Cloud TTS	OpenAI TTS
Voice quality	Best	Good	Very good	Very good
Languages	70+	30+	40+	57
Voice cloning	Yes	No	No	No
Audio tags (emotions)	Yes (v3)	No	No	No
Multi-speaker	Yes (v3)	No	No	No
Conversational AI	Yes	No	Dialogflow	Realtime API
Free plan	10 min/mo	5M chars/mo (12 mo)	1M chars/mo	None
Cost (TTS)	From $0.12/1K chars	$4/1M chars	$4-$16/1M chars	$15/1M chars
Latency	75ms (Flash)	~200ms	~200ms	~300ms

TTS models

ElevenLabs offers several models tailored to different use cases:

Eleven v3

The newest and most powerful model, released in June 2025.

Languages: 70+
Character limit: 5,000 per request
Features: Audio tags, multi-speaker dialogue, natural emotional context
Use cases: Audiobooks, podcasts, dubbing, premium content

Multilingual v2

The flagship model for high-quality speech synthesis.

Languages: 29
Character limit: 10,000 per request
Features: Most nuanced expression, excellent intonation
Use cases: Professional voiceovers, ads, e-learning

Flash v2.5

A model optimized for low latency.

Languages: 32
Character limit: 40,000 per request
Latency: ~75ms
Use cases: Conversational AI, voice assistants, real-time applications

Turbo v2.5

A fast model at lower cost.

Languages: 32
Character limit: 40,000 per request
Latency: 250-300ms
Use cases: Mass audio production, content automation

Audio tags (Eleven v3)

Audio tags are a breakthrough feature of Eleven v3 that lets you control emotions, style, and speaking manner directly in the text.

Code

TEXT

[excited] I can't believe we won the championship!

[whispers] Don't tell anyone, but I have a secret.

[sighs] Another Monday morning...

[laughs] That's the funniest thing I've heard all week!

[sad] I'm going to miss this place.

You can also combine tags with natural context:

Code

TEXT

"I'm so proud of you," she said [tearfully]. "You've come so far."

The model interprets both tags and textual context (punctuation, emotion-describing adjectives), producing very natural results.

Getting started with the API

SDK installation

Code

Bash

pip install elevenlabs

npm install elevenlabs

API key configuration

Generate an API key in the ElevenLabs dashboard and set it as an environment variable:

Code

Bash

export ELEVENLABS_API_KEY="your-api-key"

Basic text-to-speech (Python)

Code

Python

from elevenlabs.client import ElevenLabs
from elevenlabs.play import play

elevenlabs = ElevenLabs()

audio = elevenlabs.text_to_speech.convert(
    text="The first move is what sets everything in motion.",
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_v3",
    output_format="mp3_44100_128",
)

play(audio)

Basic text-to-speech (TypeScript)

Code

TypeScript

import { ElevenLabsClient } from "elevenlabs";
import { createWriteStream } from "fs";

const elevenlabs = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY,
});

const audio = await elevenlabs.textToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
  text: "The first move is what sets everything in motion.",
  model_id: "eleven_v3",
  output_format: "mp3_44100_128",
});

const writeStream = createWriteStream("output.mp3");
audio.pipe(writeStream);

Streaming audio

Code

Python

from elevenlabs import stream
from elevenlabs.client import ElevenLabs

elevenlabs = ElevenLabs()

audio_stream = elevenlabs.text_to_speech.stream(
    text="This is a streaming example. The audio plays as it generates.",
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2",
)

stream(audio_stream)

You can also process chunks manually:

Code

Python

for chunk in audio_stream:
    if isinstance(chunk, bytes):
        process_audio_chunk(chunk)

Async client

Code

Python

import asyncio
from elevenlabs.client import AsyncElevenLabs

elevenlabs = AsyncElevenLabs()

async def generate_speech():
    audio = await elevenlabs.text_to_speech.convert(
        text="Async speech generation is great for web servers.",
        voice_id="JBFqnCBsd6RMkjVDRZzb",
        model_id="eleven_v3",
    )
    return audio

asyncio.run(generate_speech())

Searching voices

Code

Python

from elevenlabs.client import ElevenLabs

elevenlabs = ElevenLabs()

response = elevenlabs.voices.search()
for voice in response.voices:
    print(f"{voice.name} ({voice.voice_id})")

Voice cloning

ElevenLabs offers two types of voice cloning:

Instant voice cloning

Quick cloning from short audio samples (30 seconds - a few minutes). Available from the Starter plan.

Code

Python

from elevenlabs.client import ElevenLabs

elevenlabs = ElevenLabs()

voice = elevenlabs.voices.ivc.create(
    name="Alex",
    description="An old American male voice with a slight hoarseness in his throat. Perfect for news",
    files=["./sample_0.mp3", "./sample_1.mp3", "./sample_2.mp3"],
)

audio = elevenlabs.text_to_speech.convert(
    text="This is my cloned voice speaking.",
    voice_id=voice.voice_id,
    model_id="eleven_v3",
)

Professional voice cloning

Advanced cloning from longer recordings (30+ minutes), delivering the highest reproduction quality. Requires the Creator plan or higher. The process includes identity verification and consent for cloning.

Conversational AI

Conversational AI is the ElevenLabs platform for building interactive real-time voice agents. It combines STT, LLM, and TTS into a single pipeline with monitoring and analytics.

Architecture

Speech-to-text - Scribe (ElevenLabs' own model) converts speech to text
LLM - Your choice of model (GPT-4o, Claude, Gemini) processes text and generates a response
Text-to-speech - ElevenLabs TTS converts the response to natural speech
Knowledge base - Optional knowledge base (documents, FAQ) accessible to the agent

Creating a conversational agent

Code

Python

from elevenlabs.client import ElevenLabs
from elevenlabs.conversational_ai.conversation import Conversation
from elevenlabs.conversational_ai.default_audio_interface import DefaultAudioInterface

elevenlabs = ElevenLabs()

audio_interface = DefaultAudioInterface()

conversation = Conversation(
    client=elevenlabs,
    agent_id="your-agent-id",
    requires_auth=True,
    audio_interface=audio_interface,
)

conversation.start_session()
conversation.end_session()

Agent with tool calling

Code

Python

import asyncio
from elevenlabs.client import ElevenLabs
from elevenlabs.conversational_ai.conversation import ClientTools, Conversation
from elevenlabs.conversational_ai.default_audio_interface import DefaultAudioInterface

elevenlabs = ElevenLabs()

async def main():
    custom_loop = asyncio.get_running_loop()
    client_tools = ClientTools(loop=custom_loop)

    async def get_weather(params):
        location = params.get("location", "Unknown")
        return f"Weather in {location}: Sunny, 72°F"

    async def check_order(params):
        order_id = params.get("order_id", "")
        return f"Order {order_id}: Shipped, arriving tomorrow."

    client_tools.register("get_weather", get_weather, is_async=True)
    client_tools.register("check_order", check_order, is_async=True)

    conversation = Conversation(
        client=elevenlabs,
        agent_id="your-agent-id",
        requires_auth=True,
        audio_interface=DefaultAudioInterface(),
        client_tools=client_tools,
    )

    conversation.start_session()

asyncio.run(main())

Tool registration

Code

Python

from elevenlabs.conversational_ai.conversation import ClientTools

client_tools = ClientTools()

def calculate_sum(params):
    numbers = params.get("numbers", [])
    return sum(numbers)

async def fetch_data(params):
    url = params.get("url")
    return {"data": "fetched"}

client_tools.register("calculate_sum", calculate_sum, is_async=False)
client_tools.register("fetch_data", fetch_data, is_async=True)

Additional products

Scribe (Speech-to-Text)

ElevenLabs' own STT model. Transcribes audio with character-level accuracy, timestamps, and speaker diarization.

Eleven Music

AI music generator launched in August 2025. Creates studio-quality music from natural language prompts. Developed in collaboration with record labels and artists - generated music is cleared for commercial use.

Video dubbing and localization

Localizes films and videos into 70+ languages while preserving the original speaker's voice, emotions, and timing.

Reader App

A mobile app (iOS/Android) that lets you listen to articles, PDFs, and ePubs with AI voices.

Audiobook publishing

A platform for creating and publishing AI-generated audiobooks, launched in February 2025.

Pricing

Plans

Plan	Price/mo	TTS minutes	Conversational AI	Voice cloning
Free	$0	~10 min	-	-
Starter	$5	~30 min	-	Instant
Creator	$22	~100 min	-	Professional
Pro	$99	~500 min	Yes	Professional
Scale	$330	~2,000 min	Yes	Professional
Business	$1,320	11,000 min	13,750 min	Professional
Enterprise	Custom	Custom	Custom	Custom

Per-character costs (TTS, Multilingual v2)

Plan	Cost per 1K chars
Creator	$0.30
Pro	$0.24
Scale	$0.18
Business	$0.12

Conversational AI

The cost of Conversational AI on the Business plan is $0.08/min. Unused minutes reset monthly.

Comparison with competitors

Platform	TTS cost	Conversational AI	Voice cloning
ElevenLabs	From $0.12/1K chars	$0.08/min	Yes
Amazon Polly	$4/1M chars	No	No
Google Cloud TTS	$4-$16/1M chars	Via Dialogflow	No
OpenAI TTS	$15/1M chars	Realtime API	No
Play.ht	From $0.10/1K chars	No	Yes

Savings

Annual plans offer 2 months free. Unused credits roll over to the next month when upgrading plans.

Security and compliance

Encryption - Data encrypted in transit and at rest
SOC 2 - SOC 2 compliance
HIPAA - HIPAA compliance support
GDPR - GDPR compliance
EU Data Residency - Option to store data in the EU
Zero Retention - Mode without data storage for sensitive applications
Consent verification - Consent verification for voice cloning

Audio formats

Format	Sample rate	Description
MP3	22.05-44.1 kHz	Default, universal
PCM	16-44.1 kHz	Raw audio, lowest latency
μ-law	8 kHz	Telephony
A-law	8 kHz	Telephony (Europe)
Opus	48 kHz	WebRTC, streaming

Practical applications

Multi-voice podcast (v3)

Code

Python

from elevenlabs.client import ElevenLabs

elevenlabs = ElevenLabs()

script = """
[Speaker: Host] Welcome to Tech Talk! Today we're discussing the future of AI.
[Speaker: Guest] Thanks for having me. I think 2026 is going to be a breakthrough year.
[Speaker: Host] [excited] Absolutely! Let's dive right in.
"""

audio = elevenlabs.text_to_speech.convert(
    text=script,
    voice_id="multi_speaker_v3",
    model_id="eleven_v3",
)

Audiobook generator

Code

Python

from elevenlabs.client import ElevenLabs
from elevenlabs import stream

elevenlabs = ElevenLabs()

chapters = [
    "Chapter 1: The Beginning. It was a dark and stormy night...",
    "Chapter 2: The Journey. The next morning brought clear skies...",
    "Chapter 3: The Discovery. Deep in the forest, she found...",
]

for i, chapter in enumerate(chapters):
    audio = elevenlabs.text_to_speech.convert(
        text=chapter,
        voice_id="JBFqnCBsd6RMkjVDRZzb",
        model_id="eleven_multilingual_v2",
        output_format="mp3_44100_128",
    )
    with open(f"chapter_{i+1}.mp3", "wb") as f:
        for chunk in audio:
            f.write(chunk)

Real-time voice assistant (Next.js)

Code

TypeScript

import { ElevenLabsClient } from "elevenlabs";
import { NextResponse } from "next/server";

const elevenlabs = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY,
});

export async function POST(request: Request) {
  const { text, voiceId } = await request.json();

  const audio = await elevenlabs.textToSpeech.convert(voiceId, {
    text,
    model_id: "eleven_flash_v2_5",
    output_format: "mp3_22050_32",
  });

  return new NextResponse(audio as unknown as ReadableStream, {
    headers: {
      "Content-Type": "audio/mpeg",
    },
  });
}

React integration

Code

TypeScript

import { useState, useRef } from "react";

export function TextToSpeechPlayer() {
  const [text, setText] = useState("");
  const [isLoading, setIsLoading] = useState(false);
  const audioRef = useRef<HTMLAudioElement>(null);

  const generateSpeech = async () => {
    setIsLoading(true);

    const response = await fetch("/api/tts", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        text,
        voiceId: "JBFqnCBsd6RMkjVDRZzb",
      }),
    });

    const blob = await response.blob();
    const url = URL.createObjectURL(blob);

    if (audioRef.current) {
      audioRef.current.src = url;
      audioRef.current.play();
    }

    setIsLoading(false);
  };

  return (
    <div className="flex flex-col gap-4 p-6 max-w-md">
      <textarea
        value={text}
        onChange={(e) => setText(e.target.value)}
        placeholder="Enter text to convert to speech..."
        className="w-full p-3 border rounded-lg dark:bg-gray-800 dark:border-gray-600"
        rows={4}
      />
      <button
        onClick={generateSpeech}
        disabled={isLoading || !text}
        className="px-4 py-2 bg-blue-500 hover:bg-blue-600 text-white rounded-lg disabled:opacity-50"
      >
        {isLoading ? "Generating..." : "Generate Speech"}
      </button>
      <audio ref={audioRef} controls className="w-full" />
    </div>
  );
}

Limitations and challenges

Cost at scale - Costs grow quickly at high audio volumes, especially with premium models
Character limit - V3 has a 5,000 character limit per request (vs 40,000 for Flash)
Free plan - Only 10 minutes per month, no commercial use, requires attribution
Voice cloning ethics - Requires consent verification, which slows down the process
No self-hosting - Models available only via API, no on-premise option
Quality in minor languages - Some languages have lower quality than English

FAQ

Does ElevenLabs support Polish?

Yes, Polish is one of the 70+ supported languages. The v3 model offers the best Polish quality with culturally appropriate intonation. For real-time applications, Flash v2.5 also supports Polish.

How much does voice cloning cost?

Instant voice cloning is available from the Starter plan ($5/mo). Professional voice cloning requires the Creator plan ($22/mo) or higher. The cloning process itself has no additional fee - you pay for audio generation as usual.

Can I use ElevenLabs for a commercial audiobook?

Yes, from the Starter plan you have commercial usage rights. ElevenLabs also offers a dedicated platform for publishing audiobooks in the Reader app.

How does ElevenLabs compare to OpenAI TTS?

ElevenLabs offers higher voice quality, voice cloning, audio tags, multi-speaker dialogue, and lower latency (75ms vs ~300ms). OpenAI TTS is simpler to use and has a Realtime API for conversations, but doesn't support voice cloning or such advanced emotion control.

Can I build a voice agent with ElevenLabs?

Yes, Conversational AI combines STT (Scribe), LLM (your choice), and TTS into a single pipeline. It supports tool calling, knowledge base, and monitoring. SDKs are available for Python, TypeScript, Flutter, Swift, and Kotlin.

Do audio tags work in all models?

No, audio tags ([excited], [whispers], etc.) are a feature exclusive to the Eleven v3 model. Older models (Multilingual v2, Flash v2.5) interpret emotions from text context but don't support tags.

Summary

ElevenLabs is the undisputed leader in AI speech synthesis quality in 2026. The v3 model with audio tags, multi-speaker dialogue, and 70+ languages sets a new industry standard. Conversational AI allows building voice agents comparable to Vapi, but with the advantage of the best voice quality on the market.

For developers, ElevenLabs offers well-documented SDKs (Python, TypeScript), streaming API, voice cloning, and flexible architecture connecting to any LLM. The main trade-offs are cost at scale and no self-hosting option - but if voice quality is the priority, ElevenLabs is hard to beat.