Usamos cookies para mejorar tu experiencia en el sitio
CodeWorlds
Volver a colecciones
Guide13 min read

ElevenLabs

ElevenLabs is an AI platform for generating realistic speech, voice cloning, and building voice agents. Guide to API, SDK, TTS models, Conversational AI, and practical applications.

ElevenLabs - complete guide to the AI voice platform

What is ElevenLabs?

ElevenLabs is an AI platform specializing in generating realistic speech, voice cloning, and building conversational voice agents. Founded in 2022 by Piotr Dąbkowski (ex-Google ML) and Mati Staniszewski (ex-Palantir), the company quickly became the market leader in speech synthesis.

In February 2026, ElevenLabs announced a $500 million funding round at an $11 billion valuation, eyeing a potential IPO. The platform offers over 3,000 voices in 70+ languages, voice cloning from just a few minutes of recording, and Conversational AI for building interactive voice agents.

What sets ElevenLabs apart from the competition is the quality of generated speech - voices sound natural, with proper intonation, emotions, and pacing. The Eleven v3 model can interpret the emotional context of text and modulate the voice accordingly.

Why ElevenLabs?

Key advantages

  1. Highest voice quality - The most realistic AI voices on the market
  2. 70+ languages - Including Polish, with culturally appropriate pronunciation and intonation
  3. Voice cloning - Clone a voice from just a few minutes of recording
  4. Conversational AI - Platform for building real-time voice agents
  5. Audio tags (v3) - Control emotions and speaking style directly in text
  6. Multi-speaker dialogue - Natural dialogues between multiple speakers in a single audio file
  7. Ultra-low latency - Flash v2.5 achieves 75ms, ideal for real-time applications

ElevenLabs vs Amazon Polly vs Google Cloud TTS vs OpenAI TTS

FeatureElevenLabsAmazon PollyGoogle Cloud TTSOpenAI TTS
Voice qualityBestGoodVery goodVery good
Languages70+30+40+57
Voice cloningYesNoNoNo
Audio tags (emotions)Yes (v3)NoNoNo
Multi-speakerYes (v3)NoNoNo
Conversational AIYesNoDialogflowRealtime API
Free plan10 min/mo5M chars/mo (12 mo)1M chars/moNone
Cost (TTS)From $0.12/1K chars$4/1M chars$4-$16/1M chars$15/1M chars
Latency75ms (Flash)~200ms~200ms~300ms

TTS models

ElevenLabs offers several models tailored to different use cases:

Eleven v3

The newest and most powerful model, released in June 2025.

  • Languages: 70+
  • Character limit: 5,000 per request
  • Features: Audio tags, multi-speaker dialogue, natural emotional context
  • Use cases: Audiobooks, podcasts, dubbing, premium content

Multilingual v2

The flagship model for high-quality speech synthesis.

  • Languages: 29
  • Character limit: 10,000 per request
  • Features: Most nuanced expression, excellent intonation
  • Use cases: Professional voiceovers, ads, e-learning

Flash v2.5

A model optimized for low latency.

  • Languages: 32
  • Character limit: 40,000 per request
  • Latency: ~75ms
  • Use cases: Conversational AI, voice assistants, real-time applications

Turbo v2.5

A fast model at lower cost.

  • Languages: 32
  • Character limit: 40,000 per request
  • Latency: 250-300ms
  • Use cases: Mass audio production, content automation

Audio tags (Eleven v3)

Audio tags are a breakthrough feature of Eleven v3 that lets you control emotions, style, and speaking manner directly in the text.

Code
TEXT
[excited] I can't believe we won the championship!

[whispers] Don't tell anyone, but I have a secret.

[sighs] Another Monday morning...

[laughs] That's the funniest thing I've heard all week!

[sad] I'm going to miss this place.

You can also combine tags with natural context:

Code
TEXT
"I'm so proud of you," she said [tearfully]. "You've come so far."

The model interprets both tags and textual context (punctuation, emotion-describing adjectives), producing very natural results.

Getting started with the API

SDK installation

Code
Bash
pip install elevenlabs

npm install elevenlabs

API key configuration

Generate an API key in the ElevenLabs dashboard and set it as an environment variable:

Code
Bash
export ELEVENLABS_API_KEY="your-api-key"

Basic text-to-speech (Python)

Code
Python
from elevenlabs.client import ElevenLabs
from elevenlabs.play import play

elevenlabs = ElevenLabs()

audio = elevenlabs.text_to_speech.convert(
    text="The first move is what sets everything in motion.",
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_v3",
    output_format="mp3_44100_128",
)

play(audio)

Basic text-to-speech (TypeScript)

Code
TypeScript
import { ElevenLabsClient } from "elevenlabs";
import { createWriteStream } from "fs";

const elevenlabs = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY,
});

const audio = await elevenlabs.textToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
  text: "The first move is what sets everything in motion.",
  model_id: "eleven_v3",
  output_format: "mp3_44100_128",
});

const writeStream = createWriteStream("output.mp3");
audio.pipe(writeStream);

Streaming audio

Code
Python
from elevenlabs import stream
from elevenlabs.client import ElevenLabs

elevenlabs = ElevenLabs()

audio_stream = elevenlabs.text_to_speech.stream(
    text="This is a streaming example. The audio plays as it generates.",
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2",
)

stream(audio_stream)

You can also process chunks manually:

Code
Python
for chunk in audio_stream:
    if isinstance(chunk, bytes):
        process_audio_chunk(chunk)

Async client

Code
Python
import asyncio
from elevenlabs.client import AsyncElevenLabs

elevenlabs = AsyncElevenLabs()

async def generate_speech():
    audio = await elevenlabs.text_to_speech.convert(
        text="Async speech generation is great for web servers.",
        voice_id="JBFqnCBsd6RMkjVDRZzb",
        model_id="eleven_v3",
    )
    return audio

asyncio.run(generate_speech())

Searching voices

Code
Python
from elevenlabs.client import ElevenLabs

elevenlabs = ElevenLabs()

response = elevenlabs.voices.search()
for voice in response.voices:
    print(f"{voice.name} ({voice.voice_id})")

Voice cloning

ElevenLabs offers two types of voice cloning:

Instant voice cloning

Quick cloning from short audio samples (30 seconds - a few minutes). Available from the Starter plan.

Code
Python
from elevenlabs.client import ElevenLabs

elevenlabs = ElevenLabs()

voice = elevenlabs.voices.ivc.create(
    name="Alex",
    description="An old American male voice with a slight hoarseness in his throat. Perfect for news",
    files=["./sample_0.mp3", "./sample_1.mp3", "./sample_2.mp3"],
)

audio = elevenlabs.text_to_speech.convert(
    text="This is my cloned voice speaking.",
    voice_id=voice.voice_id,
    model_id="eleven_v3",
)

Professional voice cloning

Advanced cloning from longer recordings (30+ minutes), delivering the highest reproduction quality. Requires the Creator plan or higher. The process includes identity verification and consent for cloning.

Conversational AI

Conversational AI is the ElevenLabs platform for building interactive real-time voice agents. It combines STT, LLM, and TTS into a single pipeline with monitoring and analytics.

Architecture

  1. Speech-to-text - Scribe (ElevenLabs' own model) converts speech to text
  2. LLM - Your choice of model (GPT-4o, Claude, Gemini) processes text and generates a response
  3. Text-to-speech - ElevenLabs TTS converts the response to natural speech
  4. Knowledge base - Optional knowledge base (documents, FAQ) accessible to the agent

Creating a conversational agent

Code
Python
from elevenlabs.client import ElevenLabs
from elevenlabs.conversational_ai.conversation import Conversation
from elevenlabs.conversational_ai.default_audio_interface import DefaultAudioInterface

elevenlabs = ElevenLabs()

audio_interface = DefaultAudioInterface()

conversation = Conversation(
    client=elevenlabs,
    agent_id="your-agent-id",
    requires_auth=True,
    audio_interface=audio_interface,
)

conversation.start_session()
conversation.end_session()

Agent with tool calling

Code
Python
import asyncio
from elevenlabs.client import ElevenLabs
from elevenlabs.conversational_ai.conversation import ClientTools, Conversation
from elevenlabs.conversational_ai.default_audio_interface import DefaultAudioInterface

elevenlabs = ElevenLabs()

async def main():
    custom_loop = asyncio.get_running_loop()
    client_tools = ClientTools(loop=custom_loop)

    async def get_weather(params):
        location = params.get("location", "Unknown")
        return f"Weather in {location}: Sunny, 72°F"

    async def check_order(params):
        order_id = params.get("order_id", "")
        return f"Order {order_id}: Shipped, arriving tomorrow."

    client_tools.register("get_weather", get_weather, is_async=True)
    client_tools.register("check_order", check_order, is_async=True)

    conversation = Conversation(
        client=elevenlabs,
        agent_id="your-agent-id",
        requires_auth=True,
        audio_interface=DefaultAudioInterface(),
        client_tools=client_tools,
    )

    conversation.start_session()

asyncio.run(main())

Tool registration

Code
Python
from elevenlabs.conversational_ai.conversation import ClientTools

client_tools = ClientTools()

def calculate_sum(params):
    numbers = params.get("numbers", [])
    return sum(numbers)

async def fetch_data(params):
    url = params.get("url")
    return {"data": "fetched"}

client_tools.register("calculate_sum", calculate_sum, is_async=False)
client_tools.register("fetch_data", fetch_data, is_async=True)

Additional products

Scribe (Speech-to-Text)

ElevenLabs' own STT model. Transcribes audio with character-level accuracy, timestamps, and speaker diarization.

Eleven Music

AI music generator launched in August 2025. Creates studio-quality music from natural language prompts. Developed in collaboration with record labels and artists - generated music is cleared for commercial use.

Video dubbing and localization

Localizes films and videos into 70+ languages while preserving the original speaker's voice, emotions, and timing.

Reader App

A mobile app (iOS/Android) that lets you listen to articles, PDFs, and ePubs with AI voices.

Audiobook publishing

A platform for creating and publishing AI-generated audiobooks, launched in February 2025.

Pricing

Plans

PlanPrice/moTTS minutesConversational AIVoice cloning
Free$0~10 min--
Starter$5~30 min-Instant
Creator$22~100 min-Professional
Pro$99~500 minYesProfessional
Scale$330~2,000 minYesProfessional
Business$1,32011,000 min13,750 minProfessional
EnterpriseCustomCustomCustomCustom

Per-character costs (TTS, Multilingual v2)

PlanCost per 1K chars
Creator$0.30
Pro$0.24
Scale$0.18
Business$0.12

Conversational AI

The cost of Conversational AI on the Business plan is $0.08/min. Unused minutes reset monthly.

Comparison with competitors

PlatformTTS costConversational AIVoice cloning
ElevenLabsFrom $0.12/1K chars$0.08/minYes
Amazon Polly$4/1M charsNoNo
Google Cloud TTS$4-$16/1M charsVia DialogflowNo
OpenAI TTS$15/1M charsRealtime APINo
Play.htFrom $0.10/1K charsNoYes

Savings

Annual plans offer 2 months free. Unused credits roll over to the next month when upgrading plans.

Security and compliance

  • Encryption - Data encrypted in transit and at rest
  • SOC 2 - SOC 2 compliance
  • HIPAA - HIPAA compliance support
  • GDPR - GDPR compliance
  • EU Data Residency - Option to store data in the EU
  • Zero Retention - Mode without data storage for sensitive applications
  • Consent verification - Consent verification for voice cloning

Audio formats

FormatSample rateDescription
MP322.05-44.1 kHzDefault, universal
PCM16-44.1 kHzRaw audio, lowest latency
μ-law8 kHzTelephony
A-law8 kHzTelephony (Europe)
Opus48 kHzWebRTC, streaming

Practical applications

Multi-voice podcast (v3)

Code
Python
from elevenlabs.client import ElevenLabs

elevenlabs = ElevenLabs()

script = """
[Speaker: Host] Welcome to Tech Talk! Today we're discussing the future of AI.
[Speaker: Guest] Thanks for having me. I think 2026 is going to be a breakthrough year.
[Speaker: Host] [excited] Absolutely! Let's dive right in.
"""

audio = elevenlabs.text_to_speech.convert(
    text=script,
    voice_id="multi_speaker_v3",
    model_id="eleven_v3",
)

Audiobook generator

Code
Python
from elevenlabs.client import ElevenLabs
from elevenlabs import stream

elevenlabs = ElevenLabs()

chapters = [
    "Chapter 1: The Beginning. It was a dark and stormy night...",
    "Chapter 2: The Journey. The next morning brought clear skies...",
    "Chapter 3: The Discovery. Deep in the forest, she found...",
]

for i, chapter in enumerate(chapters):
    audio = elevenlabs.text_to_speech.convert(
        text=chapter,
        voice_id="JBFqnCBsd6RMkjVDRZzb",
        model_id="eleven_multilingual_v2",
        output_format="mp3_44100_128",
    )
    with open(f"chapter_{i+1}.mp3", "wb") as f:
        for chunk in audio:
            f.write(chunk)

Real-time voice assistant (Next.js)

Code
TypeScript
import { ElevenLabsClient } from "elevenlabs";
import { NextResponse } from "next/server";

const elevenlabs = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY,
});

export async function POST(request: Request) {
  const { text, voiceId } = await request.json();

  const audio = await elevenlabs.textToSpeech.convert(voiceId, {
    text,
    model_id: "eleven_flash_v2_5",
    output_format: "mp3_22050_32",
  });

  return new NextResponse(audio as unknown as ReadableStream, {
    headers: {
      "Content-Type": "audio/mpeg",
    },
  });
}

React integration

Code
TypeScript
import { useState, useRef } from "react";

export function TextToSpeechPlayer() {
  const [text, setText] = useState("");
  const [isLoading, setIsLoading] = useState(false);
  const audioRef = useRef<HTMLAudioElement>(null);

  const generateSpeech = async () => {
    setIsLoading(true);

    const response = await fetch("/api/tts", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        text,
        voiceId: "JBFqnCBsd6RMkjVDRZzb",
      }),
    });

    const blob = await response.blob();
    const url = URL.createObjectURL(blob);

    if (audioRef.current) {
      audioRef.current.src = url;
      audioRef.current.play();
    }

    setIsLoading(false);
  };

  return (
    <div className="flex flex-col gap-4 p-6 max-w-md">
      <textarea
        value={text}
        onChange={(e) => setText(e.target.value)}
        placeholder="Enter text to convert to speech..."
        className="w-full p-3 border rounded-lg dark:bg-gray-800 dark:border-gray-600"
        rows={4}
      />
      <button
        onClick={generateSpeech}
        disabled={isLoading || !text}
        className="px-4 py-2 bg-blue-500 hover:bg-blue-600 text-white rounded-lg disabled:opacity-50"
      >
        {isLoading ? "Generating..." : "Generate Speech"}
      </button>
      <audio ref={audioRef} controls className="w-full" />
    </div>
  );
}

Limitations and challenges

  1. Cost at scale - Costs grow quickly at high audio volumes, especially with premium models
  2. Character limit - V3 has a 5,000 character limit per request (vs 40,000 for Flash)
  3. Free plan - Only 10 minutes per month, no commercial use, requires attribution
  4. Voice cloning ethics - Requires consent verification, which slows down the process
  5. No self-hosting - Models available only via API, no on-premise option
  6. Quality in minor languages - Some languages have lower quality than English

FAQ

Does ElevenLabs support Polish?

Yes, Polish is one of the 70+ supported languages. The v3 model offers the best Polish quality with culturally appropriate intonation. For real-time applications, Flash v2.5 also supports Polish.

How much does voice cloning cost?

Instant voice cloning is available from the Starter plan ($5/mo). Professional voice cloning requires the Creator plan ($22/mo) or higher. The cloning process itself has no additional fee - you pay for audio generation as usual.

Can I use ElevenLabs for a commercial audiobook?

Yes, from the Starter plan you have commercial usage rights. ElevenLabs also offers a dedicated platform for publishing audiobooks in the Reader app.

How does ElevenLabs compare to OpenAI TTS?

ElevenLabs offers higher voice quality, voice cloning, audio tags, multi-speaker dialogue, and lower latency (75ms vs ~300ms). OpenAI TTS is simpler to use and has a Realtime API for conversations, but doesn't support voice cloning or such advanced emotion control.

Can I build a voice agent with ElevenLabs?

Yes, Conversational AI combines STT (Scribe), LLM (your choice), and TTS into a single pipeline. It supports tool calling, knowledge base, and monitoring. SDKs are available for Python, TypeScript, Flutter, Swift, and Kotlin.

Do audio tags work in all models?

No, audio tags ([excited], [whispers], etc.) are a feature exclusive to the Eleven v3 model. Older models (Multilingual v2, Flash v2.5) interpret emotions from text context but don't support tags.

Summary

ElevenLabs is the undisputed leader in AI speech synthesis quality in 2026. The v3 model with audio tags, multi-speaker dialogue, and 70+ languages sets a new industry standard. Conversational AI allows building voice agents comparable to Vapi, but with the advantage of the best voice quality on the market.

For developers, ElevenLabs offers well-documented SDKs (Python, TypeScript), streaming API, voice cloning, and flexible architecture connecting to any LLM. The main trade-offs are cost at scale and no self-hosting option - but if voice quality is the priority, ElevenLabs is hard to beat.