FlashLabs Unveils Chroma 1.0: A Milestone in Open-Source Real-Time Voice AI

19:58, 22 January

Edited by: Veronika Radoslavskaya

Applied AI research lab FlashLabs has announced the release of Chroma 1.0, marking a significant shift in how humans interact with artificial intelligence through speech. Chroma is identified as the world's first open-source, end-to-end (E2E) speech-to-speech model, designed specifically to operate at "human speed" by eliminating the technical delays inherent in traditional voice systems. By moving away from fragmented processing pipelines, the model enables fluid, natural conversations that support complex elements like emotional nuance and immediate turn-taking.

Native Speech Architecture

Most existing voice assistants rely on a multi-step process: converting speech to text (ASR), processing that text with a language model (LLM), and finally synthesizing a vocal response (TTS). This cascading approach often creates a noticeable "latency"—the delay between a user finishing a sentence and the AI starting its reply. Chroma 1.0 operates natively in voice, achieving an end-to-end "Time to First Token" (TTFT) of under 150ms. This near-instantaneous response time allows the AI to react to interruptions and maintain the natural prosody—the rhythm and intonation—of human speech without the lag typical of older systems.

High-Fidelity Voice Cloning

A core feature of Chroma 1.0 is its advanced voice cloning capability, which requires only a few seconds of audio to create a personalized digital voice. In internal evaluations, the model achieved a speaker similarity score (SIM) of 0.817, which FlashLabs notes is nearly 11% above the human baseline for voice recognition. This suggests that high-quality, recognizable voice identities can now be generated without the need for massive datasets or extensive fine-tuning cycles.

Efficient Scale and Availability

Despite its sophisticated reasoning capabilities, Chroma 1.0 is built on a compact architecture of approximately 4 billion parameters. This efficiency makes the model suitable for a variety of applications, including:

Autonomous Voice Agents: Creating responsive assistants for personal or professional use.
Edge Deployment: Running the model locally on devices where low latency and data privacy are priorities.
Interactive NPCs: Enabling non-player characters in video games to engage in unscripted, real-time vocal dialogue.
Real-Time Translation: Powering tools that can translate spoken language almost as fast as it is uttered.

FlashLabs has released Chroma 1.0 as an open-source project, with the model weights available on Hugging Face and the inference code hosted on GitHub. This open-access approach is intended to allow researchers and developers worldwide to build upon this real-time intelligence, fostering a new era of "agentic" systems that operate at the speed of natural human conversation.

FlashLabs