Inside a Bilingual Voice AI Call: How NuPlay Handles Mid-Sentence Language Switches

Over 41 million people in the US speak Spanish as their primary language. A large share of them also speak English - and when they're frustrated, confused, or simply more comfortable, they switch. Mid-sentence. Without warning. Sometimes mid-word.

‍

Contact centers have known this for decades. The response, for most of that time, has been to build two separate call flows: press 1 for English, oprima 2 para español. It works - badly. The caller who switches languages mid-call gets dropped into the wrong flow, gets routed to hold, or gets a robotic response that ignores the switch entirely.

Voice AI was supposed to fix this. It hasn't. Not because the idea is wrong, but because most platforms treat bilingual handling as a feature - a checkbox, a configuration option, a language selector in the builder settings. That framing is wrong. Bilingual voice AI is an infrastructure problem, and it requires an infrastructure solution.

This is what that solution actually looks like.

‍

First, understand what's actually breaking

When an enterprise deploys a voice AI agent today, the bilingual problem shows up in three specific places. Each one is distinct. Most vendors only address one, if any.

‍

Problem 1 - The ASR layer can't hear the switch

Automatic Speech Recognition (ASR) is the layer that converts audio to text. Most ASR models are trained on monolingual data. When a caller says "I need help with my cuenta," a monolingual English ASR might transcribe it as "I need help with my cuenta" - or it might hallucinate an English word that sounds similar. "Counter." "Counted." The word gets lost before the AI has even started processing.

This is the root failure. Everything downstream depends on the transcript. If the ASR can't produce an accurate transcript across both languages, no amount of clever LLM prompting will save the call.

‍

Problem 2 - Language detection is too slow, or too coarse

Even if the ASR captures the word correctly, the system still needs to know what language it's dealing with before it can respond appropriately. Most approaches use utterance-level language detection - they look at the whole sentence, assign a single language label, and proceed. This fails on code-switched input because the sentence contains both languages. One label has to win. The losing language gets ignored.

Some systems use word-level detection instead. This creates a different problem: filler words and short phrases trigger false switches. A caller says "okay" - a word that exists in Spanish too - and the system flips to Spanish. It's noise, not signal.

‍

Problem 3 - The LLM doesn't know when to actually switch

Even if the system correctly identifies that a switch has occurred, the response layer has to decide what to do with that information. Most implementations either respond in a fixed language regardless, or mirror every detected switch - including false ones. Neither is right.

The real requirement: respond in the language the caller has clearly committed to, not the language of the last word they said.

These three failure points compound each other. A bad transcript leads to wrong language detection, which leads to a mismatched response. By the time the caller hears the agent respond in the wrong language, the damage to the interaction is already done.

‍

What a real solution looks like: 5 layers

At Nurix, we didn't solve this by adding a language toggle. We built a five-layer pipeline that handles code-switching at every stage of a call - from the moment audio hits the system to post-call quality evaluation. Here's each layer, what it does, and why it matters for enterprise contact centers.

‍

Layer 1 - ASR:

Speech Recognition with Native Code-Switching Support Deepgram Nova 3, self-hosted

The first thing a voice AI touches is audio. Before any LLM sees the conversation, before any language detection runs, the ASR model has to produce an accurate transcript across both English and Spanish - including mixed sentences.

We use Deepgram Nova 3, deployed in our own infrastructure. Nova 3 is trained on code-switched audio, which means it's seen real-world speech where languages mix within a single sentence. It doesn't force the transcript into one language. It transcribes what was actually said.

Why it matters for enterprises: A bad transcript at Layer 1 corrupts everything downstream. If your ASR can't handle "mi cuenta," no amount of post-processing will recover that word. This is the most commonly underestimated failure point in voice AI procurement - most buyers evaluate LLM quality and voice naturalness, not ASR accuracy on mixed-language audio.

Business impact: transcript accuracy → fewer failed calls

‍

Layer 2 - Language Identification Model:

Sentence-Level, Not Word-Level Text-level LID on top of the transcript

Once the transcript is produced, the system needs to identify which language the caller is operating in. This runs on the text of the transcript - not on individual words, but on sentence structure and grammatical patterns.

Sentence-level detection is the right granularity here. Word-level detection is too noisy - shared words between English and Spanish create constant false positives. Utterance-level detection is too coarse - it can't handle mid-sentence switches. Sentence-level sits in between: it's sensitive enough to catch real switches, stable enough to ignore fillers and cognates.

Why it matters for enterprises: This is the layer that determines whether a language switch is real or noise. Getting this wrong means the agent either ignores real switches (caller feels unheard) or over-corrects on false ones (agent keeps flipping languages, call sounds broken).

Business impact: accurate detection → appropriate agent response

‍

Layer 3 - Confidence Scoring:

Don't Guess When You're Not Sure Turn-level confidence threshold before acting

Language identification isn't always certain. Short turns, heavy background noise, or highly mixed input can produce low-confidence detections. At Layer 3, the system assigns a confidence score to each detected language. If that score falls below threshold, the system doesn't proceed with a potentially wrong assumption.

Instead, it can do one of two things: ask for clarification ("Did you want to continue in Spanish?") or route the call for human review. The key point is it doesn't guess. A wrong-language response delivered with confidence is worse than a short clarifying question.

Why it matters for enterprises: Confidence scoring is the difference between a system that handles ambiguity gracefully and one that bulldozes through it and creates a bad experience. In regulated industries - banking, insurance, healthcare - the stakes of a wrong-language interaction are especially high. Mishandled bilingual calls contribute directly to complaints, repeat call rates, and compliance exposure.

Business impact: fewer bad responses → lower repeat call rate

‍

Layer 4 - Language Mirroring:

The Agent Responds in the Right Language LLM responds in detected language; switches only on full-sentence commitment

This is the layer callers actually experience. Language Mirroring is the logic that governs how the AI agent responds to a language switch. The rule is precise: a full sentence in a new language triggers a switch. A single filler word - "okay," "sí," "sure" - does not.

This matters because bilingual callers don't always commit to a switch. They test the water with a word. If the agent flips language on every test word, the conversation becomes incoherent. Language Mirroring waits for commitment before following.

Primary and secondary language are configured at the agent level in NuPlay's builder - enterprises can set English as primary with Spanish as secondary, or configure agents for specific populations where Spanish is the primary language and English is secondary.

Why it matters for enterprises: This is where the customer experience lives. An agent that mirrors language correctly feels natural - callers don't have to think about what language they're speaking in. That reduction in cognitive load shortens handle time, reduces frustration, and improves resolution rates. One of the active deals that surfaced this requirement had callers who would start a call in English, switch to Spanish when emotional, and then switch back. The agent needs to track all three states without breaking flow.

Business impact: natural conversation → lower handle time, higher CSAT

‍

Layer 5 - Post-Call Analysis:

Post-Call Language Switch Quality Scoring Automated QA across 12 conversational metrics

The first four layers handle a call in real time. Layer 5 looks back at what happened. NuPlay's post-call evaluation system runs automatically on every call and scores a set of conversational quality metrics. Language Switch Quality is one of them.

Specifically, it scores whether the agent correctly tracked language across the conversation - did it switch when it should have, stay put when it shouldn't have, handle low-confidence turns correctly? These scores feed back into agent improvement cycles, allowing teams to identify systematic failures and fix them before they compound at scale.

Why it matters for enterprises: Without post-call QA on language switching, you're flying blind. You might know your containment rate. You might know your CSAT score. But you don't know how often your bilingual agent responded in the wrong language - unless you're listening to every call manually. Automated scoring at scale closes that gap. It turns a gut-feel problem ("we think the Spanish handling is fine") into a data problem ("language switch quality is 91% - here are the 9% of calls where it broke and why").

Business impact: systematic QA → measurable improvement over time

‍

How this compares to what everyone else is doing

Most voice AI vendors do address multilingual support in some form. The question is how and the architectural differences have direct business consequences.

Translation Layer
How it works: Translate input to English → process in English → translate response back to Spanish. The problem: Adds latency at every step. Loses nuance in translation. The AI is reasoning in a language it didn't hear the call in.

Separate Language Agents
How it works: Two agents configured - one English, one Spanish. Routing logic decides which one handles the call. The problem: Falls apart on mid-sentence switches. No routing decision can happen fast enough mid-turn. Context is lost on handoff.

"Language Count as a Feature"
How it works: "Supports 20 languages" - typically means the LLM can respond in 20 languages if prompted correctly. The problem: No code-switching support. No ASR accuracy guarantee. No QA. Works in demos, breaks on real calls.

NuPlay - Native 5-Layer Stack
How it works: ASR with code-switching support → sentence-level LID → confidence scoring → Language Mirroring → post-call QA. Result: Handles mid-sentence switches in real time. No translation step. QA closes the loop on every call.

"Bilingual isn't a feature you toggle on. It's a five-layer infrastructure problem and most voice AI is solving it with one layer, badly."

‍

What this means for enterprise contact centers

If you run a contact center that serves a bilingual population, the business case for getting this right isn't abstract. Here's where it shows up on your metrics:

AHT: Callers who don't have to manage their own language switching or be transferred - resolve faster. A caller who switches to Spanish and gets a Spanish response doesn't lose their train of thought.
Repeat Calls: Many "unresolved" calls in bilingual populations are language failures, not complexity failures. The caller understood. The agent didn't. That's a solvable problem.
Containment: Bilingual callers who hit a language wall escalate to human agents at higher rates. Fixing the language handling keeps more calls in the automated flow.
Market Reach: 41M native Spanish speakers in the US. A contact center that serves them fluently is accessible to a population that many competitors effectively wall out.

There's also a compliance angle that's rarely discussed. In regulated industries - financial services, healthcare, insurance - a caller who was mishandled due to a language failure is a liability. If your QA system can't tell you how often that happened, you don't know the exposure you're carrying.

‍

The origin of this approach: what India taught us

Nurix built its original voice AI infrastructure for the Indian market - a market where code-switching isn't occasional, it's the default mode of communication. Hinglish (Hindi + English), Tanglish (Tamil + English), and dozens of other combinations mean that a voice AI deployed in India has to handle language mixing from day one or it doesn't work at all.

That constraint forced us to solve this problem at the infrastructure level early. The 5-layer stack wasn't designed for Spanish-English in the US - it was designed for a market where failure wasn't an option. Adapting it for English-Spanish code-switching in American contact centers is a translation of the architecture, not a rebuild of it.

That's the real differentiation. Most US-focused voice AI vendors are encountering this problem for the first time as Spanish-English demand grows in enterprise contact centers. We've been solving it for years.

‍

What to ask your current vendor

If you're evaluating voice AI platforms for a bilingual deployment, five questions will tell you most of what you need to know:

What ASR model are you using, and does it support code-switched audio? Most vendors use off-the-shelf ASR. Ask specifically whether it's been evaluated on Spanish-English mixed sentences. Ask for WER (word error rate) data on code-switched test sets.
How does your language detection work - at the word level, sentence level, or utterance level? Word-level is too noisy. Utterance-level misses intra-sentence switches. Sentence-level is the right granularity. If they can't answer this question, they haven't thought about it.
What happens when language confidence is low? The system should have a defined behavior for uncertainty. If the answer is "it picks the most likely language," that's a red flag.
What's the switch trigger - word, sentence, or something else? A filler-word trigger means the agent will flip language constantly on false positives. Ask specifically what triggers a language change in the agent response.
How do you measure language switch quality post-call? If the answer is "manually" or "we don't," you can't improve what you can't measure. Automated post-call scoring on language handling should be a standard feature for any enterprise deployment.

‍

Inside a Bilingual Voice AI Call: How NuPlay Handles Mid-Sentence Language Switches

Table of Contents

Don’t miss what’s next in AI.

Conversational AI for Sales and Support teams

Ready to see what agentic AI can do for your business?

Related Blogs

Ready to make agentic workflows work at scale?