AI Speech Analytics

Inside a Bilingual Voice AI Call: How NuPlay Handles Mid-Sentence Language Switches

Written by
Dr. Anushtha Singh
Created On
16 June 2026

Table of Contents

Don’t miss what’s next in AI.

Subscribe for product updates, experiments, & success stories from the Nurix team.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Over 41 million people in the US speak Spanish as their primary language. A large share of them also speak English - and when they're frustrated, confused, or simply more comfortable, they switch. Mid-sentence. Without warning. Sometimes mid-word.

Contact centers have known this for decades. The response, for most of that time, has been to build two separate call flows: press 1 for English, oprima 2 para español. It works - badly. The caller who switches languages mid-call gets dropped into the wrong flow, gets routed to hold, or gets a robotic response that ignores the switch entirely.

Voice AI was supposed to fix this. It hasn't. Not because the idea is wrong, but because most platforms treat bilingual handling as a feature - a checkbox, a configuration option, a language selector in the builder settings. That framing is wrong. Bilingual voice AI is an infrastructure problem, and it requires an infrastructure solution.

This is what that solution actually looks like.

First, understand what's actually breaking

When an enterprise deploys a voice AI agent today, the bilingual problem shows up in three specific places. Each one is distinct. Most vendors only address one, if any.

Problem 1 - The ASR layer can't hear the switch

Automatic Speech Recognition (ASR) is the layer that converts audio to text. Most ASR models are trained on monolingual data. When a caller says "I need help with my cuenta," a monolingual English ASR might transcribe it as "I need help with my cuenta" - or it might hallucinate an English word that sounds similar. "Counter." "Counted." The word gets lost before the AI has even started processing.

This is the root failure. Everything downstream depends on the transcript. If the ASR can't produce an accurate transcript across both languages, no amount of clever LLM prompting will save the call.

Problem 2 - Language detection is too slow, or too coarse

Even if the ASR captures the word correctly, the system still needs to know what language it's dealing with before it can respond appropriately. Most approaches use utterance-level language detection - they look at the whole sentence, assign a single language label, and proceed. This fails on code-switched input because the sentence contains both languages. One label has to win. The losing language gets ignored.

Some systems use word-level detection instead. This creates a different problem: filler words and short phrases trigger false switches. A caller says "okay" - a word that exists in Spanish too - and the system flips to Spanish. It's noise, not signal.

Problem 3 - The LLM doesn't know when to actually switch

Even if the system correctly identifies that a switch has occurred, the response layer has to decide what to do with that information. Most implementations either respond in a fixed language regardless, or mirror every detected switch - including false ones. Neither is right.

The real requirement: respond in the language the caller has clearly committed to, not the language of the last word they said.

These three failure points compound each other. A bad transcript leads to wrong language detection, which leads to a mismatched response. By the time the caller hears the agent respond in the wrong language, the damage to the interaction is already done.

What a real solution looks like: 5 layers

At Nurix, we didn't solve this by adding a language toggle. We built a five-layer pipeline that handles code-switching at every stage of a call - from the moment audio hits the system to post-call quality evaluation. Here's each layer, what it does, and why it matters for enterprise contact centers.

Layer 1 - ASR: 

Speech Recognition with Native Code-Switching Support Deepgram Nova 3, self-hosted

The first thing a voice AI touches is audio. Before any LLM sees the conversation, before any language detection runs, the ASR model has to produce an accurate transcript across both English and Spanish - including mixed sentences.

We use Deepgram Nova 3, deployed in our own infrastructure. Nova 3 is trained on code-switched audio, which means it's seen real-world speech where languages mix within a single sentence. It doesn't force the transcript into one language. It transcribes what was actually said.

Why it matters for enterprises: A bad transcript at Layer 1 corrupts everything downstream. If your ASR can't handle "mi cuenta," no amount of post-processing will recover that word. This is the most commonly underestimated failure point in voice AI procurement - most buyers evaluate LLM quality and voice naturalness, not ASR accuracy on mixed-language audio.

Business impact: transcript accuracy → fewer failed calls

Layer 2 - Language Identification Model: 

Sentence-Level, Not Word-Level Text-level LID on top of the transcript

Once the transcript is produced, the system needs to identify which language the caller is operating in. This runs on the text of the transcript - not on individual words, but on sentence structure and grammatical patterns.

Sentence-level detection is the right granularity here. Word-level detection is too noisy - shared words between English and Spanish create constant false positives. Utterance-level detection is too coarse - it can't handle mid-sentence switches. Sentence-level sits in between: it's sensitive enough to catch real switches, stable enough to ignore fillers and cognates.

Why it matters for enterprises: This is the layer that determines whether a language switch is real or noise. Getting this wrong means the agent either ignores real switches (caller feels unheard) or over-corrects on false ones (agent keeps flipping languages, call sounds broken).

Business impact: accurate detection → appropriate agent response

Layer 3 - Confidence Scoring: 

Don't Guess When You're Not Sure Turn-level confidence threshold before acting

Language identification isn't always certain. Short turns, heavy background noise, or highly mixed input can produce low-confidence detections. At Layer 3, the system assigns a confidence score to each detected language. If that score falls below threshold, the system doesn't proceed with a potentially wrong assumption.

Instead, it can do one of two things: ask for clarification ("Did you want to continue in Spanish?") or route the call for human review. The key point is it doesn't guess. A wrong-language response delivered with confidence is worse than a short clarifying question.

Why it matters for enterprises: Confidence scoring is the difference between a system that handles ambiguity gracefully and one that bulldozes through it and creates a bad experience. In regulated industries - banking, insurance, healthcare - the stakes of a wrong-language interaction are especially high. Mishandled bilingual calls contribute directly to complaints, repeat call rates, and compliance exposure.

Business impact: fewer bad responses → lower repeat call rate

Layer 4 - Language Mirroring: 

The Agent Responds in the Right Language LLM responds in detected language; switches only on full-sentence commitment

This is the layer callers actually experience. Language Mirroring is the logic that governs how the AI agent responds to a language switch. The rule is precise: a full sentence in a new language triggers a switch. A single filler word - "okay," "sí," "sure" - does not.

This matters because bilingual callers don't always commit to a switch. They test the water with a word. If the agent flips language on every test word, the conversation becomes incoherent. Language Mirroring waits for commitment before following.

Primary and secondary language are configured at the agent level in NuPlay's builder - enterprises can set English as primary with Spanish as secondary, or configure agents for specific populations where Spanish is the primary language and English is secondary.

Why it matters for enterprises: This is where the customer experience lives. An agent that mirrors language correctly feels natural - callers don't have to think about what language they're speaking in. That reduction in cognitive load shortens handle time, reduces frustration, and improves resolution rates. One of the active deals that surfaced this requirement had callers who would start a call in English, switch to Spanish when emotional, and then switch back. The agent needs to track all three states without breaking flow.

Business impact: natural conversation → lower handle time, higher CSAT

Layer 5 - Post-Call Analysis: 

Post-Call Language Switch Quality Scoring Automated QA across 12 conversational metrics

The first four layers handle a call in real time. Layer 5 looks back at what happened. NuPlay's post-call evaluation system runs automatically on every call and scores a set of conversational quality metrics. Language Switch Quality is one of them.

Specifically, it scores whether the agent correctly tracked language across the conversation - did it switch when it should have, stay put when it shouldn't have, handle low-confidence turns correctly? These scores feed back into agent improvement cycles, allowing teams to identify systematic failures and fix them before they compound at scale.

Why it matters for enterprises: Without post-call QA on language switching, you're flying blind. You might know your containment rate. You might know your CSAT score. But you don't know how often your bilingual agent responded in the wrong language - unless you're listening to every call manually. Automated scoring at scale closes that gap. It turns a gut-feel problem ("we think the Spanish handling is fine") into a data problem ("language switch quality is 91% - here are the 9% of calls where it broke and why").

Business impact: systematic QA → measurable improvement over time

How this compares to what everyone else is doing

Most voice AI vendors do address multilingual support in some form. The question is how and the architectural differences have direct business consequences.

  • Translation Layer
    How it works: Translate input to English → process in English → translate response back to Spanish. The problem: Adds latency at every step. Loses nuance in translation. The AI is reasoning in a language it didn't hear the call in.
  • Separate Language Agents
    How it works: Two agents configured - one English, one Spanish. Routing logic decides which one handles the call. The problem: Falls apart on mid-sentence switches. No routing decision can happen fast enough mid-turn. Context is lost on handoff.
  • "Language Count as a Feature"
    How it works: "Supports 20 languages" - typically means the LLM can respond in 20 languages if prompted correctly. The problem: No code-switching support. No ASR accuracy guarantee. No QA. Works in demos, breaks on real calls.
  • NuPlay - Native 5-Layer Stack 
    How it works: ASR with code-switching support → sentence-level LID → confidence scoring → Language Mirroring → post-call QA. Result: Handles mid-sentence switches in real time. No translation step. QA closes the loop on every call.

"Bilingual isn't a feature you toggle on. It's a five-layer infrastructure problem and most voice AI is solving it with one layer, badly."

What this means for enterprise contact centers

If you run a contact center that serves a bilingual population, the business case for getting this right isn't abstract. Here's where it shows up on your metrics:

  • AHT: Callers who don't have to manage their own language switching or be transferred - resolve faster. A caller who switches to Spanish and gets a Spanish response doesn't lose their train of thought.
  • Repeat Calls: Many "unresolved" calls in bilingual populations are language failures, not complexity failures. The caller understood. The agent didn't. That's a solvable problem.
  • Containment: Bilingual callers who hit a language wall escalate to human agents at higher rates. Fixing the language handling keeps more calls in the automated flow.
  • Market Reach: 41M native Spanish speakers in the US. A contact center that serves them fluently is accessible to a population that many competitors effectively wall out.

There's also a compliance angle that's rarely discussed. In regulated industries - financial services, healthcare, insurance - a caller who was mishandled due to a language failure is a liability. If your QA system can't tell you how often that happened, you don't know the exposure you're carrying.

The origin of this approach: what India taught us

Nurix built its original voice AI infrastructure for the Indian market - a market where code-switching isn't occasional, it's the default mode of communication. Hinglish (Hindi + English), Tanglish (Tamil + English), and dozens of other combinations mean that a voice AI deployed in India has to handle language mixing from day one or it doesn't work at all.

That constraint forced us to solve this problem at the infrastructure level early. The 5-layer stack wasn't designed for Spanish-English in the US - it was designed for a market where failure wasn't an option. Adapting it for English-Spanish code-switching in American contact centers is a translation of the architecture, not a rebuild of it.

That's the real differentiation. Most US-focused voice AI vendors are encountering this problem for the first time as Spanish-English demand grows in enterprise contact centers. We've been solving it for years.

What to ask your current vendor

If you're evaluating voice AI platforms for a bilingual deployment, five questions will tell you most of what you need to know:

  1. What ASR model are you using, and does it support code-switched audio? Most vendors use off-the-shelf ASR. Ask specifically whether it's been evaluated on Spanish-English mixed sentences. Ask for WER (word error rate) data on code-switched test sets.

  2. How does your language detection work - at the word level, sentence level, or utterance level? Word-level is too noisy. Utterance-level misses intra-sentence switches. Sentence-level is the right granularity. If they can't answer this question, they haven't thought about it.

  3. What happens when language confidence is low? The system should have a defined behavior for uncertainty. If the answer is "it picks the most likely language," that's a red flag.

  4. What's the switch trigger - word, sentence, or something else? A filler-word trigger means the agent will flip language constantly on false positives. Ask specifically what triggers a language change in the agent response.

  5. How do you measure language switch quality post-call? If the answer is "manually" or "we don't," you can't improve what you can't measure. Automated post-call scoring on language handling should be a standard feature for any enterprise deployment.

Conversational AI for Sales and Support teams

Talk to our team to see how to see how Nurix powers smarter engagement.

Let’s Talk

Ready to see what agentic AI can do for your business?

Book a quick demo with our team to explore how Nurix can automate and scale your workflows

Let’s Talk
What is code-switching and why does it matter for voice AI?

Code-switching is when a bilingual speaker moves between two languages mid-conversation - sometimes mid-sentence. It's not a mistake or a sign of confusion. It's how bilingual people naturally communicate, especially under stress or when they're more comfortable in one language for a specific topic. For voice AI, it matters because most systems are built assuming the caller speaks one language for the entire call. The moment that assumption breaks, the call breaks with it.

Can't I just build two separate voice AI agents - one in English, one in Spanish?

You can, but it only works if your callers never switch. The moment a caller starts in English and moves to Spanish mid-call, the two-agent model has no answer. There's no handoff mechanism fast enough to work mid-sentence, and any transfer means the caller loses context and has to repeat themselves. Two agents solves the routing problem. It doesn't solve the language switching problem.

How is NuPlay's approach different from a translation-based system?

Translation-based systems convert the caller's Spanish input to English, process it, then translate the response back to Spanish. Every turn adds latency. The AI reasons in a language it didn't hear the call in - which means nuance, tone, and context get lost at every step. NuPlay detects and responds in the caller's language natively. There's no translation layer, no added latency, and no information lost between what the caller said and what the agent understood.

What if the caller only uses one Spanish word - does the agent switch languages immediately?

No. NuPlay's Language Mirroring layer switches only when a caller commits to a full sentence in a new language. A single Spanish word - especially a filler like "sí," "okay," or "bueno" - doesn't trigger a switch. This is intentional. Bilingual speakers borrow words constantly without intending to switch languages. The system waits for a clear commitment before following, which keeps the conversation stable and prevents the agent from flipping back and forth on noise.

How do you know if your bilingual voice AI is actually working post-deployment?

Most enterprises don't and that's the real problem. Without automated post-call QA on language switching, the only way to know is to manually audit calls. NuPlay's post-call scoring evaluates language switch quality automatically across every call: did the agent switch when it should have, hold language when it shouldn't have switched, handle low-confidence turns correctly? That data turns a subjective question "is our Spanish handling good?" into a measurable one you can track and improve over time.

Related

Related Blogs

Explore All
<---NEW-FAQ--->