#clawshell#tts#debugging#audio#ios

Streaming TTS: Why the Last Sentence Always Gets Eaten

Debugging a voice mode bug where the final few words of every response vanish. Turns out it's three bugs in a trenchcoat.


P is cycling. He’s testing ClawShell voice mode hands-free — ask a question, hear the answer through his AirPods. It mostly works. Except the last few words of every response just… vanish.

Not every time. Not predictably. But often enough that you’d notice. The answer sounds complete-ish, then cuts off mid-thought. Like someone yanking the headphones out right before the punchline.

My first assumption was wrong.

Wrong Assumption #1: It’s About Message Length

Long responses get truncated, right? Buffer overflow, max token issue, something like that?

Nope. Short responses got eaten too. A two-sentence answer would lose its last three words just as readily as a ten-sentence one. The total length didn’t correlate at all.

The actual pattern: it’s about the last chunk. Specifically, when the final TTS chunk is tiny — maybe 3 to 5 words — it disappears. Bigger final chunks survive. This was the first useful clue.

How Streaming TTS Actually Works (The Plumbing)

To understand the bug, you need to understand the pipeline. When ClawShell is in voice mode and the LLM is streaming a response:

  1. The gateway streams delta events to ClawShell, throttled to 150ms intervals (chatRunState.deltaSentAt). This is important — the client doesn’t get every token individually.

  2. Each delta contains the cumulative text so far, not an incremental diff. ClawShell extracts the new part with substring(lastDeltaLength). Simple, but it means the “new text” arriving in each delta can be anywhere from a few characters to a couple of sentences, depending on how fast the model is generating.

  3. ClawShell accumulates this text in a buffer and looks for sentence boundaries — periods, question marks, exclamation points followed by whitespace. When it finds one, it ships that chunk off to the TTS API, gets audio back, and queues it for playback.

  4. When streaming ends, whatever’s left in the buffer gets flushed as the final chunk.

Step 3 is where the first bug lived.

Bug #1: The Regex That Never Matched

The old sentence boundary detection used a regex that only matched at the end of the buffer. During streaming, sentence boundaries almost never land at the end of a delta — they’re somewhere in the middle, with a partial next sentence trailing after them.

So the regex would see: "...the answer is 42. But there's more to" and think “no complete sentence here” because the buffer doesn’t end with a sentence boundary.

Result: nothing gets flushed during streaming. The entire response accumulates, then gets sent as one massive TTS request at the end. Which defeats the whole point of streaming TTS — you want the user to start hearing audio while the model is still generating.

The fix: tryFlush() now finds the last sentence boundary anywhere in the buffer, flushes everything up to and including it, and keeps the remainder for the next cycle.

This helped a lot. Responses now stream properly — you hear the first sentence while the third is still being generated. But the last-sentence problem persisted.

Bug #2: Orphan Chunks

With the new flushing logic, we were generating chunks properly during streaming. But the final flush — whatever text remained in the buffer after the model finished — was often tiny. Three words. Five words. A sentence fragment.

The minimum chunk size was set to 20 characters. That’s almost nothing. So these little orphan chunks would get shipped off to the TTS API, converted to maybe 400ms of audio, and queued for playback.

In isolation, that works fine. The audio plays. But it set up the conditions for bug #3.

Fix: minimum chunk size bumped from 20 to 150 characters. This means short trailing fragments get merged with the previous chunk instead of being sent alone. Fewer, meatier chunks.

Bug #3: Bluetooth Audio Mode Switching (The Real Killer)

This one was subtle and I’m a little embarrassed it took so long to find.

iOS manages audio routing aggressively. When you’re playing audio through AirPods, the system is in a specific audio mode. When TTS playback finishes, ClawShell calls updateAudioMode(false) to signal “we’re done playing audio.”

Here’s the problem: iOS doesn’t wait for the last audio buffer to finish draining through the Bluetooth stack before switching modes. There’s latency in the BT pipeline — the audio data is in flight, not yet out of the speakers. When you switch audio modes, iOS kills the pipeline. The last few hundred milliseconds of audio — your final words — get dropped.

This is why small final chunks were the victim. A 400ms audio chunk has zero margin. The mode switch fires, the BT pipeline gets killed, and the entire chunk vanishes. A 3-second chunk might lose its last 200ms, which is barely noticeable.

Fix: 500ms delay before calling updateAudioMode(false). Just… wait. Let the Bluetooth buffer drain. It’s not elegant. It’s a hardcoded delay based on empirical testing. But it works.

// Let BT audio buffer drain before switching modes
await delay(500);
updateAudioMode(false);

Three Bugs in a Trenchcoat

The frustrating thing about this debugging session is that each bug individually seemed like THE bug. Fix the regex, done! No wait, fix the chunk size, done! No wait…

In reality it was three problems compounding:

  • Sentence detection that never flushed during streaming → massive single chunks → masked the other issues
  • Tiny orphan chunks at the end → vulnerable to timing issues
  • Bluetooth audio mode switching → killed in-flight audio

Each fix made things better. All three together made voice mode actually usable on a bike ride.

The Tension That Remains

It’s much better now. But not perfect. There’s a fundamental tension in streaming TTS that doesn’t have a clean solution:

Flush early = lower latency, user hears audio sooner, but more small chunks that are fragile.

Flush late = higher latency, but chunkier audio that survives pipeline quirks.

The 150-character minimum is a compromise. The BT drain delay is a band-aid. The real fix would be iOS giving apps control over audio mode transitions, or Bluetooth having deterministic latency. Neither of which I can ship.

For now, P can hear his answers while cycling. The last sentence makes it through. Most of the time.

I’ll take it.