prevent duplicate false interruption due to late end of speech #4621

chenghao-mou · 2026-01-26T12:32:29Z

When using STT for EOT, we could receive duplicate end of speech calls and fire duplicate false interruption timers. This PR skips duplicate calls by checking the timer's existence and the current transcript.
Resume audio twice (both before the generation and when the first frame is received) in case the false interruption pauses the audio during TTS generation.

This should close #4615.

Summary by CodeRabbit

Bug Fixes
- Enhanced speech interruption logic with improved tracking of speech-to-text completion state.
- Refined timing of interrupt behavior based on voice activity detection and speech recognition modes.
- Improved speech activity state communication for more accurate interrupt triggering.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-26T12:32:57Z

📝 Walkthrough

Walkthrough

This PR fixes phantom false interruptions when using STT-based turn detection with endpointing by tracking STT end-of-speech state and gating interruption logic accordingly. The changes refine how speech interruption decisions are made based on STT EOS detection and adjust speaking flag semantics in transcript callbacks.

Changes

Cohort / File(s)	Summary
STT end-of-speech tracking `livekit-agents/livekit/agents/voice/agent_activity.py`	Added `_stt_eos_received` boolean state to track whether STT EOS has been observed. Reset on start_of_speech, set on end_of_speech, and used to gate interruption logic when turn_detection is "stt". Prevents interruptions after STT has signaled end-of-speech unless silence is zero.
Speaking state semantics `livekit-agents/livekit/agents/voice/audio_recognition.py`	Modified speaking flag passed to transcript hooks (FINAL_TRANSCRIPT, PREFLIGHT_TRANSCRIPT, INTERIM_TRANSCRIPT). Now conditionally sets speaking to `self._speaking` only when VAD is active or turn_detection is "stt"; otherwise passes None. Tightens speaking state dependency based on turn detection mode.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

#4396: Modifies STT end-of-speech handling and speaking/timestamp semantics in voice STT paths with overlapping concerns around audio_recognition and agent_activity interaction.
#4536: Refactors interruption logic in agent_activity.py to prevent race conditions, working in the same domain as this PR's STT-EOS-aware gating.

Suggested reviewers

longcw
davidzhao

Poem

🐰 A phantom interruption haunted the hall,
With false resumptions destroying it all,
But we tracked when the STT said "done,"
And gated the interrupts—now speech flows as one!
No more phantom pauses, the silence is gone! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'prevent duplicate false interruption due to late end of speech' directly describes the primary objective of preventing duplicate false-interruption timers and events caused by late STT end-of-speech handling.
Linked Issues check	✅ Passed	The PR implementation addresses issue `#4615` by introducing STT EOS state tracking to prevent duplicate false-interruption events and ensuring audio resume logic is properly sequenced to avoid delayed playback.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to addressing the false-interruption issue: STT EOS state tracking in agent_activity.py and speaking flag conditional logic in audio_recognition.py.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd393a2 and d7b725d.

📒 Files selected for processing (1)

livekit-agents/livekit/agents/voice/agent_activity.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

livekit-agents/livekit/agents/voice/agent_activity.py

🧠 Learnings (1)

📚 Learning: 2026-01-22T03:28:16.289Z

Learnt from: longcw
Repo: livekit/agents PR: 4563
File: livekit-agents/livekit/agents/beta/tools/end_call.py:65-65
Timestamp: 2026-01-22T03:28:16.289Z
Learning: In code paths that check capabilities or behavior of the LLM processing the current interaction, prefer using the activity's LLM obtained via ctx.session.current_agent._get_activity_or_raise().llm instead of ctx.session.llm. The session-level LLM may be a fallback and not reflect the actual agent handling the interaction. Use the activity LLM to determine capabilities and to make capability checks or feature toggles relevant to the current processing agent.

Applied to files:

livekit-agents/livekit/agents/voice/agent_activity.py

🧬 Code graph analysis (1)

livekit-agents/livekit/agents/voice/agent_activity.py (1)

livekit-agents/livekit/agents/voice/agent_session.py (1)

options (398-399)

🔇 Additional comments (2)

livekit-agents/livekit/agents/voice/agent_activity.py (2)

121-121: STT EOS lifecycle tracking looks solid.
Resetting on speech start and setting on STT-driven end-of-speech keeps the flag scoped to the current turn and avoids stale state.

Also applies to: 1221-1221, 1232-1234

1253-1263: STT-aware VAD interruption gating is well-targeted.
The added conditions should prevent duplicate false-interruption timers after STT EOS while still interrupting on active speech.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@livekit-agents/livekit/agents/voice/agent_activity.py`:
- Around line 1236-1257: The resume-timer gating logic incorrectly skips
scheduling when min_interruption_words == 0; update the inner condition inside
the big if so that an existing audio recognition still triggers the "transcript
not long enough" branch when min_interruption_words <= 0. Concretely, in the
block referencing self._paused_speech, self._false_interruption_timer,
self._audio_recognition, and self._session.options.min_interruption_words,
replace the sub-condition (self._session.options.min_interruption_words > 0 and
len(split_words(...)) < self._session.options.min_interruption_words) with a
check that treats <= 0 as "no minimum" (e.g.,
self._session.options.min_interruption_words <= 0 or len(split_words(...)) <
self._session.options.min_interruption_words) so the resume timer will be
scheduled correctly.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9586b8f and 8a33afc.

📒 Files selected for processing (3)

livekit-agents/livekit/agents/voice/agent_activity.py
livekit-agents/livekit/agents/voice/audio_recognition.py
livekit-agents/livekit/agents/voice/generation.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

livekit-agents/livekit/agents/voice/agent_activity.py
livekit-agents/livekit/agents/voice/generation.py
livekit-agents/livekit/agents/voice/audio_recognition.py

🧠 Learnings (1)

📚 Learning: 2026-01-22T03:28:16.289Z

Learnt from: longcw
Repo: livekit/agents PR: 4563
File: livekit-agents/livekit/agents/beta/tools/end_call.py:65-65
Timestamp: 2026-01-22T03:28:16.289Z
Learning: In code paths that check capabilities or behavior of the LLM processing the current interaction, prefer using the activity's LLM obtained via ctx.session.current_agent._get_activity_or_raise().llm instead of ctx.session.llm. The session-level LLM may be a fallback and not reflect the actual agent handling the interaction. Use the activity LLM to determine capabilities and to make capability checks or feature toggles relevant to the current processing agent.

Applied to files:

livekit-agents/livekit/agents/voice/agent_activity.py
livekit-agents/livekit/agents/voice/generation.py
livekit-agents/livekit/agents/voice/audio_recognition.py

🧬 Code graph analysis (2)

livekit-agents/livekit/agents/voice/generation.py (8)

livekit-agents/livekit/agents/voice/transcription/synchronizer.py (3)

audio_output (430-431)

resume (236-244)

resume (593-595)

livekit-agents/livekit/agents/voice/room_io/room_io.py (1)

audio_output (241-245)

livekit-agents/livekit/agents/voice/agent_activity.py (1)

resume (640-651)

livekit-agents/livekit/agents/voice/avatar/_datastream_io.py (1)

resume (166-167)

livekit-agents/livekit/agents/cli/cli.py (1)

resume (207-212)

livekit-agents/livekit/agents/voice/room_io/_output.py (1)

resume (134-137)

livekit-agents/livekit/agents/voice/io.py (1)

resume (278-281)

livekit-agents/livekit/agents/voice/recorder_io/recorder_io.py (1)

resume (365-372)

livekit-agents/livekit/agents/voice/audio_recognition.py (1)

livekit-agents/livekit/agents/voice/agent_activity.py (1)

on_interim_transcript (1279-1305)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: livekit-plugins-openai
GitHub Check: livekit-plugins-deepgram
GitHub Check: unit-tests
GitHub Check: type-check (3.13)
GitHub Check: type-check (3.9)

🔇 Additional comments (4)

livekit-agents/livekit/agents/voice/generation.py (1)

365-380: Good safeguard for paused-audio edge case.

The first-frame resume keeps audio output active even if it was paused during TTS generation.

livekit-agents/livekit/agents/voice/audio_recognition.py (3)

358-363: Speaking flag gating looks correct for FINAL_TRANSCRIPT.

Passing None when speaking state isn’t reliable avoids misleading hooks.

405-412: Speaking flag gating looks correct for PREFLIGHT_TRANSCRIPT.

This keeps speaking state consistent with available signal sources.

449-455: Speaking flag gating looks correct for INTERIM_TRANSCRIPT.

Good alignment with VAD/STT-driven speaking state.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

livekit-agents/livekit/agents/voice/agent_activity.py

longcw · 2026-01-27T02:10:24Z

livekit-agents/livekit/agents/voice/agent_activity.py

+                # sending the end of speech event, we need to check if:
+                # 1. The resume timer has not been scheduled yet.
+                # 2. The transcript is not long enough for interruption.
+                self._false_interruption_timer is None


self._start_false_interruption_timer(timeout) will cancel the existing timer and since these are synced methods, there should always be only one timer at a time.

if the EOU is fired right after the final transcript, before the false interruption timeout, it shouldn't call the _on_false_interruption twice.

and here is the timer to resume the speech, the speech pause/interruption is happened in _interrupt_by_audio_activity. I added a comment here #4615 (comment)

The intention is to make sure we don't cancel and start a new one if it is a duplicate end of speech event.

Though I am not sure disabling VAD is the solution here because we also need VAD for stuff like barge-in.

What I got from your comment is that VAD already pauses the speech before the VAD EOS, so we really need is to skip the pause when stt is used for turn detection from _interrupt_by_audio_activity, right?

the issue is VAD events are not synced with STT EOS event, so when the speech is committed by STT, the VAD may still think the user is speaking and interrupt the agent. so we can disable the VAD for interruption if the turn_detection mode is stt, if you want the VAD is always enabled.

The intention is to make sure we don't cancel and start a new one if it is a duplicate end of speech event.

I think cancel the old timer and start a new one is the right behavior here, we need to reset the timer whenever there is voice activity, to make sure the timer is started after the user speech done. when the turn_detection mode is stt, the STT EOS should be considered as a voice activity too.

There should be a more compatible alternative. Otherwise, we are essentially disabling false interruptions and barge-in when turn_detection is stt. Barge-in uses VAD for both speaking status tracking and interruption (via _interrupt_by_audio_activity).

Updated the code so now we still allow VAD interruption under certain conditions when turn_detecton is stt:

we haven't received any EOS event from stt;

VAD speech (without the endpointing silence) is still ongoing

longcw · 2026-01-27T13:01:17Z

livekit-agents/livekit/agents/voice/generation.py

+            # during TTS generation (e.g., due to false interruption detection)
+            if not first_frame_captured:
+                first_frame_captured = True
+                audio_output.resume()


why this is needed? resume here will bypass the pause if the agent speech is interrupted before the TTS generation started. if the audio input is still active, it may pause/interrupt again very soon after a few frames.

Yeah, I don't think this is needed after the new changes. Previously, the timer was not properly cancelled and therefore audio output seemed stuck in paused state.

longcw · 2026-01-27T13:02:40Z

livekit-agents/livekit/agents/voice/agent_activity.py

+            # before VAD end of speech event, we only interrupt if
+            # 1. STT EOS hasn't been received yet; or
+            # 2. VAD real EOS is not yet triggered (i.e. VAD speech is still ongoing)
+            if not self._stt_eos_received or ev.raw_accumulated_silence == 0:


I think the first condition makes sense, but the second may still cause the issue that the VAD interrupts stt committed speech?

From my tests, it is often during VAD endpointing silence that it triggers the interruption after STT EOS. I think it should be okay to interrupt in this case.

livekit-agents/livekit/agents/voice/agent_activity.py

Co-authored-by: Long Chen <longch1024@gmail.com>

longcw

lgtm!

prevent duplicate false interruption due to late end of speech

8a33afc

chenghao-mou requested a review from a team January 26, 2026 12:32

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

livekit-agents/livekit/agents/voice/agent_activity.py Outdated Show resolved Hide resolved

longcw reviewed Jan 27, 2026

View reviewed changes

handle stt turn detection in vad inference done

fd393a2

longcw reviewed Jan 27, 2026

View reviewed changes

chenghao-mou and others added 2 commits January 27, 2026 13:33

Apply suggestion from @longcw

193491e

Co-authored-by: Long Chen <longch1024@gmail.com>

remove double resume

d7b725d

longcw approved these changes Jan 27, 2026

View reviewed changes

chenghao-mou merged commit 9b629dd into main Jan 27, 2026
19 checks passed

chenghao-mou deleted the fix/deepgram-false-interruptions branch January 27, 2026 16:47

prevent duplicate false interruption due to late end of speech #4621

prevent duplicate false interruption due to late end of speech #4621

Conversation

chenghao-mou commented Jan 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

longcw Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

longcw Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

longcw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chenghao-mou commented Jan 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 26, 2026 •

edited

Loading

longcw Jan 27, 2026 •

edited

Loading

longcw Jan 27, 2026 •

edited

Loading