Skip to content

Conversation

@chenghao-mou
Copy link
Member

@chenghao-mou chenghao-mou commented Jan 26, 2026

  1. When using STT for EOT, we could receive duplicate end of speech calls and fire duplicate false interruption timers. This PR skips duplicate calls by checking the timer's existence and the current transcript.
  2. Resume audio twice (both before the generation and when the first frame is received) in case the false interruption pauses the audio during TTS generation.

This should close #4615.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced speech interruption logic with improved tracking of speech-to-text completion state.
    • Refined timing of interrupt behavior based on voice activity detection and speech recognition modes.
    • Improved speech activity state communication for more accurate interrupt triggering.

✏️ Tip: You can customize this high-level summary in your review settings.

@chenghao-mou chenghao-mou requested a review from a team January 26, 2026 12:32
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 26, 2026

📝 Walkthrough

Walkthrough

This PR fixes phantom false interruptions when using STT-based turn detection with endpointing by tracking STT end-of-speech state and gating interruption logic accordingly. The changes refine how speech interruption decisions are made based on STT EOS detection and adjust speaking flag semantics in transcript callbacks.

Changes

Cohort / File(s) Summary
STT end-of-speech tracking
livekit-agents/livekit/agents/voice/agent_activity.py
Added _stt_eos_received boolean state to track whether STT EOS has been observed. Reset on start_of_speech, set on end_of_speech, and used to gate interruption logic when turn_detection is "stt". Prevents interruptions after STT has signaled end-of-speech unless silence is zero.
Speaking state semantics
livekit-agents/livekit/agents/voice/audio_recognition.py
Modified speaking flag passed to transcript hooks (FINAL_TRANSCRIPT, PREFLIGHT_TRANSCRIPT, INTERIM_TRANSCRIPT). Now conditionally sets speaking to self._speaking only when VAD is active or turn_detection is "stt"; otherwise passes None. Tightens speaking state dependency based on turn detection mode.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • #4396: Modifies STT end-of-speech handling and speaking/timestamp semantics in voice STT paths with overlapping concerns around audio_recognition and agent_activity interaction.
  • #4536: Refactors interruption logic in agent_activity.py to prevent race conditions, working in the same domain as this PR's STT-EOS-aware gating.

Suggested reviewers

  • longcw
  • davidzhao

Poem

🐰 A phantom interruption haunted the hall,
With false resumptions destroying it all,
But we tracked when the STT said "done,"
And gated the interrupts—now speech flows as one!
No more phantom pauses, the silence is gone! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'prevent duplicate false interruption due to late end of speech' directly describes the primary objective of preventing duplicate false-interruption timers and events caused by late STT end-of-speech handling.
Linked Issues check ✅ Passed The PR implementation addresses issue #4615 by introducing STT EOS state tracking to prevent duplicate false-interruption events and ensuring audio resume logic is properly sequenced to avoid delayed playback.
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing the false-interruption issue: STT EOS state tracking in agent_activity.py and speaking flag conditional logic in audio_recognition.py.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd393a2 and d7b725d.

📒 Files selected for processing (1)
  • livekit-agents/livekit/agents/voice/agent_activity.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • livekit-agents/livekit/agents/voice/agent_activity.py
🧠 Learnings (1)
📚 Learning: 2026-01-22T03:28:16.289Z
Learnt from: longcw
Repo: livekit/agents PR: 4563
File: livekit-agents/livekit/agents/beta/tools/end_call.py:65-65
Timestamp: 2026-01-22T03:28:16.289Z
Learning: In code paths that check capabilities or behavior of the LLM processing the current interaction, prefer using the activity's LLM obtained via ctx.session.current_agent._get_activity_or_raise().llm instead of ctx.session.llm. The session-level LLM may be a fallback and not reflect the actual agent handling the interaction. Use the activity LLM to determine capabilities and to make capability checks or feature toggles relevant to the current processing agent.

Applied to files:

  • livekit-agents/livekit/agents/voice/agent_activity.py
🧬 Code graph analysis (1)
livekit-agents/livekit/agents/voice/agent_activity.py (1)
livekit-agents/livekit/agents/voice/agent_session.py (1)
  • options (398-399)
🔇 Additional comments (2)
livekit-agents/livekit/agents/voice/agent_activity.py (2)

121-121: STT EOS lifecycle tracking looks solid.
Resetting on speech start and setting on STT-driven end-of-speech keeps the flag scoped to the current turn and avoids stale state.

Also applies to: 1221-1221, 1232-1234


1253-1263: STT-aware VAD interruption gating is well-targeted.
The added conditions should prevent duplicate false-interruption timers after STT EOS while still interrupting on active speech.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@livekit-agents/livekit/agents/voice/agent_activity.py`:
- Around line 1236-1257: The resume-timer gating logic incorrectly skips
scheduling when min_interruption_words == 0; update the inner condition inside
the big if so that an existing audio recognition still triggers the "transcript
not long enough" branch when min_interruption_words <= 0. Concretely, in the
block referencing self._paused_speech, self._false_interruption_timer,
self._audio_recognition, and self._session.options.min_interruption_words,
replace the sub-condition (self._session.options.min_interruption_words > 0 and
len(split_words(...)) < self._session.options.min_interruption_words) with a
check that treats <= 0 as "no minimum" (e.g.,
self._session.options.min_interruption_words <= 0 or len(split_words(...)) <
self._session.options.min_interruption_words) so the resume timer will be
scheduled correctly.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9586b8f and 8a33afc.

📒 Files selected for processing (3)
  • livekit-agents/livekit/agents/voice/agent_activity.py
  • livekit-agents/livekit/agents/voice/audio_recognition.py
  • livekit-agents/livekit/agents/voice/generation.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • livekit-agents/livekit/agents/voice/agent_activity.py
  • livekit-agents/livekit/agents/voice/generation.py
  • livekit-agents/livekit/agents/voice/audio_recognition.py
🧠 Learnings (1)
📚 Learning: 2026-01-22T03:28:16.289Z
Learnt from: longcw
Repo: livekit/agents PR: 4563
File: livekit-agents/livekit/agents/beta/tools/end_call.py:65-65
Timestamp: 2026-01-22T03:28:16.289Z
Learning: In code paths that check capabilities or behavior of the LLM processing the current interaction, prefer using the activity's LLM obtained via ctx.session.current_agent._get_activity_or_raise().llm instead of ctx.session.llm. The session-level LLM may be a fallback and not reflect the actual agent handling the interaction. Use the activity LLM to determine capabilities and to make capability checks or feature toggles relevant to the current processing agent.

Applied to files:

  • livekit-agents/livekit/agents/voice/agent_activity.py
  • livekit-agents/livekit/agents/voice/generation.py
  • livekit-agents/livekit/agents/voice/audio_recognition.py
🧬 Code graph analysis (2)
livekit-agents/livekit/agents/voice/generation.py (8)
livekit-agents/livekit/agents/voice/transcription/synchronizer.py (3)
  • audio_output (430-431)
  • resume (236-244)
  • resume (593-595)
livekit-agents/livekit/agents/voice/room_io/room_io.py (1)
  • audio_output (241-245)
livekit-agents/livekit/agents/voice/agent_activity.py (1)
  • resume (640-651)
livekit-agents/livekit/agents/voice/avatar/_datastream_io.py (1)
  • resume (166-167)
livekit-agents/livekit/agents/cli/cli.py (1)
  • resume (207-212)
livekit-agents/livekit/agents/voice/room_io/_output.py (1)
  • resume (134-137)
livekit-agents/livekit/agents/voice/io.py (1)
  • resume (278-281)
livekit-agents/livekit/agents/voice/recorder_io/recorder_io.py (1)
  • resume (365-372)
livekit-agents/livekit/agents/voice/audio_recognition.py (1)
livekit-agents/livekit/agents/voice/agent_activity.py (1)
  • on_interim_transcript (1279-1305)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: livekit-plugins-openai
  • GitHub Check: livekit-plugins-deepgram
  • GitHub Check: unit-tests
  • GitHub Check: type-check (3.13)
  • GitHub Check: type-check (3.9)
🔇 Additional comments (4)
livekit-agents/livekit/agents/voice/generation.py (1)

365-380: Good safeguard for paused-audio edge case.

The first-frame resume keeps audio output active even if it was paused during TTS generation.

livekit-agents/livekit/agents/voice/audio_recognition.py (3)

358-363: Speaking flag gating looks correct for FINAL_TRANSCRIPT.

Passing None when speaking state isn’t reliable avoids misleading hooks.


405-412: Speaking flag gating looks correct for PREFLIGHT_TRANSCRIPT.

This keeps speaking state consistent with available signal sources.


449-455: Speaking flag gating looks correct for INTERIM_TRANSCRIPT.

Good alignment with VAD/STT-driven speaking state.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

# sending the end of speech event, we need to check if:
# 1. The resume timer has not been scheduled yet.
# 2. The transcript is not long enough for interruption.
self._false_interruption_timer is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._start_false_interruption_timer(timeout) will cancel the existing timer and since these are synced methods, there should always be only one timer at a time.

if the EOU is fired right after the final transcript, before the false interruption timeout, it shouldn't call the _on_false_interruption twice.

Copy link
Contributor

@longcw longcw Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here is the timer to resume the speech, the speech pause/interruption is happened in _interrupt_by_audio_activity. I added a comment here #4615 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to make sure we don't cancel and start a new one if it is a duplicate end of speech event.

Though I am not sure disabling VAD is the solution here because we also need VAD for stuff like barge-in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I got from your comment is that VAD already pauses the speech before the VAD EOS, so we really need is to skip the pause when stt is used for turn detection from _interrupt_by_audio_activity, right?

Copy link
Contributor

@longcw longcw Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue is VAD events are not synced with STT EOS event, so when the speech is committed by STT, the VAD may still think the user is speaking and interrupt the agent. so we can disable the VAD for interruption if the turn_detection mode is stt, if you want the VAD is always enabled.

The intention is to make sure we don't cancel and start a new one if it is a duplicate end of speech event.

I think cancel the old timer and start a new one is the right behavior here, we need to reset the timer whenever there is voice activity, to make sure the timer is started after the user speech done. when the turn_detection mode is stt, the STT EOS should be considered as a voice activity too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a more compatible alternative. Otherwise, we are essentially disabling false interruptions and barge-in when turn_detection is stt. Barge-in uses VAD for both speaking status tracking and interruption (via _interrupt_by_audio_activity).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code so now we still allow VAD interruption under certain conditions when turn_detecton is stt:

  1. we haven't received any EOS event from stt;
  2. VAD speech (without the endpointing silence) is still ongoing

# during TTS generation (e.g., due to false interruption detection)
if not first_frame_captured:
first_frame_captured = True
audio_output.resume()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is needed? resume here will bypass the pause if the agent speech is interrupted before the TTS generation started. if the audio input is still active, it may pause/interrupt again very soon after a few frames.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think this is needed after the new changes. Previously, the timer was not properly cancelled and therefore audio output seemed stuck in paused state.

# before VAD end of speech event, we only interrupt if
# 1. STT EOS hasn't been received yet; or
# 2. VAD real EOS is not yet triggered (i.e. VAD speech is still ongoing)
if not self._stt_eos_received or ev.raw_accumulated_silence == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first condition makes sense, but the second may still cause the issue that the VAD interrupts stt committed speech?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my tests, it is often during VAD endpointing silence that it triggers the interruption after STT EOS. I think it should be okay to interrupt in this case.

chenghao-mou and others added 2 commits January 27, 2026 13:33
Co-authored-by: Long Chen <longch1024@gmail.com>
Copy link
Contributor

@longcw longcw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@chenghao-mou chenghao-mou merged commit 9b629dd into main Jan 27, 2026
19 checks passed
@chenghao-mou chenghao-mou deleted the fix/deepgram-false-interruptions branch January 27, 2026 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phantom resumed false interrupted speech activity that severely delays speech playback when using a STT model with endpointing

3 participants