-
Notifications
You must be signed in to change notification settings - Fork 2.7k
prevent duplicate false interruption due to late end of speech #4621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis PR fixes phantom false interruptions when using STT-based turn detection with endpointing by tracking STT end-of-speech state and gating interruption logic accordingly. The changes refine how speech interruption decisions are made based on STT EOS detection and adjust speaking flag semantics in transcript callbacks. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (1)**/*.py📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (1)📚 Learning: 2026-01-22T03:28:16.289ZApplied to files:
🧬 Code graph analysis (1)livekit-agents/livekit/agents/voice/agent_activity.py (1)
🔇 Additional comments (2)
✏️ Tip: You can disable this entire section by setting Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@livekit-agents/livekit/agents/voice/agent_activity.py`:
- Around line 1236-1257: The resume-timer gating logic incorrectly skips
scheduling when min_interruption_words == 0; update the inner condition inside
the big if so that an existing audio recognition still triggers the "transcript
not long enough" branch when min_interruption_words <= 0. Concretely, in the
block referencing self._paused_speech, self._false_interruption_timer,
self._audio_recognition, and self._session.options.min_interruption_words,
replace the sub-condition (self._session.options.min_interruption_words > 0 and
len(split_words(...)) < self._session.options.min_interruption_words) with a
check that treats <= 0 as "no minimum" (e.g.,
self._session.options.min_interruption_words <= 0 or len(split_words(...)) <
self._session.options.min_interruption_words) so the resume timer will be
scheduled correctly.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
livekit-agents/livekit/agents/voice/agent_activity.pylivekit-agents/livekit/agents/voice/audio_recognition.pylivekit-agents/livekit/agents/voice/generation.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings
Files:
livekit-agents/livekit/agents/voice/agent_activity.pylivekit-agents/livekit/agents/voice/generation.pylivekit-agents/livekit/agents/voice/audio_recognition.py
🧠 Learnings (1)
📚 Learning: 2026-01-22T03:28:16.289Z
Learnt from: longcw
Repo: livekit/agents PR: 4563
File: livekit-agents/livekit/agents/beta/tools/end_call.py:65-65
Timestamp: 2026-01-22T03:28:16.289Z
Learning: In code paths that check capabilities or behavior of the LLM processing the current interaction, prefer using the activity's LLM obtained via ctx.session.current_agent._get_activity_or_raise().llm instead of ctx.session.llm. The session-level LLM may be a fallback and not reflect the actual agent handling the interaction. Use the activity LLM to determine capabilities and to make capability checks or feature toggles relevant to the current processing agent.
Applied to files:
livekit-agents/livekit/agents/voice/agent_activity.pylivekit-agents/livekit/agents/voice/generation.pylivekit-agents/livekit/agents/voice/audio_recognition.py
🧬 Code graph analysis (2)
livekit-agents/livekit/agents/voice/generation.py (8)
livekit-agents/livekit/agents/voice/transcription/synchronizer.py (3)
audio_output(430-431)resume(236-244)resume(593-595)livekit-agents/livekit/agents/voice/room_io/room_io.py (1)
audio_output(241-245)livekit-agents/livekit/agents/voice/agent_activity.py (1)
resume(640-651)livekit-agents/livekit/agents/voice/avatar/_datastream_io.py (1)
resume(166-167)livekit-agents/livekit/agents/cli/cli.py (1)
resume(207-212)livekit-agents/livekit/agents/voice/room_io/_output.py (1)
resume(134-137)livekit-agents/livekit/agents/voice/io.py (1)
resume(278-281)livekit-agents/livekit/agents/voice/recorder_io/recorder_io.py (1)
resume(365-372)
livekit-agents/livekit/agents/voice/audio_recognition.py (1)
livekit-agents/livekit/agents/voice/agent_activity.py (1)
on_interim_transcript(1279-1305)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: livekit-plugins-openai
- GitHub Check: livekit-plugins-deepgram
- GitHub Check: unit-tests
- GitHub Check: type-check (3.13)
- GitHub Check: type-check (3.9)
🔇 Additional comments (4)
livekit-agents/livekit/agents/voice/generation.py (1)
365-380: Good safeguard for paused-audio edge case.The first-frame resume keeps audio output active even if it was paused during TTS generation.
livekit-agents/livekit/agents/voice/audio_recognition.py (3)
358-363: Speaking flag gating looks correct for FINAL_TRANSCRIPT.Passing
Nonewhen speaking state isn’t reliable avoids misleading hooks.
405-412: Speaking flag gating looks correct for PREFLIGHT_TRANSCRIPT.This keeps speaking state consistent with available signal sources.
449-455: Speaking flag gating looks correct for INTERIM_TRANSCRIPT.Good alignment with VAD/STT-driven speaking state.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| # sending the end of speech event, we need to check if: | ||
| # 1. The resume timer has not been scheduled yet. | ||
| # 2. The transcript is not long enough for interruption. | ||
| self._false_interruption_timer is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._start_false_interruption_timer(timeout) will cancel the existing timer and since these are synced methods, there should always be only one timer at a time.
if the EOU is fired right after the final transcript, before the false interruption timeout, it shouldn't call the _on_false_interruption twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here is the timer to resume the speech, the speech pause/interruption is happened in _interrupt_by_audio_activity. I added a comment here #4615 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is to make sure we don't cancel and start a new one if it is a duplicate end of speech event.
Though I am not sure disabling VAD is the solution here because we also need VAD for stuff like barge-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I got from your comment is that VAD already pauses the speech before the VAD EOS, so we really need is to skip the pause when stt is used for turn detection from _interrupt_by_audio_activity, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue is VAD events are not synced with STT EOS event, so when the speech is committed by STT, the VAD may still think the user is speaking and interrupt the agent. so we can disable the VAD for interruption if the turn_detection mode is stt, if you want the VAD is always enabled.
The intention is to make sure we don't cancel and start a new one if it is a duplicate end of speech event.
I think cancel the old timer and start a new one is the right behavior here, we need to reset the timer whenever there is voice activity, to make sure the timer is started after the user speech done. when the turn_detection mode is stt, the STT EOS should be considered as a voice activity too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a more compatible alternative. Otherwise, we are essentially disabling false interruptions and barge-in when turn_detection is stt. Barge-in uses VAD for both speaking status tracking and interruption (via _interrupt_by_audio_activity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code so now we still allow VAD interruption under certain conditions when turn_detecton is stt:
- we haven't received any EOS event from stt;
- VAD speech (without the endpointing silence) is still ongoing
| # during TTS generation (e.g., due to false interruption detection) | ||
| if not first_frame_captured: | ||
| first_frame_captured = True | ||
| audio_output.resume() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this is needed? resume here will bypass the pause if the agent speech is interrupted before the TTS generation started. if the audio input is still active, it may pause/interrupt again very soon after a few frames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think this is needed after the new changes. Previously, the timer was not properly cancelled and therefore audio output seemed stuck in paused state.
| # before VAD end of speech event, we only interrupt if | ||
| # 1. STT EOS hasn't been received yet; or | ||
| # 2. VAD real EOS is not yet triggered (i.e. VAD speech is still ongoing) | ||
| if not self._stt_eos_received or ev.raw_accumulated_silence == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first condition makes sense, but the second may still cause the issue that the VAD interrupts stt committed speech?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my tests, it is often during VAD endpointing silence that it triggers the interruption after STT EOS. I think it should be okay to interrupt in this case.
Co-authored-by: Long Chen <longch1024@gmail.com>
longcw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
This should close #4615.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.