Skip to content

Comments

Stabilize Route Learning with Active/Backup Path Failover#1777

Open
robekl wants to merge 1 commit intomeshcore-dev:devfrom
robekl:mitigate_first_packet_wins
Open

Stabilize Route Learning with Active/Backup Path Failover#1777
robekl wants to merge 1 commit intomeshcore-dev:devfrom
robekl:mitigate_first_packet_wins

Conversation

@robekl
Copy link

@robekl robekl commented Feb 21, 2026

Summary

This change replaces the effective “first packet wins” route behavior with a simple, embedded-safe active/backup path strategy for contacts. It adds bounded failover logic for direct routing and reduces unnecessary path-update churn.

Problem

Current route learning is vulnerable to path churn:

  • newly learned paths can displace working routes too aggressively,
  • transient RF conditions can cause repeated direct failures,
  • fallback behavior can oscillate between direct and flood without enough local stabilization.

In practice, this can make delivery quality inconsistent even when the mesh is otherwise healthy.

Impact When It Happens

When path churn occurs, users can see:

  • intermittent direct-send failures,
  • repeated timeout/retry cycles before recovery,
  • temporary reliability drops in mobility-heavy or noisy RF conditions,
  • extra airtime consumed by recovery traffic and repeated attempts.

Scope of the Problem

This primarily affects contact-based direct messaging/request flows in dynamic conditions:

  • moving nodes/repeaters,
  • changing link quality,
  • multipath/interference scenarios where “first seen path” is not consistently best.

Static/small meshes are less affected but can still hit this during topology changes.

Description of the Change

The implementation introduces a bounded per-contact routing state and simple switching rules:

  • active path + backup path storage per contact,
  • direct-path failure counter,
  • path-switch cooldown window,
  • backup path age limit,
  • temporary direct-block window when no usable backup exists,
  • path-update callback suppression unless active path actually changed.

Behavioral updates:

  • new inbound path candidates no longer always replace active path,
  • better candidates can promote under simple rules,
  • repeated direct failures trigger backup activation,
  • if no backup is available, direct is temporarily blocked and flood is used to relearn current conditions.

How This Addresses the Problem

The change adds local route stability without protocol changes:

  • avoids immediate path replacement from transient arrivals,
  • provides deterministic failover to known backup routes,
  • prevents rapid re-entry into failing direct paths when no backup exists,
  • reduces update noise from backup-only changes.

This shifts behavior from reactive single-route churn to controlled two-route resilience.

Scope of the Fix

In scope:

  • contact route learning/update behavior,
  • direct-failure tracking and failover,
  • cooldown/blocking safeguards,
  • path-update notification/write churn reduction.

Out of scope:

  • protocol/wire-format changes,
  • multi-path scoring frameworks,
  • broad telemetry/analytics expansion,
  • non-contact routing architecture changes.

Benefits

  • More stable direct delivery under changing conditions.
  • Faster recovery from bad active paths.
  • Lower oscillation between direct and flood.
  • Fewer unnecessary path update notifications/writes.
  • Protocol-compatible and embedded-friendly (fixed-size state, simple logic).

Drawbacks / Tradeoffs

  • Additional per-contact state fields.
  • Slightly more control-flow complexity in path handling.
  • Heuristic thresholds (failure count, cooldown, block duration, backup age) may need tuning by deployment profile.
  • Not a full route-quality scoring system; intentionally simple.

New Complexity Introduced

  • Moderate but bounded:
    • one backup path and a few counters/timestamps per contact,
    • small state machine for promotion/failover/blocking,
    • no dynamic allocation, no protocol changes, no unbounded structures.

ROI

High.

  • Small implementation footprint.
  • No protocol migration cost.
  • Directly improves reliability in scenarios where users feel instability most.
  • Reduces operational pain from path churn while preserving maintainability and embedded constraints.

Replace first-packet-wins route replacement with a simple active+backup model per contact.

New path candidates are evaluated conservatively: shorter paths can promote to active, while others remain backup candidates. Direct path failures are counted, and repeated timeouts trigger backup activation when available.

If no backup is usable, direct routing is temporarily blocked for that contact so sends fall back to flood and relearn under current conditions. Path-update callbacks now fire only when the active path changes, reducing unnecessary write/notify churn.
@robekl robekl changed the base branch from main to dev February 21, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant