-
Notifications
You must be signed in to change notification settings - Fork 68
feat: add nearest DC detection with TCP race #763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1cced85 to
139dde2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds an opt-in “nearest datacenter” selection mechanism for endpoint discovery, using a TCP connection race to prefer endpoints in the lowest-latency location when DriverConfig(detect_local_dc=True) is enabled.
Changes:
- Add sync + async
nearest_dcmodules implementing TCP-race-based location detection. - Wire DC detection into sync and async discovery pools to mark endpoints in the detected location as preferred.
- Add
detect_local_dcflag toDriverConfigand introduce unit tests for both sync and async implementations.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| ydb/pool.py | Uses detected local DC (when enabled) to set endpoint preference during discovery. |
| ydb/nearest_dc.py | Implements synchronous TCP race DC detection helpers. |
| ydb/driver.py | Adds detect_local_dc configuration flag to DriverConfig. |
| ydb/aio/pool.py | Async discovery now prefers endpoints in the detected local DC. |
| ydb/aio/nearest_dc.py | Implements asynchronous TCP race DC detection helpers. |
| tests/test_nearest_dc.py | Adds unit tests for sync nearest DC detection behavior. |
| tests/aio/test_nearest_dc.py | Adds unit tests for async nearest DC detection behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not self._ssl_required and resolved_endpoint.ssl: | ||
| continue | ||
|
|
||
| preferred = resolve_details.self_location == resolved_endpoint.location | ||
| preferred = local_dc == resolved_endpoint.location | ||
|
|
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detect_local_dc only influences the preferred flag, but ConnectionsCache ignores the preferred list when DriverConfig.use_all_nodes is True (which is the default). As a result, enabling detect_local_dc=True may have no effect on endpoint selection for the default configuration. Consider either (a) updating the selection logic so preferred endpoints are prioritized even when use_all_nodes=True (while still falling back to others), or (b) documenting that detect_local_dc is only effective when use_all_nodes=False.
| if not self._ssl_required and resolved_endpoint.ssl: | ||
| continue | ||
|
|
||
| preferred = resolve_details.self_location == resolved_endpoint.location | ||
| preferred = local_dc == resolved_endpoint.location | ||
|
|
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as sync pool: the preferred flag set from local_dc is only used by ConnectionsCache when use_all_nodes is False. With the default use_all_nodes=True, detect_local_dc=True is unlikely to change endpoint selection. Either prioritize preferred endpoints even in use_all_nodes mode, or clarify in config/docs that detect_local_dc requires use_all_nodes=False to have an effect.
| if len(endpoints_by_location) == 1: | ||
| location = list(endpoints_by_location.keys())[0] | ||
| logger.info("Only one location found: %s", location) | ||
| return location | ||
|
|
||
| endpoints_to_test = [] | ||
| for location, location_endpoints in endpoints_by_location.items(): | ||
| sample = _get_random_endpoints(location_endpoints, max_per_location) | ||
| endpoints_to_test.extend(sample) | ||
| logger.debug( | ||
| "Selected %d/%d endpoints from location '%s' for testing", | ||
| len(sample), | ||
| len(location_endpoints), | ||
| location, | ||
| ) | ||
|
|
||
| fastest_endpoint = _check_fastest_endpoint(endpoints_to_test, timeout=timeout) | ||
|
|
||
| if fastest_endpoint is None: | ||
| logger.warning("Failed to detect local DC via TCP race: no endpoint connected in time") | ||
| return None | ||
|
|
||
| detected_location = fastest_endpoint.location | ||
| logger.info("Detected local DC: %s", detected_location) | ||
|
|
||
| return detected_location |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helper logs at INFO/WARNING for common outcomes (single-location, success, and “no endpoint connected”), but callers (pool.py / aio/pool.py) also log these events, leading to duplicate log lines on every discovery cycle. Consider downgrading these to DEBUG here (and letting the pool layer handle INFO/WARNING), or add a way for the caller to control logging verbosity.
| if len(endpoints_by_location) == 1: | ||
| location = list(endpoints_by_location.keys())[0] | ||
| logger.info("Only one location found: %s", location) | ||
| return location | ||
|
|
||
| endpoints_to_test = [] | ||
| for location, location_endpoints in endpoints_by_location.items(): | ||
| sample = _get_random_endpoints(location_endpoints, max_per_location) | ||
| endpoints_to_test.extend(sample) | ||
| logger.debug( | ||
| "Selected %d/%d endpoints from location '%s' for testing", | ||
| len(sample), | ||
| len(location_endpoints), | ||
| location, | ||
| ) | ||
|
|
||
| fastest_endpoint = await _check_fastest_endpoint(endpoints_to_test, timeout=timeout) | ||
|
|
||
| if fastest_endpoint is None: | ||
| logger.warning("Failed to detect local DC via TCP race: no endpoint connected in time") | ||
| return None | ||
|
|
||
| detected_location = fastest_endpoint.location | ||
| logger.info("Detected local DC: %s", detected_location) | ||
|
|
||
| return detected_location |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detect_local_dc logs INFO/WARNING in this helper, and the pool layer also logs success/failure, which can duplicate log lines during discovery. Consider making these logs DEBUG-only here (or making logging optional) and keeping user-facing logs in the pool layer where server-reported vs detected location is already included.
| :param grpc_lb_policy_name: A load balancing policy to be used for discovery channel construction. Default value is `round_round` | ||
| :param discovery_request_timeout: A default timeout to complete the discovery. The default value is 10 seconds. | ||
| :param disable_discovery: If True, endpoint discovery is disabled and only the start endpoint is used for all requests. | ||
| :param detect_local_dc: If True, detect nearest datacenter using TCP latency measurement instead of using server-provided self_location. |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for detect_local_dc suggests it can be enabled independently, but with the current pool/cache logic it only affects routing when use_all_nodes=False (since preferred endpoints are ignored when using all nodes). Please either update the docstring to mention this interaction, or adjust the implementation so detect_local_dc=True has an effect even with use_all_nodes=True.
| finally: | ||
| sock.close() | ||
|
|
||
| except (OSError, socket.timeout): |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except (OSError, socket.timeout): | |
| except (OSError, socket.timeout): | |
| # Ignore expected connection errors; endpoints that fail simply lose the TCP race. |
Add automatic nearest datacenter detection for connection pooling
Pull request type
What is the current behavior?
The SDK uses round-robin load balancing across all discovered endpoints without considering network latency, which can lead to suboptimal performance in multi-datacenter deployments.
Fixes: #446
What is the new behavior?
Adds opt-in nearest datacenter detection (enabled via
DriverConfig(detect_local_dc=True)) that uses TCP connection racing to identify and prioritize the fastest-responding datacenter, reducing request latency in multi-DC deployments.Other information