-
Notifications
You must be signed in to change notification settings - Fork 71
🐛 Add immediate fallback dialer to eliminate Happy Eyeballs delay #2464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2464 +/- ##
==========================================
- Coverage 69.87% 69.54% -0.33%
==========================================
Files 101 102 +1
Lines 7991 8292 +301
==========================================
+ Hits 5584 5767 +183
- Misses 1968 2066 +98
- Partials 439 459 +20
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This appears to address the issue, but it doesn't entirely sit well with me. Ideally, I'd like to see a top-level instantiable which handles the dial context, or we would see a fix in the default HTTP service mechanism to robustly perform v4/v6 evaluation and fallback, that this code would just enjoy. |
|
Yeah this seems like a misconfigured environment. I'd like to hear from the metal team what's going on here before we consider client side preferences for IPv4 over IPv6. |
|
|
||
| tlsConfig := &tls.Config{ | ||
| // Clone the default transport to inherit IPv4 preference and other defaults | ||
| tlsTransport := http.DefaultTransport.(*http.Transport).Clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if type casting fails?
| // Initialize klog flags for testing | ||
| klog.InitFlags(nil) | ||
| if err := flag.Set("v", "4"); err != nil { | ||
| t.Fatalf("Failed to set v flag: %v", err) | ||
| } | ||
| if err := flag.Set("logtostderr", "true"); err != nil { | ||
| t.Fatalf("Failed to set logtostderr flag: %v", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be in some sort of TestMain? do we really need this?
| // We expect connection refused or similar, not a DNS error | ||
| if err != nil { | ||
| t.Logf("Connection failed as expected: %v", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err == nil, should we fail the test?
| // Only IPv6 available | ||
| t.Logf("Testing IPv6-only: should use available IPv6 address(es)") | ||
| t.Logf("✓ Will use %d IPv6 address(es)", len(ipv6Addrs)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in what cases we are failing the tests?
@sdodson Who would we reach out to on the metal environment? |
|
Devan / Catherine are going to get a bug open against metal platform. This seems to happen in a test that configures an environment that's specifically IPv4 but may for some reason behave in an unexpected manner when the host OS is RHCOS10 rather than RHCOS9, assuming this has worked in the past. I think we should also ask the networking team to define a policy for what clients should do in this scenario. Quick searching seems to indicate that prevailing wisdom is to prefer IPv6 but fail fast and latch that context to IPv4 if IPv6 fails and IPv4 works. |
The failure seems to be that we aren't falling over to IPv4, because IPv6 is failing fairly quickly, so the 300ms delay that would move us over to the IPv4 protocol isn't happening. EDIT: I'm a bit concerned about latching to a particular address type; different servers may support different address types. |
cmd/operator-controller/main.go
Outdated
| if transport, ok := http.DefaultTransport.(*http.Transport); ok { | ||
| transport.DialContext = httputil.ImmediateFallbackDialContext | ||
| } else { | ||
| setupLog.Error(errors.New("http.DefaultTransport type assertion failed"), "Failed to configure custom dialer: http.DefaultTransport is not *http.Transport") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we deduplicate this snippet, given that appear twice in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the size of this snippet, I'm not sure it would actually reduce the code size that much.
i.e. we define a new function of at ~5 lines, and then the snippet in each main.go remains at 3 lines to handle errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I did think about it, since the. code is common, but it's such a small amount of code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I did take this out and make it a common function.
| } | ||
|
|
||
| // tryAddresses attempts to connect to a list of addresses sequentially | ||
| tryAddresses := func(addresses []net.IP, addressType, category, dialNetwork string) (net.Conn, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that category arg is just passed to be used in log message. Given that this is not a standalone function (closure), I would suggest that we do not log here, instead let's return properly crafted error message, with the details we would like to log, and log it at the caller side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to disagree here, since we are logging multiple attempts and failures. There may be multiple failures due to multiple addresses being available, and we would lose that information when an address fails. All we'd get is the logging of the final failure, and not the attempts in between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, I now question the need to separate out primary and fallback addresses - this could be just one list.
| wantFail bool | ||
| minExpectedAddrs int // minimum addresses we expect to find |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets group expectations from the test input, so that we can read them better:
want struct {
fail bool
minExpectedAddrs int
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure adding a new structure inside another structure is of much benefit, TBH.
| } | ||
|
|
||
| tlsConfig := &tls.Config{ | ||
| // Clone the default transport to inherit custom dialer and other defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like bellow. By updating BuildHTTPClient to avoid assuming http.DefaultTransport is always *http.Transport. Right now it can panic or fail if something replaces the default transport.
A simple fix is: try to clone http.DefaultTransport when it’s a *http.Transport, otherwise fall back to a new http.Transport with basic defaults. Then set TLSClientConfig on that single transport.
This also seems to removes the duplicate tlsTransport := &http.Transport{...} block, so we don’t overwrite the cloned transport.
base, ok := http.DefaultTransport.(*http.Transport)
var tlsTransport *http.Transport
if ok && base != nil {
tlsTransport = base.Clone()
} else {
// Fallback: avoid panic if someone replaced http.DefaultTransport.
tlsTransport = &http.Transport{
Proxy: http.ProxyFromEnvironment,
// DialContext left as default unless your code sets it elsewhere
ForceAttemptHTTP2: true,
MaxIdleConns: 100,
IdleConnTimeout: 90 * time.Second,
TLSHandshakeTimeout: 10 * time.Second,
ExpectContinueTimeout: 1 * time.Second,
}
}
tlsTransport.TLSClientConfig = &tls.Config{
RootCAs: pool,
MinVersion: tls.VersionTLS12,
}
httpClient.Transport = tlsTransport
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lot more code, for not much benefit, and I'm concerned with us setting all those http.Transport{} values to something fixed. Better to indicate an error, IMHO, since it would be a coding error on our part if the original code does not work.
There should be NO reason why the casting fails without a golang code change
| transport.DialContext = httputil.ImmediateFallbackDialContext | ||
| } else { | ||
| setupLog.Error(errors.New("http.DefaultTransport type assertion failed"), "Failed to configure custom dialer: http.DefaultTransport is not *http.Transport") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about we wrapper this one to avoid pacnic?
Something like
// dialerRoundTripper injects DialContext into any *http.Transport it sees.
// If the underlying RT isn't a *http.Transport, it just passes through.
type dialerRoundTripper struct {
base http.RoundTripper
}
func (d dialerRoundTripper) RoundTrip(r *http.Request) (*http.Response, error) {
if tr, ok := d.base.(*http.Transport); ok && tr != nil {
cl := tr.Clone()
cl.DialContext = httputil.IPv4PreferringDialContext
return cl.RoundTrip(r)
}
return d.base.RoundTrip(r)
}
func init() {
if http.DefaultTransport == nil {
http.DefaultTransport = http.Transport{}
}
http.DefaultTransport = dialerRoundTripper{base: http.DefaultTransport}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, a panic indicates that we did something unexpected (wrong) that caused this to happen, so we want the panic to occur. See my other comment:
That's a lot more code, for not much benefit, and I'm concerned with us setting all those http.Transport{} values to something fixed. Better to indicate an error, IMHO, since it would be a coding error on our part if the original code does not work.
There should be NO reason why the casting fails without a golang code change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to a common function.
| @@ -274,7 +284,10 @@ func run(ctx context.Context) error { | |||
| } | |||
|
|
|||
| // Create manager | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a feature gate for this one ?
OR falg option to be default ipv6 and then with flag ipv4 something like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It's not feature-gate "worthy", since it's attempting to fix a bug. The behavior is mimicking existing behavior without any explicit default of IPv4 or IPv6. So, there's really no choice between the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @tmshort has addressed all the comments. Answered them out.
There’s a bug we need to fix soon, so I am ok with forward.
We can always follow up to improve the implementation.
Perfection shouldn’t block progress.
/approved
|
@sdodson The latest changes should at least give us additional insight into what is happening, without significantly changing the behavior. |
Implements ImmediateFallbackDialContext that removes the 300ms delay from Go's Happy Eyeballs algorithm by trying addresses sequentially in the order returned by DNS, without racing or artificial delays. This respects DNS server address ordering (which already optimizes for the local network environment) while eliminating the delay that causes IPv6 "network is unreachable" failures in dual-stack environments where IPv6 has internal-only routing. All network clients (HTTP, Kubernetes REST, image pulls) now use the immediate fallback dialer. Signed-off-by: Todd Short <tshort@redhat.com> Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/hold Until we know if we really want this, maybe it's an environment problem? |
Implements ImmediateFallbackDialContext that removes the 300ms delay from Go's Happy Eyeballs algorithm by trying addresses sequentially in the order returned by DNS, without racing or artificial delays.
This respects DNS server address ordering (which already optimizes for the local network environment) while eliminating the delay that causes IPv6 "network is unreachable" failures in dual-stack environments where IPv6 has internal-only routing.
All network clients (HTTP, Kubernetes REST, image pulls) now use the immediate fallback dialer.
Assisted-By: Claude
Description
Reviewer Checklist