Skip to content

Conversation

@tmshort
Copy link
Contributor

@tmshort tmshort commented Jan 28, 2026

Implements ImmediateFallbackDialContext that removes the 300ms delay from Go's Happy Eyeballs algorithm by trying addresses sequentially in the order returned by DNS, without racing or artificial delays.

This respects DNS server address ordering (which already optimizes for the local network environment) while eliminating the delay that causes IPv6 "network is unreachable" failures in dual-stack environments where IPv6 has internal-only routing.

All network clients (HTTP, Kubernetes REST, image pulls) now use the immediate fallback dialer.

Assisted-By: Claude

Description

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@openshift-ci openshift-ci bot requested review from anik120 and trgeiger January 28, 2026 19:28
@openshift-ci
Copy link

openshift-ci bot commented Jan 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tmshort for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented Jan 28, 2026

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 65efd5d
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/697bb8c74539e900088b0ca8
😎 Deploy Preview https://deploy-preview-2464--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

❌ Patch coverage is 73.84615% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.54%. Comparing base (7a60e71) to head (65efd5d).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
internal/shared/util/http/httputil.go 77.19% 8 Missing and 5 partials ⚠️
cmd/catalogd/main.go 60.00% 1 Missing and 1 partial ⚠️
cmd/operator-controller/main.go 33.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2464      +/-   ##
==========================================
- Coverage   69.87%   69.54%   -0.33%     
==========================================
  Files         101      102       +1     
  Lines        7991     8292     +301     
==========================================
+ Hits         5584     5767     +183     
- Misses       1968     2066      +98     
- Partials      439      459      +20     
Flag Coverage Δ
e2e 46.97% <58.46%> (-0.14%) ⬇️
experimental-e2e 13.77% <52.30%> (+0.31%) ⬆️
unit 57.52% <50.76%> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@grokspawn
Copy link
Contributor

This appears to address the issue, but it doesn't entirely sit well with me. Ideally, I'd like to see a top-level instantiable which handles the dial context, or we would see a fix in the default HTTP service mechanism to robustly perform v4/v6 evaluation and fallback, that this code would just enjoy.
And if we have to do a thing as fundamental as this in our codebase, I would really like to see capture of this approach in the v1 design decisions, since we're diverging from what is evidently a core library default behavior.

@sdodson
Copy link

sdodson commented Jan 29, 2026

Yeah this seems like a misconfigured environment. I'd like to hear from the metal team what's going on here before we consider client side preferences for IPv4 over IPv6.


tlsConfig := &tls.Config{
// Clone the default transport to inherit IPv4 preference and other defaults
tlsTransport := http.DefaultTransport.(*http.Transport).Clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if type casting fails?

Comment on lines 14 to 21
// Initialize klog flags for testing
klog.InitFlags(nil)
if err := flag.Set("v", "4"); err != nil {
t.Fatalf("Failed to set v flag: %v", err)
}
if err := flag.Set("logtostderr", "true"); err != nil {
t.Fatalf("Failed to set logtostderr flag: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be in some sort of TestMain? do we really need this?

// We expect connection refused or similar, not a DNS error
if err != nil {
t.Logf("Connection failed as expected: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err == nil, should we fail the test?

// Only IPv6 available
t.Logf("Testing IPv6-only: should use available IPv6 address(es)")
t.Logf("✓ Will use %d IPv6 address(es)", len(ipv6Addrs))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what cases we are failing the tests?

@tmshort
Copy link
Contributor Author

tmshort commented Jan 29, 2026

Yeah this seems like a misconfigured environment. I'd like to hear from the metal team what's going on here before we consider client side preferences for IPv4 over IPv6.

@sdodson Who would we reach out to on the metal environment?

@sdodson
Copy link

sdodson commented Jan 29, 2026

Devan / Catherine are going to get a bug open against metal platform. This seems to happen in a test that configures an environment that's specifically IPv4 but may for some reason behave in an unexpected manner when the host OS is RHCOS10 rather than RHCOS9, assuming this has worked in the past.

I think we should also ask the networking team to define a policy for what clients should do in this scenario. Quick searching seems to indicate that prevailing wisdom is to prefer IPv6 but fail fast and latch that context to IPv4 if IPv6 fails and IPv4 works.

@tmshort
Copy link
Contributor Author

tmshort commented Jan 29, 2026

Devan / Catherine are going to get a bug open against metal platform. This seems to happen in a test that configures an environment that's specifically IPv4 but may for some reason behave in an unexpected manner when the host OS is RHCOS10 rather than RHCOS9, assuming this has worked in the past.

I think we should also ask the networking team to define a policy for what clients should do in this scenario. Quick searching seems to indicate that prevailing wisdom is to prefer IPv6 but fail fast and latch that context to IPv4 if IPv6 fails and IPv4 works.

The failure seems to be that we aren't falling over to IPv4, because IPv6 is failing fairly quickly, so the 300ms delay that would move us over to the IPv4 protocol isn't happening.
So, it still sounds as though we might need something like what we have here, but with slightly different logic.

EDIT: I'm a bit concerned about latching to a particular address type; different servers may support different address types.

@tmshort tmshort changed the title 🐛 Configure IPv4 preference for all network clients 🐛 Add immediate fallback dialer to eliminate Happy Eyeballs delay Jan 29, 2026
Comment on lines 205 to 209
if transport, ok := http.DefaultTransport.(*http.Transport); ok {
transport.DialContext = httputil.ImmediateFallbackDialContext
} else {
setupLog.Error(errors.New("http.DefaultTransport type assertion failed"), "Failed to configure custom dialer: http.DefaultTransport is not *http.Transport")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we deduplicate this snippet, given that appear twice in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the size of this snippet, I'm not sure it would actually reduce the code size that much.
i.e. we define a new function of at ~5 lines, and then the snippet in each main.go remains at 3 lines to handle errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I did think about it, since the. code is common, but it's such a small amount of code)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I did take this out and make it a common function.

}

// tryAddresses attempts to connect to a list of addresses sequentially
tryAddresses := func(addresses []net.IP, addressType, category, dialNetwork string) (net.Conn, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that category arg is just passed to be used in log message. Given that this is not a standalone function (closure), I would suggest that we do not log here, instead let's return properly crafted error message, with the details we would like to log, and log it at the caller side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to disagree here, since we are logging multiple attempts and failures. There may be multiple failures due to multiple addresses being available, and we would lose that information when an address fails. All we'd get is the logging of the final failure, and not the attempts in between.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I now question the need to separate out primary and fallback addresses - this could be just one list.

Comment on lines +13 to +14
wantFail bool
minExpectedAddrs int // minimum addresses we expect to find
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets group expectations from the test input, so that we can read them better:

want struct {
  fail bool
  minExpectedAddrs int
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure adding a new structure inside another structure is of much benefit, TBH.

}

tlsConfig := &tls.Config{
// Clone the default transport to inherit custom dialer and other defaults
Copy link
Contributor

@camilamacedo86 camilamacedo86 Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like bellow. By updating BuildHTTPClient to avoid assuming http.DefaultTransport is always *http.Transport. Right now it can panic or fail if something replaces the default transport.

A simple fix is: try to clone http.DefaultTransport when it’s a *http.Transport, otherwise fall back to a new http.Transport with basic defaults. Then set TLSClientConfig on that single transport.

This also seems to removes the duplicate tlsTransport := &http.Transport{...} block, so we don’t overwrite the cloned transport.

base, ok := http.DefaultTransport.(*http.Transport)

var tlsTransport *http.Transport
if ok && base != nil {
    tlsTransport = base.Clone()
} else {
    // Fallback: avoid panic if someone replaced http.DefaultTransport.
    tlsTransport = &http.Transport{
        Proxy: http.ProxyFromEnvironment,
        // DialContext left as default unless your code sets it elsewhere
        ForceAttemptHTTP2:     true,
        MaxIdleConns:          100,
        IdleConnTimeout:       90 * time.Second,
        TLSHandshakeTimeout:   10 * time.Second,
        ExpectContinueTimeout: 1 * time.Second,
    }
}

tlsTransport.TLSClientConfig = &tls.Config{
    RootCAs:    pool,
    MinVersion: tls.VersionTLS12,
}

httpClient.Transport = tlsTransport

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot more code, for not much benefit, and I'm concerned with us setting all those http.Transport{} values to something fixed. Better to indicate an error, IMHO, since it would be a coding error on our part if the original code does not work.
There should be NO reason why the casting fails without a golang code change

transport.DialContext = httputil.ImmediateFallbackDialContext
} else {
setupLog.Error(errors.New("http.DefaultTransport type assertion failed"), "Failed to configure custom dialer: http.DefaultTransport is not *http.Transport")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about we wrapper this one to avoid pacnic?
Something like

// dialerRoundTripper injects DialContext into any *http.Transport it sees.
// If the underlying RT isn't a *http.Transport, it just passes through.
type dialerRoundTripper struct {
	base http.RoundTripper
}

func (d dialerRoundTripper) RoundTrip(r *http.Request) (*http.Response, error) {
	if tr, ok := d.base.(*http.Transport); ok && tr != nil {
		cl := tr.Clone()
		cl.DialContext = httputil.IPv4PreferringDialContext
		return cl.RoundTrip(r)
	}
	return d.base.RoundTrip(r)
}

func init() {
	if http.DefaultTransport == nil {
		http.DefaultTransport = http.Transport{}
	}
	http.DefaultTransport = dialerRoundTripper{base: http.DefaultTransport}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, a panic indicates that we did something unexpected (wrong) that caused this to happen, so we want the panic to occur. See my other comment:

That's a lot more code, for not much benefit, and I'm concerned with us setting all those http.Transport{} values to something fixed. Better to indicate an error, IMHO, since it would be a coding error on our part if the original code does not work.
There should be NO reason why the casting fails without a golang code change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to a common function.

@@ -274,7 +284,10 @@ func run(ctx context.Context) error {
}

// Create manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a feature gate for this one ?
OR falg option to be default ipv6 and then with flag ipv4 something like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It's not feature-gate "worthy", since it's attempting to fix a bug. The behavior is mimicking existing behavior without any explicit default of IPv4 or IPv6. So, there's really no choice between the two.

Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @tmshort has addressed all the comments. Answered them out.
There’s a bug we need to fix soon, so I am ok with forward.

We can always follow up to improve the implementation.
Perfection shouldn’t block progress.

/approved

@tmshort
Copy link
Contributor Author

tmshort commented Jan 29, 2026

@sdodson The latest changes should at least give us additional insight into what is happening, without significantly changing the behavior.
For now, we'll keep this in our back pocket in case no issues are found with the environment.

Implements ImmediateFallbackDialContext that removes the 300ms delay
from Go's Happy Eyeballs algorithm by trying addresses sequentially
in the order returned by DNS, without racing or artificial delays.

This respects DNS server address ordering (which already optimizes for
the local network environment) while eliminating the delay that causes
IPv6 "network is unreachable" failures in dual-stack environments where
IPv6 has internal-only routing.

All network clients (HTTP, Kubernetes REST, image pulls) now use the
immediate fallback dialer.

Signed-off-by: Todd Short <tshort@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@tmshort
Copy link
Contributor Author

tmshort commented Jan 29, 2026

/hold

Until we know if we really want this, maybe it's an environment problem?

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants