-
Notifications
You must be signed in to change notification settings - Fork 62
Add environment variables to control container resource constraints (COMPLEMENT_CONTAINER_CPU_CORES, COMPLEMENT_CONTAINER_MEMORY)
#827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
`go run ./cmd/gendoc --config config/config.go > ENVIRONMENT.md`
| constraintStrings = append(constraintStrings, fmt.Sprintf("%.1f CPU cores", cfg.ContainerCPUCores)) | ||
| } | ||
| if cfg.ContainerMemoryBytes > 0 { | ||
| // TODO: It would be nice to pretty print this in MB/GB etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a future TODO (not for this PR)
| // Valid units are "B", (decimal: "KB", "MB", "GB, "TB, "PB"), (binary: "KiB", "MiB", | ||
| // "GiB", "TiB", "PiB") or no units (bytes). We also support "K", "M", "G" as per | ||
| // Docker's CLI. | ||
| func parseByteSizeString(inputString string) (int64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just based off glancing a few implementations and trying to make it as simple as possible for our usage here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With some hindsight, we could go even simpler and only support the same units that the Docker CLI has:
Most of these options take a positive integer, followed by a suffix of
b,k,m,g, to indicate bytes, kilobytes, megabytes, or gigabytes.-- https://docs.docker.com/engine/containers/resource_constraints/
I built this the other way around though. I wanted to be able to pass in COMPLEMENT_CONTAINER_MEMORY=1GB and wrote this out, then only later added in the Docker CLI unit variants to come to this realization 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'll be nice dev UX to be this flexible.
| // These are also supported to match Docker's CLI | ||
| "k": 1024, | ||
| "m": intPow(1024, 2), | ||
| "g": intPow(1024, 3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See docker/cli -> cli/command/container/opts_test.go#L376-L377 to cross-check decimal vs binary.
Documented as:
Most of these options take a positive integer, followed by a suffix of
b,k,m,g, to indicate bytes, kilobytes, megabytes, or gigabytes.-- https://docs.docker.com/engine/containers/resource_constraints/
…t` to account for slow servers (de-flake) (#830) Follow-up to #829 Part of element-hq/synapse#18537 ### What does this PR do? Fix `TestDelayedEvents/delayed_state_events_are_kept_on_server_restart` to account for slow servers (de-flake). Previously, this was naively using a single delayed event with a 10 second delay. But because we're stopping and starting servers here, it could take up `deployment.GetConfig().SpawnHSTimeout` (defaults to 30 seconds) for the server to start up again so by the time the server is back up, the delayed event may have already been sent, invalidating our assertions below (which expect some delayed events to still be pending and then see one of them be sent after the server is back up). We could account for this by setting the delayed event delay to be longer than `deployment.GetConfig().SpawnHSTimeout` but that would make the test suite take longer to run in all cases even for homeservers that are quick to restart because we have to wait for that large delay. We instead account for this by scheduling many delayed events at short intervals (we chose 10 seconds because that's what the test naively chose before). Then whenever the servers comes back, we can just check until it decrements by 1. ### Experiencing the flaky failure As experienced when running this test against the worker-based Synapse setup we use alongside the Synapse Pro Rust apps, element-hq/synapse-rust-apps#344 (comment). We probably experience this heavily in the private project because GitHub runners are less than half as powerful as those for public projects and that single container with a share of the 2 CPU cores available is just not powerful enough to run all 16 workers effectively. <details> <summary>For reference, the CI runners provided by GitHub for private projects are less than half as powerful as those for public projects.</summary> > #### Standard GitHub-hosted runners for public repositories > > Virtual machine | Processor (CPU) | Memory (RAM) | Storage (SSD) | Architecture | Workflow label > --- | --- | --- | --- | --- | --- > Linux | 4 | 16 GB | 14 GB | x64 | ubuntu-latest, ubuntu-24.04, ubuntu-22.04 > > *-- [Standard GitHub-hosted runners for public repositories](https://docs.github.com/en/actions/reference/runners/github-hosted-runners#standard-github-hosted-runners-for-public-repositories)* --- > #### Standard GitHub-hosted runners for private repositories > > Virtual Machine | Processor (CPU) | Memory (RAM) | Storage (SSD) | Architecture | Workflow label > --- | --- | --- | --- | --- | --- > Linux | 2 | 7 GB | 14 GB | x64 | ubuntu-latest, ubuntu-24.04, ubuntu-22.04 > > *-- [Standard GitHub-hosted runners for private repositories](https://docs.github.com/en/actions/reference/runners/github-hosted-runners#standard-github-hosted-runners-for-public-repositories)* </details> And for the same slow reasons, why we're also experiencing this as an occasional [flake](element-hq/synapse#18537) with `(workers, postgres)` in the public Synapse CI as well. ### Reproduction instructions I can easily reproduce this problem if I use #827 to limit the number of CPU's available for the homeserver containers to use: `COMPLEMENT_CONTAINER_CPUS=0.5`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a couple small things before it's ready to go.
config/config.go
Outdated
| // each iteration had a 50ms sleep between tries so the timeout is 50 * iteration ms | ||
| cfg.SpawnHSTimeout = time.Duration(50*parseEnvWithDefault("COMPLEMENT_VERSION_CHECK_ITERATIONS", 100)) * time.Millisecond | ||
| } | ||
| cfg.ContainerCPUCores, _ = strconv.ParseFloat(os.Getenv("COMPLEMENT_CONTAINER_CPUS"), 64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
| cfg.ContainerCPUCores, _ = strconv.ParseFloat(os.Getenv("COMPLEMENT_CONTAINER_CPUS"), 64) | |
| cfg.ContainerCPUCores, _ = strconv.ParseFloat(os.Getenv("COMPLEMENT_CONTAINER_CPUS"), 0) |
so it is also unlimited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64 is the bitsize of the float:
https://pkg.go.dev/strconv#ParseFloat
// ParseFloat converts the string s to a floating-point number
// with the precision specified by bitSize: 32 for float32, or 64 for float64.
// When bitSize=32, the result still has type float64, but it will be
// convertible to float32 without changing its value.
| // Valid units are "B", (decimal: "KB", "MB", "GB, "TB, "PB"), (binary: "KiB", "MiB", | ||
| // "GiB", "TiB", "PiB") or no units (bytes). We also support "K", "M", "G" as per | ||
| // Docker's CLI. | ||
| func parseByteSizeString(inputString string) (int64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'll be nice dev UX to be this flexible.
cmd/gendoc/ENVIRONMENT.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add this to .gitignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 nahh, we can actually notice this way. I've added some example usage to the top of cmd/gendoc/main.go but ideally, the script should just do the right thing without all of the options and piping.
COMPLEMENT_CONTAINER_CPUS, COMPLEMENT_CONTAINER_MEMORY)COMPLEMENT_CONTAINER_CPU_CORES, COMPLEMENT_CONTAINER_MEMORY)
|
Thanks for the review @anoadragon453 🐇 |
Add environment variables to control container resource constraints (
COMPLEMENT_CONTAINER_CPU_CORES,COMPLEMENT_CONTAINER_MEMORY). This is useful to mimic a resource-constrained environment, like a CI environment.This is spawning from some consistent flaky tests I'm seeing when running the Complement test suite with some GitHub runners in a private project.
For reference, the CI runners provided by GitHub for private projects are less than half as powerful as those for public projects.
I'm now able to reproduce the same failures locally when I constrain the CPU to less than a single core
COMPLEMENT_CONTAINER_CPU_CORES=0.5Dev notes
Docker resource constraints: https://docs.docker.com/engine/containers/resource_constraints/
How to generate
ENVIRONMENT.md:go run ./cmd/gendoc --config config/config.go > ENVIRONMENT.mdPull Request Checklist