-
Notifications
You must be signed in to change notification settings - Fork 62
Better assert when TestDelayedEvents/delayed_state_events_are_kept_on_server_restart take too long to run for our assumptions to be correct
#829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| "Test took too long to run, cannot guarantee delayed event timings. " + | ||
| "More than 10 seconds elapsed between scheduling the delayed event and now when we're about to check for it.", | ||
| ) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually fixing this and making this kind of test consistent is tricky.
If we just increase the delay time, then the test will just take longer to run even in the fast scenarios. Ideally, the second delayed event should be delayed by at-least COMPLEMENT_SPAWN_HS_TIMEOUT_SECS (defaults to 30 seconds) since we're stopping and starting servers here 🤔 - although we use COMPLEMENT_SPAWN_HS_TIMEOUT_SECS=120 with the Synapse workers variant and two minutes sounds painful to wait for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could setup a series of delayed events at 10 second intervals up to the COMPLEMENT_SPAWN_HS_TIMEOUT_SECS. Then we just want to look for a decrement after the server restart. This keeps the test as fast as before while being lenient for however long it takes to restart.
I'm going to address this in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in #830
TestDelayedEvents take too long to run for our assumptions to be incorrectTestDelayedEvents take too long to run for our assumptions to be correct
TestDelayedEvents take too long to run for our assumptions to be correctTestDelayedEvents/delayed_state_events_are_kept_on_server_restart take too long to run for our assumptions to be correct
Co-authored-by: Devon Hudson <devon.dmytro@gmail.com>
|
Thanks for the review @devonh 🦘 |
…t` to account for slow servers (de-flake) (#830) Follow-up to #829 Part of element-hq/synapse#18537 ### What does this PR do? Fix `TestDelayedEvents/delayed_state_events_are_kept_on_server_restart` to account for slow servers (de-flake). Previously, this was naively using a single delayed event with a 10 second delay. But because we're stopping and starting servers here, it could take up `deployment.GetConfig().SpawnHSTimeout` (defaults to 30 seconds) for the server to start up again so by the time the server is back up, the delayed event may have already been sent, invalidating our assertions below (which expect some delayed events to still be pending and then see one of them be sent after the server is back up). We could account for this by setting the delayed event delay to be longer than `deployment.GetConfig().SpawnHSTimeout` but that would make the test suite take longer to run in all cases even for homeservers that are quick to restart because we have to wait for that large delay. We instead account for this by scheduling many delayed events at short intervals (we chose 10 seconds because that's what the test naively chose before). Then whenever the servers comes back, we can just check until it decrements by 1. ### Experiencing the flaky failure As experienced when running this test against the worker-based Synapse setup we use alongside the Synapse Pro Rust apps, element-hq/synapse-rust-apps#344 (comment). We probably experience this heavily in the private project because GitHub runners are less than half as powerful as those for public projects and that single container with a share of the 2 CPU cores available is just not powerful enough to run all 16 workers effectively. <details> <summary>For reference, the CI runners provided by GitHub for private projects are less than half as powerful as those for public projects.</summary> > #### Standard GitHub-hosted runners for public repositories > > Virtual machine | Processor (CPU) | Memory (RAM) | Storage (SSD) | Architecture | Workflow label > --- | --- | --- | --- | --- | --- > Linux | 4 | 16 GB | 14 GB | x64 | ubuntu-latest, ubuntu-24.04, ubuntu-22.04 > > *-- [Standard GitHub-hosted runners for public repositories](https://docs.github.com/en/actions/reference/runners/github-hosted-runners#standard-github-hosted-runners-for-public-repositories)* --- > #### Standard GitHub-hosted runners for private repositories > > Virtual Machine | Processor (CPU) | Memory (RAM) | Storage (SSD) | Architecture | Workflow label > --- | --- | --- | --- | --- | --- > Linux | 2 | 7 GB | 14 GB | x64 | ubuntu-latest, ubuntu-24.04, ubuntu-22.04 > > *-- [Standard GitHub-hosted runners for private repositories](https://docs.github.com/en/actions/reference/runners/github-hosted-runners#standard-github-hosted-runners-for-public-repositories)* </details> And for the same slow reasons, why we're also experiencing this as an occasional [flake](element-hq/synapse#18537) with `(workers, postgres)` in the public Synapse CI as well. ### Reproduction instructions I can easily reproduce this problem if I use #827 to limit the number of CPU's available for the homeserver containers to use: `COMPLEMENT_CONTAINER_CPUS=0.5`
Better assert when
TestDelayedEvents/delayed_state_events_are_kept_on_server_restarttake too long to run for our assumptions to be correct.As experienced when running this test against the worker-based Synapse setup we use alongside the Synapse Pro Rust apps, https://github.com/element-hq/synapse-rust-apps/pull/344#discussion_r2620714306. We probably experience this heavily in the private project because GitHub runners are less than half as powerful as those for public projects and that single container with a share of the 2 CPU cores available is just not powerful enough to run all 16 workers effectively.
For reference, the CI runners provided by GitHub for private projects are less than half as powerful as those for public projects.
And for the same slow reasons, why we're also experiencing this as an occasional flake with
(workers, postgres)in the public Synapse CI as well.Reproduction instructions
I can easily reproduce this problem if I use #827 to limit the number of CPU's available for the homeserver containers to use:
COMPLEMENT_CONTAINER_CPUS=0.5Pull Request Checklist