-
Notifications
You must be signed in to change notification settings - Fork 159
Pull based config changes from admin to runtime #8548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Benjamin Egelund-Müller <b@egelund-muller.com>
| return nil, status.Error(codes.InvalidArgument, err.Error()) | ||
| } | ||
|
|
||
| err = s.runtime.ReloadConfig(ctx, req.InstanceId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reload happens after an instance has been started, if this call fails then the controller will be running without annotations or frontendurl set (assuming vars can anyways not be set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels problematic as it can give an initial flash of failed reconciles after the first deploy. Which may be quite confusing if looking at logs/telemetry or the status page (where I believe it retains old errors until after the next succesful reconcile).
Can you see a simple way to pull the config before creating the instance here? Otherwise, is there a simple way to pull config before starting the controller the first time (this would not block runtime server start since controllers are started in the background, and if we persist and check updated_on, might also only need to be done after creation).
Otherwise we can also just revert to the old way of sending it manually in the CreateInstance call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can missing annotations or frontendurl lead to reconcile failures ?
I think pulling the config before adding the instance will require many changes. We can take one of the following options :
- Revert to the old way of sending it manually in the CreateInstance call (most preferable).
- Retry config reloads.
runtime/runtime.go
Outdated
| return nil, err | ||
| } | ||
|
|
||
| rt.configReloader = newConfigReloader(rt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can lead to controller restarting twice for an instance that missed it's variable updates.
This is difficult to fix if we want config reloads to not block runtime starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's okay – it would be quite an edge case, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah should be an edge case.
| // do not trigger reconcile for stopped deployments | ||
| if d.Status == database.DeploymentStatusDeleting || d.Status == database.DeploymentStatusStopped || d.Status == database.DeploymentStatusStopping { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is safe – what if it's running at this point, but is changed to deleting afterwards (but before the call to UpdateDeployment?
Isn't the real problem just that UpdateDeployment changes the desired status? It seems like it shouldn't do that (since only start/stop/redeploy RPCs should change the provisioned status).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the real problem just that UpdateDeployment changes the desired status? It seems like it shouldn't do that
Yeah true. Basically all updates to project are handled in the same way whereas only some like slots and version change require a redeployment and thus require a desired status change.
Do you think we should revisit my earlier solution i.e. just directly trigger reload configs on all deployments if vars change/annotations change? Can also differentiate between update to projects i.e. slots change -> trigger reconcile but branch updates do not.
We don't need to trigger the full reconcile deployment flow for just a variable change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only some like slots and version change require a redeployment and thus require a desired status change.
I don't think slots or version changes require a redeployment or status change? Or am I missing something? They just require the reconcile job to run to propagate the changes.
Can this issue be fixed just by removing the change of the desired status in UpdateDeployment? Or if that's tricky, since starts/stops are rare, maybe using depl.DesiredStatus instead of database.DeploymentStatusRunning might also be okay.
We don't need to trigger the full reconcile deployment flow for just a variable change.
It's pretty cheap to run, any specific concerns about running the job? The idea with a "reconcile" job is that it provides a single point for config changes to a deployment (which gives nice guarantees about config changes not being made in parallel).
| } | ||
|
|
||
| func (r *Runtime) ReloadConfig(ctx context.Context, instanceID string) error { | ||
| return r.configReloader.reloadConfig(ctx, instanceID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Need to check
if r.configReloader != nil { - Also, shouldn't it run in the background? Since the
adminservice may have to call this on many runtimes at a time, and since the runtime already is capable of pulling config in the background, it'd be nice for it to just be a quick op
| // 1. via rt.ReloadConfig whenever admin wants runtime to reload its configs | ||
| // 2. whenever runtime is started to pick up any configs changed while it was down | ||
| // 3. periodically every hour to pick up any changes done outside of runtime. This just adds extra resilience | ||
| type configReloader struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move to a dedicated file runtime/config_reloader.go?
| func (r *configReloader) reloadConfig(ctx context.Context, instanceID string) error { | ||
| _, err := r.group.Do(ctx, instanceID, func(ctx context.Context) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This singleflight deduplicates calls, so I don't think it's safe. For example, if I add two variables rapidly after each other, the second call may get de-duplicated and only the first new variable will be pulled.
I think this use case requires a non-deduplicating singleflight (basically a keyed mutex). Realistically, given there usually aren't many instances, it's probably sufficient to just use ctxsync.RWMutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I totally overlooked it.
Realistically, given there usually aren't many instances, it's probably sufficient to just use ctxsync.RWMutex
Do we really need RWMutex or a simple mutex is enough ? If it is for cancelation feature then can use it but just clarifying if we need read locks here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just suggested that for the cancellation yes – isn't it considered bad practice to use normal mutexes for long running operations (since callers become unresponsive to ctx cancellations)?
closes https://linear.app/rilldata/issue/PLAT-324/switch-to-pull-instead-of-push-for-config-changes-from-admin-to
Checklist: