Skip to content

Conversation

@k-anshul
Copy link
Member

@k-anshul k-anshul commented Dec 19, 2025

closes https://linear.app/rilldata/issue/PLAT-324/switch-to-pull-instead-of-push-for-config-changes-from-admin-to

Checklist:

  • Covered by tests
  • Ran it and it works as intended
  • Reviewed the diff before requesting a review
  • Checked for unhandled edge cases
  • Linked the issues it closes
  • Checked if the docs need to be updated. If so, create a separate Linear DOCS issue
  • Intend to cherry-pick into the release branch
  • I'm proud of this work!

@k-anshul k-anshul self-assigned this Dec 19, 2025
@k-anshul k-anshul marked this pull request as draft December 19, 2025 10:49
@k-anshul k-anshul changed the title Pull configs Pull based config changes from admin to runtime Dec 19, 2025
@k-anshul k-anshul marked this pull request as ready for review December 23, 2025 06:40
k-anshul and others added 4 commits December 29, 2025 17:38
Co-authored-by: Benjamin Egelund-Müller <b@egelund-muller.com>
return nil, status.Error(codes.InvalidArgument, err.Error())
}

err = s.runtime.ReloadConfig(ctx, req.InstanceId)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reload happens after an instance has been started, if this call fails then the controller will be running without annotations or frontendurl set (assuming vars can anyways not be set).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels problematic as it can give an initial flash of failed reconciles after the first deploy. Which may be quite confusing if looking at logs/telemetry or the status page (where I believe it retains old errors until after the next succesful reconcile).

Can you see a simple way to pull the config before creating the instance here? Otherwise, is there a simple way to pull config before starting the controller the first time (this would not block runtime server start since controllers are started in the background, and if we persist and check updated_on, might also only need to be done after creation).

Otherwise we can also just revert to the old way of sending it manually in the CreateInstance call

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can missing annotations or frontendurl lead to reconcile failures ?

I think pulling the config before adding the instance will require many changes. We can take one of the following options :

  1. Revert to the old way of sending it manually in the CreateInstance call (most preferable).
  2. Retry config reloads.

return nil, err
}

rt.configReloader = newConfigReloader(rt)
Copy link
Member Author

@k-anshul k-anshul Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can lead to controller restarting twice for an instance that missed it's variable updates.
This is difficult to fix if we want config reloads to not block runtime starts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's okay – it would be quite an edge case, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah should be an edge case.

Comment on lines +147 to +150
// do not trigger reconcile for stopped deployments
if d.Status == database.DeploymentStatusDeleting || d.Status == database.DeploymentStatusStopped || d.Status == database.DeploymentStatusStopping {
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is safe – what if it's running at this point, but is changed to deleting afterwards (but before the call to UpdateDeployment?

Isn't the real problem just that UpdateDeployment changes the desired status? It seems like it shouldn't do that (since only start/stop/redeploy RPCs should change the provisioned status).

Copy link
Member Author

@k-anshul k-anshul Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the real problem just that UpdateDeployment changes the desired status? It seems like it shouldn't do that

Yeah true. Basically all updates to project are handled in the same way whereas only some like slots and version change require a redeployment and thus require a desired status change.

Do you think we should revisit my earlier solution i.e. just directly trigger reload configs on all deployments if vars change/annotations change? Can also differentiate between update to projects i.e. slots change -> trigger reconcile but branch updates do not.
We don't need to trigger the full reconcile deployment flow for just a variable change.

Copy link
Contributor

@begelundmuller begelundmuller Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only some like slots and version change require a redeployment and thus require a desired status change.

I don't think slots or version changes require a redeployment or status change? Or am I missing something? They just require the reconcile job to run to propagate the changes.

Can this issue be fixed just by removing the change of the desired status in UpdateDeployment? Or if that's tricky, since starts/stops are rare, maybe using depl.DesiredStatus instead of database.DeploymentStatusRunning might also be okay.

We don't need to trigger the full reconcile deployment flow for just a variable change.

It's pretty cheap to run, any specific concerns about running the job? The idea with a "reconcile" job is that it provides a single point for config changes to a deployment (which gives nice guarantees about config changes not being made in parallel).

}

func (r *Runtime) ReloadConfig(ctx context.Context, instanceID string) error {
return r.configReloader.reloadConfig(ctx, instanceID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Need to check if r.configReloader != nil {
  2. Also, shouldn't it run in the background? Since the admin service may have to call this on many runtimes at a time, and since the runtime already is capable of pulling config in the background, it'd be nice for it to just be a quick op

// 1. via rt.ReloadConfig whenever admin wants runtime to reload its configs
// 2. whenever runtime is started to pick up any configs changed while it was down
// 3. periodically every hour to pick up any changes done outside of runtime. This just adds extra resilience
type configReloader struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move to a dedicated file runtime/config_reloader.go?

Comment on lines +277 to +278
func (r *configReloader) reloadConfig(ctx context.Context, instanceID string) error {
_, err := r.group.Do(ctx, instanceID, func(ctx context.Context) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This singleflight deduplicates calls, so I don't think it's safe. For example, if I add two variables rapidly after each other, the second call may get de-duplicated and only the first new variable will be pulled.

I think this use case requires a non-deduplicating singleflight (basically a keyed mutex). Realistically, given there usually aren't many instances, it's probably sufficient to just use ctxsync.RWMutex

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I totally overlooked it.

Realistically, given there usually aren't many instances, it's probably sufficient to just use ctxsync.RWMutex

Do we really need RWMutex or a simple mutex is enough ? If it is for cancelation feature then can use it but just clarifying if we need read locks here.

Copy link
Contributor

@begelundmuller begelundmuller Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just suggested that for the cancellation yes – isn't it considered bad practice to use normal mutexes for long running operations (since callers become unresponsive to ctx cancellations)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants