Skip to content

Cache dag access and teams check to be reused in grid ti_summary API calls#61623

Open
tirkarthi wants to merge 1 commit intoapache:mainfrom
tirkarthi:gh61485
Open

Cache dag access and teams check to be reused in grid ti_summary API calls#61623
tirkarthi wants to merge 1 commit intoapache:mainfrom
tirkarthi:gh61485

Conversation

@tirkarthi
Copy link
Contributor

@tirkarthi tirkarthi commented Feb 8, 2026

requires_access_dag is used in the ti_summary endpoint. When this method returns a result then this can be cached and reused for other API calls since access to a dag doesn't change based on taskinstance. On a similar note for dags that don't have many changes the serialized dag entry also remains the same.

This PR adds caching to airflow-core which is independent from fab provider. Hence the fab related ttl cannot be always used and this might need a new config if the approach is accepted.

command with 10 concurrent requests since the grid loads 10 dagruns by default.

hey -c 10 -H 'Cookie: _token=<token>' 'http://localhost:8000/ui/grid/ti_summaries/asset_produces_2/manual__2026-01-31T06:15:30.690694+00:00'

Main branch

Summary:
  Total:	1.0294 secs
  Slowest:	0.0801 secs
  Fastest:	0.0160 secs
  Average:	0.0460 secs
  Requests/sec:	194.2854
 
Latency distribution:
  10% in 0.0335 secs
  25% in 0.0385 secs
  50% in 0.0438 secs
  75% in 0.0537 secs
  90% in 0.0626 secs
  95% in 0.0667 secs
  99% in 0.0758 secs

cache only is_authorized_dag in fab auth manager used in requires_access_dag check

Summary:
  Total:	0.9620 secs
  Slowest:	0.0741 secs
  Fastest:	0.0163 secs
  Average:	0.0410 secs
  Requests/sec:	207.8980
  
Latency distribution:
  10% in 0.0295 secs
  25% in 0.0328 secs
  50% in 0.0403 secs
  75% in 0.0496 secs
  90% in 0.0537 secs
  95% in 0.0560 secs
  99% in 0.0630 secs

cache get_team_name and is_authorized_dag in fab auth manager used in requires_access_dag check

Summary:
  Total:	0.8245 secs
  Slowest:	0.0813 secs
  Fastest:	0.0115 secs
  Average:	0.0304 secs
  Requests/sec:	242.5708
  
Latency distribution:
  10% in 0.0204 secs
  25% in 0.0234 secs
  50% in 0.0262 secs
  75% in 0.0365 secs
  90% in 0.0497 secs
  95% in 0.0531 secs
  99% in 0.0628 secs

cache get_team_name and is_authorized_dag in fab auth manager used in requires_access_dag check and _get_serdag.

Summary:
  Total:	0.6000 secs
  Slowest:	0.0532 secs
  Fastest:	0.0100 secs
  Average:	0.0249 secs
  Requests/sec:	333.3144
  
Latency distribution:
  10% in 0.0176 secs
  25% in 0.0198 secs
  50% in 0.0240 secs
  75% in 0.0305 secs
  90% in 0.0341 secs
  95% in 0.0367 secs
  99% in 0.0402 secs

related: #61485

Was generative AI tooling used to co-author this PR?

No

requires_access_dag is used in the ti_summary endpoint. When this
method returns a result then this can be cached and reused for other
API calls since access to a dag doesn't change based on taskinstance.
@tirkarthi
Copy link
Contributor Author

As per my understanding there are only 5 connections in the pool by default and with concurrent requests and db calls being made synchronously in requires_access_dag and in the endpoint other requests might need to wait based on max_overflow.

Prompt : If I have connection pool as 5 in sqlalchemy and create 10 sessions concurrently then do the last 5 sessions wait for the initial 5 sessions to be returned back to the pool 

Gemini output : 

The short answer is: not necessarily. It depends entirely on your max_overflow setting. By default, 
SQLAlchemy doesn't just make those extra 5 sessions sit in a queue; it often creates "overflow" connections 
to handle the spike.

Here is the breakdown of how SQLAlchemy manages this scenario:
1. The pool_size vs. max_overflow

When you define a pool, SQLAlchemy uses two primary parameters to determine how to handle concurrent requests:

    pool_size (5 in your case): The number of "permanent" connections kept open in the pool.

    max_overflow (Default is 10): The number of additional connections the pool is allowed to create if the pool_size is exhausted.

2. What happens to your 10 sessions?

If you are using the default settings (pool_size=5, max_overflow=10):

    Sessions 1–5: Grab the 5 "checked-in" connections from the pool immediately.

    Sessions 6–10: Since the pool is full, the engine looks at max_overflow. Because 5 is less than the default limit of 10, it creates 5 new temporary connections.

    Result: All 10 sessions run concurrently. No one waits.

3. When do sessions actually wait?

The "waiting" behavior only kicks in if you exceed pool_size + max_overflow.

If you set max_overflow=0 or if you tried to open 16 sessions with default settings:

    The 16th session would hit the limit.

    It would wait for a connection to be returned to the pool.

    This wait is governed by pool_timeout (default is 30 seconds). If no connection becomes free in that time, SQLAlchemy throws a TimeoutError

Copy link
Member

@pierrejeambrun pierrejeambrun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should think about the cache invalidation cycle too.

For instance if I update the permissions of the user the is_authorized_dag cache isn't invalidated, and therefore the user won't have access until cache expires (I don't know when in the current implementation)

Also different workers can have different version of the cache in these specific case, one worker might grant access, while the other can block the access, I think this is a problem.

@tirkarthi
Copy link
Contributor Author

Thanks @pierrejeambrun . I updated the signature for the caching key and still see failures about the permission which I assume is about caching invalidation. The primary intention of the PR was around recurring checks for dag access around the same time within 1-2 seconds the grid loads. I will see if I can bring up an approach or to have a shorter ttl like 1-2 seconds within the grid loading cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants