25.8 Antalya backport of #90825: Add role-based access to Glue catalog by zvonand · Pull Request #1428 · Altinity/ClickHouse

zvonand · 2026-02-18T15:50:10Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add role-based access to Glue catalog. Use settings aws_role_arn and, optionally, aws_role_session_name. (ClickHouse#90825 by @antonio2368)

CI/CD Options

Exclude tests:

Regression jobs to run:

Add role-based access to Glue catalog

github-actions · 2026-02-18T15:59:02Z

Workflow [PR], commit [d81516c]

Selfeer · 2026-02-20T16:42:23Z

Integration Test Failure

All 12 test_database_glue tests fail during fixture setup, before any test logic runs. The 1 other failure (test_storage_s3_queue::test_list_and_delete_race) is unrelated and flaky.

Root Cause

The new run_s3_mocks() function added in this PR passes a 4-tuple to start_mock_servers():

start_mock_servers(
    started_cluster, script_dir,
    [("mock_sts.py", "sts.us-east-1.amazonaws.com", "80", args)],  # 4 elements
)

But helpers/mock_servers.py on the antalya-25.8 branch only supports 3-tuples:

for server_name, container, port in mocks:  # ValueError: too many values to unpack

The upstream repo has an updated mock_servers.py that accepts the 4th args element, but that change was not included in the backport.

Traceback

test_database_glue/test.py:280 → run_s3_mocks(cluster)
test_database_glue/test.py:39  → start_mock_servers(...)
helpers/mock_servers.py:23     → ValueError: too many values to unpack (expected 3)

Impact

This is directly caused by this PR. Since the fixture crashes, all 12 test_database_glue tests (both existing and the new test_sts_smoke) fail without executing.

zvonand · 2026-02-20T16:45:47Z

@Selfeer i saw that, Ill fix that

The problem is regression fails

Selfeer · 2026-02-20T17:13:02Z

Regression aarch64 swarms - Failure Analysis

PR: #1428 | Workflow: 22222536503 | Version: 25.8.16.20001.altinityantalya | Arch: aarch64

Reason: Not related to this specific PR

Report: report.html

Summary

The Regression aarch64 swarms job failed with 6 failed scenarios out of 20 (14 OK). All 6 failures are in the node failure feature.

Failed Scenarios & Historical Flakiness

Scenario	Pass Rate	Fail	OK	Flaky?
`node failure/network failure`	79.9%	80	318	Yes (highest flakiness)
`node failure/check restart swarm node`	80.2%	77	319	Yes
`node failure/check restart clickhouse on swarm node`	85.9%	52	342	Yes
`node failure/swarm out of disk space`	91.7%	30	363	Yes
`node failure/cpu overload`	91.9%	29	364	Yes
`node failure/initiator out of disk space`	95.7%	14	379	Yes (least flaky)

Root Cause Analysis

All 6 failures originate from swarms/tests/node_failure.py and share a common error pattern:

Primary failure (4 scenarios): Test expects DB::Exception: Query was cancelled. but receives DB::Exception: Query is killed in pending state. (QUERY_WAS_CANCELLED) — an assertion message mismatch where the error code 394 is correct but the message text differs.
Secondary failure (2 scenarios — disk space): check_preprocessed_config_is_updated() assertion fails (exitcode == 1 instead of 0), and subsequent query gets ALL_CONNECTION_TRIES_FAILED (Code: 279) — a timing/infrastructure issue where ClickHouse config reload doesn't complete in time.

Conclusion

Not a regression. These failures have a long history across multiple PRs and versions (25.8.14, 25.8.16).

Appendix: Queries Used

All queries below were run against the gh-data database at github-checks.tenant-a.staging.altinity.cloud:8443.

1. Historical pass/fail ratios per scenario (all-time)

This query computes the historical pass/fail percentage for each of the failed scenarios across all recorded runs.

SELECT
    test_name,
    result,
    count() AS cnt,
    round(count() * 100.0 / sum(count()) OVER (PARTITION BY test_name), 1) AS pct
FROM `gh-data`.clickhouse_regression_results
WHERE test_name IN (
    '/swarms/feature/node failure/check restart clickhouse on swarm node',
    '/swarms/feature/node failure/check restart swarm node',
    '/swarms/feature/node failure/cpu overload',
    '/swarms/feature/node failure/initiator out of disk space',
    '/swarms/feature/node failure/network failure',
    '/swarms/feature/node failure/swarm out of disk space',
    '/swarms/feature/swarm joins',
    '/swarms/feature/swarm union'
)
GROUP BY test_name, result
ORDER BY test_name, result;

Result:

test_name	result	cnt	pct
`node failure/check restart clickhouse on swarm node`	Error	2	0.5%
`node failure/check restart clickhouse on swarm node`	Fail	52	13.1%
`node failure/check restart clickhouse on swarm node`	OK	342	85.9%
`node failure/check restart clickhouse on swarm node`	Skip	2	0.5%
`node failure/check restart swarm node`	Fail	77	19.3%
`node failure/check restart swarm node`	OK	319	80.2%
`node failure/check restart swarm node`	Skip	2	0.5%
`node failure/cpu overload`	Error	1	0.3%
`node failure/cpu overload`	Fail	29	7.3%
`node failure/cpu overload`	OK	364	91.9%
`node failure/cpu overload`	Skip	2	0.5%
`node failure/initiator out of disk space`	Error	1	0.3%
`node failure/initiator out of disk space`	Fail	14	3.5%
`node failure/initiator out of disk space`	OK	379	95.7%
`node failure/initiator out of disk space`	Skip	2	0.5%
`node failure/network failure`	Fail	80	20.1%
`node failure/network failure`	OK	318	79.9%
`node failure/swarm out of disk space`	Error	1	0.3%
`node failure/swarm out of disk space`	Fail	30	7.6%
`node failure/swarm out of disk space`	OK	363	91.7%
`node failure/swarm out of disk space`	Skip	2	0.5%
`swarm joins`	Error	6	1.4%
`swarm joins`	Fail	151	35.1%
`swarm joins`	OK	270	62.8%
`swarm joins`	Skip	3	0.7%
`swarm union`	Error	2	0.5%
`swarm union`	Fail	23	5.6%
`swarm union`	OK	385	93.2%
`swarm union`	Skip	3	0.7%

2. Recent timeline for disk space scenarios

This query shows the recent pass/fail timeline for the initiator out of disk space and swarm out of disk space scenarios, to confirm that failures are intermittent and not trending.

SELECT
    test_name,
    result,
    start_time,
    clickhouse_version,
    job_url
FROM `gh-data`.clickhouse_regression_results
WHERE test_name LIKE '%swarm%'
  AND (
    test_name LIKE '%kill swarm%'
    OR test_name LIKE '%out of disk%'
    OR test_name LIKE '%initiator%'
    OR test_name LIKE '%kill initiator%'
    OR test_name LIKE '%node failure%feature%'
  )
ORDER BY start_time DESC
LIMIT 200;

Key observation: The most recent ~30 entries for both initiator out of disk space and swarm out of disk space show predominantly OK results, with only sporadic Fail entries, confirming the flaky nature rather than a consistent regression.

Selfeer · 2026-02-20T17:15:10Z

'/tiered storage/with s3gcs/background move/max move factor': https://github.com/Altinity/ClickHouse/actions/runs/22222536503/job/64284577377

Also is not related, and seems like an infra fail - The test queries system.part_log for path_on_disk of moved parts, and the output contains an empty line (between paths of a previous table and
the current table). The test iterates over all lines and asserts each starts with /var/lib/clickhouse/disks/external/ - the empty string '' fails that assertion. This is a timing/cleanup artifact, not a logic error.

Merge pull request ClickHouse#90825 from ClickHouse/glue-iam-auth

5577784

Add role-based access to Glue catalog

zvonand added antalya backport Backport antalya-25.8 antalya-25.8.16 antalya-25.8.16.20001 labels Feb 18, 2026

zvonand and others added 3 commits February 18, 2026 19:47

fix build

98bc08b

fix (2)

3347841

Merge branch 'antalya-25.8' into backports/antalya-25.8/90825

288e896

zvonand force-pushed the backports/antalya-25.8/90825 branch from cef9232 to fba0893 Compare February 20, 2026 09:54

fix build (3)

c96e9ea

zvonand force-pushed the backports/antalya-25.8/90825 branch from fba0893 to c96e9ea Compare February 20, 2026 10:05

fix nullptr access

e76cf14

fix integration

d81516c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

25.8 Antalya backport of #90825: Add role-based access to Glue catalog#1428

25.8 Antalya backport of #90825: Add role-based access to Glue catalog#1428
zvonand wants to merge 7 commits intoantalya-25.8from
backports/antalya-25.8/90825

zvonand commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

Selfeer commented Feb 20, 2026

Uh oh!

zvonand commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

zvonand commented Feb 18, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

github-actions bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Selfeer commented Feb 20, 2026

Integration Test Failure

Root Cause

Traceback

Impact

Uh oh!

zvonand commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 20, 2026

Regression aarch64 swarms - Failure Analysis

Summary

Failed Scenarios & Historical Flakiness

Root Cause Analysis

Conclusion

Appendix: Queries Used

1. Historical pass/fail ratios per scenario (all-time)

2. Recent timeline for disk space scenarios

Uh oh!

Selfeer commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Feb 18, 2026 •

edited

Loading