Skip to content

Comments

25.8 Antalya backport of #90825: Add role-based access to Glue catalog#1428

Open
zvonand wants to merge 7 commits intoantalya-25.8from
backports/antalya-25.8/90825
Open

25.8 Antalya backport of #90825: Add role-based access to Glue catalog#1428
zvonand wants to merge 7 commits intoantalya-25.8from
backports/antalya-25.8/90825

Conversation

@zvonand
Copy link
Collaborator

@zvonand zvonand commented Feb 18, 2026

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add role-based access to Glue catalog. Use settings aws_role_arn and, optionally, aws_role_session_name. (ClickHouse#90825 by @antonio2368)

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link

github-actions bot commented Feb 18, 2026

Workflow [PR], commit [d81516c]

@zvonand zvonand force-pushed the backports/antalya-25.8/90825 branch from cef9232 to fba0893 Compare February 20, 2026 09:54
@zvonand zvonand force-pushed the backports/antalya-25.8/90825 branch from fba0893 to c96e9ea Compare February 20, 2026 10:05
@Selfeer
Copy link
Collaborator

Selfeer commented Feb 20, 2026

Integration Test Failure

All 12 test_database_glue tests fail during fixture setup, before any test logic runs. The 1 other failure (test_storage_s3_queue::test_list_and_delete_race) is unrelated and flaky.

Root Cause

The new run_s3_mocks() function added in this PR passes a 4-tuple to start_mock_servers():

start_mock_servers(
    started_cluster, script_dir,
    [("mock_sts.py", "sts.us-east-1.amazonaws.com", "80", args)],  # 4 elements
)

But helpers/mock_servers.py on the antalya-25.8 branch only supports 3-tuples:

for server_name, container, port in mocks:  # ValueError: too many values to unpack

The upstream repo has an updated mock_servers.py that accepts the 4th args element, but that change was not included in the backport.

Traceback

test_database_glue/test.py:280 → run_s3_mocks(cluster)
test_database_glue/test.py:39  → start_mock_servers(...)
helpers/mock_servers.py:23     → ValueError: too many values to unpack (expected 3)

Impact

This is directly caused by this PR. Since the fixture crashes, all 12 test_database_glue tests (both existing and the new test_sts_smoke) fail without executing.

@zvonand
Copy link
Collaborator Author

zvonand commented Feb 20, 2026

@Selfeer i saw that, Ill fix that

The problem is regression fails

@Selfeer
Copy link
Collaborator

Selfeer commented Feb 20, 2026

Regression aarch64 swarms - Failure Analysis

PR: #1428 | Workflow: 22222536503 | Version: 25.8.16.20001.altinityantalya | Arch: aarch64

Reason: Not related to this specific PR

Report: report.html

Summary

The Regression aarch64 swarms job failed with 6 failed scenarios out of 20 (14 OK). All 6 failures are in the node failure feature.

Failed Scenarios & Historical Flakiness

Scenario Pass Rate Fail OK Flaky?
node failure/network failure 79.9% 80 318 Yes (highest flakiness)
node failure/check restart swarm node 80.2% 77 319 Yes
node failure/check restart clickhouse on swarm node 85.9% 52 342 Yes
node failure/swarm out of disk space 91.7% 30 363 Yes
node failure/cpu overload 91.9% 29 364 Yes
node failure/initiator out of disk space 95.7% 14 379 Yes (least flaky)

Root Cause Analysis

All 6 failures originate from swarms/tests/node_failure.py and share a common error pattern:

  1. Primary failure (4 scenarios): Test expects DB::Exception: Query was cancelled. but receives DB::Exception: Query is killed in pending state. (QUERY_WAS_CANCELLED) — an assertion message mismatch where the error code 394 is correct but the message text differs.

  2. Secondary failure (2 scenarios — disk space): check_preprocessed_config_is_updated() assertion fails (exitcode == 1 instead of 0), and subsequent query gets ALL_CONNECTION_TRIES_FAILED (Code: 279) — a timing/infrastructure issue where ClickHouse config reload doesn't complete in time.

Conclusion

  • Not a regression. These failures have a long history across multiple PRs and versions (25.8.14, 25.8.16).

Appendix: Queries Used

All queries below were run against the gh-data database at github-checks.tenant-a.staging.altinity.cloud:8443.

1. Historical pass/fail ratios per scenario (all-time)

This query computes the historical pass/fail percentage for each of the failed scenarios across all recorded runs.

SELECT
    test_name,
    result,
    count() AS cnt,
    round(count() * 100.0 / sum(count()) OVER (PARTITION BY test_name), 1) AS pct
FROM `gh-data`.clickhouse_regression_results
WHERE test_name IN (
    '/swarms/feature/node failure/check restart clickhouse on swarm node',
    '/swarms/feature/node failure/check restart swarm node',
    '/swarms/feature/node failure/cpu overload',
    '/swarms/feature/node failure/initiator out of disk space',
    '/swarms/feature/node failure/network failure',
    '/swarms/feature/node failure/swarm out of disk space',
    '/swarms/feature/swarm joins',
    '/swarms/feature/swarm union'
)
GROUP BY test_name, result
ORDER BY test_name, result;

Result:

test_name result cnt pct
node failure/check restart clickhouse on swarm node Error 2 0.5%
node failure/check restart clickhouse on swarm node Fail 52 13.1%
node failure/check restart clickhouse on swarm node OK 342 85.9%
node failure/check restart clickhouse on swarm node Skip 2 0.5%
node failure/check restart swarm node Fail 77 19.3%
node failure/check restart swarm node OK 319 80.2%
node failure/check restart swarm node Skip 2 0.5%
node failure/cpu overload Error 1 0.3%
node failure/cpu overload Fail 29 7.3%
node failure/cpu overload OK 364 91.9%
node failure/cpu overload Skip 2 0.5%
node failure/initiator out of disk space Error 1 0.3%
node failure/initiator out of disk space Fail 14 3.5%
node failure/initiator out of disk space OK 379 95.7%
node failure/initiator out of disk space Skip 2 0.5%
node failure/network failure Fail 80 20.1%
node failure/network failure OK 318 79.9%
node failure/swarm out of disk space Error 1 0.3%
node failure/swarm out of disk space Fail 30 7.6%
node failure/swarm out of disk space OK 363 91.7%
node failure/swarm out of disk space Skip 2 0.5%
swarm joins Error 6 1.4%
swarm joins Fail 151 35.1%
swarm joins OK 270 62.8%
swarm joins Skip 3 0.7%
swarm union Error 2 0.5%
swarm union Fail 23 5.6%
swarm union OK 385 93.2%
swarm union Skip 3 0.7%

2. Recent timeline for disk space scenarios

This query shows the recent pass/fail timeline for the initiator out of disk space and swarm out of disk space scenarios, to confirm that failures are intermittent and not trending.

SELECT
    test_name,
    result,
    start_time,
    clickhouse_version,
    job_url
FROM `gh-data`.clickhouse_regression_results
WHERE test_name LIKE '%swarm%'
  AND (
    test_name LIKE '%kill swarm%'
    OR test_name LIKE '%out of disk%'
    OR test_name LIKE '%initiator%'
    OR test_name LIKE '%kill initiator%'
    OR test_name LIKE '%node failure%feature%'
  )
ORDER BY start_time DESC
LIMIT 200;

Key observation: The most recent ~30 entries for both initiator out of disk space and swarm out of disk space show predominantly OK results, with only sporadic Fail entries, confirming the flaky nature rather than a consistent regression.

@Selfeer
Copy link
Collaborator

Selfeer commented Feb 20, 2026

'/tiered storage/with s3gcs/background move/max move factor': https://github.com/Altinity/ClickHouse/actions/runs/22222536503/job/64284577377

Also is not related, and seems like an infra fail - The test queries system.part_log for path_on_disk of moved parts, and the output contains an empty line (between paths of a previous table and
the current table). The test iterates over all lines and asserts each starts with /var/lib/clickhouse/disks/external/ - the empty string '' fails that assertion. This is a timing/cleanup artifact, not a logic error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants