Spoc 388: Z0DAN should quickly fail instead of waiting timeout by danolivo · Pull Request #330 · pgEdge/spock

danolivo · 2026-02-04T13:20:48Z

Fix Z0DAN error handling: Change RAISE NOTICE to RAISE EXCEPTION across all error handlers in zodan.sql so that add_node fails immediately on error instead of silently logging and eventually timing out. Affected operations: node creation, sync event triggering/waiting, replication slot creation/advancement, subscription creation/enabling, and replication lag monitoring.
Fix subscription name assignment: Move sub_name assignment to the correct location inside the subscription creation block where it is actually used.
Fix Z0DAN initialization: Add COMMIT in present_final_cluster_state to let newly created subscriptions update their state; check all subscriptions across the cluster, not only the new node.
Fix status label: Correct [INITIALIZING] to [SYNCING] for subscriptions in syncing state.
Allow removing the last Spock node via remove_node.
Remove unreachable code: Drop CONTINUE/RAISE statements after RAISE EXCEPTION.
Add 3-node cluster TAP test (016_zodan_3n_timeout).

coderabbitai · 2026-02-04T13:21:10Z

📝 Walkthrough

Walkthrough

The PR modifies ZODAN cluster node procedures to enhance error handling by converting notices to exceptions, removes the last-node removal guard, adds source DSN parameter for remote status checks, and introduces a new test validating quick failure when critical replication hooks are missing.

Changes

Cohort / File(s)	Summary
ZODAN Procedures `samples/Z0DAN/zodan.sql`, `samples/Z0DAN/zodremove.sql`, `samples/Z0DAN/zodan_cleanup.sql`	Modified error handling to throw exceptions instead of notices in exception blocks; updated `present_final_cluster_state` signature to accept `src_dsn` as first parameter for remote status probing; added per-subscription status checks and early exit on timeout; removed guard preventing removal of final cluster node.
Test Scheduling `tests/tap/schedule`	Replaced scheduled test `014_rolling_upgrade` with new test `017_zodan_3n_timeout`.
New Timeout Test `tests/tap/t/017_zodan_3n_timeout.pl`	Added comprehensive test validating that `add_node` fails promptly when critical hooks (sync_event, pg_replication_slot_advance, sub_create) are missing, rather than waiting for timeout; includes cluster setup, function renaming to simulate missing hooks, failure assertions, and final consistency checks.

Poem

🐰 A rabbit hops through SQL so bright,
Exceptions now buzz where notices did write,
The final node flies free at last,
Fast failures catch the missing cast!
With DSN in hand, we probe the way,
And test the timeout's swift ballet. ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: Z0DAN should fail quickly instead of waiting for timeout, which aligns with the primary objective across all file modifications.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, detailing multiple specific fixes and improvements across zodan.sql, zodremove.sql, and the new test file.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch spoc-388

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@samples/Z0DAN/zodan.sql`:
- Around line 1844-1895: Update the DROP PROCEDURE statement for
spock.present_final_cluster_state to match the new signature that includes
src_dsn as the first parameter: change the old DROP PROCEDURE IF EXISTS
spock.present_final_cluster_state(integer, boolean) to DROP PROCEDURE IF EXISTS
spock.present_final_cluster_state(text, integer, boolean) so the cleanup matches
the revised procedure signature.

In `@tests/tap/t/016_zodan_3n_timeout.pl`:
- Around line 43-78: Capture and assert the add_node call actually failed
instead of only checking elapsed time: wrap the CALL spock.add_node invocation
(where scalar_query is used) so you record any error/exception (e.g. via eval {
scalar_query(...) } and check $@ or by using a scalar_query variant that returns
status), then add an assertion that the call failed (e.g. ok(!$result || $@,
"...add_node failed as expected") and optionally assert the error message
contains "spock.sync_event" or "function ... does not exist"); keep the existing
timing check (ok($elapsed_time < 600,...)). Apply the same pattern to the other
add_node tests referenced (the blocks calling scalar_query for spock.add_node at
the other locations) and leave the cleanup calls (CALL spock.remove_node)
unchanged.
- Around line 80-117: The timer currently starts before the successful add_node
for n2 so $elapsed_time includes that operation; move the start_time = time()
assignment to immediately before the failing CALL spock.add_node for n3 (i.e.
right before the scalar_query(3, qq{ CALL spock.add_node ... }) invocation) so
the measured duration only covers the failing attempt, then compute
$elapsed_time and assert it is < 600 as before; remove or relocate the earlier
$start_time assignment placed just after renaming pg_replication_slot_advance to
avoid double-timing.

samples/Z0DAN/zodan.sql

tests/tap/t/017_zodan_3n_timeout.pl

mason-sharp · 2026-02-05T18:16:42Z

unfortunately needs a file rename and rebase

Procedure present_final_cluster_state should execute COMMIT command to let newly created subscriptions from working to the new node update their state. Also, final cluster state should check all subscriptions around the cluster, not only the new node.

- Change RAISE NOTICE to RAISE EXCEPTION in error handlers to properly propagate errors instead of silently continuing - Move sub_name assignment to the correct location inside the subscription creation block where it is actually used - Add new test 016_zodan_2n_timeout for 2-node timeout scenario Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Change RAISE NOTICE to RAISE EXCEPTION in zodan.sql error handlers to ensure add_node fails immediately rather than continuing silently and eventually timing out. Affected operations: - Replication lag monitoring - Creating replication slots on other nodes - Enabling subscriptions - Creating subscriptions on other nodes - Triggering and waiting for sync events - Checking commit LSN for slot advancement - Advancing replication slots - Triggering sync_event on source node Also remove unreachable code (CONTINUE/RAISE statements after RAISE EXCEPTION). Add comprehensive 3-node cluster test that verifies quick failure by: - spock.sync_event() is missing - pg_replication_slot_advance() is missing - spock.sub_create() is missing Test also verifies successful cluster formation with 3 nodes and 2 enabled subscriptions per node. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

scalar_query() uses backticks which don't throw Perl exceptions on error - they just return empty output and set $? with the exit code. Now properly check psql exit code ($? >> 8) after each add_node call that is expected to fail, and assert non-zero exit code to verify the operation actually failed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

samples/Z0DAN/zodan.sql (2)
300-311: ⚠️ Potential issue | 🔴 Critical

create_sub silently swallows exceptions when verb=false — fails to propagate errors.

The exception handler at lines 303–309 only raises when verb is true. When verb=false (which is how the test and likely production calls use it), a failure in spock.sub_create via dblink is silently caught and the procedure returns as if it succeeded. This directly undermines the PR's goal of fast failure: errors from subscription creation on remote nodes won't propagate, and the caller (e.g., create_sub_on_new_node_to_src_node) will continue as though the subscription was created successfully.
Suggested fix — always re-raise
     EXCEPTION
         WHEN OTHERS THEN
-            IF verb THEN
-                RAISE EXCEPTION E'
-            [STEP 3] Subscription "%" creation failed on remote node! Error: %
-            ', subscription_name, SQLERRM;
-            END IF;
+            RAISE EXCEPTION E'
+        [STEP 3] Subscription "%" creation failed on remote node! Error: %
+        ', subscription_name, SQLERRM;
     END;
547-568: ⚠️ Potential issue | 🔴 Critical

Same verb-gated exception issue in create_node.

Both the NULL-joinid check (line 556) and the WHEN OTHERS handler (line 563) only raise exceptions when verb=true. When verb=false, node creation failures are silently ignored.
Suggested fix — always re-raise
         IF joinid IS NOT NULL THEN
             IF verb THEN
                 RAISE NOTICE 'Node % created remotely with id % and DSN: %', node_name, joinid, dsn;
             END IF;
         ELSE
-            IF verb THEN
-                RAISE EXCEPTION E'
-            [STEP 3] Node "%" creation failed remotely!
-            ', node_name;
-            END IF;
+            RAISE EXCEPTION E'
+        [STEP 3] Node "%" creation failed remotely!
+        ', node_name;
         END IF;
     EXCEPTION
         WHEN OTHERS THEN
-            IF verb THEN
-                RAISE EXCEPTION E'
-            [STEP 3] Node "%" creation failed remotely! Error: %
-            ', node_name, SQLERRM;
-            END IF;
+            RAISE EXCEPTION E'
+        [STEP 3] Node "%" creation failed remotely! Error: %
+        ', node_name, SQLERRM;
     END;

🤖 Fix all issues with AI agents

In `@samples/Z0DAN/zodan.sql`:
- Around line 1284-1288: The EXCEPTION block raises an exception then has an
unreachable CONTINUE; to allow the loop to continue on error, replace the RAISE
EXCEPTION with a non-fatal logger (e.g., RAISE WARNING or RAISE NOTICE) that
includes SQLERRM and rec.node_name, or alternatively remove the CONTINUE if you
intend to abort; update the EXCEPTION handler around the block that references
SQLERRM and rec.node_name so the error is logged with the same message but
without terminating execution.
- Around line 1141-1145: The EXCEPTION block contains an unreachable "RAISE;"
after a terminating "RAISE EXCEPTION" call; remove the redundant "RAISE;"
statement so the handler only logs/raises once. Locate the EXCEPTION section
that includes the RAISE EXCEPTION '    ✗ %', rpad('Creating new node ' ||
new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ') and delete the
subsequent standalone "RAISE;" line (inside the same block) to eliminate dead
code.

In `@tests/tap/t/017_zodan_3n_timeout.pl`:
- Around line 75-80: The remove_node calls are building DSN strings that include
user=$db_user but omit password=$db_password, so scalar_query("CALL
spock.remove_node(...)") will fail under non-trust auth; update each remove_node
invocation (the scalar_query calls invoking spock.remove_node in this file) to
include password=$db_password in the target_node_dsn argument (matching the
add_node DSN format used elsewhere) so the cleanup executes with proper
credentials; ensure all occurrences (the remove_node calls near the scalar_query
invocations) are updated consistently.

coderabbitai · 2026-02-06T14:00:35Z

samples/Z0DAN/zodan.sql

    EXCEPTION
        WHEN OTHERS THEN
-            RAISE NOTICE '    ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
+            RAISE EXCEPTION '    ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
            RAISE;
    END;


⚠️ Potential issue | 🟡 Minor

Unreachable RAISE; after RAISE EXCEPTION.

Line 1144 is dead code — RAISE EXCEPTION on line 1143 already terminates execution. The PR objectives explicitly call out removing such unreachable statements.

Suggested fix

EXCEPTION WHEN OTHERS THEN RAISE EXCEPTION ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' '); - RAISE; END;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

EXCEPTION

WHEN OTHERS THEN

RAISE NOTICE ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ');

RAISE EXCEPTION ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ');

RAISE;

END;

EXCEPTION

WHEN OTHERS THEN

RAISE EXCEPTION ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ');

END;

🤖 Prompt for AI Agents

In `@samples/Z0DAN/zodan.sql` around lines 1141 - 1145, The EXCEPTION block contains an unreachable "RAISE;" after a terminating "RAISE EXCEPTION" call; remove the redundant "RAISE;" statement so the handler only logs/raises once. Locate the EXCEPTION section that includes the RAISE EXCEPTION ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ') and delete the subsequent standalone "RAISE;" line (inside the same block) to eliminate dead code.

coderabbitai · 2026-02-06T14:00:35Z

samples/Z0DAN/zodan.sql

        EXCEPTION
            WHEN OTHERS THEN
-                RAISE NOTICE '    ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
+                RAISE EXCEPTION '    ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
                CONTINUE;
        END;


⚠️ Potential issue | 🟡 Minor

Unreachable CONTINUE after RAISE EXCEPTION.

Line 1287 can never execute — RAISE EXCEPTION on line 1286 already terminates the block. This is the same dead-code pattern the PR objectives intend to eliminate.

Suggested fix

EXCEPTION WHEN OTHERS THEN RAISE EXCEPTION ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' '); - CONTINUE; END;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

EXCEPTION

WHEN OTHERS THEN

RAISE NOTICE ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' ');

RAISE EXCEPTION ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' ');

CONTINUE;

END;

EXCEPTION

WHEN OTHERS THEN

RAISE EXCEPTION ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' ');

END;

🤖 Prompt for AI Agents

In `@samples/Z0DAN/zodan.sql` around lines 1284 - 1288, The EXCEPTION block raises an exception then has an unreachable CONTINUE; to allow the loop to continue on error, replace the RAISE EXCEPTION with a non-fatal logger (e.g., RAISE WARNING or RAISE NOTICE) that includes SQLERRM and rec.node_name, or alternatively remove the CONTINUE if you intend to abort; update the EXCEPTION handler around the block that references SQLERRM and rec.node_name so the error is logged with the same message but without terminating execution.

coderabbitai · 2026-02-06T14:00:35Z

tests/tap/t/017_zodan_3n_timeout.pl

+scalar_query(2, qq{
+	CALL spock.remove_node(
+		target_node_name := 'n2',
+		target_node_dsn := 'host=$host dbname=$dbname port=$node_ports->[1] user=$db_user',
+		verbose_mode := true)
+});


⚠️ Potential issue | 🟡 Minor

remove_node DSN strings omit password=… unlike add_node calls.

All remove_node invocations (lines 78, 120, 151, 157) use user=$db_user without password=$db_password, while every add_node call includes the password. If the test environment doesn't use trust authentication, these cleanup calls will silently fail (since they use scalar_query), potentially leaving stale cluster state that affects subsequent test scenarios.

🤖 Prompt for AI Agents

In `@tests/tap/t/017_zodan_3n_timeout.pl` around lines 75 - 80, The remove_node calls are building DSN strings that include user=$db_user but omit password=$db_password, so scalar_query("CALL spock.remove_node(...)") will fail under non-trust auth; update each remove_node invocation (the scalar_query calls invoking spock.remove_node in this file) to include password=$db_password in the target_node_dsn argument (matching the add_node DSN format used elsewhere) so the cleanup executes with proper credentials; ensure all occurrences (the remove_node calls near the scalar_query invocations) are updated consistently.

danolivo requested a review from ibrarahmad February 4, 2026 13:20

danolivo self-assigned this Feb 4, 2026

danolivo added the enhancement New feature or request label Feb 4, 2026

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

samples/Z0DAN/zodan.sql Show resolved Hide resolved

tests/tap/t/017_zodan_3n_timeout.pl Show resolved Hide resolved

tests/tap/t/017_zodan_3n_timeout.pl Show resolved Hide resolved

danolivo and others added 5 commits February 6, 2026 14:40

Fix Z0DAN initialization issue.

567ce38

Procedure present_final_cluster_state should execute COMMIT command to let newly created subscriptions from working to the new node update their state. Also, final cluster state should check all subscriptions around the cluster, not only the new node.

Allow the remove_node routine remove the last Spock node

7d6216a

danolivo force-pushed the spoc-388 branch from 8667884 to 6b7c54b Compare February 6, 2026 13:54

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spoc 388: Z0DAN should quickly fail instead of waiting timeout#330

Spoc 388: Z0DAN should quickly fail instead of waiting timeout#330
danolivo wants to merge 5 commits intomainfrom
spoc-388

danolivo commented Feb 4, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 4, 2026 •

edited

Loading

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mason-sharp commented Feb 5, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 6, 2026

Uh oh!

coderabbitai bot Feb 6, 2026

Uh oh!

coderabbitai bot Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danolivo commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mason-sharp commented Feb 5, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danolivo commented Feb 4, 2026 •

edited

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading