Skip to content

Spoc 388: Z0DAN should quickly fail instead of waiting timeout#330

Open
danolivo wants to merge 5 commits intomainfrom
spoc-388
Open

Spoc 388: Z0DAN should quickly fail instead of waiting timeout#330
danolivo wants to merge 5 commits intomainfrom
spoc-388

Conversation

@danolivo
Copy link
Contributor

@danolivo danolivo commented Feb 4, 2026

  • Fix Z0DAN error handling: Change RAISE NOTICE to RAISE EXCEPTION across all error handlers in zodan.sql so that add_node fails immediately on error instead of silently logging and eventually timing out. Affected operations: node creation, sync event triggering/waiting, replication slot creation/advancement, subscription creation/enabling, and replication lag monitoring.

  • Fix subscription name assignment: Move sub_name assignment to the correct location inside the subscription creation block where it is actually used.

  • Fix Z0DAN initialization: Add COMMIT in present_final_cluster_state to let newly created subscriptions update their state; check all subscriptions across the cluster, not only the new node.

  • Fix status label: Correct [INITIALIZING] to [SYNCING] for subscriptions in syncing state.

  • Allow removing the last Spock node via remove_node.

  • Remove unreachable code: Drop CONTINUE/RAISE statements after RAISE EXCEPTION.

  • Add 3-node cluster TAP test (016_zodan_3n_timeout).

@danolivo danolivo requested a review from ibrarahmad February 4, 2026 13:20
@danolivo danolivo self-assigned this Feb 4, 2026
@danolivo danolivo added the enhancement New feature or request label Feb 4, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

The PR modifies ZODAN cluster node procedures to enhance error handling by converting notices to exceptions, removes the last-node removal guard, adds source DSN parameter for remote status checks, and introduces a new test validating quick failure when critical replication hooks are missing.

Changes

Cohort / File(s) Summary
ZODAN Procedures
samples/Z0DAN/zodan.sql, samples/Z0DAN/zodremove.sql, samples/Z0DAN/zodan_cleanup.sql
Modified error handling to throw exceptions instead of notices in exception blocks; updated present_final_cluster_state signature to accept src_dsn as first parameter for remote status probing; added per-subscription status checks and early exit on timeout; removed guard preventing removal of final cluster node.
Test Scheduling
tests/tap/schedule
Replaced scheduled test 014_rolling_upgrade with new test 017_zodan_3n_timeout.
New Timeout Test
tests/tap/t/017_zodan_3n_timeout.pl
Added comprehensive test validating that add_node fails promptly when critical hooks (sync_event, pg_replication_slot_advance, sub_create) are missing, rather than waiting for timeout; includes cluster setup, function renaming to simulate missing hooks, failure assertions, and final consistency checks.

Poem

🐰 A rabbit hops through SQL so bright,
Exceptions now buzz where notices did write,
The final node flies free at last,
Fast failures catch the missing cast!
With DSN in hand, we probe the way,
And test the timeout's swift ballet. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: Z0DAN should fail quickly instead of waiting for timeout, which aligns with the primary objective across all file modifications.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing multiple specific fixes and improvements across zodan.sql, zodremove.sql, and the new test file.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch spoc-388

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@samples/Z0DAN/zodan.sql`:
- Around line 1844-1895: Update the DROP PROCEDURE statement for
spock.present_final_cluster_state to match the new signature that includes
src_dsn as the first parameter: change the old DROP PROCEDURE IF EXISTS
spock.present_final_cluster_state(integer, boolean) to DROP PROCEDURE IF EXISTS
spock.present_final_cluster_state(text, integer, boolean) so the cleanup matches
the revised procedure signature.

In `@tests/tap/t/016_zodan_3n_timeout.pl`:
- Around line 43-78: Capture and assert the add_node call actually failed
instead of only checking elapsed time: wrap the CALL spock.add_node invocation
(where scalar_query is used) so you record any error/exception (e.g. via eval {
scalar_query(...) } and check $@ or by using a scalar_query variant that returns
status), then add an assertion that the call failed (e.g. ok(!$result || $@,
"...add_node failed as expected") and optionally assert the error message
contains "spock.sync_event" or "function ... does not exist"); keep the existing
timing check (ok($elapsed_time < 600,...)). Apply the same pattern to the other
add_node tests referenced (the blocks calling scalar_query for spock.add_node at
the other locations) and leave the cleanup calls (CALL spock.remove_node)
unchanged.
- Around line 80-117: The timer currently starts before the successful add_node
for n2 so $elapsed_time includes that operation; move the start_time = time()
assignment to immediately before the failing CALL spock.add_node for n3 (i.e.
right before the scalar_query(3, qq{ CALL spock.add_node ... }) invocation) so
the measured duration only covers the failing attempt, then compute
$elapsed_time and assert it is < 600 as before; remove or relocate the earlier
$start_time assignment placed just after renaming pg_replication_slot_advance to
avoid double-timing.

@mason-sharp
Copy link
Member

unfortunately needs a file rename and rebase

danolivo and others added 5 commits February 6, 2026 14:40
Procedure present_final_cluster_state should execute COMMIT command to let
newly created subscriptions from working to the new node update their state.

Also, final cluster state should check all subscriptions around the cluster,
not only the new node.
- Change RAISE NOTICE to RAISE EXCEPTION in error handlers to properly
  propagate errors instead of silently continuing
- Move sub_name assignment to the correct location inside the
  subscription creation block where it is actually used
- Add new test 016_zodan_2n_timeout for 2-node timeout scenario

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change RAISE NOTICE to RAISE EXCEPTION in zodan.sql error handlers to
ensure add_node fails immediately rather than continuing silently and
eventually timing out. Affected operations:

- Replication lag monitoring
- Creating replication slots on other nodes
- Enabling subscriptions
- Creating subscriptions on other nodes
- Triggering and waiting for sync events
- Checking commit LSN for slot advancement
- Advancing replication slots
- Triggering sync_event on source node

Also remove unreachable code (CONTINUE/RAISE statements after
RAISE EXCEPTION).

Add comprehensive 3-node cluster test that verifies quick failure by:
- spock.sync_event() is missing
- pg_replication_slot_advance() is missing
- spock.sub_create() is missing

Test also verifies successful cluster formation with 3 nodes and
2 enabled subscriptions per node.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
scalar_query() uses backticks which don't throw Perl exceptions on
error - they just return empty output and set $? with the exit code.

Now properly check psql exit code ($? >> 8) after each add_node call
that is expected to fail, and assert non-zero exit code to verify
the operation actually failed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
samples/Z0DAN/zodan.sql (2)

300-311: ⚠️ Potential issue | 🔴 Critical

create_sub silently swallows exceptions when verb=false — fails to propagate errors.

The exception handler at lines 303–309 only raises when verb is true. When verb=false (which is how the test and likely production calls use it), a failure in spock.sub_create via dblink is silently caught and the procedure returns as if it succeeded. This directly undermines the PR's goal of fast failure: errors from subscription creation on remote nodes won't propagate, and the caller (e.g., create_sub_on_new_node_to_src_node) will continue as though the subscription was created successfully.

Suggested fix — always re-raise
     EXCEPTION
         WHEN OTHERS THEN
-            IF verb THEN
-                RAISE EXCEPTION E'
-            [STEP 3] Subscription "%" creation failed on remote node! Error: %
-            ', subscription_name, SQLERRM;
-            END IF;
+            RAISE EXCEPTION E'
+        [STEP 3] Subscription "%" creation failed on remote node! Error: %
+        ', subscription_name, SQLERRM;
     END;

547-568: ⚠️ Potential issue | 🔴 Critical

Same verb-gated exception issue in create_node.

Both the NULL-joinid check (line 556) and the WHEN OTHERS handler (line 563) only raise exceptions when verb=true. When verb=false, node creation failures are silently ignored.

Suggested fix — always re-raise
         IF joinid IS NOT NULL THEN
             IF verb THEN
                 RAISE NOTICE 'Node % created remotely with id % and DSN: %', node_name, joinid, dsn;
             END IF;
         ELSE
-            IF verb THEN
-                RAISE EXCEPTION E'
-            [STEP 3] Node "%" creation failed remotely!
-            ', node_name;
-            END IF;
+            RAISE EXCEPTION E'
+        [STEP 3] Node "%" creation failed remotely!
+        ', node_name;
         END IF;
     EXCEPTION
         WHEN OTHERS THEN
-            IF verb THEN
-                RAISE EXCEPTION E'
-            [STEP 3] Node "%" creation failed remotely! Error: %
-            ', node_name, SQLERRM;
-            END IF;
+            RAISE EXCEPTION E'
+        [STEP 3] Node "%" creation failed remotely! Error: %
+        ', node_name, SQLERRM;
     END;
🤖 Fix all issues with AI agents
In `@samples/Z0DAN/zodan.sql`:
- Around line 1284-1288: The EXCEPTION block raises an exception then has an
unreachable CONTINUE; to allow the loop to continue on error, replace the RAISE
EXCEPTION with a non-fatal logger (e.g., RAISE WARNING or RAISE NOTICE) that
includes SQLERRM and rec.node_name, or alternatively remove the CONTINUE if you
intend to abort; update the EXCEPTION handler around the block that references
SQLERRM and rec.node_name so the error is logged with the same message but
without terminating execution.
- Around line 1141-1145: The EXCEPTION block contains an unreachable "RAISE;"
after a terminating "RAISE EXCEPTION" call; remove the redundant "RAISE;"
statement so the handler only logs/raises once. Locate the EXCEPTION section
that includes the RAISE EXCEPTION '    ✗ %', rpad('Creating new node ' ||
new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ') and delete the
subsequent standalone "RAISE;" line (inside the same block) to eliminate dead
code.

In `@tests/tap/t/017_zodan_3n_timeout.pl`:
- Around line 75-80: The remove_node calls are building DSN strings that include
user=$db_user but omit password=$db_password, so scalar_query("CALL
spock.remove_node(...)") will fail under non-trust auth; update each remove_node
invocation (the scalar_query calls invoking spock.remove_node in this file) to
include password=$db_password in the target_node_dsn argument (matching the
add_node DSN format used elsewhere) so the cleanup executes with proper
credentials; ensure all occurrences (the remove_node calls near the scalar_query
invocations) are updated consistently.

Comment on lines 1141 to 1145
EXCEPTION
WHEN OTHERS THEN
RAISE NOTICE ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
RAISE EXCEPTION ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
RAISE;
END;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unreachable RAISE; after RAISE EXCEPTION.

Line 1144 is dead code — RAISE EXCEPTION on line 1143 already terminates execution. The PR objectives explicitly call out removing such unreachable statements.

Suggested fix
     EXCEPTION
         WHEN OTHERS THEN
             RAISE EXCEPTION '    ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
-            RAISE;
     END;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
EXCEPTION
WHEN OTHERS THEN
RAISE NOTICE ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
RAISE EXCEPTION ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
RAISE;
END;
EXCEPTION
WHEN OTHERS THEN
RAISE EXCEPTION ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
END;
🤖 Prompt for AI Agents
In `@samples/Z0DAN/zodan.sql` around lines 1141 - 1145, The EXCEPTION block
contains an unreachable "RAISE;" after a terminating "RAISE EXCEPTION" call;
remove the redundant "RAISE;" statement so the handler only logs/raises once.
Locate the EXCEPTION section that includes the RAISE EXCEPTION '    ✗ %',
rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')',
120, ' ') and delete the subsequent standalone "RAISE;" line (inside the same
block) to eliminate dead code.

Comment on lines 1284 to 1288
EXCEPTION
WHEN OTHERS THEN
RAISE NOTICE ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
RAISE EXCEPTION ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
CONTINUE;
END;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unreachable CONTINUE after RAISE EXCEPTION.

Line 1287 can never execute — RAISE EXCEPTION on line 1286 already terminates the block. This is the same dead-code pattern the PR objectives intend to eliminate.

Suggested fix
         EXCEPTION
             WHEN OTHERS THEN
                 RAISE EXCEPTION '    ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
-                CONTINUE;
         END;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
EXCEPTION
WHEN OTHERS THEN
RAISE NOTICE ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
RAISE EXCEPTION ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
CONTINUE;
END;
EXCEPTION
WHEN OTHERS THEN
RAISE EXCEPTION ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
END;
🤖 Prompt for AI Agents
In `@samples/Z0DAN/zodan.sql` around lines 1284 - 1288, The EXCEPTION block raises
an exception then has an unreachable CONTINUE; to allow the loop to continue on
error, replace the RAISE EXCEPTION with a non-fatal logger (e.g., RAISE WARNING
or RAISE NOTICE) that includes SQLERRM and rec.node_name, or alternatively
remove the CONTINUE if you intend to abort; update the EXCEPTION handler around
the block that references SQLERRM and rec.node_name so the error is logged with
the same message but without terminating execution.

Comment on lines +75 to +80
scalar_query(2, qq{
CALL spock.remove_node(
target_node_name := 'n2',
target_node_dsn := 'host=$host dbname=$dbname port=$node_ports->[1] user=$db_user',
verbose_mode := true)
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

remove_node DSN strings omit password=… unlike add_node calls.

All remove_node invocations (lines 78, 120, 151, 157) use user=$db_user without password=$db_password, while every add_node call includes the password. If the test environment doesn't use trust authentication, these cleanup calls will silently fail (since they use scalar_query), potentially leaving stale cluster state that affects subsequent test scenarios.

🤖 Prompt for AI Agents
In `@tests/tap/t/017_zodan_3n_timeout.pl` around lines 75 - 80, The remove_node
calls are building DSN strings that include user=$db_user but omit
password=$db_password, so scalar_query("CALL spock.remove_node(...)") will fail
under non-trust auth; update each remove_node invocation (the scalar_query calls
invoking spock.remove_node in this file) to include password=$db_password in the
target_node_dsn argument (matching the add_node DSN format used elsewhere) so
the cleanup executes with proper credentials; ensure all occurrences (the
remove_node calls near the scalar_query invocations) are updated consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants