Conversation
📝 WalkthroughWalkthroughThe PR modifies ZODAN cluster node procedures to enhance error handling by converting notices to exceptions, removes the last-node removal guard, adds source DSN parameter for remote status checks, and introduces a new test validating quick failure when critical replication hooks are missing. Changes
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@samples/Z0DAN/zodan.sql`:
- Around line 1844-1895: Update the DROP PROCEDURE statement for
spock.present_final_cluster_state to match the new signature that includes
src_dsn as the first parameter: change the old DROP PROCEDURE IF EXISTS
spock.present_final_cluster_state(integer, boolean) to DROP PROCEDURE IF EXISTS
spock.present_final_cluster_state(text, integer, boolean) so the cleanup matches
the revised procedure signature.
In `@tests/tap/t/016_zodan_3n_timeout.pl`:
- Around line 43-78: Capture and assert the add_node call actually failed
instead of only checking elapsed time: wrap the CALL spock.add_node invocation
(where scalar_query is used) so you record any error/exception (e.g. via eval {
scalar_query(...) } and check $@ or by using a scalar_query variant that returns
status), then add an assertion that the call failed (e.g. ok(!$result || $@,
"...add_node failed as expected") and optionally assert the error message
contains "spock.sync_event" or "function ... does not exist"); keep the existing
timing check (ok($elapsed_time < 600,...)). Apply the same pattern to the other
add_node tests referenced (the blocks calling scalar_query for spock.add_node at
the other locations) and leave the cleanup calls (CALL spock.remove_node)
unchanged.
- Around line 80-117: The timer currently starts before the successful add_node
for n2 so $elapsed_time includes that operation; move the start_time = time()
assignment to immediately before the failing CALL spock.add_node for n3 (i.e.
right before the scalar_query(3, qq{ CALL spock.add_node ... }) invocation) so
the measured duration only covers the failing attempt, then compute
$elapsed_time and assert it is < 600 as before; remove or relocate the earlier
$start_time assignment placed just after renaming pg_replication_slot_advance to
avoid double-timing.
|
unfortunately needs a file rename and rebase |
Procedure present_final_cluster_state should execute COMMIT command to let newly created subscriptions from working to the new node update their state. Also, final cluster state should check all subscriptions around the cluster, not only the new node.
- Change RAISE NOTICE to RAISE EXCEPTION in error handlers to properly propagate errors instead of silently continuing - Move sub_name assignment to the correct location inside the subscription creation block where it is actually used - Add new test 016_zodan_2n_timeout for 2-node timeout scenario Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change RAISE NOTICE to RAISE EXCEPTION in zodan.sql error handlers to ensure add_node fails immediately rather than continuing silently and eventually timing out. Affected operations: - Replication lag monitoring - Creating replication slots on other nodes - Enabling subscriptions - Creating subscriptions on other nodes - Triggering and waiting for sync events - Checking commit LSN for slot advancement - Advancing replication slots - Triggering sync_event on source node Also remove unreachable code (CONTINUE/RAISE statements after RAISE EXCEPTION). Add comprehensive 3-node cluster test that verifies quick failure by: - spock.sync_event() is missing - pg_replication_slot_advance() is missing - spock.sub_create() is missing Test also verifies successful cluster formation with 3 nodes and 2 enabled subscriptions per node. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
scalar_query() uses backticks which don't throw Perl exceptions on error - they just return empty output and set $? with the exit code. Now properly check psql exit code ($? >> 8) after each add_node call that is expected to fail, and assert non-zero exit code to verify the operation actually failed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
samples/Z0DAN/zodan.sql (2)
300-311:⚠️ Potential issue | 🔴 Critical
create_subsilently swallows exceptions whenverb=false— fails to propagate errors.The exception handler at lines 303–309 only raises when
verbis true. Whenverb=false(which is how the test and likely production calls use it), a failure inspock.sub_createvia dblink is silently caught and the procedure returns as if it succeeded. This directly undermines the PR's goal of fast failure: errors from subscription creation on remote nodes won't propagate, and the caller (e.g.,create_sub_on_new_node_to_src_node) will continue as though the subscription was created successfully.Suggested fix — always re-raise
EXCEPTION WHEN OTHERS THEN - IF verb THEN - RAISE EXCEPTION E' - [STEP 3] Subscription "%" creation failed on remote node! Error: % - ', subscription_name, SQLERRM; - END IF; + RAISE EXCEPTION E' + [STEP 3] Subscription "%" creation failed on remote node! Error: % + ', subscription_name, SQLERRM; END;
547-568:⚠️ Potential issue | 🔴 CriticalSame verb-gated exception issue in
create_node.Both the NULL-joinid check (line 556) and the
WHEN OTHERShandler (line 563) only raise exceptions whenverb=true. Whenverb=false, node creation failures are silently ignored.Suggested fix — always re-raise
IF joinid IS NOT NULL THEN IF verb THEN RAISE NOTICE 'Node % created remotely with id % and DSN: %', node_name, joinid, dsn; END IF; ELSE - IF verb THEN - RAISE EXCEPTION E' - [STEP 3] Node "%" creation failed remotely! - ', node_name; - END IF; + RAISE EXCEPTION E' + [STEP 3] Node "%" creation failed remotely! + ', node_name; END IF; EXCEPTION WHEN OTHERS THEN - IF verb THEN - RAISE EXCEPTION E' - [STEP 3] Node "%" creation failed remotely! Error: % - ', node_name, SQLERRM; - END IF; + RAISE EXCEPTION E' + [STEP 3] Node "%" creation failed remotely! Error: % + ', node_name, SQLERRM; END;
🤖 Fix all issues with AI agents
In `@samples/Z0DAN/zodan.sql`:
- Around line 1284-1288: The EXCEPTION block raises an exception then has an
unreachable CONTINUE; to allow the loop to continue on error, replace the RAISE
EXCEPTION with a non-fatal logger (e.g., RAISE WARNING or RAISE NOTICE) that
includes SQLERRM and rec.node_name, or alternatively remove the CONTINUE if you
intend to abort; update the EXCEPTION handler around the block that references
SQLERRM and rec.node_name so the error is logged with the same message but
without terminating execution.
- Around line 1141-1145: The EXCEPTION block contains an unreachable "RAISE;"
after a terminating "RAISE EXCEPTION" call; remove the redundant "RAISE;"
statement so the handler only logs/raises once. Locate the EXCEPTION section
that includes the RAISE EXCEPTION ' ✗ %', rpad('Creating new node ' ||
new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ') and delete the
subsequent standalone "RAISE;" line (inside the same block) to eliminate dead
code.
In `@tests/tap/t/017_zodan_3n_timeout.pl`:
- Around line 75-80: The remove_node calls are building DSN strings that include
user=$db_user but omit password=$db_password, so scalar_query("CALL
spock.remove_node(...)") will fail under non-trust auth; update each remove_node
invocation (the scalar_query calls invoking spock.remove_node in this file) to
include password=$db_password in the target_node_dsn argument (matching the
add_node DSN format used elsewhere) so the cleanup executes with proper
credentials; ensure all occurrences (the remove_node calls near the scalar_query
invocations) are updated consistently.
| EXCEPTION | ||
| WHEN OTHERS THEN | ||
| RAISE NOTICE ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' '); | ||
| RAISE EXCEPTION ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' '); | ||
| RAISE; | ||
| END; |
There was a problem hiding this comment.
Unreachable RAISE; after RAISE EXCEPTION.
Line 1144 is dead code — RAISE EXCEPTION on line 1143 already terminates execution. The PR objectives explicitly call out removing such unreachable statements.
Suggested fix
EXCEPTION
WHEN OTHERS THEN
RAISE EXCEPTION ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
- RAISE;
END;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| EXCEPTION | |
| WHEN OTHERS THEN | |
| RAISE NOTICE ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' '); | |
| RAISE EXCEPTION ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' '); | |
| RAISE; | |
| END; | |
| EXCEPTION | |
| WHEN OTHERS THEN | |
| RAISE EXCEPTION ' ✗ %', rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' '); | |
| END; |
🤖 Prompt for AI Agents
In `@samples/Z0DAN/zodan.sql` around lines 1141 - 1145, The EXCEPTION block
contains an unreachable "RAISE;" after a terminating "RAISE EXCEPTION" call;
remove the redundant "RAISE;" statement so the handler only logs/raises once.
Locate the EXCEPTION section that includes the RAISE EXCEPTION ' ✗ %',
rpad('Creating new node ' || new_node_name || ' (error: ' || SQLERRM || ')',
120, ' ') and delete the subsequent standalone "RAISE;" line (inside the same
block) to eliminate dead code.
| EXCEPTION | ||
| WHEN OTHERS THEN | ||
| RAISE NOTICE ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' '); | ||
| RAISE EXCEPTION ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' '); | ||
| CONTINUE; | ||
| END; |
There was a problem hiding this comment.
Unreachable CONTINUE after RAISE EXCEPTION.
Line 1287 can never execute — RAISE EXCEPTION on line 1286 already terminates the block. This is the same dead-code pattern the PR objectives intend to eliminate.
Suggested fix
EXCEPTION
WHEN OTHERS THEN
RAISE EXCEPTION ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' ');
- CONTINUE;
END;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| EXCEPTION | |
| WHEN OTHERS THEN | |
| RAISE NOTICE ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' '); | |
| RAISE EXCEPTION ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' '); | |
| CONTINUE; | |
| END; | |
| EXCEPTION | |
| WHEN OTHERS THEN | |
| RAISE EXCEPTION ' ✗ %', rpad('Triggering sync event on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' '); | |
| END; |
🤖 Prompt for AI Agents
In `@samples/Z0DAN/zodan.sql` around lines 1284 - 1288, The EXCEPTION block raises
an exception then has an unreachable CONTINUE; to allow the loop to continue on
error, replace the RAISE EXCEPTION with a non-fatal logger (e.g., RAISE WARNING
or RAISE NOTICE) that includes SQLERRM and rec.node_name, or alternatively
remove the CONTINUE if you intend to abort; update the EXCEPTION handler around
the block that references SQLERRM and rec.node_name so the error is logged with
the same message but without terminating execution.
| scalar_query(2, qq{ | ||
| CALL spock.remove_node( | ||
| target_node_name := 'n2', | ||
| target_node_dsn := 'host=$host dbname=$dbname port=$node_ports->[1] user=$db_user', | ||
| verbose_mode := true) | ||
| }); |
There was a problem hiding this comment.
remove_node DSN strings omit password=… unlike add_node calls.
All remove_node invocations (lines 78, 120, 151, 157) use user=$db_user without password=$db_password, while every add_node call includes the password. If the test environment doesn't use trust authentication, these cleanup calls will silently fail (since they use scalar_query), potentially leaving stale cluster state that affects subsequent test scenarios.
🤖 Prompt for AI Agents
In `@tests/tap/t/017_zodan_3n_timeout.pl` around lines 75 - 80, The remove_node
calls are building DSN strings that include user=$db_user but omit
password=$db_password, so scalar_query("CALL spock.remove_node(...)") will fail
under non-trust auth; update each remove_node invocation (the scalar_query calls
invoking spock.remove_node in this file) to include password=$db_password in the
target_node_dsn argument (matching the add_node DSN format used elsewhere) so
the cleanup executes with proper credentials; ensure all occurrences (the
remove_node calls near the scalar_query invocations) are updated consistently.
Fix Z0DAN error handling: Change RAISE NOTICE to RAISE EXCEPTION across all error handlers in zodan.sql so that add_node fails immediately on error instead of silently logging and eventually timing out. Affected operations: node creation, sync event triggering/waiting, replication slot creation/advancement, subscription creation/enabling, and replication lag monitoring.
Fix subscription name assignment: Move sub_name assignment to the correct location inside the subscription creation block where it is actually used.
Fix Z0DAN initialization: Add COMMIT in present_final_cluster_state to let newly created subscriptions update their state; check all subscriptions across the cluster, not only the new node.
Fix status label: Correct [INITIALIZING] to [SYNCING] for subscriptions in syncing state.
Allow removing the last Spock node via remove_node.
Remove unreachable code: Drop CONTINUE/RAISE statements after RAISE EXCEPTION.
Add 3-node cluster TAP test (016_zodan_3n_timeout).