Skip to content

Deterministic state syncs#2177

Open
marcello33 wants to merge 30 commits intodevelopfrom
mardizzone/POS-3441_deterministic-ss
Open

Deterministic state syncs#2177
marcello33 wants to merge 30 commits intodevelopfrom
mardizzone/POS-3441_deterministic-ss

Conversation

@marcello33
Copy link
Copy Markdown
Collaborator

@marcello33 marcello33 commented Mar 31, 2026

Description

Implements the bor side of deterministic state syncs. Post-fork, Bor switches from the non-deterministic (from_id, to_time) state sync query pattern to the combined GetStateSyncsByTime endpoint that resolves the Heimdall height internally and returns events in a single call. This ensures all validators derive identical state-sync sets regardless of when they query Heimdall.

Changes

  • Fork gating: Added DeterministicStateSyncBlock to BorConfig with IsDeterministicStateSync() check
  • Consensus logic: CommitStates branches on fork activation — post-fork uses the combined StateSyncEventsByTime endpoint. On errors, logs and returns empty (matching pre-fork resilience) instead of crashing block production
  • Heimdall clients: Added StateSyncEventsByTime to all three transports (HTTP, gRPC, in-process app). Consistent sorting and retry behavior across all transports.
  • Failover: MultiHeimdallClient wraps all new methods with the existing failover mechanism. Fixed flaky failover tests that used global hit counters instead of method-specific tracking.
  • Fork ID: DeterministicStateSyncBlock included in GatherForks() for peer compatibility
  • Startup banner: Fork block printed in ChainConfig.Description()

Activation

  • DeterministicStateSyncBlock: nil (disabled) on mainnet/amoy until HF is scheduled
  • Requires corresponding Heimdall-v2 changes (heimdall-v2#572) deployed first. Heimdall HF also needs to be triggered first.

Changes

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)
  • Changes only for a subset of nodes

Breaking changes

HF needed.

Checklist

  • I have added at least 2 reviewer or the whole pos-v1 team
  • I have added sufficient documentation in code
  • I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
  • Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
  • Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

Testing

  • I have added unit tests
  • I have added tests to CI
  • I have tested this code manually on local environment
  • I have tested this code manually on remote devnet using express-cli
  • I have tested this code manually on amoy
  • I have created new e2e tests into express-cli

Manual tests

Tested on kurtosis devnets with chaos testing (2-vs-2 network partition) and self-heal recovery. Many rounds of chaos, zero errors, state roots match erigon.

Copilot AI review requested due to automatic review settings March 31, 2026 18:53
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@marcello33 marcello33 changed the title Deterministic state syncs. Deterministic state syncs Mar 31, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a fork-gated “deterministic state sync” flow for Bor by resolving a cutoff-time → Heimdall height, then querying state-sync events “visible at” that fixed Heimdall height to avoid non-determinism from querying a moving Heimdall head.

Changes:

  • Added Bor config fork flag DeterministicStateSyncBlock and plumbing to include it in fork ID gathering.
  • Extended Heimdall client interfaces (HTTP/gRPC/in-process app + failover wrapper) with GetBlockHeightByTime and StateSyncEventsAtHeight, and switched Bor.CommitStates to use them post-fork.
  • Added/updated tests and mocks to cover fork switching and URL formatting.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
consensus/bor/bor.go Switches state sync fetching logic to deterministic (height-anchored) mode post-fork.
consensus/bor/heimdall.go Extends IHeimdallClient with the new deterministic state-sync APIs.
consensus/bor/heimdall/client.go Implements the new endpoints for the HTTP Heimdall client + URL builders.
consensus/bor/heimdallgrpc/state_sync.go Implements gRPC equivalents for height-anchored state sync and height-by-time lookup.
consensus/bor/heimdallapp/state_sync.go Implements in-process HeimdallApp equivalents using keeper/query server.
consensus/bor/heimdall/failover_client.go Extends failover endpoint interface and wrapper calls for the new APIs.
consensus/bor/heimdall/metrics.go Adds request-type metrics for the new endpoints.
consensus/bor/heimdall/state_sync_url_test.go Adds URL-format tests for the new HTTP paths (and validates RFC3339 formatting).
params/config.go Adds DeterministicStateSyncBlock and IsDeterministicStateSync.
params/config_test.go Unit tests for IsDeterministicStateSync.
core/forkid/forkid.go Includes the new fork block in Bor fork gathering for fork ID computation.
consensus/bor/bor_test.go Adds tests verifying pre-/post-fork behavior and failure semantics post-fork.
tests/bor/mocks/IHeimdallClient.go Regenerated gomock for updated IHeimdallClient.
consensus/bor/span_store_test.go Updates test mocks to satisfy the extended Heimdall client interface.
eth/handler_bor_test.go Updates test mock to satisfy the extended Heimdall client interface.
eth/ethconfig/config_test.go Updates test mock to satisfy the extended Heimdall client interface.
consensus/bor/heimdall/failover_client_test.go Updates mock endpoint used by failover tests to satisfy the extended interface.
go.mod Pins/upgrades github.com/0xPolygon/heimdall-v2 to a newer pseudo-version (via replace).
go.sum Updates checksums for the new heimdall-v2 version.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@marcello33
Copy link
Copy Markdown
Collaborator Author

@claude review

@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

Claude finished @marcello33's task in 1m 50s —— View job


Code Review: Deterministic State Syncs

  • Gather context and read changed files
  • Review consensus logic changes (bor.go)
  • Review Heimdall client changes (HTTP, gRPC, app)
  • Review config and fork ID changes
  • Review tests
  • Post final review

Overall this is a well-structured PR. The fork-gating approach is correct, the error handling post-fork is properly strict (returning errors instead of silently swallowing), and the test coverage for the fork switch is good. Below are findings organized by severity.


HIGH: HTTP client does client-side filtering that gRPC/app clients do not

In consensus/bor/heimdall/client.go:329-343, the HTTP StateSyncEventsAtHeight applies a client-side filter:

if e.Id >= fromID && e.RecordTime.Before(time.Unix(toTime, 0)) {

But the gRPC implementation (heimdallgrpc/state_sync.go:125-137) and the in-process app implementation (heimdallapp/state_sync.go) trust the server response and append all returned events without filtering.

This means the three transport implementations have different behavior if the Heimdall endpoint returns events outside the expected range. For a deterministic state sync feature, all paths must produce identical results. Either:

  1. All clients should apply the same client-side filter (defensive), or
  2. All clients should trust the server (if the server guarantees correct filtering)

The existing StateSyncEvents has the same asymmetry (HTTP filters, gRPC doesn't), but for a feature explicitly targeting determinism, this inconsistency is worth resolving.

Fix this →


MEDIUM: queryCtx variable is unnecessary

In consensus/bor/bor.go:1763, queryCtx is assigned c.ctx and then used everywhere instead of c.ctx directly. This adds no value — it's not a derived context (no timeout, no cancellation). Consider removing the alias to reduce noise, or if the intent is to add a timeout later, add a TODO.


MEDIUM: log.Info in hot pagination loop

consensus/bor/heimdall/client.go:319:

log.Info("Fetching state sync events at height", "queryParams", u.RawQuery)

This logs at Info level inside the pagination loop, so it fires on every page. For large state sync batches this will be noisy. Consider log.Debug for per-page logs, keeping Info for the initial call only (similar to how the gRPC path logs once before the loop).


MEDIUM: TODO comments left in consensus-critical code

There are two identical TODO comments in bor.go:1769-1771 and bor.go:1779-1781:

// TODO marcello double check: if we silently return "no state syncs" here,
//  different validators could end up deriving different state sync sets...

The code already does the right thing (returns an error instead of silently swallowing). The TODOs are now misleading since they describe a concern that the code already addresses. These should be removed or replaced with a brief comment confirming the intentional design choice (e.g., // Post-fork: fail hard to preserve determinism).


LOW: Typo in test comment

consensus/bor/bor_test.go:5369:

// deterministi state sync

Should be deterministic state sync. (Also flagged by Copilot.)

Fix this →


LOW: to_time query parameter is not URL-encoded

In visibleAtHeightURL (client.go:537-540), the RFC3339Nano-formatted time is inserted into the query string via fmt.Sprintf without URL-encoding. RFC3339 timestamps contain + characters for positive timezone offsets (though UTC uses Z). Since the code explicitly uses .UTC(), this is safe today, but it's fragile — if the .UTC() call were ever removed, + would be interpreted as a space by the server. Consider using url.QueryEscape or url.Values for robustness.


LOW: No retry for GetBlockHeightByTime in HTTP client

StateSyncEventsAtHeight uses Fetch (no retry) while GetBlockHeightByTime uses FetchWithRetry. This is a reasonable design choice since retrying the height lookup is valuable (it's a single call), but it means a transient failure on the events endpoint will immediately fail the block. Given this is consensus-critical and a transient 503 from Heimdall would halt block production, consider whether FetchWithRetry would be more appropriate for the events fetch too, matching the pattern used in the existing StateSyncEvents (which also uses FetchWithRetry — I see it's actually calling Fetch directly here, not FetchWithRetry).

Actually, looking more carefully: the existing StateSyncEvents also uses bare Fetch, not FetchWithRetry, so this is consistent. Disregard.


Observation: go.mod replace directive

The replace directive for heimdall-v2 pins a pseudo-version. As Copilot noted, if this is temporary (waiting for a tagged release), it would be good to track when this can be cleaned up. This is fine for development but should not ship to mainnet without a proper version pin.


Summary

The core design is sound — anchoring state sync queries to a resolved Heimdall height is the right approach for determinism. The error handling post-fork correctly fails loudly rather than silently returning empty results. The main actionable finding is the client-side filtering inconsistency across HTTP/gRPC/app transports, which should be resolved to ensure all paths produce identical event sets.

@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

test

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 24.71910% with 134 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.89%. Comparing base (a1c5b22) to head (85521b2).
⚠️ Report is 4 commits behind head on develop.

Files with missing lines Patch % Lines
consensus/bor/heimdallgrpc/state_sync.go 0.00% 53 Missing ⚠️
consensus/bor/heimdall/client.go 14.28% 42 Missing ⚠️
consensus/bor/heimdallapp/state_sync.go 0.00% 25 Missing ⚠️
tests/bor/mocks/IHeimdallClient.go 0.00% 11 Missing ⚠️
params/config.go 40.00% 3 Missing ⚠️

❌ Your patch check has failed because the patch coverage (24.71%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2177      +/-   ##
===========================================
- Coverage    51.92%   51.89%   -0.03%     
===========================================
  Files          884      884              
  Lines       155431   155607     +176     
===========================================
+ Hits         80701    80752      +51     
- Misses       69524    69647     +123     
- Partials      5206     5208       +2     
Files with missing lines Coverage Δ
consensus/bor/bor.go 83.87% <100.00%> (+0.23%) ⬆️
consensus/bor/heimdall/failover_client.go 95.74% <100.00%> (+0.21%) ⬆️
consensus/bor/heimdall/metrics.go 12.50% <ø> (ø)
core/forkid/forkid.go 76.69% <100.00%> (+0.11%) ⬆️
params/config.go 36.84% <40.00%> (+0.02%) ⬆️
tests/bor/mocks/IHeimdallClient.go 0.00% <0.00%> (ø)
consensus/bor/heimdallapp/state_sync.go 0.00% <0.00%> (ø)
consensus/bor/heimdall/client.go 45.05% <14.28%> (-4.79%) ⬇️
consensus/bor/heimdallgrpc/state_sync.go 0.00% <0.00%> (ø)

... and 20 files with indirect coverage changes

Files with missing lines Coverage Δ
consensus/bor/bor.go 83.87% <100.00%> (+0.23%) ⬆️
consensus/bor/heimdall/failover_client.go 95.74% <100.00%> (+0.21%) ⬆️
consensus/bor/heimdall/metrics.go 12.50% <ø> (ø)
core/forkid/forkid.go 76.69% <100.00%> (+0.11%) ⬆️
params/config.go 36.84% <40.00%> (+0.02%) ⬆️
tests/bor/mocks/IHeimdallClient.go 0.00% <0.00%> (ø)
consensus/bor/heimdallapp/state_sync.go 0.00% <0.00%> (ø)
consensus/bor/heimdall/client.go 45.05% <14.28%> (-4.79%) ⬇️
consensus/bor/heimdallgrpc/state_sync.go 0.00% <0.00%> (ø)

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@marcello33
Copy link
Copy Markdown
Collaborator Author

@claude addressed your points.
Here's a recap.

  • HTTP client-side filtering inconsistency: since the server already filters correctly, removed the client-side filter for consistency across all three transports.
  • Unnecessary queryCtx alias: removed, using c.ctx directly.
  • log.Info in pagination loop: changed to log.Debug to avoid noise.
  • Misleading TODO comments: removed.
  • to_time not URL-encoded in visibleAtHeightURL: switched from fmt.Sprintf to url.Values for proper encoding.
  • Typo in test comment: fixed.

Re-review.

@claude
Copy link
Copy Markdown

claude bot commented Apr 1, 2026

Claude finished @marcello33's task in 1m 14s —— View job


Re-review: Deterministic State Syncs

  • Verify previous review fixes
  • Review full diff for new issues
  • Post final review

All six items from the previous review have been addressed correctly:

  1. Client-side filtering removed from HTTPclient.go:328-340 now appends all server-returned events without filtering, matching gRPC and app paths. ✅
  2. queryCtx alias removedbor.go:1764 uses c.ctx directly. ✅
  3. log.Infolog.Debug in pagination loop — client.go:318. ✅
  4. TODO comments removed — replaced with clear // Post-fork: fail hard to preserve determinism comments at bor.go:1766,1774. ✅
  5. url.Values for URL encodingvisibleAtHeightURL (client.go:536-541) now uses url.Values properly. ✅
  6. Typo fixedbor_test.go:5370. ✅

Remaining finding: Sort inconsistency across transports

The HTTP client sorts events after fetching (client.go:350-352):

sort.SliceStable(eventRecords, func(i, j int) bool {
    return eventRecords[i].ID < eventRecords[j].ID
})

But neither the gRPC client (heimdallgrpc/state_sync.go:90-148) nor the app client (heimdallapp/state_sync.go:42-69) sort the results. If the server returns events in a different order across transports — or if pagination produces different orderings — this breaks the determinism guarantee that is the entire point of this PR.

Either all three should sort, or none should (if the server guarantees ordering). Given this feature targets determinism, I'd recommend all three apply the same sort for defense-in-depth.

Fix this →


Everything else looks good. The core design is sound and the error handling is correct.

View job

Copilot AI review requested due to automatic review settings April 1, 2026 09:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@marcello33
Copy link
Copy Markdown
Collaborator Author

@claude addressed your points. Review once more and resolve the related open conversations if the comments are addressed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1763 to +1773
if c.config.IsDeterministicStateSync(header.Number) {
log.Info("Using deterministic state sync", "cutoff", to.Unix())

eventRecords, err = c.HeimdallClient.StateSyncEventsByTime(c.ctx, from, to.Unix())
if err != nil {
// Match pre-fork resilience: log and return empty on transient errors.
// Determinism is preserved because all validators independently skip
// the same sprint, and events will be picked up in the next sprint.
log.Error("Error fetching deterministic state sync events", "fromID", from, "to", to.Unix(), "err", err)

stateSyncs := make([]*types.StateSyncData, 0)
return stateSyncs, nil
return make([]*types.StateSyncData, 0), nil
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-fork deterministic mode currently logs and returns an empty state-sync set when StateSyncEventsByTime errors. That outcome depends on each node’s local Heimdall connectivity; if some validators succeed and others hit a transient error, they will apply different state updates and compute different state roots for the same block. For consensus-critical finalization, this should fail block processing (return the error) or use a deterministic fallback that guarantees all nodes derive the same result (e.g., only treat specific, globally-observable errors as skippable).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-fork path has the same behavior (log + return empty). I'll probably keep this comment open and discuss with the team.
Anyway, the risk is mitigated by multiple factors:

  1. FetchWithRetry retries aggressively
  2. waitUntilHeimdallIsSynced gates CommitStates until heimdall is caught up
  3. CommitStates only runs at sprint boundaries

Copilot AI review requested due to automatic review settings April 4, 2026 18:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 consensus/bor/heimdall/failover_client.go:257-271 — The defer cancel() inside the nested for loop in cascadeClients (failover_client.go:261-265) is the classic Go "defer in loop" anti-pattern introduced by this PR — the original code called cancel() immediately after each fn() invocation, correctly limiting each child context's lifetime, but the new code defers all cancels until cascadeClients returns. As a result, each loop iteration that satisfies the \!hasDeadline condition accumulates a deferred cancel from context.WithTimeout, meaning timer goroutines from earlier cascade attempts run concurrently with later ones until the function exits. The fix is to call cancel() directly after fn() rather than using defer.

    Extended reasoning...

    What the bug is and how it manifests

    In cascadeClients (failover_client.go), the PR changed from the original pattern of calling cancel() immediately after each fn() call to using defer cancel() inside the nested for loop. The relevant code (lines 257-271) iterates over candidate endpoint passes and, for each candidate that satisfies \!hasDeadline, creates a child context via context.WithTimeout(ctx, f.attemptTimeout) and then defers the corresponding cancel: defer cancel(). In Go, defer statements are function-scoped, not block- or loop-scoped, so each iteration appends a new cancel to the deferred list rather than calling it at the end of the loop body.

    The specific code path that triggers it

    When cascadeClients iterates through multiple failover candidates — three passes of cooled, uncooled, and unhealthy endpoints — each candidate that lacks a parent deadline creates a new context.WithTimeout. The original code in the diff shows the pre-PR pattern with cancel() called right after fn(subCtx, f.clients[i]). The new code replaces this with defer cancel(), which defers execution until cascadeClients returns entirely.

    Why existing code does not prevent it

    Go's compiler does not warn about deferred functions inside loops. The change looks superficially similar to the correct pattern used in callWithFailover (where defer cancel() is acceptable because there is no loop), but the semantics differ critically in the loop context. Additionally, go vet with the -shadow flag and staticcheck would normally flag defer in a loop, but the build may not enforce this.

    What the impact would be

    For each cascade attempt through multiple unhealthy endpoints, a timer goroutine is created and kept alive until cascadeClients returns rather than being immediately released after each failed attempt. If cascading through N candidates, N timer goroutines run concurrently during the cascade. Since cascadeClients is typically short-lived (connection-refused errors return in milliseconds) and the number of endpoints is small (2-3 in practice), the practical severity is low — all contexts are cleaned up within seconds and no permanent leak occurs. However, this is a real anti-pattern regression from the original code.

    Step-by-step proof

    1. cascadeClients iterates: pass 1 (cooled), pass 2 (uncooled), pass 3 (unhealthy).
    2. For each candidate i where ctx has no deadline: subCtx, cancel = context.WithTimeout(ctx, f.attemptTimeout) is called.
    3. defer cancel() appends to the function's defer stack (not executed at loop iteration end).
    4. fn(subCtx, f.clients[i]) runs; if it fails, the next iteration begins with a new context.WithTimeout and another defer cancel().
    5. After all iterations, cascadeClients returns and all deferred cancels fire — but only then.
    6. Contrast with the pre-PR code: cancel() was called immediately after fn(), so each timer goroutine was released as soon as that attempt completed.

    How to fix it

    Replace defer cancel() with an immediate cancel() call after fn() returns (both on success and failure paths), matching the original code pattern. Alternatively, extract the inner loop body into a helper function so that defer cancel() is function-scoped to that helper and fires correctly at the end of each attempt.

  • 🟡 consensus/bor/heimdall/failover_client.go:257-271 — This PR changed cascadeClients (failover_client.go:261-265) to use defer cancel() inside the inner for loop, introducing the classic Go 'defer in loop' anti-pattern. In Go, defer is function-scoped, not iteration-scoped: each loop iteration that creates a context.WithTimeout accumulates a deferred cancel that fires only when cascadeClients returns, not at the end of the iteration. The original code correctly called cancel() immediately after fn(). Fix: replace defer cancel() with an explicit cancel() call immediately after fn() returns.

    Extended reasoning...

    What the bug is and how it manifests

    In cascadeClients (consensus/bor/heimdall/failover_client.go lines 258-290), the PR introduced a conditional block inside the inner for loop:

    subCtx := ctx
    if _, hasDeadline := ctx.Deadline(); \!hasDeadline {
        var cancel context.CancelFunc
        subCtx, cancel = context.WithTimeout(ctx, f.attemptTimeout)
        defer cancel()   // ← defers accumulate, not cleaned up per iteration
    }
    result, err := fn(subCtx, f.clients[i])

    In Go, defer statements are scoped to the enclosing function, not the enclosing loop iteration. Every loop iteration that satisfies \!hasDeadline and calls context.WithTimeout registers a new deferred cancel that accumulates until cascadeClients returns. The pre-PR code called cancel() immediately after fn(), properly releasing the timer goroutine after each attempt.

    The specific code path that triggers it

    cascadeClients iterates through three passes (cooled, uncooled, unhealthy candidates). For each candidate endpoint where the caller context has no deadline, a new context.WithTimeout is created and its cancel is deferred. If N candidates are tried before a success, N timer goroutines accumulate and run concurrently until the function returns rather than being cleaned up after each attempt.

    Why existing code does not prevent it

    Go's defer semantics do not scope to loops. There is no compiler warning for this pattern — it requires go vet -shadow or explicit code review to catch. The pre-PR code was correct; the regression was introduced when the PR added the \!hasDeadline guard.

    What the impact would be

    In practice, the severity is low/nit: (1) cascadeClients is typically short-lived since connection-refused errors return in milliseconds; (2) the goroutines ARE eventually cleaned up when cascadeClients returns; (3) the maximum accumulation is bounded by the number of endpoints, typically 2-3. However, this is a real code quality regression that go vet would flag, and the original code was strictly better.

    Step-by-step proof

    1. Caller context has no deadline (typical case).
    2. Iteration 1 (cooled[0] fails): context.WithTimeout creates cancel1, defer cancel1() registered, timer goroutine T1 running.
    3. Iteration 2 (uncooled[0] fails): context.WithTimeout creates cancel2, defer cancel2() registered, timer goroutine T2 running.
    4. Iteration 3 (unhealthy[0] succeeds): cascadeClients returns, deferred cancel1() and cancel2() fire.
    5. Between steps 3 and 4, T1 and T2 are live concurrently even though their respective attempt is complete.
    6. The pre-PR code called cancel() after each fn() call, so T1 was cleaned up before T2 was even created.

    How to fix it

    Replace defer cancel() with an explicit call after fn() returns:

    subCtx := ctx
    var cancel context.CancelFunc
    if _, hasDeadline := ctx.Deadline(); \!hasDeadline {
        subCtx, cancel = context.WithTimeout(ctx, f.attemptTimeout)
    }
    result, err := fn(subCtx, f.clients[i])
    if cancel \!= nil {
        cancel()  // explicit, immediate cleanup
    }

    No verifiers refuted this bug.

@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

Code Review

1 issue found. Checked for bugs and CLAUDE.md compliance.


Bug: Missing global timeout on StateSyncEventsAtHeight HTTP client pagination loop

File: consensus/bor/heimdall/client.go L310-313

StateSyncEventsAtHeight has no context.WithTimeout bounding its pagination loop. Every sibling method added in this PR wraps its loop with a 1-minute timeout (HTTP StateSyncEventsByTime at L365-368, gRPC StateSyncEventsAtHeight at state_sync.go L95-97, gRPC StateSyncEventsByTime at state_sync.go L161-163). This method is the only one missing it. If the Heimdall server consistently returns full pages, the loop runs indefinitely. The comment on L364-365 of StateSyncEventsByTime states the intent to match the gRPC implementation, suggesting this was an oversight.

Suggested fix: Add a global timeout at the top of StateSyncEventsAtHeight: ctx, cancel := context.WithTimeout(ctx, 1*time.Minute) / defer cancel()

Flagged per consensus-security.md (timeout handling for Heimdall RPC calls) and security-common.md (context propagation for operations that can block).

Copilot AI review requested due to automatic review settings April 4, 2026 20:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — all review feedback addressed including the failover pagination timeout fix (1-minute budget set at MultiHeimdallClient level before callWithFailover, per the recommendation in the prior inline comments).

Extended reasoning...

Overview

19-file PR introducing deterministic state syncs for the Bor consensus layer. Core change: post-fork, CommitStates calls the new combined StateSyncEventsByTime endpoint (which resolves Heimdall height internally) instead of the non-deterministic (fromID, toTime) pattern. Changes span config (DeterministicStateSyncBlock), consensus logic, all three Heimdall transports (HTTP, gRPC, in-process app), the failover client, fork ID, and tests.

Security risks

No auth or crypto changes. The new API calls go to Heimdall endpoints (trusted internal dependency). The main risk is liveness: post-fork, Heimdall connectivity failures cause block proposals with empty state sync sets rather than hard crashes, matching the pre-fork resilience posture. This was an explicit design decision documented in code comments and accepted by the team.

Level of scrutiny

High scrutiny was appropriate and applied across 5+ review rounds. All raised issues were addressed:

  • Sort consistency across HTTP/gRPC/app transports
  • URL encoding with url.Values (not fmt.Sprintf)
  • 1-minute global pagination timeout at MultiHeimdallClient level (set before callWithFailover applies its 30s per-attempt cap)
  • StateSyncEventsByTime method in all mocks
  • Comprehensive failover tests for all three new methods
  • DeterministicStateSyncBlock in startup banner, fork ID, and JSON serialization
  • err shadowing, context chain growth, and other code quality issues

Other factors

The feature is gated by DeterministicStateSyncBlock: nil on mainnet/amoy, so it cannot activate until explicitly scheduled. The author confirmed manual testing on kurtosis devnets with chaos testing (2-vs-2 partition) with zero errors and state roots matching Erigon. The 19:53 inline comments from the previous approver run were about an intermediate implementation; the final diff implements the correct fix (timeout at MultiHeimdallClient level), as described in those same comments as the recommended fix.

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous comment flagging a missing timeout in StateSyncEventsAtHeight was incorrect — the 1-minute context.WithTimeout is already present in the current diff at client.go:313. This PR is ready for human sign-off.

Extended reasoning...

Overview

This PR adds deterministic state sync support to Bor's consensus engine. It introduces DeterministicStateSyncBlock fork gating, a new StateSyncEventsByTime combined endpoint, StateSyncEventsAtHeight and GetBlockHeightByTime for backward-compatibility, and wires all three through the HTTP/gRPC/in-process-app transports plus the MultiHeimdallClient failover wrapper. 19 files changed.

Security risks

The PR touches consensus-critical code (CommitStates), Heimdall client authentication, and fork-ID peer compatibility. The error-handling choice (log + return empty post-fork, matching pre-fork behavior) was explicitly discussed and accepted. No auth bypass or injection risks identified.

Level of scrutiny

This is a high-scrutiny change. It modifies how validators derive state sync event sets — any divergence between validators produces consensus failures. The PR has gone through six rounds of review; all previously raised issues (filtering inconsistency, sort ordering, URL encoding, context shadowing, timeout propagation through callWithFailover, missing mock stubs) have been addressed. My last inline comment was a false positive — the fix was already in place.

Other factors

  • Codecov shows 11.8% patch coverage on the new transport code; the gRPC and app-client paths are 0% covered by automated tests. SonarQube flagged 26.4% duplication, primarily from the repeated mock stubs across test files.
  • The pre-existing dead code in internalFetchWithTimeout (unconditional 30s override) was acknowledged but not fixed in this PR.
  • go.mod uses a pseudo-version replace directive for heimdall-v2; a TODO marks it for cleanup post-release.
  • DeterministicStateSyncBlock is nil on all live networks, so the new path is not yet active in production.

A human reviewer should confirm the overall design and sign off before merge.

Copilot AI review requested due to automatic review settings April 6, 2026 09:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

Code Review

1 issue found. Checked for bugs and CLAUDE.md compliance.

Bug: Per-attempt timeout (30s) silently caps the intended 1-minute pagination budget

File: consensus/bor/heimdall/failover_client.go:113-122

MultiHeimdallClient.StateSyncEventsByTime sets a 1-minute global timeout (line 116), intending it to bound the entire paginated fetch. However, callWithFailover (line 183) applies attemptTimeout (default 30s, line 20) as a child context on each attempt. Since context.WithTimeout uses min(parent_deadline, new_timeout), the entire paginated loop within each attempt is capped at 30s, not 1 minute.

Additionally, HeimdallClient.StateSyncEventsByTime (client.go:278) sets its own redundant 1-minute timeout inside the already-tighter 30s context - this inner timeout can never fire.

Effect: With default settings, the paginated fetch for all pages must complete within 30s per failover attempt, not the intended 1 minute. During periods with many queued state sync events requiring multiple pages, this could cause premature timeouts, triggering unnecessary failover cascading and ultimately returning empty state syncs (the proposer misses a slot).

Note: The test TestMultiFailover_StateSyncEventsByTime_CapsAttemptTimeoutWithGlobalPaginationDeadline only validates the case where attemptTimeout (set to 2 min) > global timeout (1 min). It does not test the default case where attemptTimeout (30s) < global timeout (1 min), which is the actual production behavior.

Suggested fix: Remove the 1-minute timeout from HeimdallClient.StateSyncEventsByTime (redundant), and either increase attemptTimeout for this method or pass the 1-minute context directly to the fn closure bypassing the per-attempt cap.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
17.6% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Comment on lines +52 to +57
func (m *mockHeimdallClient) GetBlockHeightByTime(context.Context, int64) (int64, error) {
return 0, nil
}
func (m *mockHeimdallClient) StateSyncEventsAtHeight(context.Context, uint64, int64, int64) ([]*clerk.EventRecordWithTime, error) {
return nil, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Multiple test mock structs across 5 files implement GetBlockHeightByTime and StateSyncEventsAtHeight as extra methods that are not declared in either the IHeimdallClient or Endpoint interface, making these stubs unreachable dead code. These are leftovers from an earlier two-step design iteration (get Heimdall height, then fetch by height) that was replaced by the unified StateSyncEventsByTime call; removing them would avoid falsely implying that failover test coverage exists for an AtHeight/ByHeight path that no longer exists in any interface.

Extended reasoning...

What the bug is and how it manifests

The PR refactored the deterministic state sync path to use a single unified StateSyncEventsByTime call instead of the previous two-step approach (GetBlockHeightByTime + StateSyncEventsAtHeight). The final IHeimdallClient interface in consensus/bor/heimdall.go and the Endpoint interface in consensus/bor/heimdall/failover_client.go both only declare StateSyncEventsByTime as the new method — neither interface contains GetBlockHeightByTime or StateSyncEventsAtHeight. Despite this, all test mock structs added or modified by this PR still implement both of those methods as extra struct methods:

  • eth/ethconfig/config_test.go lines 52-57: mockHeimdallClient.GetBlockHeightByTime and mockHeimdallClient.StateSyncEventsAtHeight
  • eth/handler_bor_test.go lines 68-73: mockHeimdall.GetBlockHeightByTime and mockHeimdall.StateSyncEventsAtHeight
  • consensus/bor/bor_test.go lines 122-127 and 2986-2991: same pattern on failingHeimdallClient and mockHeimdallClient
  • consensus/bor/span_store_test.go at multiple locations: MockHeimdallClient, MockOverlappingHeimdallClient, dynamicHeimdallClient, MockSyncStatusClient, and TimeoutHeimdallClient
  • consensus/bor/heimdall/failover_client_test.go lines 35-37: struct fields stateSyncEventsAtHeightFn and getBlockHeightByTimeFn, plus method implementations at lines 128-145

The specific code path that triggers it

In Go, interface satisfaction is structural — the compiler only checks that required methods are present. Extra struct methods not listed in the interface are completely invisible when using the interface type. Since GetBlockHeightByTime and StateSyncEventsAtHeight appear in neither IHeimdallClient nor Endpoint, no call site using either interface type can ever dispatch to these methods. They can only be called via concrete type, and no call site does that.

Why existing code does not prevent it

Go silently accepts extra methods on a struct beyond what an interface requires, and the compiler emits no warning. The tests all compile and pass correctly. This class of dead code can only be identified by cross-referencing interface definitions against struct method sets, which no automated tool enforces here.

Addressing the refutation

The sole refutation notes that "having extra methods on a struct beyond what an interface requires is valid Go code that compiles and runs correctly." This is entirely accurate. The code is not broken. However, correctness and the absence of dead code are separate concerns. The specific harm here is that failover_client_test.go carries two configurable function fields (stateSyncEventsAtHeightFn, getBlockHeightByTimeFn) and their corresponding method implementations, while the PR adds 5 failover tests each for GetBlockHeightByTime and StateSyncEventsAtHeight. This combination creates the strong false impression that there are working failover tests exercising those paths — but since neither method is in any interface, no production code dispatches through them. A reviewer auditing test coverage for the new deterministic sync path would count these as coverage that does not exist.

What the impact would be

No functional impact — all tests compile and pass. The issue is purely one of code clarity and reviewer confidence: dead stubs waste struct layout, make the interface contract harder to read at a glance, and the configurable function fields in failover_client_test.go create a false sense of test coverage for a code path (GetBlockHeightByTime + StateSyncEventsAtHeight) that was explicitly replaced by this PR.

How to fix it

Remove all GetBlockHeightByTime and StateSyncEventsAtHeight method implementations from the test mocks across all affected files, and remove the stateSyncEventsAtHeightFn and getBlockHeightByTimeFn fields from the mockHeimdallClient struct in failover_client_test.go. These methods are not required by any interface and will not break compilation or any test when removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants