Deterministic state syncs by marcello33 · Pull Request #2177 · 0xPolygon/bor

marcello33 · 2026-03-31T18:53:07Z

Description

Implements the bor side of deterministic state syncs. Post-fork, Bor switches from the non-deterministic (from_id, to_time) state sync query pattern to the combined GetStateSyncsByTime endpoint that resolves the Heimdall height internally and returns events in a single call. This ensures all validators derive identical state-sync sets regardless of when they query Heimdall.

Changes

Fork gating: Added DeterministicStateSyncBlock to BorConfig with IsDeterministicStateSync() check
Consensus logic: CommitStates branches on fork activation — post-fork uses the combined StateSyncEventsByTime endpoint. On errors, logs and returns empty (matching pre-fork resilience) instead of crashing block production
Heimdall clients: Added StateSyncEventsByTime to all three transports (HTTP, gRPC, in-process app). Consistent sorting and retry behavior across all transports.
Failover: MultiHeimdallClient wraps all new methods with the existing failover mechanism. Fixed flaky failover tests that used global hit counters instead of method-specific tracking.
Fork ID: DeterministicStateSyncBlock included in GatherForks() for peer compatibility
Startup banner: Fork block printed in ChainConfig.Description()

Activation

DeterministicStateSyncBlock: nil (disabled) on mainnet/amoy until HF is scheduled
Requires corresponding Heimdall-v2 changes (heimdall-v2#572) deployed first. Heimdall HF also needs to be triggered first.

Changes

Bugfix (non-breaking change that solves an issue)
Hotfix (change that solves an urgent issue, and requires immediate attention)
New feature (non-breaking change that adds functionality)
Breaking change (change that is not backwards-compatible and/or changes current functionality)
Changes only for a subset of nodes

Breaking changes

HF needed.

Checklist

I have added at least 2 reviewer or the whole pos-v1 team
I have added sufficient documentation in code
I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

This PR requires changes to heimdall
- In case link the PR here: Deterministic state syncs heimdall-v2#572
This PR requires changes to erigon
- In case link the PR here: Deterministic state sync erigon#145
This PR requires changes to matic-cli
- In case link the PR here:

Testing

I have added unit tests
I have added tests to CI
I have tested this code manually on local environment
I have tested this code manually on remote devnet using express-cli
I have tested this code manually on amoy
I have created new e2e tests into express-cli

Manual tests

Tested on kurtosis devnets with chaos testing (2-vs-2 network partition) and self-heal recovery. Many rounds of chaos, zero errors, state roots match erigon.

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

Copilot

Pull request overview

Introduces a fork-gated “deterministic state sync” flow for Bor by resolving a cutoff-time → Heimdall height, then querying state-sync events “visible at” that fixed Heimdall height to avoid non-determinism from querying a moving Heimdall head.

Changes:

Added Bor config fork flag DeterministicStateSyncBlock and plumbing to include it in fork ID gathering.
Extended Heimdall client interfaces (HTTP/gRPC/in-process app + failover wrapper) with GetBlockHeightByTime and StateSyncEventsAtHeight, and switched Bor.CommitStates to use them post-fork.
Added/updated tests and mocks to cover fork switching and URL formatting.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`consensus/bor/bor.go`	Switches state sync fetching logic to deterministic (height-anchored) mode post-fork.
`consensus/bor/heimdall.go`	Extends `IHeimdallClient` with the new deterministic state-sync APIs.
`consensus/bor/heimdall/client.go`	Implements the new endpoints for the HTTP Heimdall client + URL builders.
`consensus/bor/heimdallgrpc/state_sync.go`	Implements gRPC equivalents for height-anchored state sync and height-by-time lookup.
`consensus/bor/heimdallapp/state_sync.go`	Implements in-process HeimdallApp equivalents using keeper/query server.
`consensus/bor/heimdall/failover_client.go`	Extends failover endpoint interface and wrapper calls for the new APIs.
`consensus/bor/heimdall/metrics.go`	Adds request-type metrics for the new endpoints.
`consensus/bor/heimdall/state_sync_url_test.go`	Adds URL-format tests for the new HTTP paths (and validates RFC3339 formatting).
`params/config.go`	Adds `DeterministicStateSyncBlock` and `IsDeterministicStateSync`.
`params/config_test.go`	Unit tests for `IsDeterministicStateSync`.
`core/forkid/forkid.go`	Includes the new fork block in Bor fork gathering for fork ID computation.
`consensus/bor/bor_test.go`	Adds tests verifying pre-/post-fork behavior and failure semantics post-fork.
`tests/bor/mocks/IHeimdallClient.go`	Regenerated gomock for updated `IHeimdallClient`.
`consensus/bor/span_store_test.go`	Updates test mocks to satisfy the extended Heimdall client interface.
`eth/handler_bor_test.go`	Updates test mock to satisfy the extended Heimdall client interface.
`eth/ethconfig/config_test.go`	Updates test mock to satisfy the extended Heimdall client interface.
`consensus/bor/heimdall/failover_client_test.go`	Updates mock endpoint used by failover tests to satisfy the extended interface.
`go.mod`	Pins/upgrades `github.com/0xPolygon/heimdall-v2` to a newer pseudo-version (via replace).
`go.sum`	Updates checksums for the new `heimdall-v2` version.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

consensus/bor/bor_test.go

go.mod

marcello33 · 2026-03-31T19:06:14Z

@claude review

claude · 2026-03-31T19:06:35Z

Claude finished @marcello33's task in 1m 50s —— View job

Code Review: Deterministic State Syncs

Gather context and read changed files
Review consensus logic changes (bor.go)
Review Heimdall client changes (HTTP, gRPC, app)
Review config and fork ID changes
Review tests
Post final review

Overall this is a well-structured PR. The fork-gating approach is correct, the error handling post-fork is properly strict (returning errors instead of silently swallowing), and the test coverage for the fork switch is good. Below are findings organized by severity.

HIGH: HTTP client does client-side filtering that gRPC/app clients do not

In consensus/bor/heimdall/client.go:329-343, the HTTP StateSyncEventsAtHeight applies a client-side filter:

if e.Id >= fromID && e.RecordTime.Before(time.Unix(toTime, 0)) {

But the gRPC implementation (heimdallgrpc/state_sync.go:125-137) and the in-process app implementation (heimdallapp/state_sync.go) trust the server response and append all returned events without filtering.

This means the three transport implementations have different behavior if the Heimdall endpoint returns events outside the expected range. For a deterministic state sync feature, all paths must produce identical results. Either:

All clients should apply the same client-side filter (defensive), or
All clients should trust the server (if the server guarantees correct filtering)

The existing StateSyncEvents has the same asymmetry (HTTP filters, gRPC doesn't), but for a feature explicitly targeting determinism, this inconsistency is worth resolving.

Fix this →

MEDIUM: `queryCtx` variable is unnecessary

In consensus/bor/bor.go:1763, queryCtx is assigned c.ctx and then used everywhere instead of c.ctx directly. This adds no value — it's not a derived context (no timeout, no cancellation). Consider removing the alias to reduce noise, or if the intent is to add a timeout later, add a TODO.

MEDIUM: `log.Info` in hot pagination loop

consensus/bor/heimdall/client.go:319:

log.Info("Fetching state sync events at height", "queryParams", u.RawQuery)

This logs at Info level inside the pagination loop, so it fires on every page. For large state sync batches this will be noisy. Consider log.Debug for per-page logs, keeping Info for the initial call only (similar to how the gRPC path logs once before the loop).

MEDIUM: TODO comments left in consensus-critical code

There are two identical TODO comments in bor.go:1769-1771 and bor.go:1779-1781:

// TODO marcello double check: if we silently return "no state syncs" here,
//  different validators could end up deriving different state sync sets...

The code already does the right thing (returns an error instead of silently swallowing). The TODOs are now misleading since they describe a concern that the code already addresses. These should be removed or replaced with a brief comment confirming the intentional design choice (e.g., // Post-fork: fail hard to preserve determinism).

LOW: Typo in test comment

consensus/bor/bor_test.go:5369:

// deterministi state sync

Should be deterministic state sync. (Also flagged by Copilot.)

Fix this →

LOW: `to_time` query parameter is not URL-encoded

In visibleAtHeightURL (client.go:537-540), the RFC3339Nano-formatted time is inserted into the query string via fmt.Sprintf without URL-encoding. RFC3339 timestamps contain + characters for positive timezone offsets (though UTC uses Z). Since the code explicitly uses .UTC(), this is safe today, but it's fragile — if the .UTC() call were ever removed, + would be interpreted as a space by the server. Consider using url.QueryEscape or url.Values for robustness.

LOW: No retry for `GetBlockHeightByTime` in HTTP client

StateSyncEventsAtHeight uses Fetch (no retry) while GetBlockHeightByTime uses FetchWithRetry. This is a reasonable design choice since retrying the height lookup is valuable (it's a single call), but it means a transient failure on the events endpoint will immediately fail the block. Given this is consensus-critical and a transient 503 from Heimdall would halt block production, consider whether FetchWithRetry would be more appropriate for the events fetch too, matching the pattern used in the existing StateSyncEvents (which also uses FetchWithRetry — I see it's actually calling Fetch directly here, not FetchWithRetry).

Actually, looking more carefully: the existing StateSyncEvents also uses bare Fetch, not FetchWithRetry, so this is consistent. Disregard.

Observation: `go.mod` replace directive

The replace directive for heimdall-v2 pins a pseudo-version. As Copilot noted, if this is temporary (waiting for a tagged release), it would be good to track when this can be cleaned up. This is fine for development but should not ship to mainnet without a proper version pin.

Summary

The core design is sound — anchoring state sync queries to a resolved Heimdall height is the right approach for determinism. The error handling post-fork correctly fails loudly rather than silently returning empty results. The main actionable finding is the client-side filtering inconsistency across HTTP/gRPC/app transports, which should be resolved to ensure all paths produce identical event sets.

claude · 2026-03-31T19:55:31Z

test

codecov · 2026-03-31T19:59:49Z

Codecov Report

❌ Patch coverage is 24.71910% with 134 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.89%. Comparing base (a1c5b22) to head (85521b2).
⚠️ Report is 4 commits behind head on develop.

Files with missing lines	Patch %	Lines
consensus/bor/heimdallgrpc/state_sync.go	0.00%	53 Missing ⚠️
consensus/bor/heimdall/client.go	14.28%	42 Missing ⚠️
consensus/bor/heimdallapp/state_sync.go	0.00%	25 Missing ⚠️
tests/bor/mocks/IHeimdallClient.go	0.00%	11 Missing ⚠️
params/config.go	40.00%	3 Missing ⚠️

❌ Your patch check has failed because the patch coverage (24.71%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2177      +/-   ##
===========================================
- Coverage    51.92%   51.89%   -0.03%     
===========================================
  Files          884      884              
  Lines       155431   155607     +176     
===========================================
+ Hits         80701    80752      +51     
- Misses       69524    69647     +123     
- Partials      5206     5208       +2

Files with missing lines	Coverage Δ
consensus/bor/bor.go	`83.87% <100.00%> (+0.23%)`	⬆️
consensus/bor/heimdall/failover_client.go	`95.74% <100.00%> (+0.21%)`	⬆️
consensus/bor/heimdall/metrics.go	`12.50% <ø> (ø)`
core/forkid/forkid.go	`76.69% <100.00%> (+0.11%)`	⬆️
params/config.go	`36.84% <40.00%> (+0.02%)`	⬆️
tests/bor/mocks/IHeimdallClient.go	`0.00% <0.00%> (ø)`
consensus/bor/heimdallapp/state_sync.go	`0.00% <0.00%> (ø)`
consensus/bor/heimdall/client.go	`45.05% <14.28%> (-4.79%)`	⬇️
consensus/bor/heimdallgrpc/state_sync.go	`0.00% <0.00%> (ø)`

... and 20 files with indirect coverage changes

Files with missing lines	Coverage Δ
consensus/bor/bor.go	`83.87% <100.00%> (+0.23%)`	⬆️
consensus/bor/heimdall/failover_client.go	`95.74% <100.00%> (+0.21%)`	⬆️
consensus/bor/heimdall/metrics.go	`12.50% <ø> (ø)`
core/forkid/forkid.go	`76.69% <100.00%> (+0.11%)`	⬆️
params/config.go	`36.84% <40.00%> (+0.02%)`	⬆️
tests/bor/mocks/IHeimdallClient.go	`0.00% <0.00%> (ø)`
consensus/bor/heimdallapp/state_sync.go	`0.00% <0.00%> (ø)`
consensus/bor/heimdall/client.go	`45.05% <14.28%> (-4.79%)`	⬇️
consensus/bor/heimdallgrpc/state_sync.go	`0.00% <0.00%> (ø)`

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

marcello33 · 2026-04-01T09:11:30Z

@claude addressed your points.
Here's a recap.

HTTP client-side filtering inconsistency: since the server already filters correctly, removed the client-side filter for consistency across all three transports.
Unnecessary queryCtx alias: removed, using c.ctx directly.
log.Info in pagination loop: changed to log.Debug to avoid noise.
Misleading TODO comments: removed.
to_time not URL-encoded in visibleAtHeightURL: switched from fmt.Sprintf to url.Values for proper encoding.
Typo in test comment: fixed.

Re-review.

claude · 2026-04-01T09:11:51Z

Claude finished @marcello33's task in 1m 14s —— View job

Re-review: Deterministic State Syncs

Verify previous review fixes
Review full diff for new issues
Post final review

All six items from the previous review have been addressed correctly:

Client-side filtering removed from HTTP — client.go:328-340 now appends all server-returned events without filtering, matching gRPC and app paths. ✅
queryCtx alias removed — bor.go:1764 uses c.ctx directly. ✅
log.Info → log.Debug in pagination loop — client.go:318. ✅
TODO comments removed — replaced with clear // Post-fork: fail hard to preserve determinism comments at bor.go:1766,1774. ✅
url.Values for URL encoding — visibleAtHeightURL (client.go:536-541) now uses url.Values properly. ✅
Typo fixed — bor_test.go:5370. ✅

Remaining finding: Sort inconsistency across transports

The HTTP client sorts events after fetching (client.go:350-352):

sort.SliceStable(eventRecords, func(i, j int) bool {
    return eventRecords[i].ID < eventRecords[j].ID
})

But neither the gRPC client (heimdallgrpc/state_sync.go:90-148) nor the app client (heimdallapp/state_sync.go:42-69) sort the results. If the server returns events in a different order across transports — or if pagination produces different orderings — this breaks the determinism guarantee that is the entire point of this PR.

Either all three should sort, or none should (if the server guarantees ordering). Given this feature targets determinism, I'd recommend all three apply the same sort for defense-in-depth.

Fix this →

Everything else looks good. The core design is sound and the error handling is correct.

View job

consensus/bor/heimdall/state_sync_url_test.go

consensus/bor/heimdallgrpc/state_sync.go

Copilot

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

params/config.go

consensus/bor/bor.go

consensus/bor/heimdall/client.go

marcello33 · 2026-04-01T10:12:38Z

@claude addressed your points. Review once more and resolve the related open conversations if the comments are addressed.

Copilot

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-04T09:48:23Z

consensus/bor/bor.go

+	if c.config.IsDeterministicStateSync(header.Number) {
+		log.Info("Using deterministic state sync", "cutoff", to.Unix())
+
+		eventRecords, err = c.HeimdallClient.StateSyncEventsByTime(c.ctx, from, to.Unix())
+		if err != nil {
+			// Match pre-fork resilience: log and return empty on transient errors.
+			// Determinism is preserved because all validators independently skip
+			// the same sprint, and events will be picked up in the next sprint.
+			log.Error("Error fetching deterministic state sync events", "fromID", from, "to", to.Unix(), "err", err)

-		stateSyncs := make([]*types.StateSyncData, 0)
-		return stateSyncs, nil
+			return make([]*types.StateSyncData, 0), nil


Post-fork deterministic mode currently logs and returns an empty state-sync set when StateSyncEventsByTime errors. That outcome depends on each node’s local Heimdall connectivity; if some validators succeed and others hit a transient error, they will apply different state updates and compute different state roots for the same block. For consensus-critical finalization, this should fail block processing (return the error) or use a deterministic fallback that guarantees all nodes derive the same result (e.g., only treat specific, globally-observable errors as skippable).

The pre-fork path has the same behavior (log + return empty). I'll probably keep this comment open and discuss with the team.
Anyway, the risk is mitigated by multiple factors:

FetchWithRetry retries aggressively

waitUntilHeimdallIsSynced gates CommitStates until heimdall is caught up

CommitStates only runs at sprint boundaries

consensus/bor/heimdall/failover_client.go

consensus/bor/heimdall/client.go

consensus/bor/heimdall/failover_client.go

consensus/bor/heimdall/state_sync_url_test.go

consensus/bor/heimdall/failover_client_test.go

Copilot

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

consensus/bor/heimdall/failover_client.go

claude

Additional findings (outside current diff — PR may have been updated during review):

🟡 consensus/bor/heimdall/failover_client.go:257-271 — The defer cancel() inside the nested for loop in cascadeClients (failover_client.go:261-265) is the classic Go "defer in loop" anti-pattern introduced by this PR — the original code called cancel() immediately after each fn() invocation, correctly limiting each child context's lifetime, but the new code defers all cancels until cascadeClients returns. As a result, each loop iteration that satisfies the \!hasDeadline condition accumulates a deferred cancel from context.WithTimeout, meaning timer goroutines from earlier cascade attempts run concurrently with later ones until the function exits. The fix is to call cancel() directly after fn() rather than using defer.
Extended reasoning...

What the bug is and how it manifests

In cascadeClients (failover_client.go), the PR changed from the original pattern of calling cancel() immediately after each fn() call to using defer cancel() inside the nested for loop. The relevant code (lines 257-271) iterates over candidate endpoint passes and, for each candidate that satisfies \!hasDeadline, creates a child context via context.WithTimeout(ctx, f.attemptTimeout) and then defers the corresponding cancel: defer cancel(). In Go, defer statements are function-scoped, not block- or loop-scoped, so each iteration appends a new cancel to the deferred list rather than calling it at the end of the loop body.

The specific code path that triggers it

When cascadeClients iterates through multiple failover candidates — three passes of cooled, uncooled, and unhealthy endpoints — each candidate that lacks a parent deadline creates a new context.WithTimeout. The original code in the diff shows the pre-PR pattern with cancel() called right after fn(subCtx, f.clients[i]). The new code replaces this with defer cancel(), which defers execution until cascadeClients returns entirely.

Why existing code does not prevent it

Go's compiler does not warn about deferred functions inside loops. The change looks superficially similar to the correct pattern used in callWithFailover (where defer cancel() is acceptable because there is no loop), but the semantics differ critically in the loop context. Additionally, go vet with the -shadow flag and staticcheck would normally flag defer in a loop, but the build may not enforce this.

What the impact would be

For each cascade attempt through multiple unhealthy endpoints, a timer goroutine is created and kept alive until cascadeClients returns rather than being immediately released after each failed attempt. If cascading through N candidates, N timer goroutines run concurrently during the cascade. Since cascadeClients is typically short-lived (connection-refused errors return in milliseconds) and the number of endpoints is small (2-3 in practice), the practical severity is low — all contexts are cleaned up within seconds and no permanent leak occurs. However, this is a real anti-pattern regression from the original code.

Step-by-step proof
1. cascadeClients iterates: pass 1 (cooled), pass 2 (uncooled), pass 3 (unhealthy).
2. For each candidate i where ctx has no deadline: subCtx, cancel = context.WithTimeout(ctx, f.attemptTimeout) is called.
3. defer cancel() appends to the function's defer stack (not executed at loop iteration end).
4. fn(subCtx, f.clients[i]) runs; if it fails, the next iteration begins with a new context.WithTimeout and another defer cancel().
5. After all iterations, cascadeClients returns and all deferred cancels fire — but only then.
6. Contrast with the pre-PR code: cancel() was called immediately after fn(), so each timer goroutine was released as soon as that attempt completed.
How to fix it

Replace defer cancel() with an immediate cancel() call after fn() returns (both on success and failure paths), matching the original code pattern. Alternatively, extract the inner loop body into a helper function so that defer cancel() is function-scoped to that helper and fires correctly at the end of each attempt.
🟡 consensus/bor/heimdall/failover_client.go:257-271 — This PR changed cascadeClients (failover_client.go:261-265) to use defer cancel() inside the inner for loop, introducing the classic Go 'defer in loop' anti-pattern. In Go, defer is function-scoped, not iteration-scoped: each loop iteration that creates a context.WithTimeout accumulates a deferred cancel that fires only when cascadeClients returns, not at the end of the iteration. The original code correctly called cancel() immediately after fn(). Fix: replace defer cancel() with an explicit cancel() call immediately after fn() returns.
Extended reasoning...

What the bug is and how it manifests

In cascadeClients (consensus/bor/heimdall/failover_client.go lines 258-290), the PR introduced a conditional block inside the inner for loop:
```
subCtx := ctx
if _, hasDeadline := ctx.Deadline(); \!hasDeadline {
    var cancel context.CancelFunc
    subCtx, cancel = context.WithTimeout(ctx, f.attemptTimeout)
    defer cancel()   // ← defers accumulate, not cleaned up per iteration
}
result, err := fn(subCtx, f.clients[i])
```
In Go, defer statements are scoped to the enclosing function, not the enclosing loop iteration. Every loop iteration that satisfies \!hasDeadline and calls context.WithTimeout registers a new deferred cancel that accumulates until cascadeClients returns. The pre-PR code called cancel() immediately after fn(), properly releasing the timer goroutine after each attempt.

The specific code path that triggers it

cascadeClients iterates through three passes (cooled, uncooled, unhealthy candidates). For each candidate endpoint where the caller context has no deadline, a new context.WithTimeout is created and its cancel is deferred. If N candidates are tried before a success, N timer goroutines accumulate and run concurrently until the function returns rather than being cleaned up after each attempt.

Why existing code does not prevent it

Go's defer semantics do not scope to loops. There is no compiler warning for this pattern — it requires go vet -shadow or explicit code review to catch. The pre-PR code was correct; the regression was introduced when the PR added the \!hasDeadline guard.

What the impact would be

In practice, the severity is low/nit: (1) cascadeClients is typically short-lived since connection-refused errors return in milliseconds; (2) the goroutines ARE eventually cleaned up when cascadeClients returns; (3) the maximum accumulation is bounded by the number of endpoints, typically 2-3. However, this is a real code quality regression that go vet would flag, and the original code was strictly better.

Step-by-step proof
1. Caller context has no deadline (typical case).
2. Iteration 1 (cooled[0] fails): context.WithTimeout creates cancel1, defer cancel1() registered, timer goroutine T1 running.
3. Iteration 2 (uncooled[0] fails): context.WithTimeout creates cancel2, defer cancel2() registered, timer goroutine T2 running.
4. Iteration 3 (unhealthy[0] succeeds): cascadeClients returns, deferred cancel1() and cancel2() fire.
5. Between steps 3 and 4, T1 and T2 are live concurrently even though their respective attempt is complete.
6. The pre-PR code called cancel() after each fn() call, so T1 was cleaned up before T2 was even created.
How to fix it

Replace defer cancel() with an explicit call after fn() returns:
```
subCtx := ctx
var cancel context.CancelFunc
if _, hasDeadline := ctx.Deadline(); \!hasDeadline {
    subCtx, cancel = context.WithTimeout(ctx, f.attemptTimeout)
}
result, err := fn(subCtx, f.clients[i])
if cancel \!= nil {
    cancel()  // explicit, immediate cleanup
}
```
No verifiers refuted this bug.

consensus/bor/heimdall/failover_client.go

claude · 2026-04-04T19:29:33Z

Code Review

1 issue found. Checked for bugs and CLAUDE.md compliance.

Bug: Missing global timeout on StateSyncEventsAtHeight HTTP client pagination loop

File: consensus/bor/heimdall/client.go L310-313

StateSyncEventsAtHeight has no context.WithTimeout bounding its pagination loop. Every sibling method added in this PR wraps its loop with a 1-minute timeout (HTTP StateSyncEventsByTime at L365-368, gRPC StateSyncEventsAtHeight at state_sync.go L95-97, gRPC StateSyncEventsByTime at state_sync.go L161-163). This method is the only one missing it. If the Heimdall server consistently returns full pages, the loop runs indefinitely. The comment on L364-365 of StateSyncEventsByTime states the intent to match the gRPC implementation, suggesting this was an oversight.

Suggested fix: Add a global timeout at the top of StateSyncEventsAtHeight: ctx, cancel := context.WithTimeout(ctx, 1*time.Minute) / defer cancel()

Flagged per consensus-security.md (timeout handling for Heimdall RPC calls) and security-common.md (context propagation for operations that can block).

consensus/bor/heimdall/failover_client.go

Copilot

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

consensus/bor/heimdall/failover_client.go

consensus/bor/heimdall/failover_client_test.go

claude · 2026-04-04T20:34:04Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

claude

LGTM — all review feedback addressed including the failover pagination timeout fix (1-minute budget set at MultiHeimdallClient level before callWithFailover, per the recommendation in the prior inline comments).

Extended reasoning...

Overview

19-file PR introducing deterministic state syncs for the Bor consensus layer. Core change: post-fork, CommitStates calls the new combined StateSyncEventsByTime endpoint (which resolves Heimdall height internally) instead of the non-deterministic (fromID, toTime) pattern. Changes span config (DeterministicStateSyncBlock), consensus logic, all three Heimdall transports (HTTP, gRPC, in-process app), the failover client, fork ID, and tests.

Security risks

No auth or crypto changes. The new API calls go to Heimdall endpoints (trusted internal dependency). The main risk is liveness: post-fork, Heimdall connectivity failures cause block proposals with empty state sync sets rather than hard crashes, matching the pre-fork resilience posture. This was an explicit design decision documented in code comments and accepted by the team.

Level of scrutiny

High scrutiny was appropriate and applied across 5+ review rounds. All raised issues were addressed:

Sort consistency across HTTP/gRPC/app transports
URL encoding with url.Values (not fmt.Sprintf)
1-minute global pagination timeout at MultiHeimdallClient level (set before callWithFailover applies its 30s per-attempt cap)
StateSyncEventsByTime method in all mocks
Comprehensive failover tests for all three new methods
DeterministicStateSyncBlock in startup banner, fork ID, and JSON serialization
err shadowing, context chain growth, and other code quality issues

Other factors

The feature is gated by DeterministicStateSyncBlock: nil on mainnet/amoy, so it cannot activate until explicitly scheduled. The author confirmed manual testing on kurtosis devnets with chaos testing (2-vs-2 partition) with zero errors and state roots matching Erigon. The 19:53 inline comments from the previous approver run were about an intermediate implementation; the final diff implements the correct fix (timeout at MultiHeimdallClient level), as described in those same comments as the recommended fix.

claude

My previous comment flagging a missing timeout in StateSyncEventsAtHeight was incorrect — the 1-minute context.WithTimeout is already present in the current diff at client.go:313. This PR is ready for human sign-off.

Extended reasoning...

Overview

This PR adds deterministic state sync support to Bor's consensus engine. It introduces DeterministicStateSyncBlock fork gating, a new StateSyncEventsByTime combined endpoint, StateSyncEventsAtHeight and GetBlockHeightByTime for backward-compatibility, and wires all three through the HTTP/gRPC/in-process-app transports plus the MultiHeimdallClient failover wrapper. 19 files changed.

Security risks

The PR touches consensus-critical code (CommitStates), Heimdall client authentication, and fork-ID peer compatibility. The error-handling choice (log + return empty post-fork, matching pre-fork behavior) was explicitly discussed and accepted. No auth bypass or injection risks identified.

Level of scrutiny

This is a high-scrutiny change. It modifies how validators derive state sync event sets — any divergence between validators produces consensus failures. The PR has gone through six rounds of review; all previously raised issues (filtering inconsistency, sort ordering, URL encoding, context shadowing, timeout propagation through callWithFailover, missing mock stubs) have been addressed. My last inline comment was a false positive — the fix was already in place.

Other factors

Codecov shows 11.8% patch coverage on the new transport code; the gRPC and app-client paths are 0% covered by automated tests. SonarQube flagged 26.4% duplication, primarily from the repeated mock stubs across test files.
The pre-existing dead code in internalFetchWithTimeout (unconditional 30s override) was acknowledged but not fixed in this PR.
go.mod uses a pseudo-version replace directive for heimdall-v2; a TODO marks it for cleanup post-release.
DeterministicStateSyncBlock is nil on all live networks, so the new path is not yet active in production.

A human reviewer should confirm the overall design and sign off before merge.

Copilot

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

consensus/bor/heimdall.go

claude · 2026-04-06T09:25:45Z

Code Review

1 issue found. Checked for bugs and CLAUDE.md compliance.

Bug: Per-attempt timeout (30s) silently caps the intended 1-minute pagination budget

File: consensus/bor/heimdall/failover_client.go:113-122

MultiHeimdallClient.StateSyncEventsByTime sets a 1-minute global timeout (line 116), intending it to bound the entire paginated fetch. However, callWithFailover (line 183) applies attemptTimeout (default 30s, line 20) as a child context on each attempt. Since context.WithTimeout uses min(parent_deadline, new_timeout), the entire paginated loop within each attempt is capped at 30s, not 1 minute.

Additionally, HeimdallClient.StateSyncEventsByTime (client.go:278) sets its own redundant 1-minute timeout inside the already-tighter 30s context - this inner timeout can never fire.

Effect: With default settings, the paginated fetch for all pages must complete within 30s per failover attempt, not the intended 1 minute. During periods with many queued state sync events requiring multiple pages, this could cause premature timeouts, triggering unnecessary failover cascading and ultimately returning empty state syncs (the proposer misses a slot).

Note: The test TestMultiFailover_StateSyncEventsByTime_CapsAttemptTimeoutWithGlobalPaginationDeadline only validates the case where attemptTimeout (set to 2 min) > global timeout (1 min). It does not test the default case where attemptTimeout (30s) < global timeout (1 min), which is the actual production behavior.

Suggested fix: Remove the 1-minute timeout from HeimdallClient.StateSyncEventsByTime (redundant), and either increase attemptTimeout for this method or pass the 1-minute context directly to the fn closure bypassing the per-attempt cap.

consensus/bor/heimdall/failover_client_test.go

sonarqubecloud · 2026-04-06T16:06:39Z

Quality Gate failed

Failed conditions
17.6% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

claude · 2026-04-06T16:41:45Z

eth/ethconfig/config_test.go

+func (m *mockHeimdallClient) GetBlockHeightByTime(context.Context, int64) (int64, error) {
+	return 0, nil
+}
+func (m *mockHeimdallClient) StateSyncEventsAtHeight(context.Context, uint64, int64, int64) ([]*clerk.EventRecordWithTime, error) {
+	return nil, nil
+}


🟡 Multiple test mock structs across 5 files implement GetBlockHeightByTime and StateSyncEventsAtHeight as extra methods that are not declared in either the IHeimdallClient or Endpoint interface, making these stubs unreachable dead code. These are leftovers from an earlier two-step design iteration (get Heimdall height, then fetch by height) that was replaced by the unified StateSyncEventsByTime call; removing them would avoid falsely implying that failover test coverage exists for an AtHeight/ByHeight path that no longer exists in any interface.

Extended reasoning...

What the bug is and how it manifests

The PR refactored the deterministic state sync path to use a single unified StateSyncEventsByTime call instead of the previous two-step approach (GetBlockHeightByTime + StateSyncEventsAtHeight). The final IHeimdallClient interface in consensus/bor/heimdall.go and the Endpoint interface in consensus/bor/heimdall/failover_client.go both only declare StateSyncEventsByTime as the new method — neither interface contains GetBlockHeightByTime or StateSyncEventsAtHeight. Despite this, all test mock structs added or modified by this PR still implement both of those methods as extra struct methods:

eth/ethconfig/config_test.go lines 52-57: mockHeimdallClient.GetBlockHeightByTime and mockHeimdallClient.StateSyncEventsAtHeight

eth/handler_bor_test.go lines 68-73: mockHeimdall.GetBlockHeightByTime and mockHeimdall.StateSyncEventsAtHeight

consensus/bor/bor_test.go lines 122-127 and 2986-2991: same pattern on failingHeimdallClient and mockHeimdallClient

consensus/bor/span_store_test.go at multiple locations: MockHeimdallClient, MockOverlappingHeimdallClient, dynamicHeimdallClient, MockSyncStatusClient, and TimeoutHeimdallClient

consensus/bor/heimdall/failover_client_test.go lines 35-37: struct fields stateSyncEventsAtHeightFn and getBlockHeightByTimeFn, plus method implementations at lines 128-145

The specific code path that triggers it

In Go, interface satisfaction is structural — the compiler only checks that required methods are present. Extra struct methods not listed in the interface are completely invisible when using the interface type. Since GetBlockHeightByTime and StateSyncEventsAtHeight appear in neither IHeimdallClient nor Endpoint, no call site using either interface type can ever dispatch to these methods. They can only be called via concrete type, and no call site does that.

Why existing code does not prevent it

Go silently accepts extra methods on a struct beyond what an interface requires, and the compiler emits no warning. The tests all compile and pass correctly. This class of dead code can only be identified by cross-referencing interface definitions against struct method sets, which no automated tool enforces here.

Addressing the refutation

The sole refutation notes that "having extra methods on a struct beyond what an interface requires is valid Go code that compiles and runs correctly." This is entirely accurate. The code is not broken. However, correctness and the absence of dead code are separate concerns. The specific harm here is that failover_client_test.go carries two configurable function fields (stateSyncEventsAtHeightFn, getBlockHeightByTimeFn) and their corresponding method implementations, while the PR adds 5 failover tests each for GetBlockHeightByTime and StateSyncEventsAtHeight. This combination creates the strong false impression that there are working failover tests exercising those paths — but since neither method is in any interface, no production code dispatches through them. A reviewer auditing test coverage for the new deterministic sync path would count these as coverage that does not exist.

What the impact would be

No functional impact — all tests compile and pass. The issue is purely one of code clarity and reviewer confidence: dead stubs waste struct layout, make the interface contract harder to read at a glance, and the configurable function fields in failover_client_test.go create a false sense of test coverage for a code path (GetBlockHeightByTime + StateSyncEventsAtHeight) that was explicitly replaced by this PR.

How to fix it

Remove all GetBlockHeightByTime and StateSyncEventsAtHeight method implementations from the test mocks across all affected files, and remove the stateSyncEventsAtHeightFn and getBlockHeightByTimeFn fields from the mockHeimdallClient struct in failover_client_test.go. These methods are not required by any interface and will not break compilation or any test when removed.

marcello33 added 10 commits March 25, 2026 16:03

deterministic state sync

7e59d71

temp bump heimdall to committed version for testing purposes

fdb0082

fix parsing

3ae74d8

fix unmarshalling of RecordListVisibleAtHeightResponse

405e5e6

change heimdall dep for testing

1893557

Merge branch 'develop' into mardizzone/POS-3441_deterministic-ss

f48218f

add DeterministicStateSyncBlock to GatherForks

f3446c4

Merge branch 'develop' into mardizzone/POS-3441_deterministic-ss

2874ed1

Merge branch 'develop' into mardizzone/POS-3441_deterministic-ss

2934d3e

better comment on go.mod

47f48bc

Copilot AI review requested due to automatic review settings March 31, 2026 18:53

claude bot reviewed Mar 31, 2026

View reviewed changes

Copilot started reviewing on behalf of marcello33 March 31, 2026 18:53 View session

marcello33 changed the title ~~Deterministic state syncs.~~ Deterministic state syncs Mar 31, 2026

fix linter

13a51b8

Copilot AI reviewed Mar 31, 2026

View reviewed changes

consensus/bor/bor_test.go Show resolved Hide resolved

go.mod Show resolved Hide resolved

address comments

96d8797

claude bot reviewed Apr 1, 2026

View reviewed changes

consensus/bor/heimdall/state_sync_url_test.go Outdated Show resolved Hide resolved

consensus/bor/heimdallgrpc/state_sync.go Outdated Show resolved Hide resolved

address comments

c6f0d55

Copilot AI review requested due to automatic review settings April 1, 2026 09:37

Copilot AI reviewed Apr 1, 2026

View reviewed changes

params/config.go Outdated Show resolved Hide resolved

params/config.go Show resolved Hide resolved

consensus/bor/bor.go Show resolved Hide resolved

claude bot reviewed Apr 1, 2026

View reviewed changes

consensus/bor/heimdall/client.go Outdated Show resolved Hide resolved

remove omitempty

a94f1a9

Copilot started reviewing on behalf of marcello33 April 4, 2026 09:44 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

solve lint issue

5dfe949

claude bot reviewed Apr 4, 2026

View reviewed changes

consensus/bor/heimdall/client.go Show resolved Hide resolved

claude bot reviewed Apr 4, 2026

View reviewed changes

consensus/bor/heimdall/failover_client.go Show resolved Hide resolved

consensus/bor/heimdall/state_sync_url_test.go Show resolved Hide resolved

consensus/bor/heimdall/failover_client_test.go Outdated Show resolved Hide resolved

address comments

8d06151

Copilot AI review requested due to automatic review settings April 4, 2026 18:58

Copilot started reviewing on behalf of marcello33 April 4, 2026 18:58 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

consensus/bor/heimdall/failover_client.go Outdated Show resolved Hide resolved

consensus/bor/heimdall/failover_client.go Outdated Show resolved Hide resolved

address comments

3a98421

claude bot reviewed Apr 4, 2026

View reviewed changes

consensus/bor/heimdall/failover_client.go Show resolved Hide resolved

consensus/bor/heimdall/failover_client.go Show resolved Hide resolved

claude bot reviewed Apr 4, 2026

View reviewed changes

consensus/bor/heimdall/failover_client.go Show resolved Hide resolved

address comments

a1fa6b9

Copilot AI review requested due to automatic review settings April 4, 2026 20:07

Copilot started reviewing on behalf of marcello33 April 4, 2026 20:07 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

consensus/bor/heimdall/failover_client.go Outdated Show resolved Hide resolved

consensus/bor/heimdall/failover_client_test.go Outdated Show resolved Hide resolved

consensus/bor/heimdall/failover_client_test.go Outdated Show resolved Hide resolved

address comments

dce6483

claude bot reviewed Apr 4, 2026

View reviewed changes

unify endpoints

ca9ee85

Copilot AI review requested due to automatic review settings April 6, 2026 09:12

Copilot started reviewing on behalf of marcello33 April 6, 2026 09:13 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

consensus/bor/heimdall.go Show resolved Hide resolved

claude bot reviewed Apr 6, 2026

View reviewed changes

consensus/bor/heimdall/failover_client_test.go Outdated Show resolved Hide resolved

address comments

85521b2

claude bot reviewed Apr 6, 2026

View reviewed changes

Conversation

marcello33 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Activation

Changes

Breaking changes

Checklist

Cross repository changes

Testing

Manual tests

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

marcello33 commented Mar 31, 2026

Uh oh!

claude bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Deterministic State Syncs

HIGH: HTTP client does client-side filtering that gRPC/app clients do not

MEDIUM: queryCtx variable is unnecessary

MEDIUM: log.Info in hot pagination loop

MEDIUM: TODO comments left in consensus-critical code

LOW: Typo in test comment

LOW: to_time query parameter is not URL-encoded

LOW: No retry for GetBlockHeightByTime in HTTP client

Observation: go.mod replace directive

Summary

Uh oh!

claude bot commented Mar 31, 2026

Uh oh!

codecov bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

marcello33 commented Apr 1, 2026

Uh oh!

claude bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: Deterministic State Syncs

Remaining finding: Sort inconsistency across transports

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcello33 commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

marcello33 Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

marcello33 commented Mar 31, 2026 •

edited

Loading

claude bot commented Mar 31, 2026 •

edited

Loading

MEDIUM: `queryCtx` variable is unnecessary

MEDIUM: `log.Info` in hot pagination loop

LOW: `to_time` query parameter is not URL-encoded

LOW: No retry for `GetBlockHeightByTime` in HTTP client

Observation: `go.mod` replace directive

codecov bot commented Mar 31, 2026 •

edited

Loading

claude bot commented Apr 1, 2026 •

edited

Loading