Skip to content

chore: Better check for stored meta not stale#3221

Open
alpe wants to merge 2 commits intomainfrom
alex/exec_check
Open

chore: Better check for stored meta not stale#3221
alpe wants to merge 2 commits intomainfrom
alex/exec_check

Conversation

@alpe
Copy link
Copy Markdown
Contributor

@alpe alpe commented Mar 31, 2026

Overview

Enhanced stale block detection during failover to prevent resuming outdated execution states.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced validation of execution metadata to prevent resumption of stale in-progress executions with mismatched transaction inputs or timestamps.

@alpe alpe requested a review from julienrbrt March 31, 2026 15:22
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMar 31, 2026, 3:24 PM

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 31, 2026

Claude finished @alpe's task in 1m 53s —— View job


Code Review

  • Gather context and read changed files
  • Review execution/evm/execution.go
  • Review execution/evm/execution_reconcile_test.go
  • Summarize findings

Summary

This PR tightens the staleness check for in-progress (ExecStageStarted) execution metadata before resuming from a cached payloadID. Previously, the code would attempt to resume any stored payload at a given height without verifying that the stored timestamp and transaction set match the current request — a subtle correctness bug that could cause resume attempts with mismatched block inputs. The fix is correct and well-scoped.

The hashTxs helper extraction is a clean refactor with no behavior change.


Issues

1. Missing test: inputs match but EL payload is stale (execution_reconcile_test.go)

The resume_when_inputs_match case only tests the happy path where GetPayload succeeds. There's no test for the branch where inputs match but GetPayload returns an error (stale payload after node restart). That path should return found=false and payloadID=nil while logging a debug message. Currently, that branch is covered only by existing tests elsewhere (if any) — consider adding:

"stale_payload_when_inputs_match": {
    execMetaTimestamp: 1700000012,
    execMetaTxs:       [][]byte{[]byte("tx-1")},
    requestedTxs:      [][]byte{[]byte("tx-1")},
    requestedTime:     time.Unix(1700000012, 0),
    expectFound:       false,
    expectPayloadID:   false,
    expectGetPayloads: 1, // tried but failed
    // requires payloads map to NOT contain the payloadID
},

Fix this →

2. Log message missing tx hash context (execution.go:761-765)

When the mismatch is due to differing tx hashes (not timestamps), the log field names are misleading — only execmeta_timestamp and requested_timestamp are logged, implying the mismatch is always a timestamp issue. Consider adding a field or adjusting the message to reflect when it's a tx hash mismatch:

c.logger.Debug().
    Uint64("height", height).
    Int64("execmeta_timestamp", execMeta.Timestamp).
    Int64("requested_timestamp", timestamp.Unix()).
    Bool("timestamp_match", execMeta.Timestamp == timestamp.Unix()).
    Bool("txhash_match", bytes.Equal(execMeta.TxHash, requestedTxHash)).
    Msg("ExecuteTxs: ignoring stale in-progress execution for different block inputs")

Fix this →


Observations (non-blocking)

hashTxs with nil vs empty: hashTxs(nil) and hashTxs([][]byte{}) both return nil. Since bytes.Equal(nil, nil) == true, two empty-transaction blocks will correctly match. This is consistent between saveExecMeta and the comparison site. Worth noting that existing ExecMeta entries saved before this field existed will have TxHash == nil, which will match any empty-tx request — but that's an existing edge case, not introduced here.

Test isolation: mockReconcileEngineRPCClient uses a non-atomic getPayloadCalls counter. Since tests are sequential this is fine, but worth noting for future parallel test refactors.

Overall: This is a focused, well-reasoned correctness fix. The refactored hashTxs helper is cleaner than the inline version. The table-driven tests cover the three primary scenarios clearly.

@alpe alpe changed the title Better check for store meta not stale chore: Better check for store meta not stale Mar 31, 2026
@alpe alpe changed the title chore: Better check for store meta not stale chore: Better check for stored meta not stale Mar 31, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

The changes add improved validation for cached execution metadata during reconciliation. When resuming a started execution, the system now verifies that the execution timestamp and transaction set match the stored metadata before attempting to retrieve the payload, reducing the risk of stale cached data being used. Supporting changes include a refactored transaction hash computation helper and comprehensive test coverage.

Changes

Cohort / File(s) Summary
Documentation Update
CHANGELOG.md
Reformatted changelog entries to use consistent bullet-point notation across all version sections; added entry for stale metadata check improvement.
Execution Metadata Reconciliation
execution/evm/execution.go
Added pre-check validation when reconciling ExecStageStarted metadata by comparing stored metadata's timestamp and transaction hash against current inputs before attempting payload retrieval. Refactored transaction hash computation into new hashTxs() helper function that returns nil for empty transaction sets. Updated control-flow logging to handle stale execution cases.
Reconciliation Testing
execution/evm/execution_reconcile_test.go
Added new test TestReconcileExecutionAtHeight_StartedExecMeta covering three scenarios: matching inputs (returns found with payload ID and one GetPayload call), mismatched timestamp, and mismatched transactions (both return not found with no payload calls). Includes local mock implementations for engine and Ethereum RPC clients.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Suggested reviewers

  • tac0turtle
  • randygrok

Poem

🐰 Hops of joy through execution's flow,
Stale metadata? No, no, no!
With timestamps and hashes we now align,
Fresh payloads only—such fine design! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: improving validation of stored execution metadata to prevent using stale cached state during failover scenarios.
Description check ✅ Passed The description is minimal but covers the essential context—it explains the goal and rationale, though without explicit linking to issue #3221 or detailed technical explanation of the validation improvements.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch alex/exec_check

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 61.43%. Comparing base (4a70e0b) to head (0d5a6f5).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3221   +/-   ##
=======================================
  Coverage   61.43%   61.43%           
=======================================
  Files         120      120           
  Lines       12504    12504           
=======================================
  Hits         7682     7682           
  Misses       3960     3960           
  Partials      862      862           
Flag Coverage Δ
combined 61.43% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! this makes #3172 more readable as well so great

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
execution/evm/execution_reconcile_test.go (1)

19-57: Use a slice-backed table so the subtests run in a stable order.

range over a map randomizes the case order, which makes failures a bit noisier to reproduce. A slice keeps this table-driven without losing determinism.

As per coding guidelines, "Use table-driven tests in Go unit tests" and "Ensure tests are deterministic in Go."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@execution/evm/execution_reconcile_test.go` around lines 19 - 57, The test
currently builds specs as a map (variable specs) and ranges over it, which
randomizes subtest order; change specs to a slice of structs that include a name
field (e.g., type case struct { name string; execMetaTimestamp int64;
execMetaTxs [][]byte; requestedTxs [][]byte; requestedTime time.Time;
expectFound bool; expectPayloadID bool; expectGetPayloads int }) and populate it
in the desired stable order, then iterate over that slice (for _, spec := range
cases { t.Run(spec.name, func(t *testing.T) { ... }) }) instead of ranging the
map so subtests run deterministically while keeping the same assertions and
using the same symbols (specs -> cases, t.Run).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Line 333: Remove the orphan list marker '*' from the links footer in
CHANGELOG.md (the stray asterisk that renders as an empty bullet); open
CHANGELOG.md, locate the lone '*' in the footer area and delete that line so the
list renders correctly without the empty bullet.
- Around line 12-14: Move the entry "chore: Better check for stored meta not
stale" out of the "### Added" section into the "### Fixed" section and rewrite
it in user-facing language that explains the bug and the user impact (e.g.,
"Fixed stale execution-resume behavior by improving validation of stored
metadata to prevent resuming with outdated state"). Update the changelog entry
to remove commit-style phrasing and use a concise, user-first sentence
describing the fix and its effect.

In `@execution/evm/execution.go`:
- Around line 758-785: The resume path for ExecStageStarted currently only
checks timestamp and tx hash (execMeta.Timestamp, execMeta.TxHash via
hashTxs(txs)) before returning the stored PayloadID (pid); you must also
validate that the current parent hash matches execMeta.ParentHash to avoid
resuming on a changed parent. Retrieve the current parent hash for the block
(height-1) from the chain/headers API available in this package and compare it
to execMeta.ParentHash; if they differ, treat the payload as stale (log and fall
through to fresh execution) rather than returning pid even if
c.engineClient.GetPayload(ctx, pid) succeeds. Ensure you still keep the existing
GetPayload existence check (c.engineClient.GetPayload) but only return pid when
both the payload exists and the parent hash matches execMeta.ParentHash.
- Around line 1048-1059: The hashTxs function currently concatenates raw tx
bytes which can produce ambiguous collisions; modify hashTxs to make the tx-list
encoding unambiguous by prefixing each tx with its length (or hashing each tx
separately with a domain separator) before writing into the sha256 hasher: e.g.,
for each tx in hashTxs write the tx length as a fixed-size varint/uint64 or a
clear separator and then write the tx bytes (or write sha256(tx) with a
separator), then return h.Sum(nil) so the digest uniquely represents the ordered
list.

---

Nitpick comments:
In `@execution/evm/execution_reconcile_test.go`:
- Around line 19-57: The test currently builds specs as a map (variable specs)
and ranges over it, which randomizes subtest order; change specs to a slice of
structs that include a name field (e.g., type case struct { name string;
execMetaTimestamp int64; execMetaTxs [][]byte; requestedTxs [][]byte;
requestedTime time.Time; expectFound bool; expectPayloadID bool;
expectGetPayloads int }) and populate it in the desired stable order, then
iterate over that slice (for _, spec := range cases { t.Run(spec.name, func(t
*testing.T) { ... }) }) instead of ranging the map so subtests run
deterministically while keeping the same assertions and using the same symbols
(specs -> cases, t.Run).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e12b9912-7b63-430f-b6da-57e2d9c96b2c

📥 Commits

Reviewing files that changed from the base of the PR and between 4a70e0b and 0d5a6f5.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • execution/evm/execution.go
  • execution/evm/execution_reconcile_test.go

Comment on lines +12 to +14
### Added

* chore: Better check for stored meta not stale [#3221](https://github.com/evstack/ev-node/pull/3221)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Move this entry under Fixed and rewrite it user-first.

This change hardens stale execution-resume behavior, so it reads like a bug fix rather than a new feature, and chore: Better check for stored meta not stale is still commit-title/internal wording instead of changelog wording.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` around lines 12 - 14, Move the entry "chore: Better check for
stored meta not stale" out of the "### Added" section into the "### Fixed"
section and rewrite it in user-facing language that explains the bug and the
user impact (e.g., "Fixed stale execution-resume behavior by improving
validation of stored metadata to prevent resuming with outdated state"). Update
the changelog entry to remove commit-style phrasing and use a concise,
user-first sentence describing the fix and its effect.

<!-- Links -->

-
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove the orphan list marker in the links footer.

Line 333 renders as an empty bullet and looks like a leftover from the list-marker normalization.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` at line 333, Remove the orphan list marker '*' from the links
footer in CHANGELOG.md (the stray asterisk that renders as an empty bullet);
open CHANGELOG.md, locate the lone '*' in the footer area and delete that line
so the list renders correctly without the empty bullet.

Comment on lines 758 to 785
if execMeta.Stage == ExecStageStarted && len(execMeta.PayloadID) == 8 {
var pid engine.PayloadID
copy(pid[:], execMeta.PayloadID)
requestedTxHash := hashTxs(txs)
if execMeta.Timestamp != timestamp.Unix() || !bytes.Equal(execMeta.TxHash, requestedTxHash) {
c.logger.Debug().
Uint64("height", height).
Int64("execmeta_timestamp", execMeta.Timestamp).
Int64("requested_timestamp", timestamp.Unix()).
Msg("ExecuteTxs: ignoring stale in-progress execution for different block inputs")
} else {
var pid engine.PayloadID
copy(pid[:], execMeta.PayloadID)

// Validate payload still exists by attempting to retrieve it
if _, err = c.engineClient.GetPayload(ctx, pid); err == nil {
c.logger.Info().
// Validate payload still exists by attempting to retrieve it
if _, err = c.engineClient.GetPayload(ctx, pid); err == nil {
c.logger.Info().
Uint64("height", height).
Str("stage", execMeta.Stage).
Msg("ExecuteTxs: found in-progress execution with payloadID, returning payloadID for resume")
return nil, &pid, true, nil
}
// Payload is stale (expired or node restarted) - proceed with fresh execution
c.logger.Debug().
Uint64("height", height).
Str("stage", execMeta.Stage).
Msg("ExecuteTxs: found in-progress execution with payloadID, returning payloadID for resume")
return nil, &pid, true, nil
Str("payloadID", pid.String()).
Err(err).
Msg("ExecuteTxs: stale ExecMeta payloadID no longer valid in EL, will re-execute")
// Don't return - fall through to fresh execution
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

ExecStageStarted resumes still ignore parent-chain changes.

This gate only checks timestamp and tx hash before reusing pid. After a rollback/reorg, the parent at height-1 can change while both still match, so we can resume and later promote a payload that was built on the old parent. Please also validate the current parent hash before returning pid; ExecMeta already has a ParentHash field you can use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@execution/evm/execution.go` around lines 758 - 785, The resume path for
ExecStageStarted currently only checks timestamp and tx hash
(execMeta.Timestamp, execMeta.TxHash via hashTxs(txs)) before returning the
stored PayloadID (pid); you must also validate that the current parent hash
matches execMeta.ParentHash to avoid resuming on a changed parent. Retrieve the
current parent hash for the block (height-1) from the chain/headers API
available in this package and compare it to execMeta.ParentHash; if they differ,
treat the payload as stale (log and fall through to fresh execution) rather than
returning pid even if c.engineClient.GetPayload(ctx, pid) succeeds. Ensure you
still keep the existing GetPayload existence check (c.engineClient.GetPayload)
but only return pid when both the payload exists and the parent hash matches
execMeta.ParentHash.

Comment on lines +1048 to +1059
func hashTxs(txs [][]byte) []byte {
if len(txs) == 0 {
return nil
}

h := sha256.New()
for _, tx := range txs {
h.Write(tx)
}

return h.Sum(nil)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the tx-list hash unambiguous.

hashTxs hashes the raw concatenation of every tx, so different [][]byte values can collide when they flatten to the same byte stream. That weakens the stale-meta guard this PR is adding. Hash each element independently or include a separator/length per element.

Possible fix
 func hashTxs(txs [][]byte) []byte {
 	if len(txs) == 0 {
 		return nil
 	}
 
 	h := sha256.New()
 	for _, tx := range txs {
-		h.Write(tx)
+		txHash := sha256.Sum256(tx)
+		h.Write(txHash[:])
 	}
 
 	return h.Sum(nil)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func hashTxs(txs [][]byte) []byte {
if len(txs) == 0 {
return nil
}
h := sha256.New()
for _, tx := range txs {
h.Write(tx)
}
return h.Sum(nil)
}
func hashTxs(txs [][]byte) []byte {
if len(txs) == 0 {
return nil
}
h := sha256.New()
for _, tx := range txs {
txHash := sha256.Sum256(tx)
h.Write(txHash[:])
}
return h.Sum(nil)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@execution/evm/execution.go` around lines 1048 - 1059, The hashTxs function
currently concatenates raw tx bytes which can produce ambiguous collisions;
modify hashTxs to make the tx-list encoding unambiguous by prefixing each tx
with its length (or hashing each tx separately with a domain separator) before
writing into the sha256 hasher: e.g., for each tx in hashTxs write the tx length
as a fixed-size varint/uint64 or a clear separator and then write the tx bytes
(or write sha256(tx) with a separator), then return h.Sum(nil) so the digest
uniquely represents the ordered list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants