feat(pkg/p2p): reconnect on disconnected peers#3212
Conversation
📝 WalkthroughWalkthroughImplements automatic reconnection and periodic peer discovery for seed peers, re-enables local module replace directives for development builds, and updates retry-loop syntax in cloud KMS signers. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as P2P Client
participant Host as libp2p Host
participant DHT as DHT/Discovery
participant Notifee as Disconnect Notifier
participant Maintenance as Maintenance Loop
Client->>Host: startWithHost()
Client->>Notifee: register disconnect handler
Client->>Maintenance: start maintenance goroutine
Note over Host,Notifee: network event
Host-->>Notifee: peer disconnected
Notifee->>Maintenance: send peerID on reconnectCh
Maintenance->>Maintenance: apply per-peer backoff
Maintenance->>Host: tryConnect(peerID) (limited by connectSem)
Host-->>Maintenance: connection result
Maintenance->>DHT: periodic refreshPeerDiscovery()
DHT->>Host: re-advertise & FindPeers()
DHT-->>Maintenance: peer list
Maintenance->>Host: tryConnect(discovered peers)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
Claude finished @julienrbrt's task in 12m 10s —— View job Code Review
Overall the PR introduces a well-structured reconnection mechanism. The design is sound (background maintenance goroutine, exponential backoff, semaphore-bounded peer discovery), and using Critical: Reconnect never retries after a failed connect attempt
go func(info peer.AddrInfo) {
if err := c.host.Connect(ctx, info); err != nil && ctx.Err() == nil {
c.logger.Warn()... // ← missing: schedule a retry
}
}(sp)Fix: re-enqueue the peer ID after the backoff delay on failure, e.g.: go func(info peer.AddrInfo, delay time.Duration) {
if err := c.host.Connect(ctx, info); err != nil && ctx.Err() == nil {
c.logger.Warn()...
time.AfterFunc(delay, func() {
select {
case <-ctx.Done():
case c.reconnectCh <- info.ID:
default:
}
})
}
}(sp, backoff)Critical: Disconnect events arriving during backoff window are silently dropped
if time.Now().Before(st.lastAttempt.Add(backoff)) {
continue // ← event lost forever
}If a disconnect fires during the cooldown window (e.g. a flapping peer), the event is discarded and no retry is scheduled. After the window expires there's nothing left to trigger a reconnect. Should schedule the retry for when the backoff expires rather than dropping it: if wait := time.Until(st.lastAttempt.Add(backoff)); wait > 0 {
time.AfterFunc(wait, func() {
select { case c.reconnectCh <- pid: default: }
})
continue
}Major: Goroutines from
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3212 +/- ##
==========================================
- Coverage 61.43% 61.23% -0.21%
==========================================
Files 120 120
Lines 12504 12591 +87
==========================================
+ Hits 7682 7710 +28
- Misses 3960 4017 +57
- Partials 862 864 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
apps/testapp/go.mod (1)
5-5: Keep thisreplacelocal to development.This ties the module to
../../.being present at build time. That works inside this checkout, but not when the submodule is built outside the monorepo layout. Prefer a localgo.workor an uncommitted dev override instead. Please verify this module is not expected to build outside the repository layout; ifapps/testappis intentionally monorepo-only, feel free to ignore this.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/testapp/go.mod` at line 5, The committed go.mod contains a local replace directive ("replace github.com/evstack/ev-node => ../../.") that forces a monorepo layout; remove this replace from the tracked go.mod and instead keep it as an uncommitted developer-only override (or move the replacement into a go.work used only in development), or if this module is intended to be monorepo-only, document that restriction; update the repository by deleting the replace line from the committed go.mod and adding a dev-only go.work or .gitignore'd override as appropriate so builds outside the repo layout are not broken.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/evm/go.mod`:
- Around line 5-8: The go.mod contains committed relative replace directives for
modules "github.com/evstack/ev-node" and
"github.com/evstack/ev-node/execution/evm" (the two replace (...) entries) which
will break consumers outside the monorepo; remove these relative replace lines
from apps/evm/go.mod before committing and, for local development only, move
them into a go.work file or use developer instructions to run `go work use` so
local overrides remain available but are not shipped in the module's go.mod.
In `@pkg/p2p/client.go`:
- Around line 343-355: The reconnect loop currently only receives signals from
reconnectCh (fed by Disconnected) so a failed host.Connect leaves the peer
without future triggers, and the current logic drops events that arrive during
the backoff window; update the reconnect handling (references: reconnectCh,
Disconnected, host.Connect, st.attempts, st.lastAttempt, reconnectCooldown,
maxReconnectCooldown) so that on a connect failure you re-schedule another retry
(e.g., push a retry event into reconnectCh or start a time.AfterFunc/goroutine
to re-enqueue after computed backoff) and when an event arrives inside the
backoff window do not drop it but schedule it to be processed after the
remaining backoff interval (use a timer to requeue into reconnectCh); apply the
same scheduling change to the other similar block (lines around the second
instance handling backoff) so transient outages cause repeated retries rather
than a single attempt.
- Line 199: The maintenance goroutine is started with the caller-supplied ctx so
timeouts/cancels on Start() prematurely stop reconnection/discovery; change this
so maintenance uses its own background context owned by the client and only
stops when Close() is called. Create a maintenance context via
context.WithCancel(context.Background()) (store the cancel func on the client
struct), pass that maintenance context into startConnectionMaintenance (and the
other maintenance/startDiscovery loops currently using the Start() ctx around
lines 317-320), and invoke the stored cancel in Close(); do not use the Start()
ctx for these long-lived loops.
---
Nitpick comments:
In `@apps/testapp/go.mod`:
- Line 5: The committed go.mod contains a local replace directive ("replace
github.com/evstack/ev-node => ../../.") that forces a monorepo layout; remove
this replace from the tracked go.mod and instead keep it as an uncommitted
developer-only override (or move the replacement into a go.work used only in
development), or if this module is intended to be monorepo-only, document that
restriction; update the repository by deleting the replace line from the
committed go.mod and adding a dev-only go.work or .gitignore'd override as
appropriate so builds outside the repo layout are not broken.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1c692055-7514-4e34-ad88-0d6d76c89f7d
⛔ Files ignored due to path filters (2)
apps/evm/go.sumis excluded by!**/*.sumapps/testapp/go.sumis excluded by!**/*.sum
📒 Files selected for processing (6)
CHANGELOG.mdapps/evm/go.modapps/testapp/go.modpkg/p2p/client.gopkg/signer/aws/signer.gopkg/signer/gcp/signer.go
| if time.Since(st.lastAttempt) > maxReconnectCooldown { | ||
| st.attempts = 0 | ||
| } | ||
|
|
||
| backoff := reconnectCooldown * time.Duration(1<<min(st.attempts, 6)) | ||
| if backoff > maxReconnectCooldown { | ||
| backoff = maxReconnectCooldown | ||
| } | ||
| if time.Now().Before(st.lastAttempt.Add(backoff)) { | ||
| continue | ||
| } | ||
| st.lastAttempt = time.Now() | ||
|
|
There was a problem hiding this comment.
Backoff never produces a second reconnect attempt.
reconnectCh is only fed by Disconnected, so a failed host.Connect leaves the peer with no future trigger. Line 352 also drops any event that arrives inside the current backoff window instead of scheduling it for later. That turns transient outages into a single reconnect attempt.
🔧 One way to keep the backoff loop alive
+func (c *Client) scheduleReconnect(ctx context.Context, pid peer.ID, delay time.Duration) {
+ time.AfterFunc(delay, func() {
+ select {
+ case <-ctx.Done():
+ case c.reconnectCh <- pid:
+ default:
+ }
+ })
+}
+
func (c *Client) startConnectionMaintenance(parentCtx context.Context) {
ctx, cancel := context.WithCancel(parentCtx)
c.maintenanceCancel = cancel
@@
- if time.Now().Before(st.lastAttempt.Add(backoff)) {
- continue
- }
+ if wait := time.Until(st.lastAttempt.Add(backoff)); wait > 0 {
+ c.scheduleReconnect(ctx, pid, wait)
+ continue
+ }
@@
- go func(info peer.AddrInfo) {
+ go func(info peer.AddrInfo, retryAfter time.Duration) {
if err := c.host.Connect(ctx, info); err != nil && ctx.Err() == nil {
c.logger.Warn().Str("peer", info.ID.String()).Err(err).Msg("failed to reconnect to seed peer")
+ c.scheduleReconnect(ctx, info.ID, retryAfter)
}
- }(sp)
+ }(sp, backoff)Also applies to: 364-370
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/p2p/client.go` around lines 343 - 355, The reconnect loop currently only
receives signals from reconnectCh (fed by Disconnected) so a failed host.Connect
leaves the peer without future triggers, and the current logic drops events that
arrive during the backoff window; update the reconnect handling (references:
reconnectCh, Disconnected, host.Connect, st.attempts, st.lastAttempt,
reconnectCooldown, maxReconnectCooldown) so that on a connect failure you
re-schedule another retry (e.g., push a retry event into reconnectCh or start a
time.AfterFunc/goroutine to re-enqueue after computed backoff) and when an event
arrives inside the backoff window do not drop it but schedule it to be processed
after the remaining backoff interval (use a timer to requeue into reconnectCh);
apply the same scheduling change to the other similar block (lines around the
second instance handling backoff) so transient outages cause repeated retries
rather than a single attempt.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
pkg/p2p/client.go (1)
351-352:⚠️ Potential issue | 🟠 MajorBackoff still turns a transient seed outage into a single retry.
At Line 351 the event is dropped if the peer is still inside its backoff window, and at Line 367 a failed
host.Connectonly logs. After that first failed dial there is no future trigger for that seed peer, so it can remain disconnected indefinitely.Also applies to: 367-369
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/p2p/client.go` around lines 351 - 352, The code currently skips processing a seed peer if time.Now().Before(st.lastAttempt.Add(backoff)) and only logs on host.Connect failure, which means a transient outage may never be retried; change this so that when a peer is within backoff you requeue/schedule it for a future attempt instead of dropping the event (use a timer or push it back onto the attempt queue and keep state in st), and when host.Connect(peer) fails update st.lastAttempt and also schedule a follow-up retry (rather than only logging) so the peer will be retried after backoff; make these changes around the references to st.lastAttempt.Add(backoff) and the host.Connect call to ensure failed dials trigger future retries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/p2p/client.go`:
- Around line 317-321: startConnectionMaintenance currently only handles future
Disconnected events so seed peers that are already disconnected never get
enqueued; before the maintenance select loop begins (inside the maintenanceWg.Go
started by startConnectionMaintenance), iterate c.seedPeers and for each peer
that is not currently connected (use existing connection state check used
elsewhere in the file), send that peer into c.reconnectCh once so they will be
retried; ensure sends won't block the worker (use a non-blocking select with
default or spawn a short goroutine per send) and rely on
refreshPeerDiscovery/c.getNamespace as before for discovery updates.
- Around line 388-389: The current loop repeatedly calls c.advertise(ctx) which
spawns a background re-advertisement goroutine each time (discutil.Advertise
runs until ctx.Done()), causing a goroutine leak; fix by invoking
c.advertise(ctx) exactly once from the maintenance context (e.g., start it when
maintenanceCtx is created) and change the ticker loop to only call peer
discovery (e.g., c.FindPeers or the method that refreshes discovery) on each
tick, or alternatively modify c.advertise to provide a single-shot advertise API
and call that from the ticker; locate and update the code paths referencing
c.advertise(ctx) and the ticker loop so no repeated long-lived advertise
goroutines are started.
---
Duplicate comments:
In `@pkg/p2p/client.go`:
- Around line 351-352: The code currently skips processing a seed peer if
time.Now().Before(st.lastAttempt.Add(backoff)) and only logs on host.Connect
failure, which means a transient outage may never be retried; change this so
that when a peer is within backoff you requeue/schedule it for a future attempt
instead of dropping the event (use a timer or push it back onto the attempt
queue and keep state in st), and when host.Connect(peer) fails update
st.lastAttempt and also schedule a follow-up retry (rather than only logging) so
the peer will be retried after backoff; make these changes around the references
to st.lastAttempt.Add(backoff) and the host.Connect call to ensure failed dials
trigger future retries.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| func (c *Client) startConnectionMaintenance() { | ||
| ctx, cancel := context.WithCancel(context.Background()) | ||
| c.maintenanceCancel = cancel | ||
|
|
||
| c.maintenanceWg.Go(func() { |
There was a problem hiding this comment.
Queue currently disconnected seed peers when maintenance starts.
This worker only reacts to future Disconnected events. A seed peer that never connected during startup, or dropped before the notifee was registered, never enters reconnectCh, and refreshPeerDiscovery only searches c.getNamespace(). Seed any currently disconnected c.seedPeers once before the select loop begins.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/p2p/client.go` around lines 317 - 321, startConnectionMaintenance
currently only handles future Disconnected events so seed peers that are already
disconnected never get enqueued; before the maintenance select loop begins
(inside the maintenanceWg.Go started by startConnectionMaintenance), iterate
c.seedPeers and for each peer that is not currently connected (use existing
connection state check used elsewhere in the file), send that peer into
c.reconnectCh once so they will be retried; ensure sends won't block the worker
(use a non-blocking select with default or spawn a short goroutine per send) and
rely on refreshPeerDiscovery/c.getNamespace as before for discovery updates.
| _ = c.advertise(ctx) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
For github.com/libp2p/go-libp2p v0.48.0, what does p2p/discovery/util.Advertise do? Does it start a background re-advertisement loop tied to ctx.Done(), or is it a single synchronous advertise call?
💡 Result:
In github.com/libp2p/go-libp2p/p2p/discovery/util v0.48.0, util.Advertise(ctx, a, ns, ...) starts a background goroutine that re-advertises in a loop until ctx.Done() is closed—it is not a single synchronous advertise call. (raw.githubusercontent.com)
Behavior details:
- Calls
a.Advertise(ctx, ns, opts...)repeatedly, receiving attl. (raw.githubusercontent.com) - Waits
7*ttl/8before the next advertise (or exits early onctx.Done()). (raw.githubusercontent.com) - On advertise error, logs, then (if context isn’t canceled) waits 2 minutes and retries (or exits on
ctx.Done()). (raw.githubusercontent.com)
Citations:
- 1: https://raw.githubusercontent.com/libp2p/go-libp2p/v0.48.0/p2p/discovery/util/util.go
- 2: https://raw.githubusercontent.com/libp2p/go-libp2p/v0.48.0/p2p/discovery/util/util.go
- 3: https://raw.githubusercontent.com/libp2p/go-libp2p/v0.48.0/p2p/discovery/util/util.go
- 4: https://raw.githubusercontent.com/libp2p/go-libp2p/v0.48.0/p2p/discovery/util/util.go
Fix goroutine leak in advertise refresh loop.
Calling c.advertise(ctx) every 5 minutes at line 388 spawns a new background goroutine with each call. In libp2p v0.48.0, discutil.Advertise starts a re-advertisement loop that runs until ctx.Done() — it is not a single synchronous call. Stacking these goroutines will leak resources for the client's lifetime.
Start c.advertise(ctx) once from the maintenance context and let the ticker only refresh peer discovery via FindPeers, or switch to a single-shot advertise call that does not spawn a background loop.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/p2p/client.go` around lines 388 - 389, The current loop repeatedly calls
c.advertise(ctx) which spawns a background re-advertisement goroutine each time
(discutil.Advertise runs until ctx.Done()), causing a goroutine leak; fix by
invoking c.advertise(ctx) exactly once from the maintenance context (e.g., start
it when maintenanceCtx is created) and change the ticker loop to only call peer
discovery (e.g., c.FindPeers or the method that refreshes discovery) on each
tick, or alternatively modify c.advertise to provide a single-shot advertise API
and call that from the ticker; locate and update the code paths referencing
c.advertise(ctx) and the ticker loop so no repeated long-lived advertise
goroutines are started.
Attempt to reconnect on peers when having connectivity issue.
PR made by AI. Still testing.
Summary by CodeRabbit