Skip to content

[fluss-server] Fix replicasOnOffline deadlock with enhanced diagnostics#3012

Open
platinumhamburg wants to merge 1 commit intoapache:mainfrom
platinumhamburg:periodic-election-fix
Open

[fluss-server] Fix replicasOnOffline deadlock with enhanced diagnostics#3012
platinumhamburg wants to merge 1 commit intoapache:mainfrom
platinumhamburg:periodic-election-fix

Conversation

@platinumhamburg
Copy link
Copy Markdown
Contributor

Root cause: ReplicaManager.addFetcherForReplicas() misclassified "no leader" (leaderId=null) as a STORAGE_EXCEPTION, which triggered permanent offline marking via replicasOnOffline. Once a replica was marked offline, it was excluded from liveReplicas during election, causing the bucket to remain leaderless indefinitely.

Fix: Replace the error branch with a simple guard (leaderId != null && leaderId >= 0) that skips fetcher setup for replicas without a valid leader, allowing them to recover naturally on the next LeaderAndIsr notification.

Additional improvements:

  • Enhanced initCoordinatorContext logging with server load stats and skip reasons (null registration, missing endpoint)
  • Added diagnostic logging when ISR members are excluded from liveReplicas during election
  • Added warn logging for NotifyLeaderAndIsr failures before marking replicas offline
  • Extracted electAndNotifyForBucket() to reduce duplication in TableBucketStateMachine state transitions
  • Added CoordinatorContext.isReplicaInOfflineSet() for diagnostics

(The sections below can be removed for hotfixes or typos)
-->

Purpose

Linked issue: close #3011 3011

Brief change log

Tests

API and Format

Documentation

@platinumhamburg platinumhamburg force-pushed the periodic-election-fix branch 5 times, most recently from d109537 to 720122f Compare April 8, 2026 02:01
Only fatal errors (STORAGE_EXCEPTION, LOG_STORAGE_EXCEPTION,
KV_STORAGE_EXCEPTION, UNKNOWN_SERVER_ERROR) mark replicas offline.
All other NotifyLeaderAndIsr errors are transient and do not affect
leader election eligibility.

Also fixes ReplicaManager to return LeaderNotAvailableException
instead of StorageException when leaderId is null or negative.
@platinumhamburg platinumhamburg force-pushed the periodic-election-fix branch from 720122f to 2af4779 Compare April 8, 2026 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[fluss-server] Fix replicasOnOffline deadlock in addFetcherForReplicas

1 participant