[fluss-server] Fix replicasOnOffline deadlock with enhanced diagnostics by platinumhamburg · Pull Request #3012 · apache/fluss

platinumhamburg · 2026-04-07T02:43:36Z

Root cause: ReplicaManager.addFetcherForReplicas() misclassified "no leader" (leaderId=null) as a STORAGE_EXCEPTION, which triggered permanent offline marking via replicasOnOffline. Once a replica was marked offline, it was excluded from liveReplicas during election, causing the bucket to remain leaderless indefinitely.

Fix: Replace the error branch with a simple guard (leaderId != null && leaderId >= 0) that skips fetcher setup for replicas without a valid leader, allowing them to recover naturally on the next LeaderAndIsr notification.

Additional improvements:

Enhanced initCoordinatorContext logging with server load stats and skip reasons (null registration, missing endpoint)
Added diagnostic logging when ISR members are excluded from liveReplicas during election
Added warn logging for NotifyLeaderAndIsr failures before marking replicas offline
Extracted electAndNotifyForBucket() to reduce duplication in TableBucketStateMachine state transitions
Added CoordinatorContext.isReplicaInOfflineSet() for diagnostics

(The sections below can be removed for hotfixes or typos)
-->

Purpose

Linked issue: close #3011 3011

Brief change log

Tests

API and Format

Documentation

Only fatal errors (STORAGE_EXCEPTION, LOG_STORAGE_EXCEPTION, KV_STORAGE_EXCEPTION, UNKNOWN_SERVER_ERROR) mark replicas offline. All other NotifyLeaderAndIsr errors are transient and do not affect leader election eligibility. Also fixes ReplicaManager to return LeaderNotAvailableException instead of StorageException when leaderId is null or negative.

platinumhamburg force-pushed the periodic-election-fix branch 5 times, most recently from d109537 to 720122f Compare April 8, 2026 02:01

platinumhamburg force-pushed the periodic-election-fix branch from 720122f to 2af4779 Compare April 8, 2026 03:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fluss-server] Fix replicasOnOffline deadlock with enhanced diagnostics#3012

[fluss-server] Fix replicasOnOffline deadlock with enhanced diagnostics#3012
platinumhamburg wants to merge 1 commit intoapache:mainfrom
platinumhamburg:periodic-election-fix

platinumhamburg commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

platinumhamburg commented Apr 7, 2026

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant