HBASE-29039 Seek past delete markers instead of skipping one at a time by junegunn · Pull Request #8001 · apache/hbase

junegunn · 2026-03-29T01:13:09Z

Context

HBASE-30036 (#7993) consolidates redundant delete markers on flush, preventing them from growing unbounded in HFiles. However, markers still accumulate in the memstore before flush, degrading read performance. HBASE-29039 addresses this from the read path side. Both are needed for full coverage. There is an open PR (#6557), but the review process has been stalled. This is an alternative approach with fewer code changes, hopefully making it easier to reach consensus.

Test result

Using the test code in HBASE-30036.

`DeleteFamily`

Substantial read performance improvement before flushes.
Without HBASE-30036, delete markers still accumulate in store files.

`DeleteColumnContiguous`

Substantial read performance improvement before flushes.
Without HBASE-30036, delete markers still accumulate in store files.

`DeleteColumnInterleaved`

No difference, as expected. Already triggers SEEK_NEXT_COL via the masked put.

Description

When a DeleteColumn or DeleteFamily marker is encountered during a normal user scan, the matcher currently returns SKIP, forcing the scanner to advance one cell at a time. This causes read latency to degrade linearly with the number of accumulated delete markers for the same row or column.

Since these are range deletes that mask all remaining versions of the column, seek past the entire column immediately via columns.getNextRowOrNextColumn(). This is safe because cells arrive in timestamp descending order, so any puts newer than the delete have already been processed.

For DeleteFamily, also fix getKeyForNextColumn in ScanQueryMatcher to bypass the empty-qualifier guard (HBASE-18471) when the cell is a DeleteFamily marker. Without this, the seek barely advances past the current cell instead of jumping to the first real qualified column.

The optimization is skipped when:

seePastDeleteMarkers is true (KEEP_DELETED_CELLS)
newVersionBehavior is enabled (sequence IDs determine visibility)
the delete marker is not tracked (visibility labels)

When a DeleteColumn or DeleteFamily marker is encountered during a normal user scan, the matcher currently returns SKIP, forcing the scanner to advance one cell at a time. This causes read latency to degrade linearly with the number of accumulated delete markers for the same row or column. Since these are range deletes that mask all remaining versions of the column, seek past the entire column immediately via columns.getNextRowOrNextColumn(). This is safe because cells arrive in timestamp descending order, so any puts newer than the delete have already been processed. For DeleteFamily, also fix getKeyForNextColumn in ScanQueryMatcher to bypass the empty-qualifier guard (HBASE-18471) when the cell is a DeleteFamily marker. Without this, the seek barely advances past the current cell instead of jumping to the first real qualified column. The optimization is only applied with plain ScanDeleteTracker, and skipped when: - seePastDeleteMarkers is true (KEEP_DELETED_CELLS) - newVersionBehavior is enabled (sequence IDs determine visibility) - visibility labels are in use (delete/put label mismatch)

junegunn · 2026-03-30T02:17:12Z

I found a regression with this patch. When scanning across many rows where each row has only one DeleteFamily (or DeleteColumn) marker, scan performance degrades by ~50% compared to master. The seek triggered by this optimization is more expensive than a simple skip when there's nothing to skip over.

The optimization helps when multiple delete markers accumulate for the same row or column. But for the common case of one delete per row, the seek is wasted and the overhead adds up across many rows.

Benchmark data (scan time at 300K iterations, DeleteFamily on different rows):

master: ~0.2s
HBASE-29039-alt: ~0.3s

benchmark(:DeleteFamilyDifferentRows) do |i|
  row = i.to_s.to_java_bytes
  T.put(Put.new(row).addColumn(CF, CQ, VALUE))
  T.delete(Delete.new(row))
end

One approach: only seek on the N-th contiguous delete marker. The first N-1 markers SKIP as before. N contiguous markers signals accumulation and triggers a seek. This way:

One delete per row (common case): always skips, no regression (base case)
Accumulated redundant delete markers: first N-1 skips, then 1 seek (best case)
Accumulated non-redundant delete markers: Unnecessary seek happens every N delete markers (worst case)

Would this kind of heuristic make sense? A higher N reduces the relative overhead in the worst case, but delays the benefit in the best case.

Note

This patch does not compare qualifiers of contiguous delete markers. Doing so (e.g. exposing a method on ScanDeleteTracker) would prevent cross-column false positives but not eliminate them entirely. Even with qualifier comparison, if a column has exactly N DeleteColumn markers, the seek at the N-th is still a false positive. e.g.

DC(q1) --skip--> DC(q1) --skip--> DC(q1) --seek--> DC(q2) --skip--> DC(q2) --skip--> DC(q2) --seek--> DC(q3)

Seeking is ~50% more expensive than skipping. When each row has only one DeleteFamily or DeleteColumn marker (common case), the seek overhead adds up across many rows, causing ~50% scan regression. Introduce a counter that tracks consecutive range delete markers per row. Only switch from SKIP to SEEK after seeing SEEK_ON_DELETE_MARKER_THRESHOLD (default 3) markers, indicating actual accumulation. This preserves skip performance for the common case while still optimizing the accumulation case.

junegunn · 2026-04-02T10:59:43Z

I ran another test to truly check the SEEK overhead. In this case, we generate many DeleteColumns for different qualifiers:

benchmark(:DeleteColumnFalsePositive) do |i|
  T.put(PUT) if i.zero?

  dc = Delete.new(ROW).addColumns(CF, i.to_s.to_java_bytes)
  T.delete(dc)

  # Let's manually flush after every 100,000 operations because it's hard to
  # fill up the memstore only with delete markers.
  flush 't' if (i % 100_000).zero? && i.positive?
end

DC Q1
DC Q2
DC Q3
DC Q4
DC Q5
...
Put Q0

SEEK can only advance the pointer by one cell, providing no advantage over SKIP. This is the worst case for this optimization.

Here are the results with different N values:

Higher N reduces overhead, as expected. At N=100, overhead is negligible. Yet the best-case benefit still holds at N=100.

junegunn · 2026-04-02T17:46:48Z

I updated the patch to compare qualifiers of contiguous delete markers, so the counter only increments for consecutive markers targeting the same column. With this, we don't need such a large N value to avoid the regression in the worst case.

N=3 works correctly with this approach:

Same-column accumulation (the real problem): seeks after 3 markers. Fast kick-in.
Different-column DCs (false positive case): counter resets on qualifier change. All skip as before. No overhead.
One delete per row (common case): counter never reaches 3. Zero overhead.

Even with qualifier comparison, false positives remain: exactly N consecutive redundant DCs for the same qualifier trigger a seek.

DC(q1) -skip-> DC(q1) -skip-> DC(q1) -seek-> DC(q2) -skip-> DC(q2) -skip-> DC(q2) -seek-> DC(q3)

This should be rare in practice. If overhead is a concern, increasing N is the only alternative.

Here are the results.

Regression in non-redundant DeleteFamily markers is fixed.
No overhead in the worst case
The best case benefit still holds

junegunn marked this pull request as draft March 29, 2026 03:08

junegunn force-pushed the HBASE-29039-alt branch from 018a268 to 3a87682 Compare March 29, 2026 03:30

junegunn marked this pull request as ready for review March 29, 2026 03:42

junegunn force-pushed the HBASE-29039-alt branch 2 times, most recently from 6be48a0 to e7dc782 Compare March 30, 2026 23:17

junegunn marked this pull request as draft March 30, 2026 23:24

junegunn force-pushed the HBASE-29039-alt branch from e7dc782 to 59ad767 Compare March 30, 2026 23:35

junegunn marked this pull request as ready for review March 31, 2026 00:17

Compare qualifiers to avoid false positive seeks on different columns

bea7eaa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HBASE-29039 Seek past delete markers instead of skipping one at a time#8001

HBASE-29039 Seek past delete markers instead of skipping one at a time#8001
junegunn wants to merge 3 commits intoapache:masterfrom
junegunn:HBASE-29039-alt

junegunn commented Mar 29, 2026 •

edited

Loading

Uh oh!

junegunn commented Mar 30, 2026 •

edited

Loading

Uh oh!

junegunn commented Apr 2, 2026 •

edited

Loading

Uh oh!

junegunn commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

junegunn commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Test result

DeleteFamily

DeleteColumnContiguous

DeleteColumnInterleaved

Description

Uh oh!

junegunn commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junegunn commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junegunn commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

junegunn commented Mar 29, 2026 •

edited

Loading

`DeleteFamily`

`DeleteColumnContiguous`

`DeleteColumnInterleaved`

junegunn commented Mar 30, 2026 •

edited

Loading

junegunn commented Apr 2, 2026 •

edited

Loading