HBASE-29039 Seek past delete markers instead of skipping one at a time#8001
HBASE-29039 Seek past delete markers instead of skipping one at a time#8001junegunn wants to merge 3 commits intoapache:masterfrom
Conversation
When a DeleteColumn or DeleteFamily marker is encountered during a normal user scan, the matcher currently returns SKIP, forcing the scanner to advance one cell at a time. This causes read latency to degrade linearly with the number of accumulated delete markers for the same row or column. Since these are range deletes that mask all remaining versions of the column, seek past the entire column immediately via columns.getNextRowOrNextColumn(). This is safe because cells arrive in timestamp descending order, so any puts newer than the delete have already been processed. For DeleteFamily, also fix getKeyForNextColumn in ScanQueryMatcher to bypass the empty-qualifier guard (HBASE-18471) when the cell is a DeleteFamily marker. Without this, the seek barely advances past the current cell instead of jumping to the first real qualified column. The optimization is only applied with plain ScanDeleteTracker, and skipped when: - seePastDeleteMarkers is true (KEEP_DELETED_CELLS) - newVersionBehavior is enabled (sequence IDs determine visibility) - visibility labels are in use (delete/put label mismatch)
018a268 to
3a87682
Compare
|
I found a regression with this patch. When scanning across many rows where each row has only one The optimization helps when multiple delete markers accumulate for the same row or column. But for the common case of one delete per row, the seek is wasted and the overhead adds up across many rows. Benchmark data (scan time at 300K iterations,
benchmark(:DeleteFamilyDifferentRows) do |i|
row = i.to_s.to_java_bytes
T.put(Put.new(row).addColumn(CF, CQ, VALUE))
T.delete(Delete.new(row))
end
One approach: only seek on the N-th contiguous delete marker. The first N-1 markers
Would this kind of heuristic make sense? A higher N reduces the relative overhead in the worst case, but delays the benefit in the best case. Note This patch does not compare qualifiers of contiguous delete markers. Doing so (e.g. exposing a method on |
6be48a0 to
e7dc782
Compare
Seeking is ~50% more expensive than skipping. When each row has only one DeleteFamily or DeleteColumn marker (common case), the seek overhead adds up across many rows, causing ~50% scan regression. Introduce a counter that tracks consecutive range delete markers per row. Only switch from SKIP to SEEK after seeing SEEK_ON_DELETE_MARKER_THRESHOLD (default 3) markers, indicating actual accumulation. This preserves skip performance for the common case while still optimizing the accumulation case.
e7dc782 to
59ad767
Compare
|
I updated the patch to compare qualifiers of contiguous delete markers, so the counter only increments for consecutive markers targeting the same column. With this, we don't need such a large N value to avoid the regression in the worst case. N=3 works correctly with this approach:
Even with qualifier comparison, false positives remain: exactly N consecutive redundant DCs for the same qualifier trigger a seek.
This should be rare in practice. If overhead is a concern, increasing N is the only alternative. Here are the results. |








Context
HBASE-30036 (#7993) consolidates redundant delete markers on flush, preventing them from growing unbounded in HFiles. However, markers still accumulate in the memstore before flush, degrading read performance. HBASE-29039 addresses this from the read path side. Both are needed for full coverage. There is an open PR (#6557), but the review process has been stalled. This is an alternative approach with fewer code changes, hopefully making it easier to reach consensus.
Test result
Using the test code in HBASE-30036.
DeleteFamilyDeleteColumnContiguousDeleteColumnInterleavedDescription
When a DeleteColumn or DeleteFamily marker is encountered during a normal user scan, the matcher currently returns SKIP, forcing the scanner to advance one cell at a time. This causes read latency to degrade linearly with the number of accumulated delete markers for the same row or column.
Since these are range deletes that mask all remaining versions of the column, seek past the entire column immediately via columns.getNextRowOrNextColumn(). This is safe because cells arrive in timestamp descending order, so any puts newer than the delete have already been processed.
For DeleteFamily, also fix getKeyForNextColumn in ScanQueryMatcher to bypass the empty-qualifier guard (HBASE-18471) when the cell is a DeleteFamily marker. Without this, the seek barely advances past the current cell instead of jumping to the first real qualified column.
The optimization is skipped when: