Skip to content

[client] Support log scanner scan to arrow record batch#2995

Draft
luoyuxia wants to merge 1 commit intoapache:mainfrom
luoyuxia:log-scanner-support-scan-log-record-batch
Draft

[client] Support log scanner scan to arrow record batch#2995
luoyuxia wants to merge 1 commit intoapache:mainfrom
luoyuxia:log-scanner-support-scan-log-record-batch

Conversation

@luoyuxia
Copy link
Copy Markdown
Contributor

@luoyuxia luoyuxia commented Apr 3, 2026

Purpose

Linked issue: close #2965

Support LogScanner to scan log data as Arrow record batches (ArrowBatchData) instead of row-by-row ScanRecord, enabling consumers like the tiering service to process columnar Arrow batches directly.

Brief change log

  • Add LogScannerImpl.pollRecordBatch(Duration) returning ArrowScanRecords for ARROW-format append-only log tables.
  • Add ArrowBatchData to hold scanned Arrow VectorSchemaRoot with log metadata and memory ownership.
  • Add LogRecordBatch.loadArrowBatch(ReadContext) to load Arrow batch directly from batch memory.
  • Introduce ArrowRecordBatchContext / UnshadedArrowBatchAccess to bridge shaded/unshaded Arrow types with batch-scoped child allocators.
  • Add unshaded Arrow compression support and UnshadedArrowReadUtils for the unshaded read path with schema evolution support.
  • Extract AbstractLogFetchCollector<T, R> to share fetch-collection logic between row-based and Arrow batch paths.
  • Extract CompletedFetch.nextFetchedBatch() / finishFetchedBatches() and generalize LogScannerImpl.doPoll() for code reuse.

Tests

  • LogScannerITCase.testPollArrowBatchesWithSchemaEvolution
  • FileLogInputStreamTest.testLoadArrowBatchFromFileLogInputStream

API and Format

  • New @Internal API: LogScannerImpl.pollRecordBatch(Duration), ArrowBatchData, ArrowScanRecords.
  • No storage format changes.

Documentation

N/A

@luoyuxia luoyuxia force-pushed the log-scanner-support-scan-log-record-batch branch from bfa46ef to e29e8c5 Compare April 3, 2026 13:18
@luoyuxia luoyuxia requested a review from Copilot April 3, 2026 13:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an Arrow-native scan path so Fluss clients can poll log data as Arrow VectorSchemaRoot batches (instead of iterating row-by-row), including support for schema evolution (missing columns filled with nulls) and projection reordering.

Changes:

  • Introduces ArrowLogScanner / ArrowScanRecords and the ArrowBatchData container API for polling Arrow record batches from the client.
  • Adds unshaded-Arrow batch loading + client-side projection utilities (UnshadedArrowReadUtils, UnshadedFlussVectorLoader, unshaded compression codecs/factory).
  • Refactors existing log scanner/collector code to share implementation via new abstract base classes and adds tests covering Arrow-batch loading and schema evolution.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
fluss-common/src/main/java/org/apache/fluss/record/LogRecordBatch.java Adds loadArrowBatch(...) API and Arrow-specific methods to ReadContext.
fluss-common/src/main/java/org/apache/fluss/record/LogRecordReadContext.java Adds selected row type + Arrow column projection calculation; makes Arrow resources lazily created.
fluss-common/src/main/java/org/apache/fluss/record/DefaultLogRecordBatch.java Implements loadArrowBatch(...) for Arrow log batches, including optional client-side projection.
fluss-common/src/main/java/org/apache/fluss/record/ArrowBatchData.java New public container holding VectorSchemaRoot + log metadata + change types; closeable.
fluss-common/src/main/java/org/apache/fluss/utils/UnshadedArrowReadUtils.java New utilities to deserialize unshaded Arrow batches and project/reorder vectors.
fluss-common/src/main/java/org/apache/fluss/record/UnshadedFlussVectorLoader.java New unshaded variant of FlussVectorLoader for scanner/read path.
fluss-common/src/main/java/org/apache/fluss/compression/Unshaded*ArrowCompressionCodec.java New unshaded Arrow compression codecs for LZ4/Zstd.
fluss-common/src/main/java/org/apache/fluss/compression/UnshadedArrowCompressionFactory.java Factory providing unshaded compression codecs for read path.
fluss-common/src/main/java/org/apache/fluss/row/ProjectedRow.java Exposes index mapping via new accessor.
fluss-common/src/main/java/org/apache/fluss/record/FileLogInputStream.java Delegates loadArrowBatch() for file-backed batches.
fluss-common/src/test/java/org/apache/fluss/record/MemoryLogRecordsArrowBuilderTest.java Adds schema-evolution test for missing projected columns in Arrow batches.
fluss-common/src/test/java/org/apache/fluss/record/FileLogInputStreamTest.java Adds test to load Arrow batch from file log input stream.
fluss-common/pom.xml Adds unshaded Arrow dependencies (currently provided).
fluss-client/src/main/java/org/apache/fluss/client/table/scanner/Scan.java Adds createArrowLogScanner() API.
fluss-client/src/main/java/org/apache/fluss/client/table/scanner/TableScan.java Implements createArrowLogScanner() with log-format/limit validation.
fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/ArrowLogScanner*.java New Arrow scanner interfaces + implementation.
fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/ArrowScanRecords.java New container for scanned Arrow batches per bucket.
fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogScannerImpl.java Refactors to share logic via AbstractLogScanner.
fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/AbstractLogScanner.java New shared scanner implementation (poll/subscribe/unsubscribe/wakeup/close).
fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetcher.java Adds Arrow fetch collection path.
fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/ArrowLogFetchCollector.java New fetch collector producing ArrowScanRecords.
fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/AbstractLogFetchCollector.java New shared fetch-collection implementation.
fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetchCollector.java Refactors to extend AbstractLogFetchCollector.
fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/CompletedFetch.java Adds fetchArrowBatches(...) path that loads batches via loadArrowBatch(...).
fluss-client/src/test/java/org/apache/fluss/client/table/scanner/log/LogScannerITCase.java Adds IT coverage for Arrow scanning, schema evolution, and projection reorder.
fluss-client/pom.xml Adds unshaded Arrow dependencies (currently provided).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@luoyuxia luoyuxia force-pushed the log-scanner-support-scan-log-record-batch branch 7 times, most recently from b865197 to 1432afb Compare April 7, 2026 07:42
@luoyuxia luoyuxia requested a review from Copilot April 7, 2026 07:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@luoyuxia luoyuxia force-pushed the log-scanner-support-scan-log-record-batch branch from 1432afb to e897604 Compare April 7, 2026 08:46
@luoyuxia luoyuxia marked this pull request as ready for review April 7, 2026 08:47
@luoyuxia luoyuxia force-pushed the log-scanner-support-scan-log-record-batch branch from e897604 to 1daf809 Compare April 7, 2026 08:55
@luoyuxia luoyuxia force-pushed the log-scanner-support-scan-log-record-batch branch from 1daf809 to 433fd85 Compare April 7, 2026 08:57
@luoyuxia luoyuxia marked this pull request as draft April 9, 2026 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support log scanner scan data as arrow reacord batch

2 participants