feat(puffin): add format constants, utilities, and JSON serialization#603
feat(puffin): add format constants, utilities, and JSON serialization#603zhaoxuan1994 wants to merge 1 commit intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds core Puffin file-format helpers and JSON (de)serialization support for Puffin metadata, plus unit tests and build-system wiring to compile/run the new components.
Changes:
- Introduce
puffin_formatconstants/utilities (footer flags, little-endian I/O, compression dispatch stubs). - Add
puffin_json_internalJSON serialization/deserialization forBlobMetadataandFileMetadata. - Add
puffin_testand register it in both CMake and Meson test targets.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/iceberg/puffin/puffin_format.h |
Declares Puffin format constants and helper APIs (flags, endian I/O, compress/decompress). |
src/iceberg/puffin/puffin_format.cc |
Implements flags/endian helpers and compression dispatch stubs. |
src/iceberg/puffin/puffin_json_internal.h |
Declares JSON (de)serialization APIs for Puffin metadata. |
src/iceberg/puffin/puffin_json_internal.cc |
Implements JSON (de)serialization and string parsing for FileMetadata. |
src/iceberg/test/puffin_test.cc |
Adds unit tests for endian/flags/compression stubs and JSON round-trips. |
src/iceberg/test/CMakeLists.txt |
Registers puffin_test in CMake-based test build. |
src/iceberg/test/meson.build |
Registers puffin_test in Meson-based test build. |
src/iceberg/meson.build |
Adds new Puffin .cc sources to the main Meson library build. |
src/iceberg/CMakeLists.txt |
Adds new Puffin .cc sources to the main CMake library build. |
src/iceberg/puffin/meson.build |
Updates Puffin header installation list. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| /// Default compression codec for footer payload. | ||
| static constexpr PuffinCompressionCodec kFooterCompressionCodec = | ||
| PuffinCompressionCodec::kLz4; |
There was a problem hiding this comment.
PuffinFormat::kFooterCompressionCodec is set to kLz4, but Compress/Decompress currently return NotSupported for LZ4. This constant is likely to be used as the default when writing/reading footer payloads, and will cause immediate failures once that code lands. Consider defaulting to kNone until LZ4 support is implemented, or clearly documenting that the default codec is not yet supported and must not be used for writing.
| PuffinCompressionCodec::kLz4; | |
| PuffinCompressionCodec::kNone; |
There was a problem hiding this comment.
Keeping kLz4 as it reflects the Puffin spec's default footer compression codec, consistent with the Java implementation. This constant is a spec-level declaration — actual usage in the writer will check the Result from Compress(), so unsupported codecs surface as explicit errors.
There was a problem hiding this comment.
I think this is actually a real compatibility concern where java puffin readers would likely break if a compressed footer was ever properly written. Java has never implemented lz4 compression: https://github.com/apache/iceberg/blob/8a51a685894eace51474f512b46a187566f27202/core/src/main/java/org/apache/iceberg/puffin/PuffinFormat.java#L112 which IIUC means anything written with a compressed footer would not be readable from Java. If you agree with this assessment it might be misreading the code but might be worth a discussion on the mailing list.
There was a problem hiding this comment.
Looking more closely it looks like java would just throw unsupported codec for compressed footers
There was a problem hiding this comment.
I agree the spec calls for lz4, so there is maybe nothing to be done here
18a4dcd to
39d4a91
Compare
src/iceberg/puffin/puffin_format.h
Outdated
| ICEBERG_EXPORT void WriteInt32LittleEndian(int32_t value, std::span<uint8_t, 4> output); | ||
|
|
||
| /// \brief Read a 32-bit integer from a fixed-size span in little-endian format. | ||
| ICEBERG_EXPORT int32_t ReadInt32LittleEndian(std::span<const uint8_t, 4> input); |
There was a problem hiding this comment.
There is very code for this in the roaring bitmap code it should probably standardized in a common place?
There was a problem hiding this comment.
Done. Added WriteLittleEndian/ReadLittleEndian templates to endian.h as public API. Updated both puffin_format.cc and roaring_position_bitmap.cc to use them, removing the duplicated local helpers.
src/iceberg/puffin/puffin_format.h
Outdated
|
|
||
| /// \brief Decompress data using the specified codec. | ||
| ICEBERG_EXPORT Result<std::vector<uint8_t>> Decompress(PuffinCompressionCodec codec, | ||
| std::span<const uint8_t> input); |
There was a problem hiding this comment.
Roaring bitmap is using string_view as collection of bytes, we should probably standardize on one of them.
There was a problem hiding this comment.
I used span here since it feels more natural for binary data. But I can see the argument for string_view given that CRoaring already uses std::string for serialization. Do you have a preference on which one the project should standardize on?
There was a problem hiding this comment.
We're on C++23 so std::byte is definitely available. Together with my previous comment for a generic codec interface/implementation, we can use std::span<const std::byte> as input and std::vector<std::byte> as output but with overloads that accepting std::string_view and returning std::string, etc.
There was a problem hiding this comment.
@wgtmac I guess my main concern here would be to try to maintain internal consistency within all the code in the project. For consuming use-cases the overloads might be worth it. I wonder if it would be too premature to consider having a Buffer object?
There was a problem hiding this comment.
I'm hesitant to introduce a Buffer (and corresponding MemoryPool) at this moment. They are only used as ephemeral input/output buffers for CompressionCodec and FileIO.
|
|
||
| /// \brief Deserialize a FileMetadata from a JSON string. | ||
| ICEBERG_EXPORT Result<FileMetadata> FileMetadataFromJsonString( | ||
| const std::string& json_string); |
There was a problem hiding this comment.
nit: std::string_view?
There was a problem hiding this comment.
seems like tests should be seperated per .cc?
2ec6aa5 to
abc7aa8
Compare
6163111 to
7c823ec
Compare
7c823ec to
a60eb24
Compare
| } | ||
| } | ||
|
|
||
| /// \brief Write a value in little-endian format to a buffer. |
There was a problem hiding this comment.
It would be good to add a comment to say that it is caller's responsibility to guarantee that output has sufficient length, otherwise it is a UB.
|
|
||
| /// \brief Read a value in little-endian format from a buffer. | ||
| template <EndianConvertible T> | ||
| T ReadLittleEndian(const void* input) { |
|
|
||
| ICEBERG_EXPORT std::string ToString(const BlobMetadata& blob_metadata); | ||
|
|
||
| inline std::ostream& operator<<(std::ostream& os, const BlobMetadata& b) { |
There was a problem hiding this comment.
We don't have any precedence to add this. It is preferable to use std::format wherever possible.
| #include <nlohmann/json_fwd.hpp> | ||
|
|
||
| #include "iceberg/iceberg_export.h" | ||
| #include "iceberg/puffin/file_metadata.h" |
There was a problem hiding this comment.
Can we remove this and add a iceberg/puffin/type_fwd.h for forward declarations?
|
|
||
| /// \brief Read a 32-bit integer from a buffer at the given offset in little-endian | ||
| /// format. | ||
| ICEBERG_EXPORT Result<int32_t> ReadInt32LittleEndian(std::span<const uint8_t> data, |
There was a problem hiding this comment.
Should we move this to endian.h? If not, I'd be inclined to remove this for now and then add it back as an internal function in the source file where it is called.
| ICEBERG_EXPORT Result<FileMetadata> FileMetadataFromJson(const nlohmann::json& json); | ||
|
|
||
| /// \brief Serialize a FileMetadata to a JSON string. | ||
| ICEBERG_EXPORT std::string ToJsonString(const FileMetadata& file_metadata, |
There was a problem hiding this comment.
Do we really need to specialize this?
| namespace iceberg::puffin { | ||
|
|
||
| TEST(PuffinFormatTest, ByteOrderRoundTrip) { | ||
| std::array<uint8_t, 4> buf{}; |
There was a problem hiding this comment.
This test seems unrelated to puffin. Move it to endian_test.cc?
| INSTANTIATE_TEST_SUITE_P( | ||
| PuffinJson, InvalidBlobMetadataJsonTest, | ||
| ::testing::Values( | ||
| InvalidBlobMetadataJsonParam{.name = "MissingType", |
There was a problem hiding this comment.
Add a MissingSequenceNumber test case for completeness.
Second PR in the Puffin series (following #588).
puffin_format.h/.cc: format constants, footer flag operations, little-endian I/O, compress/decompress dispatch (LZ4/Zstd stubbed as NotSupported)
puffin_json_internal.h/.cc: JSON serialization for BlobMetadata and FileMetadata