[C++] feat: add streaming Snappy codec using official framing format#49183
Open
taiyang-li wants to merge 3 commits intoapache:mainfrom
Open
[C++] feat: add streaming Snappy codec using official framing format#49183taiyang-li wants to merge 3 commits intoapache:mainfrom
taiyang-li wants to merge 3 commits intoapache:mainfrom
Conversation
## Summary Implement streaming Snappy compressor/decompressor for Arrow C++ using the official Snappy framing format, including per-chunk masked CRC-32C verification, and enable the existing streaming tests for Snappy. ## Details - Add a small `crc32c_masked` helper in `arrow::util` to compute the masked CRC-32C checksum as defined by the Snappy framing specification. - Extend the C++ util build to compile `crc32c.cc` and link it into the main util library. - Reimplement the Snappy codec streaming layer in `compression_snappy.cc`: - Keep one-shot `Codec::Compress/Decompress` based on raw Snappy bitstreams (RawCompress/RawUncompress). - Implement `SnappyFramedCompressor` that emits the official stream identifier chunk and split the uncompressed stream into 64 KiB chunks, each wrapped as a framed chunk with a per-chunk masked CRC-32C checksum. - Implement `SnappyFramedDecompressor` as a stateful parser for Snappy framed streams that validates the stream identifier, handles compressed/uncompressed/skippable chunks, verifies the masked CRC-32C of the uncompressed payload, and supports incremental output via the `Decompress` API. - Wire `Codec::MakeCompressor` / `Codec::MakeDecompressor` for `Compression::SNAPPY` to the new framed implementations. - Generalize the streaming compression/decompression tests in `compression_test.cc` so that they: - Validate streaming compressor output using the streaming decompressor instead of the one-shot codec, aligning with codecs where streaming and one-shot formats differ. - Generate inputs for `CheckStreamingDecompressor` using the streaming compressor rather than one-shot compression. - Remove the Snappy-specific skips in `StreamingCompressor`, `StreamingDecompressor`, `StreamingRoundtrip`, `StreamingDecompressorReuse`, and `StreamingMultiFlush`, so streaming tests now cover Snappy as well as the existing codecs. ## Testing Due to the environment lacking a configured C/C++ toolchain and Ninja, a local CMake/Ninja build with `ARROW_WITH_SNAPPY=ON` and `ARROW_BUILD_TESTS=ON` could not be completed in this sandbox. The changes are limited to the C++ util layer and its unit tests; they should be validated by running the standard C++ test suite (in particular `util-compression-test`) in a fully provisioned Arrow development environment. Co-Authored-By: Aime <aime@bytedance.com> Change-Id: I97c877d81959c13578c6f251cb6c8a8141297d6a
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implement streaming Snappy compressor/decompressor for Arrow C++ using the official Snappy framing format, including per-chunk masked CRC-32C verification, and enable the existing streaming tests for Snappy.
Details
crc32c_maskedhelper inarrow::utilto compute the masked CRC-32C checksum as defined by the Snappy framing specification.crc32c.ccand link it into the main util library.compression_snappy.cc:Codec::Compress/Decompressbased on raw Snappy bitstreams (RawCompress/RawUncompress).SnappyFramedCompressorthat emits the official stream identifier chunk and split the uncompressed stream into 64 KiB chunks, each wrapped as a framed chunk with a per-chunk masked CRC-32C checksum.SnappyFramedDecompressoras a stateful parser for Snappy framed streams that validates the stream identifier, handles compressed/uncompressed/skippable chunks, verifies the masked CRC-32C of the uncompressed payload, and supports incremental output via theDecompressAPI.Codec::MakeCompressor/Codec::MakeDecompressorforCompression::SNAPPYto the new framed implementations.compression_test.ccso that they:CheckStreamingDecompressorusing the streaming compressor rather than one-shot compression.StreamingCompressor,StreamingDecompressor,StreamingRoundtrip,StreamingDecompressorReuse, andStreamingMultiFlush, so streaming tests now cover Snappy as well as the existing codecs.Testing
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?