Skip to content

[C++] feat: add streaming Snappy codec using official framing format#49183

Open
taiyang-li wants to merge 3 commits intoapache:mainfrom
taiyang-li:aime/1769980800-snappy-streaming-official
Open

[C++] feat: add streaming Snappy codec using official framing format#49183
taiyang-li wants to merge 3 commits intoapache:mainfrom
taiyang-li:aime/1769980800-snappy-streaming-official

Conversation

@taiyang-li
Copy link

@taiyang-li taiyang-li commented Feb 8, 2026

Summary

Implement streaming Snappy compressor/decompressor for Arrow C++ using the official Snappy framing format, including per-chunk masked CRC-32C verification, and enable the existing streaming tests for Snappy.

Details

  • Add a small crc32c_masked helper in arrow::util to compute the masked CRC-32C checksum as defined by the Snappy framing specification.
  • Extend the C++ util build to compile crc32c.cc and link it into the main util library.
  • Reimplement the Snappy codec streaming layer in compression_snappy.cc:
    • Keep one-shot Codec::Compress/Decompress based on raw Snappy bitstreams (RawCompress/RawUncompress).
    • Implement SnappyFramedCompressor that emits the official stream identifier chunk and split the uncompressed stream into 64 KiB chunks, each wrapped as a framed chunk with a per-chunk masked CRC-32C checksum.
    • Implement SnappyFramedDecompressor as a stateful parser for Snappy framed streams that validates the stream identifier, handles compressed/uncompressed/skippable chunks, verifies the masked CRC-32C of the uncompressed payload, and supports incremental output via the Decompress API.
  • Wire Codec::MakeCompressor / Codec::MakeDecompressor for Compression::SNAPPY to the new framed implementations.
  • Generalize the streaming compression/decompression tests in compression_test.cc so that they:
    • Validate streaming compressor output using the streaming decompressor instead of the one-shot codec, aligning with codecs where streaming and one-shot formats differ.
    • Generate inputs for CheckStreamingDecompressor using the streaming compressor rather than one-shot compression.
    • Remove the Snappy-specific skips in StreamingCompressor, StreamingDecompressor, StreamingRoundtrip, StreamingDecompressorReuse, and StreamingMultiFlush, so streaming tests now cover Snappy as well as the existing codecs.

Testing

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

## Summary

Implement streaming Snappy compressor/decompressor for Arrow C++ using the official Snappy framing format, including per-chunk masked CRC-32C verification, and enable the existing streaming tests for Snappy.

## Details

- Add a small `crc32c_masked` helper in `arrow::util` to compute the masked CRC-32C checksum as defined by the Snappy framing specification.
- Extend the C++ util build to compile `crc32c.cc` and link it into the main util library.
- Reimplement the Snappy codec streaming layer in `compression_snappy.cc`:
  - Keep one-shot `Codec::Compress/Decompress` based on raw Snappy bitstreams (RawCompress/RawUncompress).
  - Implement `SnappyFramedCompressor` that emits the official stream identifier chunk and split the uncompressed stream into 64 KiB chunks, each wrapped as a framed chunk with a per-chunk masked CRC-32C checksum.
  - Implement `SnappyFramedDecompressor` as a stateful parser for Snappy framed streams that validates the stream identifier, handles compressed/uncompressed/skippable chunks, verifies the masked CRC-32C of the uncompressed payload, and supports incremental output via the `Decompress` API.
- Wire `Codec::MakeCompressor` / `Codec::MakeDecompressor` for `Compression::SNAPPY` to the new framed implementations.
- Generalize the streaming compression/decompression tests in `compression_test.cc` so that they:
  - Validate streaming compressor output using the streaming decompressor instead of the one-shot codec, aligning with codecs where streaming and one-shot formats differ.
  - Generate inputs for `CheckStreamingDecompressor` using the streaming compressor rather than one-shot compression.
  - Remove the Snappy-specific skips in `StreamingCompressor`, `StreamingDecompressor`, `StreamingRoundtrip`, `StreamingDecompressorReuse`, and `StreamingMultiFlush`, so streaming tests now cover Snappy as well as the existing codecs.

## Testing

Due to the environment lacking a configured C/C++ toolchain and Ninja, a local CMake/Ninja build with `ARROW_WITH_SNAPPY=ON` and `ARROW_BUILD_TESTS=ON` could not be completed in this sandbox. The changes are limited to the C++ util layer and its unit tests; they should be validated by running the standard C++ test suite (in particular `util-compression-test`) in a fully provisioned Arrow development environment.

Co-Authored-By: Aime <aime@bytedance.com>
Change-Id: I97c877d81959c13578c6f251cb6c8a8141297d6a
@github-actions
Copy link

github-actions bot commented Feb 8, 2026

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@taiyang-li taiyang-li marked this pull request as draft February 8, 2026 15:04
@taiyang-li taiyang-li changed the title [wip] util: add streaming Snappy codec using official framing format [C++] feat: add streaming Snappy codec using official framing format Feb 8, 2026
@taiyang-li taiyang-li marked this pull request as ready for review February 8, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant