GH-48986: [C++][Dataset] Add lazy evaluation infrastructure for ORC predicate pushdown (3/14)#49011
Draft
cbb330 wants to merge 3 commits intoapache:mainfrom
Draft
GH-48986: [C++][Dataset] Add lazy evaluation infrastructure for ORC predicate pushdown (3/14)#49011cbb330 wants to merge 3 commits intoapache:mainfrom
cbb330 wants to merge 3 commits intoapache:mainfrom
Conversation
Add internal utilities for extracting min/max statistics from ORC stripe metadata. This establishes the foundation for statistics-based stripe filtering in predicate pushdown. Changes: - Add MinMaxStats struct to hold extracted statistics - Add ExtractStripeStatistics() function for INT64 columns - Statistics extraction returns std::nullopt for missing/invalid data - Validates statistics integrity (min <= max) This is an internal-only change with no public API modifications. Part of incremental ORC predicate pushdown implementation (PR1/15).
Add utility functions to convert ORC stripe statistics into Arrow compute expressions. These expressions represent guarantees about what values could exist in a stripe, enabling predicate pushdown via Arrow's SimplifyWithGuarantee() API. Changes: - Add BuildMinMaxExpression() for creating range expressions - Support null handling with OR is_null(field) when nulls present - Add convenience overload accepting MinMaxStats directly - Expression format: (field >= min AND field <= max) [OR is_null(field)] This is an internal-only utility with no public API changes. Part of incremental ORC predicate pushdown implementation (PR2/15).
Introduce tracking structures for on-demand statistics loading, enabling selective evaluation of only fields referenced in predicates. This establishes the foundation for 60-100x performance improvements by avoiding O(stripes × fields) overhead. Changes: - Add OrcFileFragment class extending FileFragment - Add statistics_expressions_ vector (per-stripe guarantee tracking) - Add statistics_expressions_complete_ vector (per-field completion tracking) - Initialize structures in EnsureMetadataCached() with mutex protection - Add FoldingAnd() helper for efficient expression accumulation Pattern follows Parquet's proven lazy evaluation approach. This is infrastructure-only with no public API exposure yet. Part of incremental ORC predicate pushdown implementation (PR3/15).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Part 3/14 of ORC predicate pushdown implementation.
Adds lazy evaluation infrastructure to OrcFileFragment:
Changes
statistics_expressions_cache to OrcFileFragmentstatistics_expressions_complete_trackingEnsureMetadataCached()for lazy loadingPerformance Impact
For a file with 100 columns and predicate on 2 columns:
Part of stacked PR series. Review after PR 2.