GH-3601: Cache shouldIgnoreStatistics version parsing result by yadavay-amzn · Pull Request #3607 · apache/parquet-java

yadavay-amzn · 2026-06-07T02:37:49Z

Closes #3601.

Summary

Caches the result of CorruptStatistics.shouldIgnoreStatistics to avoid redundant version string parsing. The createdBy string is constant per file but was parsed R×C times (row groups × columns) during footer reading.

Changes

Added a bounded ConcurrentHashMap<String, Boolean> cache (max 64 entries) keyed on createdBy
Cheap type check (BINARY/FIXED_LEN_BYTE_ARRAY) still runs before cache lookup
When cache is full, computes directly without storing (best-effort cap)

Testing

testCachingBehavior: verifies cache is populated (cacheSize grows from 0→1→2)
testCorrectnessWhenCacheIsFull: fills cache to 64 entries, verifies correct results on the cache-bypass path (65th+ distinct strings)
All existing CorruptStatisticsTest tests pass (correctness preserved)

asifsmohammed

I generally like the approach @yadavay-amzn, here's why I would not use cache alone as I mentioned in the issue. It's better than the current approach of checking it RGxCC times

In high workload scenarios we are going to call shouldIgnoreStatistics RGxCC times i.e. we are going to call .size and .computeIfAbsent(hash + equals) that many times which is still not ideal.

The cleaner fix would be at the caller side in ParquetMetadataConverter.fromParquetMetadata(): compute shouldIgnoreStatistics once before the row group loop using the file-level createdBy, then pass the pre-computed boolean through buildColumnChunkMetaData → fromParquetStatisticsInternal. This eliminates all per-column overhead — no hash, no lookup, just a boolean flowing through the call chain.

This eliminates the need to check shouldIgnoreStatistics for every RGxCC. Both approaches could coexist (cache helps other callers), but the caller-side fix is where we see real improvements

…tMetadataConverter

yadavay-amzn · 2026-06-08T20:29:13Z

@asifsmohammed Thanks for reviewing! I am an agreement with your suggestions.

Removed the global static cache entirely. Now computes shouldIgnoreStatistics once per file in fromParquetMetadata before the row-group loop and passes the pre-computed flag through buildColumnChunkMetaData to fromParquetStatisticsInternal.

cc @wgtmac

Copilot

Pull request overview

Optimizes footer metadata conversion by avoiding repeated CorruptStatistics.shouldIgnoreStatistics(created_by, ...) evaluation during row-group/column iteration, by precomputing a boolean once and threading it through column-chunk metadata/statistics conversion.

Changes:

Added a new internal statistics conversion overload that accepts a precomputed shouldIgnoreCorruptStats boolean.
Added an overloaded buildColumnChunkMetaData method that accepts the precomputed flag and routes footer-reading through it.
Precomputes shouldIgnoreCorruptStats once in fromParquetMetadata(...) and passes it into the column loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    boolean shouldIgnoreCorruptStats =
+        CorruptStatistics.shouldIgnoreStatistics(createdBy, PrimitiveTypeName.BINARY);
+    return buildColumnChunkMetaData(metaData, columnPath, type, createdBy, shouldIgnoreCorruptStats);


asif-moh · 2026-06-12T03:24:24Z

+    // Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY
+    // (the only types affected by PARQUET-251), and always false for other types.
+    boolean shouldIgnoreCorruptStats =
+        CorruptStatistics.shouldIgnoreStatistics(parquetMetadata.getCreated_by(), PrimitiveTypeName.BINARY);


This is calling shouldIgnoreStatistics with PrimitiveTypeName.BINARY hardcoded which is incorrect.

Instead we can refactor shouldIgnoreStatistics by adding public methods.

public static boolean shouldIgnoreStatistics(String createdBy, PrimitiveTypeName columnType) { if (!isCorruptStatisticsColumnType(columnType)) { // the bug only applies to binary columns return false; } return fileHasCorruptStatistics(createdBy); } public static boolean isCorruptStatisticsColumnType(PrimitiveTypeName columnType) { return columnType == PrimitiveTypeName.BINARY || columnType == PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY; } public static boolean fileHasCorruptStatistics(String createdBy) { // rest of the logic from shouldIgnoreStatistics }

+    // Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY
+    // (the only types affected by PARQUET-251), and always false for other types.


asif-moh · 2026-06-12T03:21:05Z

  }

+  // Overload that uses a pre-computed shouldIgnoreCorruptStats flag to avoid redundant parsing
+  private org.apache.parquet.column.statistics.Statistics fromParquetStatisticsInternal(


Instead of duplicating the entire fromParquetStatisticsInternal body, the existing method can simply delegate to a new overload and this eliminates duplicate code

static org.apache.parquet.column.statistics.Statistics fromParquetStatisticsInternal( String createdBy, Statistics formatStats, PrimitiveType type, SortOrder typeSortOrder) { return fromParquetStatisticsInternal( formatStats, type, typeSortOrder, CorruptStatistics.fileHasCorruptStatistics(createdBy) // This is a new method in CorruptStatistics ); } // overloaded method static org.apache.parquet.column.statistics.Statistics fromParquetStatisticsInternal( Statistics formatStats, PrimitiveType type, SortOrder typeSortOrder, boolean fileHasCorruptStats) {

asif-moh · 2026-06-12T03:24:24Z

+    // Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY
+    // (the only types affected by PARQUET-251), and always false for other types.
+    boolean shouldIgnoreCorruptStats =
+        CorruptStatistics.shouldIgnoreStatistics(parquetMetadata.getCreated_by(), PrimitiveTypeName.BINARY);


This is calling shouldIgnoreStatistics with PrimitiveTypeName.BINARY hardcoded which is incorrect.

Instead we can refactor shouldIgnoreStatistics by adding public methods.

public static boolean shouldIgnoreStatistics(String createdBy, PrimitiveTypeName columnType) { if (!isCorruptStatisticsColumnType(columnType)) { // the bug only applies to binary columns return false; } return fileHasCorruptStatistics(createdBy); } public static boolean isCorruptStatisticsColumnType(PrimitiveTypeName columnType) { return columnType == PrimitiveTypeName.BINARY || columnType == PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY; } public static boolean fileHasCorruptStatistics(String createdBy) { // rest of the logic from shouldIgnoreStatistics }

asif-moh · 2026-06-12T03:31:08Z

            }
          }

          String createdBy = parquetMetadata.getCreated_by();


We no longer need to pass createdBy downstream

asif-moh · 2026-06-12T03:38:53Z

+    return buildColumnChunkMetaData(metaData, columnPath, type, createdBy, shouldIgnoreCorruptStats);
+  }
+
+  ColumnChunkMetaData buildColumnChunkMetaData(


buildColumnChunkMetaData can delegate to a package-private overload that takes the boolean similar to what you have done but with few changes,

public ColumnChunkMetaData buildColumnChunkMetaData( ColumnMetaData metaData, ColumnPath columnPath, PrimitiveType type, String createdBy) { return buildColumnChunkMetaData( metaData, columnPath, type, CorruptStatistics.fileHasCorruptStatistics(createdBy)); } ColumnChunkMetaData buildColumnChunkMetaData( ColumnMetaData metaData, ColumnPath columnPath, PrimitiveType type, boolean fileHasCorruptStats) { SortOrder expectedOrder = overrideSortOrderToSigned(type) ? SortOrder.SIGNED : sortOrder(type); return ColumnChunkMetaData.get(..., fromParquetStatisticsInternal(metaData.statistics, type, expectedOrder, fileHasCorruptStats), ...); }

No need to pass createdBy downstream, the boolean is all the internal overload needs. SortOrder computation moves here since we bypass fromParquetStatistics to avoid re-parsing createdBy as you have already done by replacing fromParquetStatisticsInternal with fromParquetStatistics.

Also notice how the new public methods we extracted in CorruptStatistics are being used in each delegate method here

asif-moh · 2026-06-12T03:44:36Z

+        boolean ignoreForThisColumn = shouldIgnoreCorruptStats
+            && (primitiveTypeName == PrimitiveTypeName.BINARY
+                || primitiveTypeName == PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY);
+        if (!ignoreForThisColumn && (sortOrdersMatch || maxEqualsMin)) {


This check in shouldIgnoreStatistics is dead code with current changes as we always pass BINARY

if (columnType != PrimitiveTypeName.BINARY && columnType !=PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY)

We can utilize the new methods here. Please refer to below comments for context.

if (!(fileHasCorruptStats && CorruptStatistics.isCorruptStatisticsColumnType(type.getPrimitiveTypeName()))

Instead of calling or moving the PrimitiveTypeName checks to ParquetMetadataConverter and leave it as the responsibility of CorruptStatistics

if (!CorruptStatistics.shouldIgnoreStatistics(createdBy, type.getPrimitiveTypeName())

wgtmac reviewed Jun 8, 2026

View reviewed changes

Comment thread parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java Outdated

asifsmohammed reviewed Jun 8, 2026

View reviewed changes

apacheGH-3601: Compute shouldIgnoreStatistics once per file in Parque…

cbd31f7

…tMetadataConverter

yadavay-amzn force-pushed the fix/3601-shouldIgnoreStatistics-cache branch from e1ce8ad to cbd31f7 Compare June 8, 2026 20:29

wgtmac requested a review from Copilot June 9, 2026 05:24

Copilot started reviewing on behalf of wgtmac June 9, 2026 05:24 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

asif-moh reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3601: Cache shouldIgnoreStatistics version parsing result#3607

GH-3601: Cache shouldIgnoreStatistics version parsing result#3607
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/3601-shouldIgnoreStatistics-cache

yadavay-amzn commented Jun 7, 2026

Uh oh!

Uh oh!

asifsmohammed left a comment

Uh oh!

yadavay-amzn commented Jun 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

asif-moh Jun 12, 2026

Uh oh!

asif-moh Jun 12, 2026

Uh oh!

asif-moh Jun 12, 2026

Uh oh!

asif-moh Jun 12, 2026

Uh oh!

asif-moh Jun 12, 2026

Uh oh!

asif-moh Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		// Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY
		// (the only types affected by PARQUET-251), and always false for other types.

Conversation

yadavay-amzn commented Jun 7, 2026

Summary

Changes

Testing

Uh oh!

Uh oh!

asifsmohammed left a comment

Choose a reason for hiding this comment

Uh oh!

yadavay-amzn commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

asif-moh Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

asif-moh Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

asif-moh Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

asif-moh Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

asif-moh Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

asif-moh Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yadavay-amzn commented Jun 8, 2026 •

edited

Loading