Skip to content

GH-3601: Cache shouldIgnoreStatistics version parsing result#3607

Open
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/3601-shouldIgnoreStatistics-cache
Open

GH-3601: Cache shouldIgnoreStatistics version parsing result#3607
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/3601-shouldIgnoreStatistics-cache

Conversation

@yadavay-amzn

Copy link
Copy Markdown
Contributor

Closes #3601.

Summary

Caches the result of CorruptStatistics.shouldIgnoreStatistics to avoid redundant version string parsing. The createdBy string is constant per file but was parsed R×C times (row groups × columns) during footer reading.

Changes

  • Added a bounded ConcurrentHashMap<String, Boolean> cache (max 64 entries) keyed on createdBy
  • Cheap type check (BINARY/FIXED_LEN_BYTE_ARRAY) still runs before cache lookup
  • When cache is full, computes directly without storing (best-effort cap)

Testing

  • testCachingBehavior: verifies cache is populated (cacheSize grows from 0→1→2)
  • testCorrectnessWhenCacheIsFull: fills cache to 64 entries, verifies correct results on the cache-bypass path (65th+ distinct strings)
  • All existing CorruptStatisticsTest tests pass (correctness preserved)

Comment thread parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java Outdated

@asifsmohammed asifsmohammed left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally like the approach @yadavay-amzn, here's why I would not use cache alone as I mentioned in the issue. It's better than the current approach of checking it RGxCC times

  • In high workload scenarios we are going to call shouldIgnoreStatistics RGxCC times i.e. we are going to call .size and .computeIfAbsent(hash + equals) that many times which is still not ideal.

The cleaner fix would be at the caller side in ParquetMetadataConverter.fromParquetMetadata(): compute shouldIgnoreStatistics once before the row group loop using the file-level createdBy, then pass the pre-computed boolean through buildColumnChunkMetaDatafromParquetStatisticsInternal. This eliminates all per-column overhead — no hash, no lookup, just a boolean flowing through the call chain.

This eliminates the need to check shouldIgnoreStatistics for every RGxCC. Both approaches could coexist (cache helps other callers), but the caller-side fix is where we see real improvements

@yadavay-amzn yadavay-amzn force-pushed the fix/3601-shouldIgnoreStatistics-cache branch from e1ce8ad to cbd31f7 Compare June 8, 2026 20:29
@yadavay-amzn

yadavay-amzn commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@asifsmohammed Thanks for reviewing! I am an agreement with your suggestions.

Removed the global static cache entirely. Now computes shouldIgnoreStatistics once per file in fromParquetMetadata before the row-group loop and passes the pre-computed flag through buildColumnChunkMetaData to fromParquetStatisticsInternal.

cc @wgtmac

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes footer metadata conversion by avoiding repeated CorruptStatistics.shouldIgnoreStatistics(created_by, ...) evaluation during row-group/column iteration, by precomputing a boolean once and threading it through column-chunk metadata/statistics conversion.

Changes:

  • Added a new internal statistics conversion overload that accepts a precomputed shouldIgnoreCorruptStats boolean.
  • Added an overloaded buildColumnChunkMetaData method that accepts the precomputed flag and routes footer-reading through it.
  • Precomputes shouldIgnoreCorruptStats once in fromParquetMetadata(...) and passes it into the column loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1837 to +1839
boolean shouldIgnoreCorruptStats =
CorruptStatistics.shouldIgnoreStatistics(createdBy, PrimitiveTypeName.BINARY);
return buildColumnChunkMetaData(metaData, columnPath, type, createdBy, shouldIgnoreCorruptStats);
Comment on lines +1883 to +1886
// Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY
// (the only types affected by PARQUET-251), and always false for other types.
boolean shouldIgnoreCorruptStats =
CorruptStatistics.shouldIgnoreStatistics(parquetMetadata.getCreated_by(), PrimitiveTypeName.BINARY);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is calling shouldIgnoreStatistics with PrimitiveTypeName.BINARY hardcoded which is incorrect.

Instead we can refactor shouldIgnoreStatistics by adding public methods.

public static boolean shouldIgnoreStatistics(String createdBy, PrimitiveTypeName columnType) {
    if (!isCorruptStatisticsColumnType(columnType)) {
      // the bug only applies to binary columns
      return false;
    }
    return fileHasCorruptStatistics(createdBy);
}

public static boolean isCorruptStatisticsColumnType(PrimitiveTypeName columnType) {
    return columnType == PrimitiveTypeName.BINARY || columnType == PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY;
}

public static boolean fileHasCorruptStatistics(String createdBy) {
    // rest of the logic from shouldIgnoreStatistics
}

Comment on lines +1883 to +1884
// Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY
// (the only types affected by PARQUET-251), and always false for other types.
}

// Overload that uses a pre-computed shouldIgnoreCorruptStats flag to avoid redundant parsing
private org.apache.parquet.column.statistics.Statistics fromParquetStatisticsInternal(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of duplicating the entire fromParquetStatisticsInternal body, the existing method can simply delegate to a new overload and this eliminates duplicate code

static org.apache.parquet.column.statistics.Statistics fromParquetStatisticsInternal(
      String createdBy, Statistics formatStats, PrimitiveType type, SortOrder typeSortOrder) {
    return fromParquetStatisticsInternal(
        formatStats, 
        type, 
        typeSortOrder, 
        CorruptStatistics.fileHasCorruptStatistics(createdBy) // This is a new method in CorruptStatistics
    );
}


// overloaded method
static org.apache.parquet.column.statistics.Statistics fromParquetStatisticsInternal(
      Statistics formatStats, PrimitiveType type, SortOrder typeSortOrder, boolean fileHasCorruptStats) {

Comment on lines +1883 to +1886
// Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY
// (the only types affected by PARQUET-251), and always false for other types.
boolean shouldIgnoreCorruptStats =
CorruptStatistics.shouldIgnoreStatistics(parquetMetadata.getCreated_by(), PrimitiveTypeName.BINARY);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is calling shouldIgnoreStatistics with PrimitiveTypeName.BINARY hardcoded which is incorrect.

Instead we can refactor shouldIgnoreStatistics by adding public methods.

public static boolean shouldIgnoreStatistics(String createdBy, PrimitiveTypeName columnType) {
    if (!isCorruptStatisticsColumnType(columnType)) {
      // the bug only applies to binary columns
      return false;
    }
    return fileHasCorruptStatistics(createdBy);
}

public static boolean isCorruptStatisticsColumnType(PrimitiveTypeName columnType) {
    return columnType == PrimitiveTypeName.BINARY || columnType == PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY;
}

public static boolean fileHasCorruptStatistics(String createdBy) {
    // rest of the logic from shouldIgnoreStatistics
}

}
}

String createdBy = parquetMetadata.getCreated_by();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need to pass createdBy downstream

return buildColumnChunkMetaData(metaData, columnPath, type, createdBy, shouldIgnoreCorruptStats);
}

ColumnChunkMetaData buildColumnChunkMetaData(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildColumnChunkMetaData can delegate to a package-private overload that takes the boolean similar to what you have done but with few changes,

public ColumnChunkMetaData buildColumnChunkMetaData(
      ColumnMetaData metaData, ColumnPath columnPath, PrimitiveType type, String createdBy) {
    return buildColumnChunkMetaData(
        metaData, columnPath, type, CorruptStatistics.fileHasCorruptStatistics(createdBy));
}

ColumnChunkMetaData buildColumnChunkMetaData(
      ColumnMetaData metaData, ColumnPath columnPath, PrimitiveType type, boolean fileHasCorruptStats) {
    SortOrder expectedOrder = overrideSortOrderToSigned(type) ? SortOrder.SIGNED : sortOrder(type);
    return ColumnChunkMetaData.get(...,
        fromParquetStatisticsInternal(metaData.statistics, type, expectedOrder, fileHasCorruptStats), ...);
}

No need to pass createdBy downstream, the boolean is all the internal overload needs. SortOrder computation moves here since we bypass fromParquetStatistics to avoid re-parsing createdBy as you have already done by replacing fromParquetStatisticsInternal with fromParquetStatistics.

Also notice how the new public methods we extracted in CorruptStatistics are being used in each delegate method here

Comment on lines +995 to +998
boolean ignoreForThisColumn = shouldIgnoreCorruptStats
&& (primitiveTypeName == PrimitiveTypeName.BINARY
|| primitiveTypeName == PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY);
if (!ignoreForThisColumn && (sortOrdersMatch || maxEqualsMin)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check in shouldIgnoreStatistics is dead code with current changes as we always pass BINARY

if (columnType != PrimitiveTypeName.BINARY && columnType !=PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY)

We can utilize the new methods here. Please refer to below comments for context.

if (!(fileHasCorruptStats && CorruptStatistics.isCorruptStatisticsColumnType(type.getPrimitiveTypeName()))

Instead of calling or moving the PrimitiveTypeName checks to ParquetMetadataConverter and leave it as the responsibility of CorruptStatistics

if (!CorruptStatistics.shouldIgnoreStatistics(createdBy, type.getPrimitiveTypeName())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize repeated shouldIgnoreStatistics calls during footer reading in ParquetMetadataConverter

5 participants