GH-3601: Cache shouldIgnoreStatistics version parsing result#3607
GH-3601: Cache shouldIgnoreStatistics version parsing result#3607yadavay-amzn wants to merge 1 commit into
Conversation
asifsmohammed
left a comment
There was a problem hiding this comment.
I generally like the approach @yadavay-amzn, here's why I would not use cache alone as I mentioned in the issue. It's better than the current approach of checking it RGxCC times
- In high workload scenarios we are going to call
shouldIgnoreStatisticsRGxCC times i.e. we are going to call.sizeand.computeIfAbsent(hash + equals) that many times which is still not ideal.
The cleaner fix would be at the caller side in ParquetMetadataConverter.fromParquetMetadata(): compute shouldIgnoreStatistics once before the row group loop using the file-level createdBy, then pass the pre-computed boolean through buildColumnChunkMetaData → fromParquetStatisticsInternal. This eliminates all per-column overhead — no hash, no lookup, just a boolean flowing through the call chain.
This eliminates the need to check shouldIgnoreStatistics for every RGxCC. Both approaches could coexist (cache helps other callers), but the caller-side fix is where we see real improvements
…tMetadataConverter
e1ce8ad to
cbd31f7
Compare
|
@asifsmohammed Thanks for reviewing! I am an agreement with your suggestions. Removed the global static cache entirely. Now computes cc @wgtmac |
There was a problem hiding this comment.
Pull request overview
Optimizes footer metadata conversion by avoiding repeated CorruptStatistics.shouldIgnoreStatistics(created_by, ...) evaluation during row-group/column iteration, by precomputing a boolean once and threading it through column-chunk metadata/statistics conversion.
Changes:
- Added a new internal statistics conversion overload that accepts a precomputed
shouldIgnoreCorruptStatsboolean. - Added an overloaded
buildColumnChunkMetaDatamethod that accepts the precomputed flag and routes footer-reading through it. - Precomputes
shouldIgnoreCorruptStatsonce infromParquetMetadata(...)and passes it into the column loop.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| boolean shouldIgnoreCorruptStats = | ||
| CorruptStatistics.shouldIgnoreStatistics(createdBy, PrimitiveTypeName.BINARY); | ||
| return buildColumnChunkMetaData(metaData, columnPath, type, createdBy, shouldIgnoreCorruptStats); |
| // Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY | ||
| // (the only types affected by PARQUET-251), and always false for other types. | ||
| boolean shouldIgnoreCorruptStats = | ||
| CorruptStatistics.shouldIgnoreStatistics(parquetMetadata.getCreated_by(), PrimitiveTypeName.BINARY); |
There was a problem hiding this comment.
This is calling shouldIgnoreStatistics with PrimitiveTypeName.BINARY hardcoded which is incorrect.
Instead we can refactor shouldIgnoreStatistics by adding public methods.
public static boolean shouldIgnoreStatistics(String createdBy, PrimitiveTypeName columnType) {
if (!isCorruptStatisticsColumnType(columnType)) {
// the bug only applies to binary columns
return false;
}
return fileHasCorruptStatistics(createdBy);
}
public static boolean isCorruptStatisticsColumnType(PrimitiveTypeName columnType) {
return columnType == PrimitiveTypeName.BINARY || columnType == PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY;
}
public static boolean fileHasCorruptStatistics(String createdBy) {
// rest of the logic from shouldIgnoreStatistics
}
| // Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY | ||
| // (the only types affected by PARQUET-251), and always false for other types. |
| } | ||
|
|
||
| // Overload that uses a pre-computed shouldIgnoreCorruptStats flag to avoid redundant parsing | ||
| private org.apache.parquet.column.statistics.Statistics fromParquetStatisticsInternal( |
There was a problem hiding this comment.
Instead of duplicating the entire fromParquetStatisticsInternal body, the existing method can simply delegate to a new overload and this eliminates duplicate code
static org.apache.parquet.column.statistics.Statistics fromParquetStatisticsInternal(
String createdBy, Statistics formatStats, PrimitiveType type, SortOrder typeSortOrder) {
return fromParquetStatisticsInternal(
formatStats,
type,
typeSortOrder,
CorruptStatistics.fileHasCorruptStatistics(createdBy) // This is a new method in CorruptStatistics
);
}
// overloaded method
static org.apache.parquet.column.statistics.Statistics fromParquetStatisticsInternal(
Statistics formatStats, PrimitiveType type, SortOrder typeSortOrder, boolean fileHasCorruptStats) {
| // Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY | ||
| // (the only types affected by PARQUET-251), and always false for other types. | ||
| boolean shouldIgnoreCorruptStats = | ||
| CorruptStatistics.shouldIgnoreStatistics(parquetMetadata.getCreated_by(), PrimitiveTypeName.BINARY); |
There was a problem hiding this comment.
This is calling shouldIgnoreStatistics with PrimitiveTypeName.BINARY hardcoded which is incorrect.
Instead we can refactor shouldIgnoreStatistics by adding public methods.
public static boolean shouldIgnoreStatistics(String createdBy, PrimitiveTypeName columnType) {
if (!isCorruptStatisticsColumnType(columnType)) {
// the bug only applies to binary columns
return false;
}
return fileHasCorruptStatistics(createdBy);
}
public static boolean isCorruptStatisticsColumnType(PrimitiveTypeName columnType) {
return columnType == PrimitiveTypeName.BINARY || columnType == PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY;
}
public static boolean fileHasCorruptStatistics(String createdBy) {
// rest of the logic from shouldIgnoreStatistics
}
| } | ||
| } | ||
|
|
||
| String createdBy = parquetMetadata.getCreated_by(); |
There was a problem hiding this comment.
We no longer need to pass createdBy downstream
| return buildColumnChunkMetaData(metaData, columnPath, type, createdBy, shouldIgnoreCorruptStats); | ||
| } | ||
|
|
||
| ColumnChunkMetaData buildColumnChunkMetaData( |
There was a problem hiding this comment.
buildColumnChunkMetaData can delegate to a package-private overload that takes the boolean similar to what you have done but with few changes,
public ColumnChunkMetaData buildColumnChunkMetaData(
ColumnMetaData metaData, ColumnPath columnPath, PrimitiveType type, String createdBy) {
return buildColumnChunkMetaData(
metaData, columnPath, type, CorruptStatistics.fileHasCorruptStatistics(createdBy));
}
ColumnChunkMetaData buildColumnChunkMetaData(
ColumnMetaData metaData, ColumnPath columnPath, PrimitiveType type, boolean fileHasCorruptStats) {
SortOrder expectedOrder = overrideSortOrderToSigned(type) ? SortOrder.SIGNED : sortOrder(type);
return ColumnChunkMetaData.get(...,
fromParquetStatisticsInternal(metaData.statistics, type, expectedOrder, fileHasCorruptStats), ...);
}
No need to pass createdBy downstream, the boolean is all the internal overload needs. SortOrder computation moves here since we bypass fromParquetStatistics to avoid re-parsing createdBy as you have already done by replacing fromParquetStatisticsInternal with fromParquetStatistics.
Also notice how the new public methods we extracted in CorruptStatistics are being used in each delegate method here
| boolean ignoreForThisColumn = shouldIgnoreCorruptStats | ||
| && (primitiveTypeName == PrimitiveTypeName.BINARY | ||
| || primitiveTypeName == PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY); | ||
| if (!ignoreForThisColumn && (sortOrdersMatch || maxEqualsMin)) { |
There was a problem hiding this comment.
This check in shouldIgnoreStatistics is dead code with current changes as we always pass BINARY
if (columnType != PrimitiveTypeName.BINARY && columnType !=PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY)
We can utilize the new methods here. Please refer to below comments for context.
if (!(fileHasCorruptStats && CorruptStatistics.isCorruptStatisticsColumnType(type.getPrimitiveTypeName()))
Instead of calling or moving the PrimitiveTypeName checks to ParquetMetadataConverter and leave it as the responsibility of CorruptStatistics
if (!CorruptStatistics.shouldIgnoreStatistics(createdBy, type.getPrimitiveTypeName())
Closes #3601.
Summary
Caches the result of
CorruptStatistics.shouldIgnoreStatisticsto avoid redundant version string parsing. ThecreatedBystring is constant per file but was parsed R×C times (row groups × columns) during footer reading.Changes
ConcurrentHashMap<String, Boolean>cache (max 64 entries) keyed oncreatedByTesting
testCachingBehavior: verifies cache is populated (cacheSize grows from 0→1→2)testCorrectnessWhenCacheIsFull: fills cache to 64 entries, verifies correct results on the cache-bypass path (65th+ distinct strings)CorruptStatisticsTesttests pass (correctness preserved)