GH-3530: Eagerly release column buffers during row group flush by iemejia · Pull Request #3571 · apache/parquet-java

iemejia · 2026-05-17T22:39:06Z

Part of #3530 — Apache Parquet Java Performance Improvements

Summary

Eagerly release column buffers during row group flush for correct resource management.

Call pageWriter.close() after each column in flushToFileWriter() instead of holding all buffers until the entire flush completes. Add writeAllToAndRelease() to ConcatenatingByteBufferCollector for progressive slab-by-slab memory release during write. Make close() idempotent (safe after eager release or double-close).

Changes

Call pageWriter.close() after each column in flushToFileWriter()
Add writeAllToAndRelease() to ConcatenatingByteBufferCollector for progressive slab-by-slab memory release during write
Make close() idempotent (safe to call after eager release or double-close)
Add RowGroupFlushBenchmark (20-column wide schema, PeakTrackingAllocator) and BlackHoleOutputFile for measuring flush performance and peak memory
Add tests for eager release, double-close safety, and output equivalence

Benchmark results

Environment: JDK 25.0.3 (Temurin), OpenJDK 64-Bit Server VM, JMH 1.37, Linux x86_64, -wi 3 -i 5 -f 2, -Xms512m -Xmx1g.

RowGroupFlushBenchmark (100K rows, 20 BINARY columns, 200 bytes each, UNCOMPRESSED):

Row Group Size	Metric	Baseline	Optimized	Delta
8 MB	Flush time (ms/op)	319 ± 11	322 ± 10	~0% (noise)
8 MB	Peak allocator (MB)	9.4	9.4	same
64 MB	Flush time (ms/op)	330 ± 12	327 ± 11	~0% (noise)
64 MB	Peak allocator (MB)	66.0	66.0	same

No throughput regression. Peak memory is byte-for-byte identical with and without this change — the peak is reached during page compression (as pages are accumulated into ConcatenatingByteBufferCollector), not during flush. This is correct resource management that makes buffers GC-eligible sooner and guarantees release on error paths via try-with-resources, but is not a memory optimization.

Release each column's compressed page buffers immediately after writing to disk in flushToFileWriter(), rather than holding all buffers until the entire flush completes. This is correct resource management that makes buffers GC-eligible sooner, though benchmarking with a PeakTrackingAllocator confirms it does not reduce peak memory: the peak is reached during the write phase (as pages are compressed), not during flush. Changes: - Call pageWriter.close() after each column in flushToFileWriter() - Add writeAllToAndRelease() to ConcatenatingByteBufferCollector for progressive slab-by-slab memory release during write - Make close() idempotent (safe to call after eager release or double-close) - Add RowGroupFlushBenchmark (20-column wide schema, PeakTrackingAllocator) and BlackHoleOutputFile for measuring flush performance and peak memory - Add tests for eager release, double-close safety, and output equivalence

Fokko · 2026-05-22T21:30:20Z

+   * @throws IOException if an I/O error occurs
+   */
+  public void writeAllToAndRelease(OutputStream out) throws IOException {
+    WritableByteChannel channel = Channels.newChannel(out);


Should we wrap a try-with-resource around this channel?

No, closing the channel would close the underlying OutputStream which is owned by the caller. The channel from Channels.newChannel(out) is just a stateless wrapper with no independent resources to release. I've added a comment explaining this in the updated code.

Fokko · 2026-05-22T21:35:06Z

+   * @param out the output stream to write to
+   * @throws IOException if an I/O error occurs
+   */
+  public void writeAllToAndRelease(OutputStream out) throws IOException {


Wait, are we just using this in tests?

Good catch — it was indeed only used in tests. I've merged writeAllToAndRelease into writeAllTo so the progressive buffer release is now the production implementation (called via ParquetFileWriter.writeColumnChunk → bytes.writeAllTo(out)). The separate method is gone.

Fokko · 2026-05-22T21:37:03Z

@@ -692,6 +692,11 @@ public void flushToFileWriter(ParquetFileWriter writer) throws IOException {
    for (ColumnDescriptor path : schema.getColumns()) {
      ColumnChunkPageWriter pageWriter = writers.get(path);


Should we use the try-with-resource pattern here? (Fanboy of the pattern talking here)

Done. I am also a big fan of try-with-resources. This also ensures the column's buffers are released even if writeToFileWriter throws.

…n writeAllTo - Merge writeAllToAndRelease into writeAllTo so progressive buffer release is used in production (via ParquetFileWriter.writeColumnChunk). - Add comment explaining why the WritableByteChannel is intentionally not closed (closing it would close the caller's OutputStream). - Use try-with-resources in flushToFileWriter for idiomatic cleanup. - Capture buf.size() before write in writeToFileWriter since writeAllTo now releases buffers progressively.

Fokko · 2026-06-02T20:16:19Z

        LOG.debug(String.format(
                "written %,dB for %s: %,d values, %,dB raw, %,dB comp, %d pages, encodings: %s",
-                buf.size(),
+                bytesWritten,


Why this change? We only need this when we have debug log enabled

buf.size() is captured before the write because writeColumnChunk internally calls buf.writeAllTo(out) which progressively releases each slab and resets size to 0 (see ConcatenatingByteBufferCollector.writeAllTo, line 103). If we called buf.size() inside the debug block after the write, it would always report 0.

iemejia · 2026-06-12T11:25:57Z

@Fokko All review comments addressed. Added baseline vs optimized benchmark comparison to the PR description — no throughput regression and identical peak memory. Ready for another look.

This was referenced May 17, 2026

GH-3522: Reduce peak memory during row group flush by eagerly releasing column buffers #3537

Closed

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar #3512

Closed

Apache Parquet Java Performance Improvements #3530

Open

Fokko reviewed May 22, 2026

View reviewed changes

Fokko reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3530: Eagerly release column buffers during row group flush#3571

GH-3530: Eagerly release column buffers during row group flush#3571
iemejia wants to merge 2 commits into
apache:masterfrom
iemejia:parquet-perf-v2-par8-rowgroup-flush

iemejia commented May 17, 2026 •

edited

Loading

Uh oh!

Fokko May 22, 2026

Uh oh!

iemejia Jun 2, 2026 •

edited

Loading

Uh oh!

Fokko May 22, 2026

Uh oh!

iemejia Jun 2, 2026

Uh oh!

Fokko May 22, 2026

Uh oh!

iemejia Jun 2, 2026 •

edited

Loading

Uh oh!

Fokko Jun 2, 2026

Uh oh!

iemejia Jun 12, 2026

Uh oh!

iemejia commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -692,6 +692,11 @@ public void flushToFileWriter(ParquetFileWriter writer) throws IOException {
		for (ColumnDescriptor path : schema.getColumns()) {
		ColumnChunkPageWriter pageWriter = writers.get(path);

Conversation

iemejia commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmark results

Uh oh!

Fokko May 22, 2026

Choose a reason for hiding this comment

Uh oh!

iemejia Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko May 22, 2026

Choose a reason for hiding this comment

Uh oh!

iemejia Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko May 22, 2026

Choose a reason for hiding this comment

Uh oh!

iemejia Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

iemejia Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

iemejia commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iemejia commented May 17, 2026 •

edited

Loading

iemejia Jun 2, 2026 •

edited

Loading

iemejia Jun 2, 2026 •

edited

Loading