Skip to content

Write Deletion Vectors#2822

Closed
rambleraptor wants to merge 6 commits into
apache:mainfrom
rambleraptor:deletion_vector_write
Closed

Write Deletion Vectors#2822
rambleraptor wants to merge 6 commits into
apache:mainfrom
rambleraptor:deletion_vector_write

Conversation

@rambleraptor

Copy link
Copy Markdown
Collaborator

Part of #2261

Rationale for this change

This adds a PuffinWriter for writing deletion vectors.

Right now, it's just the writer class + some round trip tests (where we read + write the same file) to sanity check that the PuffinWriter works as expected. Writing Puffin files is very complex, so I wanted to make sure we all agreed on the writing semantics before using this elsewhere.

Let me know your thoughts on this (or if it's too granular)

Are these changes tested?

Unit tests included

Are there any user-facing changes?

Comment thread pyiceberg/table/puffin.py
Comment thread tests/table/test_puffin.py Outdated
Comment thread pyiceberg/table/puffin.py Outdated
Comment thread pyiceberg/table/puffin.py
@geruh geruh mentioned this pull request Dec 10, 2025
14 tasks
Comment thread pyiceberg/table/puffin.py Outdated
bitmaps: dict[int, BitMap] = {}
cardinality = 0
for pos in positions:
cardinality += 1

@geruh geruh Dec 10, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cardinality could be incorrect with same positions passed in we can probably use the pyroaring stats to get this

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I made the proper change, let me know if you're thinking differently.

@glesperance

glesperance commented Dec 11, 2025

Copy link
Copy Markdown

Hey @rambleraptor, I was working on a DV implementation before discovering this PR. Since review is already underway, I'd rather contribute here than duplicate effort.

I've added a Spark interoperability test: glesperance/iceberg-python@c25fe312

This verifies pyiceberg can read Spark-written DVs. Combined with your existing round-trip tests, this confirms format compatibility... ie if the same reader handles both, Spark can read ours too.

This may be redundant with your existing .bin fixture tests, though I believe those test the raw bitmap format rather than full Puffin DVs with the Java wrapper (length + magic + CRC). Let me know if I'm wrong on that.

Happy to PR to your fork if you think it's pertinent -or- feel free to cherry pick the commit as you see fit.

Verify pyiceberg's PuffinFile reader can parse deletion vectors written
by Spark. Uses coalesce(1) to force Spark to create DVs instead of COW.
@rambleraptor

Copy link
Copy Markdown
Collaborator Author

@glesperance Thanks so much! I patched in your commit and I'll push it up along with my changes. Your name should appear in the commit log + PR. Let me know if you don't see it.

@rambleraptor

Copy link
Copy Markdown
Collaborator Author

PR comments have been addressed.

@geruh it looks like your work on DeleteFileIndexes will be very useful for determing offsets + lengths on the blobs!

@rambleraptor rambleraptor requested review from ebyhr and geruh December 12, 2025 00:46
@glesperance

glesperance commented Dec 16, 2025

Copy link
Copy Markdown

Really excited to see this moving forward.

@rambleraptor Thanks for the opportunity to contribute and for handling the updates. On full DV support, have you started on the delete/manifest writers for v3 and the MOR logic?

I’ve got a working PoC with some tests but it’ll certainly need more polish before it's PR-ready.
I also need to rebase it on top of #2822 (this) and #2180

@github-actions

Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Mar 18, 2026
@github-actions

Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@moomindani

Copy link
Copy Markdown

Hi @rambleraptor @glesperance — I'd like to help get deletion-vector write support across the line.

To check it's still viable, I locally applied this PR's PuffinWriter changes on top of the current main (no push here): they apply cleanly with no conflicts, and tests/table/test_puffin.py passes (7/7).

The DV-write pieces seem to have stalled on inactivity rather than merit — this PR, #2193 (roaring bitmap), and #2180 (v3 metadata write) were all auto-closed by the stale bot, and @kevinjqliu had said "let's get this in" on #2193.

@rambleraptor @glesperance are either of you still planning to drive this? I don't want to duplicate your effort. If it's open, I'd be glad to carry it forward — e.g. open a fresh PR that continues this PuffinWriter work (crediting you both), then layer the v3 delete/manifest writers + MoR on top.

(Tracking issue for context: #2261.)

@glesperance

Copy link
Copy Markdown

@moomindani I largely stepped back from the initiative because progress on the pieces I was interested in seemed quite slow from my perspective.

My reading of the repository activity is that development is increasingly happening in iceberg-rust, and my assumption is that this is where a lot of future Iceberg-related work will land.

I may be missing some context, or perhaps decisions are being discussed elsewhere, but from the repository alone it's difficult for me to understand why the PRs were closed and what the current roadmap is.

The move towards v3 has been ongoing for over a year now (#1818), so I'll let others form their own view on the pace of that effort.

Personally, the current rate of progress gives me some concern, and it's caused me to evaluate other options for my long-term data lake format choice.

@moomindani

Copy link
Copy Markdown

Thanks @glesperance for the candid update, and totally understand stepping back.

@rambleraptor — gentle nudge on my note above: are you still planning to drive this, or is it open for someone to carry forward? Happy to pick it up (crediting you both) if you're tied up — just don't want to step on your effort.

@moomindani

Copy link
Copy Markdown

Picked this up as #3474 — it keeps your commits as-is (@rambleraptor, @glesperance) and adds unit tests for a couple of agreed review items (@ebyhr's field id, and the DV blob framing) that weren't yet covered by a test. Continuing the deletion-vector write work from there. Thanks both!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants