Write Deletion Vectors#2822
Conversation
| bitmaps: dict[int, BitMap] = {} | ||
| cardinality = 0 | ||
| for pos in positions: | ||
| cardinality += 1 |
There was a problem hiding this comment.
cardinality could be incorrect with same positions passed in we can probably use the pyroaring stats to get this
There was a problem hiding this comment.
I think I made the proper change, let me know if you're thinking differently.
|
Hey @rambleraptor, I was working on a DV implementation before discovering this PR. Since review is already underway, I'd rather contribute here than duplicate effort. I've added a Spark interoperability test: glesperance/iceberg-python@c25fe312 This verifies pyiceberg can read Spark-written DVs. Combined with your existing round-trip tests, this confirms format compatibility... ie if the same reader handles both, Spark can read ours too. This may be redundant with your existing .bin fixture tests, though I believe those test the raw bitmap format rather than full Puffin DVs with the Java wrapper (length + magic + CRC). Let me know if I'm wrong on that. Happy to PR to your fork if you think it's pertinent -or- feel free to cherry pick the commit as you see fit. |
Verify pyiceberg's PuffinFile reader can parse deletion vectors written by Spark. Uses coalesce(1) to force Spark to create DVs instead of COW.
|
@glesperance Thanks so much! I patched in your commit and I'll push it up along with my changes. Your name should appear in the commit log + PR. Let me know if you don't see it. |
|
PR comments have been addressed. @geruh it looks like your work on DeleteFileIndexes will be very useful for determing offsets + lengths on the blobs! |
|
Really excited to see this moving forward. @rambleraptor Thanks for the opportunity to contribute and for handling the updates. On full DV support, have you started on the delete/manifest writers for v3 and the MOR logic? I’ve got a working PoC with some tests but it’ll certainly need more polish before it's PR-ready. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
Hi @rambleraptor @glesperance — I'd like to help get deletion-vector write support across the line. To check it's still viable, I locally applied this PR's The DV-write pieces seem to have stalled on inactivity rather than merit — this PR, #2193 (roaring bitmap), and #2180 (v3 metadata write) were all auto-closed by the stale bot, and @kevinjqliu had said "let's get this in" on #2193. @rambleraptor @glesperance are either of you still planning to drive this? I don't want to duplicate your effort. If it's open, I'd be glad to carry it forward — e.g. open a fresh PR that continues this (Tracking issue for context: #2261.) |
|
@moomindani I largely stepped back from the initiative because progress on the pieces I was interested in seemed quite slow from my perspective. My reading of the repository activity is that development is increasingly happening in iceberg-rust, and my assumption is that this is where a lot of future Iceberg-related work will land. I may be missing some context, or perhaps decisions are being discussed elsewhere, but from the repository alone it's difficult for me to understand why the PRs were closed and what the current roadmap is. The move towards v3 has been ongoing for over a year now (#1818), so I'll let others form their own view on the pace of that effort. Personally, the current rate of progress gives me some concern, and it's caused me to evaluate other options for my long-term data lake format choice. |
|
Thanks @glesperance for the candid update, and totally understand stepping back. @rambleraptor — gentle nudge on my note above: are you still planning to drive this, or is it open for someone to carry forward? Happy to pick it up (crediting you both) if you're tied up — just don't want to step on your effort. |
|
Picked this up as #3474 — it keeps your commits as-is (@rambleraptor, @glesperance) and adds unit tests for a couple of agreed review items (@ebyhr's field id, and the DV blob framing) that weren't yet covered by a test. Continuing the deletion-vector write work from there. Thanks both! |
Part of #2261
Rationale for this change
This adds a PuffinWriter for writing deletion vectors.
Right now, it's just the writer class + some round trip tests (where we read + write the same file) to sanity check that the PuffinWriter works as expected. Writing Puffin files is very complex, so I wanted to make sure we all agreed on the writing semantics before using this elsewhere.
Let me know your thoughts on this (or if it's too granular)
Are these changes tested?
Unit tests included
Are there any user-facing changes?