Skip to content

Add table statistics#1285

Merged
Fokko merged 6 commits into
apache:mainfrom
ndrluis:add-statistics
Jan 16, 2025
Merged

Add table statistics#1285
Fokko merged 6 commits into
apache:mainfrom
ndrluis:add-statistics

Conversation

@ndrluis

@ndrluis ndrluis commented Nov 4, 2024

Copy link
Copy Markdown
Collaborator

The Java expire snapshot process expires table statistics and partition statistics. I am implementing a statistics table to make our expire snapshot compatible with the Java implementation.

@ndrluis ndrluis changed the title Add table statistics update Add table statistics Nov 4, 2024
@ndrluis

ndrluis commented Nov 4, 2024

Copy link
Copy Markdown
Collaborator Author

I plan to move the set/remove statistics methods from the Transaction class to another class, such as ManageSnapshot. In the meantime, I’d like to confirm with everyone if I’m heading in the right direction with the current implementation.

@Fokko @sungwy @kevinjqliu

@ndrluis ndrluis changed the title Add table statistics WIP: Add table statistics Nov 4, 2024

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Added a few comments. I think it would also be helpful to include integration tests

Comment thread pyiceberg/table/metadata.py
Comment thread pyiceberg/table/statistics.py
Comment thread pyiceberg/table/__init__.py Outdated
Comment thread pyiceberg/table/update/__init__.py Outdated
Comment thread tests/conftest.py
@ndrluis ndrluis force-pushed the add-statistics branch 2 times, most recently from 9b15c86 to d16ef47 Compare November 10, 2024 23:30
@ndrluis ndrluis requested a review from kevinjqliu November 10, 2024 23:42
@ndrluis ndrluis changed the title WIP: Add table statistics Add table statistics Nov 10, 2024
@ndrluis ndrluis marked this pull request as ready for review November 10, 2024 23:43
@ndrluis

ndrluis commented Nov 10, 2024

Copy link
Copy Markdown
Collaborator Author

@kevinjqliu could you please review it once more?

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments.

Do you know which engine currently can generate puffin files? would be great to add an integration with a spark generated puffin file

Comment thread mkdocs/docs/api.md
Comment thread mkdocs/docs/api.md Outdated
Comment thread pyiceberg/table/statistics.py Outdated
@ndrluis

ndrluis commented Nov 12, 2024

Copy link
Copy Markdown
Collaborator Author

Do you know which engine currently can generate puffin files? would be great to add an integration with a spark generated puffin file

@kevinjqliu As far as I know, only Trino can generate them. What kind of test would you like to have? I believe we are covering all relevant cases for this PR. If PyIceberg could generate or read puffin files, then I agree it would be useful to add tests to check compatibility between engines. However, I think it only makes sense to test puffin files during reading, as testing generation would mean verifying the implementation of something that isn’t our responsibility. In this case, it’s just a metadata update.

What do you think?

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for working on this!

Regarding the integration tests, since we're manipulating table metadata to add/remove table stats, it would be great to verify that another source can interact with these stats. Not a hard blocker

@ndrluis ndrluis mentioned this pull request Nov 24, 2024
@kevinjqliu

Copy link
Copy Markdown
Contributor

@ndrluis do you mind resolving the merge conflict here?

@ndrluis

ndrluis commented Jan 8, 2025

Copy link
Copy Markdown
Collaborator Author

@kevinjqliu Done!

@Fokko Fokko left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ndrluis Thanks for working on this, and sorry for leaving this hanging for so long. I have some small comments, but it looks good to me 👍

Comment thread dev/provision.py Outdated
Comment thread mkdocs/docs/api.md Outdated
Comment thread mkdocs/docs/api.md Outdated
Comment thread pyiceberg/table/statistics.py Outdated
Comment on lines +494 to +495
if update.snapshot_id != update.statistics.snapshot_id:
raise ValueError("Snapshot id in statistics does not match the snapshot id in the update")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of an awkward check, but something that we have to live with I guess.

@ndrluis ndrluis force-pushed the add-statistics branch 3 times, most recently from 2034836 to ba64764 Compare January 15, 2025 21:19
@ndrluis ndrluis force-pushed the add-statistics branch 2 times, most recently from 8fe9992 to 217bb95 Compare January 15, 2025 21:43
ndrluis and others added 5 commits January 16, 2025 10:13
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>

@Fokko Fokko left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, thanks @ndrluis 🙌

@Fokko Fokko merged commit 0a3a886 into apache:main Jan 16, 2025
@ndrluis ndrluis deleted the add-statistics branch January 16, 2025 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants