[SPARK-57676][PYTHON] Refactor SQL_GROUPED_AGG_PANDAS_ITER_UDF#56754
Closed
Yicong-Huang wants to merge 2 commits into
Closed
[SPARK-57676][PYTHON] Refactor SQL_GROUPED_AGG_PANDAS_ITER_UDF#56754Yicong-Huang wants to merge 2 commits into
Yicong-Huang wants to merge 2 commits into
Conversation
uros-b
reviewed
Jun 24, 2026
uros-b
reviewed
Jun 24, 2026
uros-b
reviewed
Jun 24, 2026
uros-b
reviewed
Jun 24, 2026
uros-b
left a comment
Member
There was a problem hiding this comment.
Thank you @Yicong-Huang, left a few comments, but mostly lgtm!
HyukjinKwon
approved these changes
Jun 25, 2026
Yicong-Huang
added a commit
that referenced
this pull request
Jun 25, 2026
### What changes were proposed in this pull request? Refactor `SQL_GROUPED_AGG_PANDAS_ITER_UDF` to use `ArrowStreamGroupSerializer` as a pure I/O layer, moving all processing logic into a dedicated function in `read_udfs()` in `worker.py`. This mirrors the Arrow path done in SPARK-56123 for `SQL_GROUPED_AGG_ARROW_ITER_UDF`. The now-unused `ArrowStreamAggPandasUDFSerializer` and `wrap_grouped_agg_pandas_iter_udf` are removed from `worker.py` (the serializer class itself is removed in a follow-up, SPARK-57680). ### Why are the changes needed? Part of SPARK-55388. `SQL_GROUPED_AGG_PANDAS_ITER_UDF` was the last consumer of `ArrowStreamAggPandasUDFSerializer`; this migration unblocks deleting that serializer. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. No behavior change. ASV `GroupedAggPandasIterUDFTimeBench` (`-a repeat=3`, `--python=same`), one representative run per side; conclusion consistent across multiple runs. ```text scenario udf before after diff few_groups_sm sum 40.6+-0.2ms 39.4+-0.1ms -3% few_groups_sm mean_multi 46.3+-0.2ms 45.6+-0.3ms -2% few_groups_lg sum 67.2+-0.5ms 65.6+-0.1ms -2% few_groups_lg mean_multi 74.5+-0.4ms 74.6+-2ms 0% many_groups_sm sum 1.53+-0s 1.49+-0.01s -3% many_groups_sm mean_multi 1.77+-0.01s 1.72+-0s -3% many_groups_lg sum 421+-2ms 411+-2ms -2% many_groups_lg mean_multi 489+-2ms 474+-1ms -3% wide_cols sum 403+-0.7ms 393+-0.7ms -2% wide_cols mean_multi 423+-0.2ms 414+-1ms -2% ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #56754 from Yicong-Huang/SPARK-57676. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> (cherry picked from commit d52093a) Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Refactor
SQL_GROUPED_AGG_PANDAS_ITER_UDFto useArrowStreamGroupSerializeras a pure I/O layer, moving all processing logic into a dedicated function inread_udfs()inworker.py. This mirrors the Arrow path done in SPARK-56123 forSQL_GROUPED_AGG_ARROW_ITER_UDF. The now-unusedArrowStreamAggPandasUDFSerializerandwrap_grouped_agg_pandas_iter_udfare removed fromworker.py(the serializer class itself is removed in a follow-up, SPARK-57680).Why are the changes needed?
Part of SPARK-55388.
SQL_GROUPED_AGG_PANDAS_ITER_UDFwas the last consumer ofArrowStreamAggPandasUDFSerializer; this migration unblocks deleting that serializer.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests. No behavior change.
ASV
GroupedAggPandasIterUDFTimeBench(-a repeat=3,--python=same), one representative run per side; conclusion consistent across multiple runs.Was this patch authored or co-authored using generative AI tooling?
No.