[SPARK-57676][PYTHON] Refactor SQL_GROUPED_AGG_PANDAS_ITER_UDF by Yicong-Huang · Pull Request #56754 · apache/spark

Yicong-Huang · 2026-06-24T21:41:53Z

What changes were proposed in this pull request?

Refactor SQL_GROUPED_AGG_PANDAS_ITER_UDF to use ArrowStreamGroupSerializer as a pure I/O layer, moving all processing logic into a dedicated function in read_udfs() in worker.py. This mirrors the Arrow path done in SPARK-56123 for SQL_GROUPED_AGG_ARROW_ITER_UDF. The now-unused ArrowStreamAggPandasUDFSerializer and wrap_grouped_agg_pandas_iter_udf are removed from worker.py (the serializer class itself is removed in a follow-up, SPARK-57680).

Why are the changes needed?

Part of SPARK-55388. SQL_GROUPED_AGG_PANDAS_ITER_UDF was the last consumer of ArrowStreamAggPandasUDFSerializer; this migration unblocks deleting that serializer.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. No behavior change.

ASV GroupedAggPandasIterUDFTimeBench (-a repeat=3, --python=same), one representative run per side; conclusion consistent across multiple runs.

scenario         udf           before        after         diff
few_groups_sm    sum           40.6+-0.2ms   39.4+-0.1ms   -3%
few_groups_sm    mean_multi    46.3+-0.2ms   45.6+-0.3ms   -2%
few_groups_lg    sum           67.2+-0.5ms   65.6+-0.1ms   -2%
few_groups_lg    mean_multi    74.5+-0.4ms   74.6+-2ms      0%
many_groups_sm   sum           1.53+-0s      1.49+-0.01s   -3%
many_groups_sm   mean_multi    1.77+-0.01s   1.72+-0s      -3%
many_groups_lg   sum           421+-2ms      411+-2ms      -2%
many_groups_lg   mean_multi    489+-2ms      474+-1ms      -3%
wide_cols        sum           403+-0.7ms    393+-0.7ms    -2%
wide_cols        mean_multi    423+-0.2ms    414+-1ms      -2%

Was this patch authored or co-authored using generative AI tooling?

No.

…Serializer

uros-b

Thank you @Yicong-Huang, left a few comments, but mostly lgtm!

### What changes were proposed in this pull request? Refactor `SQL_GROUPED_AGG_PANDAS_ITER_UDF` to use `ArrowStreamGroupSerializer` as a pure I/O layer, moving all processing logic into a dedicated function in `read_udfs()` in `worker.py`. This mirrors the Arrow path done in SPARK-56123 for `SQL_GROUPED_AGG_ARROW_ITER_UDF`. The now-unused `ArrowStreamAggPandasUDFSerializer` and `wrap_grouped_agg_pandas_iter_udf` are removed from `worker.py` (the serializer class itself is removed in a follow-up, SPARK-57680). ### Why are the changes needed? Part of SPARK-55388. `SQL_GROUPED_AGG_PANDAS_ITER_UDF` was the last consumer of `ArrowStreamAggPandasUDFSerializer`; this migration unblocks deleting that serializer. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. No behavior change. ASV `GroupedAggPandasIterUDFTimeBench` (`-a repeat=3`, `--python=same`), one representative run per side; conclusion consistent across multiple runs. ```text scenario udf before after diff few_groups_sm sum 40.6+-0.2ms 39.4+-0.1ms -3% few_groups_sm mean_multi 46.3+-0.2ms 45.6+-0.3ms -2% few_groups_lg sum 67.2+-0.5ms 65.6+-0.1ms -2% few_groups_lg mean_multi 74.5+-0.4ms 74.6+-2ms 0% many_groups_sm sum 1.53+-0s 1.49+-0.01s -3% many_groups_sm mean_multi 1.77+-0.01s 1.72+-0s -3% many_groups_lg sum 421+-2ms 411+-2ms -2% many_groups_lg mean_multi 489+-2ms 474+-1ms -3% wide_cols sum 403+-0.7ms 393+-0.7ms -2% wide_cols mean_multi 423+-0.2ms 414+-1ms -2% ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #56754 from Yicong-Huang/SPARK-57676. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> (cherry picked from commit d52093a) Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>

refactor: migrate SQL_GROUPED_AGG_PANDAS_ITER_UDF to ArrowStreamGroup…

0e7d3e0

…Serializer

uros-b reviewed Jun 24, 2026

View reviewed changes

Comment thread python/pyspark/worker.py

uros-b reviewed Jun 24, 2026

View reviewed changes

Comment thread python/pyspark/worker.py

uros-b reviewed Jun 24, 2026

View reviewed changes

Comment thread python/pyspark/worker.py Outdated

uros-b reviewed Jun 24, 2026

View reviewed changes

HyukjinKwon approved these changes Jun 25, 2026

View reviewed changes

refactor: address review nits for grouped agg pandas iter migration

33f57fe

Yicong-Huang closed this in d52093a Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57676][PYTHON] Refactor SQL_GROUPED_AGG_PANDAS_ITER_UDF#56754

[SPARK-57676][PYTHON] Refactor SQL_GROUPED_AGG_PANDAS_ITER_UDF#56754
Yicong-Huang wants to merge 2 commits into
apache:masterfrom
Yicong-Huang:SPARK-57676

Yicong-Huang commented Jun 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

uros-b left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Yicong-Huang commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

uros-b left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yicong-Huang commented Jun 24, 2026 •

edited

Loading