Skip to content

[SPARK-57676][PYTHON] Refactor SQL_GROUPED_AGG_PANDAS_ITER_UDF#56754

Closed
Yicong-Huang wants to merge 2 commits into
apache:masterfrom
Yicong-Huang:SPARK-57676
Closed

[SPARK-57676][PYTHON] Refactor SQL_GROUPED_AGG_PANDAS_ITER_UDF#56754
Yicong-Huang wants to merge 2 commits into
apache:masterfrom
Yicong-Huang:SPARK-57676

Conversation

@Yicong-Huang

@Yicong-Huang Yicong-Huang commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Refactor SQL_GROUPED_AGG_PANDAS_ITER_UDF to use ArrowStreamGroupSerializer as a pure I/O layer, moving all processing logic into a dedicated function in read_udfs() in worker.py. This mirrors the Arrow path done in SPARK-56123 for SQL_GROUPED_AGG_ARROW_ITER_UDF. The now-unused ArrowStreamAggPandasUDFSerializer and wrap_grouped_agg_pandas_iter_udf are removed from worker.py (the serializer class itself is removed in a follow-up, SPARK-57680).

Why are the changes needed?

Part of SPARK-55388. SQL_GROUPED_AGG_PANDAS_ITER_UDF was the last consumer of ArrowStreamAggPandasUDFSerializer; this migration unblocks deleting that serializer.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. No behavior change.

ASV GroupedAggPandasIterUDFTimeBench (-a repeat=3, --python=same), one representative run per side; conclusion consistent across multiple runs.

scenario         udf           before        after         diff
few_groups_sm    sum           40.6+-0.2ms   39.4+-0.1ms   -3%
few_groups_sm    mean_multi    46.3+-0.2ms   45.6+-0.3ms   -2%
few_groups_lg    sum           67.2+-0.5ms   65.6+-0.1ms   -2%
few_groups_lg    mean_multi    74.5+-0.4ms   74.6+-2ms      0%
many_groups_sm   sum           1.53+-0s      1.49+-0.01s   -3%
many_groups_sm   mean_multi    1.77+-0.01s   1.72+-0s      -3%
many_groups_lg   sum           421+-2ms      411+-2ms      -2%
many_groups_lg   mean_multi    489+-2ms      474+-1ms      -3%
wide_cols        sum           403+-0.7ms    393+-0.7ms    -2%
wide_cols        mean_multi    423+-0.2ms    414+-1ms      -2%

Was this patch authored or co-authored using generative AI tooling?

No.

Comment thread python/pyspark/worker.py
Comment thread python/pyspark/worker.py
Comment thread python/pyspark/worker.py Outdated

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Yicong-Huang, left a few comments, but mostly lgtm!

Yicong-Huang added a commit that referenced this pull request Jun 25, 2026
### What changes were proposed in this pull request?

Refactor `SQL_GROUPED_AGG_PANDAS_ITER_UDF` to use `ArrowStreamGroupSerializer` as a pure I/O layer, moving all processing logic into a dedicated function in `read_udfs()` in `worker.py`. This mirrors the Arrow path done in SPARK-56123 for `SQL_GROUPED_AGG_ARROW_ITER_UDF`. The now-unused `ArrowStreamAggPandasUDFSerializer` and `wrap_grouped_agg_pandas_iter_udf` are removed from `worker.py` (the serializer class itself is removed in a follow-up, SPARK-57680).

### Why are the changes needed?

Part of SPARK-55388. `SQL_GROUPED_AGG_PANDAS_ITER_UDF` was the last consumer of `ArrowStreamAggPandasUDFSerializer`; this migration unblocks deleting that serializer.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests. No behavior change.

ASV `GroupedAggPandasIterUDFTimeBench` (`-a repeat=3`, `--python=same`), one representative run per side; conclusion consistent across multiple runs.

```text
scenario         udf           before        after         diff
few_groups_sm    sum           40.6+-0.2ms   39.4+-0.1ms   -3%
few_groups_sm    mean_multi    46.3+-0.2ms   45.6+-0.3ms   -2%
few_groups_lg    sum           67.2+-0.5ms   65.6+-0.1ms   -2%
few_groups_lg    mean_multi    74.5+-0.4ms   74.6+-2ms      0%
many_groups_sm   sum           1.53+-0s      1.49+-0.01s   -3%
many_groups_sm   mean_multi    1.77+-0.01s   1.72+-0s      -3%
many_groups_lg   sum           421+-2ms      411+-2ms      -2%
many_groups_lg   mean_multi    489+-2ms      474+-1ms      -3%
wide_cols        sum           403+-0.7ms    393+-0.7ms    -2%
wide_cols        mean_multi    423+-0.2ms    414+-1ms      -2%
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #56754 from Yicong-Huang/SPARK-57676.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>
(cherry picked from commit d52093a)
Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants