Skip to content

Consolidate MMM data input into unified xr.Dataset representation#2596

Open
williambdean wants to merge 2 commits into
mainfrom
feat/dataset-input-support
Open

Consolidate MMM data input into unified xr.Dataset representation#2596
williambdean wants to merge 2 commits into
mainfrom
feat/dataset-input-support

Conversation

@williambdean
Copy link
Copy Markdown
Contributor

@williambdean williambdean commented May 18, 2026

Summary

Consolidate all MMM data input paths into a single to_mmm_dataset entry point, using xr.Dataset as the canonical internal representation. Removes self.X/self.y from MMM, eliminates ~700 lines of scattered type-dispatching code, and opens the door to different granularities of data sources.

Context

This refactoring standardizes on xr.Dataset as the single internal data format. The _channel, _target, and _control variables now have consistent dims and coordinates, regardless of whether input comes from pandas, xarray, or numpy. This means:

  • Different data granularities: Daily, weekly, or custom-frequency data all normalize to the same internal variable structure
  • Panel/multi-market data: The dims parameter threads through naturally — adding dims=("country",) or dims=("region", "product") just adds extra coordinates to the Dataset
  • Flexible input types: build_model and fit now accept pd.DataFrame, xr.Dataset, or xr.DataArray — users can pass data directly from xarray-based pipelines
  • Cleaner extension path: Future features (custom effects, control transformations, time-varying parameters) inherit the standardized data without duplicating conversion logic

Changes

pymc_marketing/mmm/_data_conversion.py (new, 423 lines)

  • to_mmm_dataset(X, y, ...) — single singledispatch entry point that normalizes pd.DataFrame, xr.Dataset, and xr.DataArray inputs into a canonical xr.Dataset
  • Handles multi-dimensional (panel) data via dims parameter
  • MultiIndex + sort_index + to_xarray alignment for proper coordinate ordering

pymc_marketing/mmm/mmm.py (-364 net)

  • Removed: _normalize_target, _create_xarray_from_pandas + 5 sub-methods, duplicate create_fit_data logic
  • _generate_and_preprocess_model_data — single to_mmm_dataset call, no self.X/self.y
  • _posterior_predictive_data_transformation — uses to_mmm_dataset
  • _apply_budget_distribution_pattern / _apply_carryover_effect — operate on xr.Dataset directly
  • build_from_idata — sets self.idata = idata so sample_posterior_predictive works after load
  • build_model / fit — accept pd.DataFrame, xr.Dataset, or xr.DataArray
  • create_fit_data — handles embedded target in xr.Dataset, cleans up underscore columns

pymc_marketing/mmm/utils.py (-102 net)

  • create_zero_dataset — returns xr.Dataset with _channel/_control variables
  • add_noise_to_channel_allocation — supports both pd.DataFrame and xr.Dataset

pymc_marketing/model_builder.py (+1 net)

  • RegressionModelBuilder.build_from_idata — sets self.idata = idata

tests/mmm/test_utils.py — Updated fake model classes and assertions for xr.Dataset return type

Testing

  • All 790+ tests in tests/mmm/ pass (230 in test_mmm.py, 151 in budget/optimizer/cost/utils, 774 in remaining suites)
  • Pre-commit (ruff, mypy, format, API docs) all green
  • Net: +779 / -744 lines (+35 net)

📚 Documentation preview 📚: https://pymc-marketing--2596.org.readthedocs.build/en/2596/

@github-actions github-actions Bot added MMM tests ModelBuilder Related to the ModelBuilder class and its children labels May 18, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 70.40359% with 66 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.10%. Comparing base (a2d8c1d) to head (2983b50).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pymc_marketing/mmm/mmm.py 51.72% 28 Missing ⚠️
pymc_marketing/mmm/_data_conversion.py 77.19% 26 Missing ⚠️
pymc_marketing/mmm/utils.py 76.00% 12 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (a2d8c1d) and HEAD (2983b50). Click for more details.

HEAD has 20 uploads less than BASE
Flag BASE (a2d8c1d) HEAD (2983b50)
28 8
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2596       +/-   ##
===========================================
- Coverage   93.96%   73.10%   -20.87%     
===========================================
  Files          95       96        +1     
  Lines       14371    14432       +61     
===========================================
- Hits        13504    10550     -2954     
- Misses        867     3882     +3015     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@juanitorduz juanitorduz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great initiative! Just a simple comment as the test are failing :)

Comment on lines +77 to +79
# ---------------------------------------------------------------------------
# Public API — single entry point
# ---------------------------------------------------------------------------
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove these comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

MMM ModelBuilder Related to the ModelBuilder class and its children tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants