Add ExperimentDesigner: posterior-aware experiment design for lift tests#2356
Add ExperimentDesigner: posterior-aware experiment design for lift tests#2356drbenvincent wants to merge 25 commits into
Conversation
Implements a posterior-aware experiment designer that recommends which marketing experiment to run based on a fitted MMM's uncertainty about channel response functions. Computes adstock-aware lift predictions, Bayesian assurance (posterior-predictive power), and weighted composite scores across candidate experiments. Includes ExperimentDesigner class with recommend() and 5 plotting methods, ExperimentRecommendation dataclass, numpy response functions, fixture generator, and 65 tests. Closes #2355 Made-with: Cursor
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2356 +/- ##
==========================================
+ Coverage 92.99% 93.05% +0.06%
==========================================
Files 82 86 +4
Lines 13256 13884 +628
==========================================
+ Hits 12327 12920 +593
- Misses 929 964 +35 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Add end-to-end walkthrough notebook (docs/source/notebooks/mmm/) - Add gallery entry under "Experiment Design" section - Ship pre-built InferenceData fixture (simulated_3channel.nc) - Add slow simulation-based assurance calibration tests - Add tests for scoring weight redistribution and channel ranking - Register 'slow' pytest marker in pyproject.toml Made-with: Cursor
Made-with: Cursor
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
…ansformer Eliminate experiment_design/functions.py by delegating to pymc_marketing.mmm.transformers.logistic_saturation via a new _eval_saturation() static method that calls .eval() on the PyTensor result. Made-with: Cursor
…c-labs/pymc-marketing into fix/2355-experiment-designer
|
Some review comments received from @carlosagostini, but restated in my own words Point 1: The posterior-as-design-prior is epistemically fragileThe reviewer likes the concept of Bayesian assurance but raises a fundamental concern about where the design prior comes from. In the classical assurance literature (O'Hagan et al., 2005), the design prior is independently elicited — it represents genuine uncertainty about the effect. In your implementation, the "design prior" is the MMM's posterior, which may suffer from the very identifiability problems that motivate running experiments in the first place. The dangerous case: if a channel appears effective due to confounding (e.g., TV spend correlates with seasonal demand), the posterior will be confidently wrong — concentrated around a large positive effect. Assurance will then be confidently high, and the tool will enthusiastically recommend an experiment designed around a misleading belief. The tool works well when the model is already well-identified, which is precisely when you need it least. When identification is weakest (the case where experiments matter most), the tool's recommendations are least trustworthy. Point 2: Channels the model believes are null will never be recommended for testingIf a channel's posterior is concentrated near zero (the model thinks it does nothing), assurance is mathematically bounded by P(effect > 0 | posterior). No matter how long or large you make the experiment, assurance stays low because the posterior says the effect is probably zero. So the tool will never recommend testing these channels. But these are exactly the channels you might most want to test — to confirm the model's belief and potentially stop wasting budget on them. The tool has a structural blind spot for "validating the null." If a channel truly is ineffective, that's enormously valuable to know with certainty, but the scoring system will always deprioritise it. Point 3: The IID residuals assumption likely leads to systematically overestimated assuranceThe noise model uses Point 4: The weighted scoring system feels arbitrary and gameableThe 5-dimensional scoring system (uncertainty, correlation, gradient, assurance, cost efficiency) with configurable weights adds a layer of complexity that most users won't be able to navigate. The reviewer's concerns are:
|
The upstream `logistic_saturation` transformer now requires xtensor dims via `as_xtensor()`, which rejects raw numpy arrays. Replace the PyTensor round-trip with the equivalent numpy formula so `_eval_saturation` works with plain arrays. Made-with: Cursor
|
All good points. Basically these are by-products of this being a very early MVP. Goal is to determine level of external interest in order to evaluate it this is something worth sinking extra developer time on. Mulling over whether to just deal with some of the bigger problems at this point. Fair enough coming up with an MVP, but it does need to be actually useful. |
…s, simplified scoring, identification warnings - Fix systematic overestimation of assurance by adding AR(1) autocorrelation correction to the measurement noise model (sigma_d) - Report null-confirmation candidates: channels the model believes have near-zero effect are flagged so they aren't silently ignored - Simplify default scoring to two transparent dimensions (assurance + cost efficiency); advanced dimensions remain available via score_weights - Add identification warning in rationale when spend correlation exceeds 0.7 - Add prominent docstring note that assurance is conditional on model identification quality - 12 new tests (77 total), notebook updated Made-with: Cursor
- Remove stale 5-dimension score_weights, use new 2-dimension defaults - Add AR(1) correction visibility cell after fixture creation - Add "Identification Safeguards" section with pathological fixture demonstrating null-confirmation candidates and correlation warnings - Update recommendation table description to mention null candidates Made-with: Cursor
|
Response to @carlosagostini's review feedback Thank you for the thorough and insightful review. All four points are well-taken. We've addressed each one in this PR, balancing pragmatic improvements now with a clear path to deeper solutions as the feature matures. Point 1: Posterior-as-design-prior is epistemically fragileWhat we did:
Deferred to v2: Prior sensitivity diagnostic (assurance under a skeptical alternative prior), which would directly quantify how much the recommendations change under different beliefs. Fisher Information-based scoring is also on the roadmap. Point 2: Channels the model believes are null will never be recommendedWhat we did:
Deferred to v2: A proper "confirmation value" scoring dimension and a two-mode recommendation system (discover vs. confirm). Point 3: IID residuals assumption overestimates assuranceWhat we did:
To directly answer the reviewer's question: the IID assumption was a deliberate simplifying assumption in the initial MVP, not an oversight. We've now replaced it with the AR(1) correction as the default. Deferred to v2: Simulation-based sigma estimation (placebo-in-time), geo-specific sigma from synthetic control fit quality. Point 4: Scoring system feels arbitrary and gameableWhat we did:
Deferred to v2: Fisher Information-based information gain as a single principled objective; Pareto frontier of information gain vs. cost; potentially a single-parameter tradeoff control. All changes include unit tests and are demonstrated end-to-end in the updated notebook. Happy to discuss any of these further. |
Improve patch coverage from ~82% to ~87% by testing: - from_idata without residual_autocorr / spend_correlation - normalize=False branches (steady-state, adstock ramp) - recommend() with defaults and decrease direction - single-channel correlation info - ExperimentRecommendations equality protocol - fixture default parameters - plotting: existing axes, spend_levels, single-row/col layouts, no-correlation diagnostics, decrease direction markers - null-confirmation HTML with non-empty recommendations Made-with: Cursor
|
Failing test notebooks nothing to do with me. This was introduced by PR #2361 (merged to main) |
Add TestInitFromMockMMM class that exercises the __init__ path using lightweight stub classes (LogisticSaturation, GeometricAdstock) and a minimal mock MMM object, covering lines 106-217 in designer.py. Also add TestFindChannelDim for the static helper, fixture edge-case tests (_simulate_spend default rng, short-series autocorrelation), and edge-case tests for predict length mismatch and spend correlation failure. designer.py now at 100% coverage. Made-with: Cursor
… fix construction Address 7 of 11 issues from code review: - Extract _channel_metrics(), _ramp_fractions_matrix(), _evaluate_candidates() to eliminate duplicated computation across scoring and plotting - Simplify scoring to only assurance + cost_efficiency (remove unused dimensions) - Add _init_common() so __init__ and from_idata share attribute setup - Replace silent exception swallowing with UserWarning - Remove dead code: unused scalers, dead _SUPPORTED_SATURATION entry, unused recommendations param on plot_adstock_ramp - Add 14 golden regression tests pinning public API numerical output All 128 tests pass with zero numerical change (verified by golden tests). Made-with: Cursor
ExperimentDesigner Refactoring — Summary of ChangesThis commit addresses the main structural issues identified during code review. All 14 golden regression tests pass, confirming zero numerical change. Changes Made1. Eliminated computation duplication
2. Simplified scoring systemRemoved the three unused scoring dimensions ( The weight redistribution logic (for redistributing correlation weight when spend correlation is unavailable) has been removed. Unknown dimension keys in 3. Fixed
|
| Category | Result |
|---|---|
Golden tests (test_golden.py, 14 tests) |
All pass — zero numerical change |
| Public-API tests | All pass unchanged |
| Private-method tests | 3 scaler tests removed, 2 exception tests updated to verify warnings |
| Total | 128 passed, 2 skipped (slow) |
Files changed: designer.py (+333 −313), test_designer.py (+40 −40), test_golden.py (new, 241 lines).
…iency only) Remove four references to the dropped scoring dimensions (uncertainty, correlation, gradient) that no longer exist after the scoring simplification. Made-with: Cursor
| mmm = MMM(...) | ||
| mmm.fit(X, y) | ||
|
|
||
| designer = ExperimentDesigner(mmm) |
There was a problem hiding this comment.
Is it possible to just take actuals and posterior predictives? Or just the residuals ?
There was a problem hiding this comment.
Addressed in b36929c. We kept the y - mmm.predict(mmm.X) approach and added a docstring explaining why: predict() is the public API for point predictions, it avoids duplicating training data the model already holds, and we only need a point estimate of residual noise — not the full posterior predictive distribution. Falls back to residual_std=1.0 with a warning if predict() fails.
|
|
||
| channel_columns = list(mmm.channel_columns) | ||
| for channel in channel_columns: | ||
| sel = {channel_dim: channel} if channel_dim else {} | ||
| posterior_samples[channel] = { | ||
| "lam": posterior[sat_var_map["lam"]] | ||
| .sel(**sel) | ||
| .values.astype(np.float64), | ||
| "beta": posterior[sat_var_map["beta"]] | ||
| .sel(**sel) | ||
| .values.astype(np.float64), | ||
| "alpha": posterior[ads_var_map["alpha"]] | ||
| .sel(**sel) | ||
| .values.astype(np.float64), | ||
| } |
There was a problem hiding this comment.
What is actually needed from the media transformation? Maybe there is a way to leverage
the graph directly.
There was a problem hiding this comment.
Addressed in b36929c. The entire parameter-extraction approach is gone. We now call extract_response_distribution to trace the model's channel_contribution graph, substitute posterior samples, and compile it into a reusable PyTensor function via _build_eval_fn_from_model. All evaluation (lift prediction, ramp, diagnostics, plotting) goes through this compiled graph — no manual access to lam, beta, or alpha needed.
| @staticmethod | ||
| def _eval_saturation( | ||
| x: np.ndarray | float, | ||
| lam: np.ndarray | float, | ||
| beta: np.ndarray | float, | ||
| ) -> np.ndarray: | ||
| """Evaluate the logistic saturation response in pure numpy. | ||
|
|
||
| Uses the same formula as | ||
| :func:`pymc_marketing.mmm.transformers.logistic_saturation` | ||
| but avoids a PyTensor round-trip so callers can pass raw arrays. | ||
| """ | ||
| x = np.asarray(x) | ||
| lam = np.asarray(lam) | ||
| sat = (1.0 - np.exp(-lam * x)) / (1.0 + np.exp(-lam * x)) | ||
| return np.asarray(beta) * sat |
There was a problem hiding this comment.
We shouldn't have to do this.
There was a problem hiding this comment.
Addressed in b36929c. This method (_eval_saturation) and the companion _compute_steady_state_spend / _compute_adstock_ramp are all deleted. Everything now evaluates through the compiled PyTensor graph — no reimplementation of transformation formulas.
…dback Replace manual NumPy reimplementations of adstock/saturation with a compiled PyTensor graph (via extract_response_distribution), making the designer honest for any adstock/saturation combination. - Replace geometric-only ramp helpers with graph-based ramp computation - Replace "Adstock α" diagnostic with "Ramp @ Xw" (adstock-agnostic) - Add __init__(mmm) test coverage (happy path, error, and warning paths) - Add docstring justification for residual computation approach - Update golden test ramp values (intentional math change) Made-with: Cursor
Addressing review feedbackGraph-based ramp fraction (scope/contract fix)Deleted the three geometric-adstock-specific helpers (
Golden test ramp values updated (lift, assurance, and SNR are unchanged; the small ramp shift reflects the graph now accounting for saturation nonlinearity jointly with adstock).
|
Update two cells that accessed the deleted _posterior_samples dict to use the xarray posterior Dataset directly. Made-with: Cursor
|
@BugBot review |
PR SummaryMedium Risk Overview Exports the new types from Reviewed by Cursor Bugbot for commit 0fa1336. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 0fa1336. Configure here.
juanitorduz
left a comment
There was a problem hiding this comment.
Some initial comments :)
| def _geometric_adstock_np( | ||
| x: np.ndarray, alpha: float, l_max: int, normalize: bool = True | ||
| ) -> np.ndarray: | ||
| """Apply geometric adstock to a 1-D series (numpy).""" | ||
| n = len(x) | ||
| weights = alpha ** np.arange(l_max) | ||
| if normalize: | ||
| weights = weights / weights.sum() | ||
|
|
||
| out = np.zeros(n) | ||
| for t in range(n): | ||
| for lag in range(min(l_max, t + 1)): | ||
| out[t] += weights[lag] * x[t - lag] | ||
| return out | ||
|
|
||
|
|
||
| def _logistic_saturation_np(x: np.ndarray, lam: float) -> np.ndarray: | ||
| """Numpy logistic saturation (without beta scaling).""" | ||
| return (1.0 - np.exp(-lam * x)) / (1.0 + np.exp(-lam * x)) |
There was a problem hiding this comment.
This seems redundant, cant't we use https://github.com/pymc-labs/pymc-marketing/blob/main/pymc_marketing/mmm/transformers.py ?
| return (1.0 - np.exp(-lam * x)) / (1.0 + np.exp(-lam * x)) | ||
|
|
||
|
|
||
| def generate_experiment_fixture( |
There was a problem hiding this comment.
Do we need this for the code or just for tests?
There was a problem hiding this comment.
Could we re=use some of https://www.pymc-marketing.io/en/stable/notebooks/mmm/mmm_data_generator.html ?
| parts.append(")") | ||
| return "".join(parts) | ||
|
|
||
| # -- Jupyter rendering --------------------------------------------------- |
| return pd.DataFrame(records) | ||
|
|
||
|
|
||
| _HIGH_CORRELATION_THRESHOLD = 0.7 |
There was a problem hiding this comment.
at the top of the file?
Summary
ExperimentDesignerclass that recommends which marketing experiment to run (channel, spend level, duration) based on a fitted MMM's posterior uncertainty about channel response functionsDetails
Core computation: For each candidate experiment design (channel × spend change × duration), evaluates the posterior-predicted lift accounting for geometric adstock ramp-up, computes measurement noise σ = σ_residual · √T, and derives Bayesian assurance (expected power over the posterior distribution of the true effect).
Scoring dimensions: Channels are ranked by a configurable weighted sum of: posterior uncertainty, spend correlation, saturation gradient, assurance, and cost efficiency — all min-max normalised.
v1 scope: National-level experiments with
LogisticSaturation+GeometricAdstock(adstock_first=True). Geo-level designs, pulse/switchback experiments, and Fisher Information are deferred to v2.New files
pymc_marketing/mmm/experiment_design/designer.py—ExperimentDesignerclasspymc_marketing/mmm/experiment_design/recommendation.py—ExperimentRecommendationdataclasspymc_marketing/mmm/experiment_design/functions.py— Numpylogistic_saturationpymc_marketing/mmm/experiment_design/fixture.py—generate_experiment_fixture()Test plan
Towards #2355
Made with Cursor
📚 Documentation preview 📚: https://pymc-marketing--2356.org.readthedocs.build/en/2356/