diff --git a/vignettes/loo2-lfo.Rmd b/vignettes/loo2-lfo.Rmd new file mode 100644 index 00000000..97b8499f --- /dev/null +++ b/vignettes/loo2-lfo.Rmd @@ -0,0 +1,804 @@ +--- +title: "Approximate leave-future-out cross-validation for time series models" +author: Paul Bürkner, Aki Vehtari, Jonah Gabry +date: "`r Sys.Date()`" +output: + html_vignette: + toc: yes +encoding: "UTF-8" +params: + EVAL: !r identical(Sys.getenv("NOT_CRAN"), "true") +--- + + + +```{r, child="children/SETTINGS-knitr.txt"} +``` + +```{r more-knitr-ops, include=FALSE} +knitr::opts_chunk$set( + cache = TRUE, + message = FALSE, + warning = FALSE +) +``` + +## Introduction + +One of the most common goals of a time series analysis is to use the observed +series to inform predictions for future observations. We will refer to this task +of predicting a sequence of $M$ future observations as $M$-step-ahead prediction +($M$-SAP). Fortunately, once we have fit a model and can sample from the +posterior predictive distribution, it is straightforward to generate predictions +as far into the future as we want. It is also straightforward to evaluate the +$M$-SAP performance of a time series model by comparing the predictions to the +observed sequence of $M$ future data points once they become available. + +Unfortunately, we are often in the position of having to use a model to inform +decisions _before_ we can collect the future observations required for assessing +the predictive performance. If we have many competing models we may also need to +first decide which of the models (or which combination of the models) we should +rely on for predictions. In these situations the best we can do is to use +methods for approximating the expected predictive performance of our models +using only the observations of the time series we already have. + +If there were no time dependence in the data or if the focus is to assess +the non-time-dependent part of the model, we could use methods like +leave-one-out cross-validation (LOO-CV). For a data set with $N$ observations, +we refit the model $N$ times, each time leaving out one of the $N$ observations +and assessing how well the model predicts the left-out observation. LOO-CV is +very expensive computationally in most realistic settings, but the Pareto +smoothed importance sampling (PSIS, Vehtari et al, 2017ab) algorithm provided by +the *loo* package allows for approximating exact LOO-CV with PSIS-LOO-CV. +PSIS-LOO-CV requires only a single fit of the full model and comes with +diagnostics for assessing the validity of the approximation. + +With a time series we can do something similar to LOO-CV but, except in a few +cases, it does not make sense to leave out observations one at a time because +then we are allowing information from the future to influence predictions of the +past (i.e., times $t + 1, t+2, \ldots$ should not be used to predict for time +$t$). To apply the idea of cross-validation to the $M$-SAP case, instead of +leave-*one*-out cross-validation we need some form of leave-*future*-out +cross-validation (LFO-CV). As we will demonstrate in this case study, LFO-CV +does not refer to one particular prediction task but rather to various possible +cross-validation approaches that all involve some form of prediction for new +time series data. Like exact LOO-CV, exact LFO-CV requires refitting the model +many times to different subsets of the data, which is computationally very +costly for most +nontrivial examples, in particular for Bayesian analyses where refitting the +model means estimating a new posterior distribution rather than a point +estimate. + +Although PSIS-LOO-CV provides an efficient approximation to exact LOO-CV, until +now there has not been an analogous approximation to exact LFO-CV that +drastically reduces the computational burden while also providing informative +diagnostics about the quality of the approximation. In this case study we +present PSIS-LFO-CV, an algorithm that typically only requires refitting the +time-series model a small number times and will make LFO-CV tractable for many +more realistic applications than previously possible. + + + +## $M$-step-ahead predictions + +Assume we have a time series of observations +$\mathbf{y} = (y_1, y_2, \ldots, y_N)$ +and let $L$ be the _minimum_ number of observations from the series that +we will require before making predictions for future data. Depending on the +application and how informative the data is, it may not be possible to make +reasonable predictions for $y_{i}$ based on $(y_1, \dots, y_{i-1})$ until $i$ is +large enough so that we can learn enough about the time series to predict future +observations. Setting $L=10$, for example, means that we will only assess +predictive performance starting with observation $y_{11}$, so that we +always have at least 10 previous observations to condition on. + +In order to assess $M$-SAP performance we would like to compute the +predictive densities + +$$ +p(y_{i \text{threshold}$) until we +arrive at $i = L+1$. Recall that $L$ is the minimum number of observations we +have deemed acceptable for making predictions (setting $L=0$ means predictions +of all observations should be computed). + +For LOO, we recommend to use a threshold of $0.7$ (Vehtari et al, 2017ab), but +for LFO, the threshold should be smaller as influential observations will affect +all pointwise LFO approximations until the next refit and thus much stronger +than in LOO. For the present case study, we choose a threshold of $0.6$, but +more experiments are needed to give general recommendations about the threshold +in LFO. + + +## Autoregressive models + +Autoregressive (AR) models are some of the most commonly used time-series models. +An AR(p) model ---an autoregressive model of order $p$--- can be defined as + +$$ +y_i = \eta_i + \sum_{k = 1}^p \varphi_k y_{i - k} + \varepsilon_i, +$$ + +where $\eta_i$ is the linear predictor for the $i$th observation, $\phi_k$ are +the autoregressive parameters and $\varepsilon_i$ are pairwise independent +errors, which are usually assumed to be normally distributed with equal variance +$\sigma^2$. The model implies a recursive formula that allows for computing the +right-hand side of the above equation for observation $i$ based on the values of +the equations for previous observations. + +## Case Study: Annual measurements of the level of Lake Huron + +To illustrate the application of PSIS-LFO-CV for estimating expected $M$-SAP +performance, we will fit a model for 98 annual measurements of the water level +(in feet) of [Lake Huron](https://en.wikipedia.org/wiki/Lake_Huron) from the +years 1875--1972. This data set is found in the **datasets** R package, which is +installed automatically with **R**. + +In addition to the **loo** package, for this analysis we will use the **brms** +interface to Stan to generate a Stan program and fit the model, and also the +**bayesplot** and **ggplot2** packages for plotting. + +```{r pkgs, cache=FALSE} +library("loo") +library("brms") +library("bayesplot") +library("ggplot2") +color_scheme_set("brightblue") +theme_set(theme_default()) + +CHAINS <- 4 +SEED <- 1234 +set.seed(SEED) +``` + +Before fitting a model, we will first put the data into a data frame and then +look at the time series. + +```{r} +N <- length(LakeHuron) +df <- data.frame( + y = as.numeric(LakeHuron), + year = as.numeric(time(LakeHuron)), + time = 1:N +) + +# save plot labels to reuse them +plot_labs <- labs( + y = "Water Level (ft)", + x = "Year", + title = "Water Level in Lake Huron (1875-1972)" +) + +ggplot(df, aes(x = year, y = y)) + + geom_point(size = 1) + + plot_labs +``` + +The above plot shows rather strong autocorrelation of the time-series as well as +some trend towards lower levels for later points in time. + +We can specify an AR(4) model for these data using the **brms** package as +follows: + +```{r fit, results = "hide"} +control <- list(adapt_delta = 0.95) +fit <- brm( + y ~ 1, + data = df, + autocor = cor_ar(~time, p = 4), + prior = prior(normal(0, 0.5), class = "ar"), + control = control, + seed = SEED, + chains = CHAINS +) +``` + +The model implied predictions along with the observed values can be plotted, +which reveals a rather good fit to the data. + +```{r, cache = FALSE} +preds <- posterior_predict(fit) +preds <- cbind( + Estimate = colMeans(preds), + Q5 = apply(preds, 2, quantile, probs = 0.05), + Q95 = apply(preds, 2, quantile, probs = 0.95) +) + +ggplot(cbind(df, preds), aes(x = year, y = Estimate)) + + geom_smooth(aes(ymin = Q5, ymax = Q95), stat = "identity", size = 0.5) + + geom_point(aes(y = y)) + + labs(subtitle = "Mean (blue) and 90% predictive intervals (gray) vs. observed data (black)") + + plot_labs +``` + +To allow for reasonable predictions of future values, we will require at least +$L = 20$ historical observations (20 years) to make predictions. + +```{r} +L <- 20 +``` + +We first perform approximate leave-one-out cross-validation (LOO-CV) for the +purpose of later comparison with exact and approximate LFO-CV for the 1-SAP +case. + +```{r, cache = FALSE} +loo_cv <- loo(log_lik(fit)[, (L + 1):N]) +print(loo_cv) +``` + +## 1-step-ahead predictions leaving out all future values + +The most basic version of $M$-SAP is 1-SAP, in which we predict only one step +ahead. In this case, $y_{i 0.7$. This is indicated by the vertical line in the plot below, which +corresponds to the point $i^\star_1$ discussed in Section +[Approximate $M$-SAP using importance-sampling](#approximate_MSAP) above. + +```{r plot_ks} +plot_ks <- function(ks, ids, thres = 0.7) { + dat_ks <- data.frame(ks = ks, ids = ids) + ggplot(dat_ks, aes(x = ids, y = ks)) + + geom_point(aes(color = ks > thres), shape = 3, show.legend = FALSE) + + geom_hline(yintercept = thres, linetype = 2, color = "red2") + + scale_color_manual(values = c("cornflowerblue", "darkblue")) + + labs(x = "Data point", y = "Pareto k") + + ylim(-0.5, 1.5) +} +``` + +```{r, cache = FALSE} +plot_ks(ks, ids) +``` + +Consequently, it is not surprising that the resulting approximate ELPD for 1-SAP +is far away from the exact ELPD we computed above: + +```{r, cache = FALSE} +approx_elpd_1sap_no_refit <- + c(ELPD = sum(approx_elpds_1sap_no_refit, na.rm = TRUE)) +rbind_print( + "approx LFO (no refit)" = approx_elpd_1sap_no_refit, + "exact LFO" = exact_elpd_1sap +) +``` + +Next, we compute approximate 1-SAP with refit at observations where +the Pareto $k$ estimate exceeds the threshold of $0.6$. + +```{r} +k_thres <- 0.6 +``` + +The code becomes a little bit more involved to handle the refitting procedure. +Note that we can compute exact 1-SAP at the refitting points, which comes with +no additional computational costs since we had to refit the model anyway. + +```{r refit_loglik, results="hide"} +loglik <- matrix(nrow = nsamples(fit), ncol = N) +approx_elpds_1sap <- rep(NA, N) + +fit_part <- fit +ids <- N:(L + 1) +i_refit <- N +refits <- NULL +ks <- NULL + +for (i in ids) { + loglik[, i] <- log_lik(fit_part)[, i] + logratio <- sum_log_ratios(loglik, i:i_refit) + psis_part <- suppressWarnings(psis(logratio)) + k <- pareto_k_values(psis_part) + ks <- c(ks, k) + if (k > k_thres) { + # refit the model based on the first i-1 observations + i_refit <- i + refits <- c(refits, i) + fit_part <- update(fit_part, newdata = df[1:(i - 1), ], recompile = FALSE) + loglik[, i] <- log_lik(fit_part, newdata = df[1:i, ])[, i] + approx_elpds_1sap[i] <- log_mean_exp(loglik[, i]) + } else { + lw_i <- weights(psis_part, normalize = TRUE)[, 1] + approx_elpds_1sap[i] <- log_sum_exp(lw_i + loglik[, i]) + } +} +``` + +We see that the final Pareto-$k$-estimates are mostly well below the threshold +of `r k_thres` and that we only needed to refit the model `r length(refits)` times at +observations `r refits`. + +```{r, cache=FALSE} +plot_ks(ks, ids) +``` + +The approximate 1-SAP ELPD is remarkably similar to the exact 1-SAP ELPD +computed above, which indicates our algorithm to compute approximate 1-SAP +worked well for the present data and model. + +```{r, cache = FALSE} +approx_elpd_1sap <- sum(approx_elpds_1sap, na.rm = TRUE) +rbind_print( + "approx LFO (no refits)" = approx_elpd_1sap_no_refit, + "approx LFO (with refits)" = approx_elpd_1sap, + "exact LFO" = exact_elpd_1sap +) +``` + +Plotting exact against approximate predictions, we see that no approximation +value deviates far from its exact counterpart, providing further evidence for +the good quality of our approximation. + +```{r, cache = FALSE} +dat_elpd <- data.frame( + approx_elpd = approx_elpds_1sap, + exact_elpd = exact_elpds_1sap +) + +ggplot(dat_elpd, aes(x = approx_elpd, y = exact_elpd)) + + geom_abline(color = "gray30") + + geom_point(size = 2) + + labs(x = "Approximate ELPDs", y = "Exact ELPDs") +``` + +```{r, echo=FALSE} +max_diff <- with(dat_elpd, max(abs(approx_elpd - exact_elpd), na.rm = TRUE)) +mean_diff <- with(dat_elpd, mean(abs(approx_elpd - exact_elpd), na.rm = TRUE)) +``` + +No observation shows a difference between the exact and approximate ELPD +calculations of more than `r round(max_diff, 2)` and the average difference is +`r round(mean_diff, 3)`, resulting in a close approximation. + +## $M$-step-ahead predictions leaving out all future values + +To illustrate the application of $M$-SAP for $M > 1$, we compute next exact and +approximate LFO-CV for the 4-SAP case. + +### Exact $M$-step-ahead predictions + +The necessary steps are the same as for 1-SAP with the exception that the +log-density values of interest are now the sums of the log predictive densities +of four consequtive observations. Further, the stability of the PSIS +approximation actually stays the same for all $M$ as it only depends on the +number of observations we leave out, not on the number of observations we +predict. + +```{r exact_loglikm, results="hide"} +M <- 4 +loglikm <- matrix(nrow = nsamples(fit), ncol = N) +for (i in (N - M + 1):max(L + 1, 2)) { + fit_i <- update(fit, newdata = df[-(i:N), ], recompile = FALSE) + ll <- log_lik(fit_i, newdata = df[1:(i + M - 1), ]) + loglikm[, i] <- rowSums(ll[, i:(i + M - 1)]) +} +``` + +```{r, cache = FALSE} +exact_elpds_4sap <- apply(loglikm, 2, log_mean_exp) +(exact_elpd_4sap <- c(ELPD = sum(exact_elpds_4sap, na.rm = TRUE))) +``` + +### Approximate $M$-step-ahead predictions + +Computing the approximate PSIS-LFO-CV for the 4-SAP case is a little bit more +involved than the approximate version for the 1-SAP case, although the +underlying principles remain the same. + +```{r refit_loglikm, results="hide"} +loglikm <- loglik <- matrix(nrow = nsamples(fit), ncol = N) +approx_elpds_4sap <- rep(NA, N) +fit_part <- fit +ids <- (N - M + 1):(L + 1) +i_refit <- N - M + 1 +refits <- NULL +ks <- NULL + +loglik[, (N - M + 2):N] <- log_lik(fit_part)[, (N - M + 2):N] + +for (i in ids) { + ll <- log_lik(fit_part, newdata = df[1:(i + M - 1), ]) + loglikm[, i] <- rowSums(ll[, i:(i + M - 1)]) + loglik[, i] <- ll[, i] + logratio <- sum_log_ratios(loglik, i:i_refit) + psis_part <- suppressWarnings(psis(logratio)) + k <- pareto_k_values(psis_part) + ks <- c(ks, k) + if (k > k_thres) { + # refit the model based on the first i-1 observations + i_refit <- i + refits <- c(refits, i) + fit_part <- update( + fit_part, newdata = df[1:(i - 1), ], + recompile = FALSE + ) + ll <- log_lik(fit_part, newdata = df[1:(i + M - 1), ]) + loglik[, i] <- ll[, i] + loglikm[, i] <- rowSums(ll[, i:(i + M - 1)]) + approx_elpds_4sap[i] <- log_mean_exp(loglikm[, i]) + } else { + lw_i <- weights(psis_part, normalize = TRUE)[, 1] + approx_elpds_4sap[i] <- log_sum_exp(lw_i + loglikm[, i]) + } +} +``` + +Again, we see that the final Pareto-$k$-estimates are mostly well below the +threshold and that we only needed to refit the model `r length(refits)` times at +observations `r refits`. + +```{r, cache = FALSE} +plot_ks(ks, ids) +``` + +The approximate ELPD computed for the 4-SAP case is not as close to its exact +counterpart as in the 1-SAP case. Likely, we need to lower the threshold +for the Pareto-k-values from `r k_thres` to `r k_thres - 0.05` or even +`r k_thres - 0.1` in order to achieve better results. + +```{r, cache = FALSE} +approx_elpd_4sap <- sum(approx_elpds_4sap, na.rm = TRUE) +rbind_print( + "Approx LFO" = approx_elpd_4sap, + "Exact LFO" = exact_elpd_4sap +) +``` + +Plotting exact against approximate pointwise predictions confirms that, for a +few specific data points, the approximate predictions underestimate the exact +predictions. + +```{r, cache = FALSE} +dat_elpd_4sap <- data.frame( + approx_elpd = approx_elpds_4sap, + exact_elpd = exact_elpds_4sap +) + +ggplot(dat_elpd_4sap, aes(x = approx_elpd, y = exact_elpd)) + + geom_abline(color = "gray30") + + geom_point(size = 2) + + labs(x = "Approximate ELPDs", y = "Exact ELPDs") +``` + +## 1-step-ahead predictions leaving out blocks of future values + +Depending on the particular time-series model and data, the Pareto $k$ estimates +may exceed `r k_thres` rather quickly (i.e., after only few observations) and so +many refits may be required even when carrying out the PSIS approximation +to LFO-CV. In this case, another option is to exclude only the block of $B$ +future values that directly follow the observations to be predicted while +retaining all of the more distant values $y_{i>B} = (y_{i + B}, \ldots, y_N)$. +This will usually result in lower Pareto $k$ estimates and thus less refitting. + +This block-$M$-SAP version closely resembles the basic $M$-SAP only if values in +the distant future, $y_{>B}$, contain little information about the current +observations being predicted. Whether this assumption is justified will depend +on the data and model. + +We will demonstrate the LFO-CV approach for the block-$M$-SAP case using +block-1-SAP. + +### Exact LFO-CV for block-1-SAP + +The differences between the code for LFO-CV for 1-SAP and the code for +block-1-SAP are very small. We now just need to define a block size $B$ of +left-out observations, which we set to $B = 10$ for the present case study. + +```{r exact_loglik_block, results="hide"} +B <- 10 +loglik <- matrix(nrow = nsamples(fit), ncol = N) +for (i in N:(L + 1)) { + to <- min(i + B - 1, N) + fit_i <- update(fit, newdata = df[-(i:to), ], recompile = FALSE) + loglik[, i] <- log_lik(fit_i, newdata = df[1:i, ])[, i] +} +``` + +Comparing the exact block-1-SAP to the exact 1-SAP reveals no substantial +difference for this particular data set and model, but this may very well be +different for other data sets and models. + +```{r, cache = FALSE} +exact_elpds_1sap_block <- apply(loglik, 2, log_mean_exp) +exact_elpd_1sap_block <- c(ELPD = sum(exact_elpds_1sap_block, na.rm = TRUE)) +rbind_print( + "Exact LFO-CV (1-SAP)" = exact_elpd_1sap, + "Exact LFO-CV (block-1-SAP)" = exact_elpd_1sap_block +) +``` + +If exact block-1-SAP and exact 1-SAP were different, it would not necessarily +imply that observations depend on distant future values, but could instead be +due to the fact that the model parameters were estimated more precisely in the +case of block-1-SAP since more data are available, in particular for predicting +observations early in the time series. If this is the case we would expect the +differences between the pointwise ELPDs for the two cases to be larger for +earlier observations. To some degree, this is also what we see when plotting the +differences for the present example, although these differences happen to cancel +each other out in this case: + +```{r, cache = FALSE} +dat_elpd_diff <- data.frame( + elpd_diff = exact_elpds_1sap_block - exact_elpds_1sap, + i = seq_along(exact_elpds_1sap_block) +) + +ggplot(dat_elpd_diff[(L + 1):N, ], aes(x = i, y = elpd_diff)) + + geom_point(size = 2) + + labs(x = "Data Point", y = "ELPD diff") +``` + +### Approximate 1-step-ahead predictions + +We compute approximate PSIS-LFO-CV for block-1-SAP as follows. + +```{r refit_loglik_block, results="hide"} +loglik <- matrix(nrow = nsamples(fit), ncol = N) +approx_elpds_1sap_block <- rep(NA, N) +fit_part <- fit +ids <- N:(L + 1) +i_refit <- N +refits <- NULL +ks <- NULL + +for (i in ids) { + loglik[, i] <- log_lik(fit_part)[, i] + to <- min(i + B - 1, i_refit) + logratio <- sum_log_ratios(loglik, i:to) + psis_part <- suppressWarnings(psis(logratio)) + k <- pareto_k_values(psis_part) + ks <- c(ks, k) + if (k > k_thres) { + # refit the model without a block of B observations + i_refit <- i + refits <- c(refits, i) + fit_part <- update(fit_part, newdata = df[-(i:to), ], recompile = FALSE) + loglik[, i] <- log_lik(fit_part, newdata = df[1:i, ])[, i] + approx_elpds_1sap_block[i] <- log_mean_exp(loglik[, i]) + } else { + lw_i <- weights(psis_part, normalize = TRUE)[, 1] + approx_elpds_1sap_block[i] <- log_sum_exp(lw_i + loglik[, i]) + } +} +``` + +Here we needed `r length(refits)` refit(s), which is remarkable given that we +have a time-series of 98 observations and are predicting for 98 - L = `r 98 - L` +of them. + +```{r, cache = FALSE} +plot_ks(ks, ids) +``` + +Consequently, it is not surprising that PSIS-LFO-CV for block-1-SAP matches the +exact version closely. + +```{r, cache = FALSE} +approx_elpd_1sap_block <- c(ELPD = sum(approx_elpds_1sap_block, na.rm = TRUE)) +rbind_print( + "Exact LFO-CV (block-1-SAP)" = exact_elpd_1sap_block, + "Approx LFO-CV (block-1-SAP)" = approx_elpd_1sap_block +) +``` + +This is also visible in the plot of approximate against exact predictions. + +```{r, cache = FALSE} +dat_elpd_block <- data.frame( + approx_elpd = approx_elpds_1sap_block, + exact_elpd = exact_elpds_1sap_block +) + +ggplot(dat_elpd_block, aes(x = approx_elpd, y = exact_elpd)) + + geom_abline(color = "gray30") + + geom_point(size = 2) + + labs(x = "Approximate ELPDs", y = "Exact ELPDs") +``` + +When increasing $M$ in block-M-SAP, we have to make sure to simultaneously +increase the block size $B$, as well, in order to keep the assumptions of +block-M-SAP satisfied. This also implies that the stability of the PSIS +approximation will decrease as we increase $M$ in block-M-SAP because we have +to leave out more observations. + +## Conclusion + +In this case study we have shown how to do carry out exact and approximate +leave-future-out cross-validation for $M$-step-ahead prediction tasks. For the +data and model used in our example the PSIS-LFO-CV algorithm provides reasonably +stable and accurate results depite not requiring us to refit the model nearly as +many times. For many other examples we expect similar performance, but the +results will always vary from case to case depending on the particulars of the +data and the time series model. + +
+ +## References + +Vehtari A., Gelman A., & Gabry J. (2017a). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. *Statistics and Computing*, 27(5), 1413--1432. doi:10.1007/s11222-016-9696-4. [Online](http://link.springer.com/article/10.1007/s11222-016-9696-4). [arXiv preprint arXiv:1507.04544](https://arxiv.org/abs/1507.04544). + +Vehtari A., Gelman A., & Gabry J. (2017b). Pareto smoothed importance sampling. [arXiv preprint arXiv:1507.02646](https://arxiv.org/abs/1507.02646). + +
+ +## Appendix + +### Appendix: Session information + +```{r} +sessionInfo() +``` + +### Appendix: Licenses + +* Code © 2018, Paul Bürkner, Aki Vehtari, Jonah Gabry (licensed under BSD-3). +* Text © 2018, Paul Bürkner, Aki Vehtari, Jonah Gabry (licensed under CC-BY-NC 4.0).