(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

There is a recent pre-print Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection by Quentin Gronau and Eric-Jan Wagenmakers. Wagenmakers asked for comments and so here are my comments.

**Short version:** They report a known limitation of LOO when it’s used in a non-recommended way for model selection. They report that their experiments show that LOO is worse than expected as it doesn’t favor the simpler true model strongly enough. I think their experiments show that the specific non-recommended LOO comparison is favoring the simpler model more than expected. I enjoyed the clarity of the writing and the experiments, but they are missing many limitations of LOO and alternative methods, and I disagree on conclusions made on the experiments.

**Long version:**

The paper focuses on model comparison. I’d like to remind that cross-validation overall and LOO are also useful for estimating the predictive performance and checking of a single model. You can find some examples in Gabry, Simpson, Vehtari, Betancourt, and Gelman (2018), loo package vignettes, and my model selection tutorial notebooks, but I’ll focus here on model comparison.

Cross-validation and LOO have many limitations. The paper focuses on a variation of one known limitation: “idealized case where there exist a data set of infinite size that is perfectly consistent with the simple model M_S, LOO will nevertheless fail to strongly endorse M_S.” The paper uses one specific variant of LOO model selection. The further variation in the paper is the use of idealized data in the experiments.

The strongest assumption in the paper is that true model is one of the models compared. This case is often called M-closed case (Bernardo and Smith, 1994). Cross-validation is usually recommended for M-open case where the true model is not included in the set of models compared (Bernardo and Smith, 1994), and thus we might say that the paper is using cross-validation for something that it is not recommended.

When we estimate the predictive performance of the model for the future data which is not yet available, we need to integrate over the future data distribution p_t(y), which we usually don’t know. Vehtari and Ojanen (2012) provide extensive survey of alternative scenarios. If we assume M-closed case, then we know that one of the models is p_t(y), but we don’t know which one. In that case, it is sensible to model p_t(y) as a posterior weighted average of the all models under consideration, that is, replace p_t(y) with Bayesian model averaging (BMA) model as proposed by San Martini and Spezzaferri (1984). This approach has the same model selection consistent behavior as Bayes factor. Thus if the assumption is that one of the models is true, BMA reference model approach by San Martini and Spezzaferri could be used, although asymptotic model selection consistency doesn’t guarantee good finite sample performance, and BMA may have worse performance than cross-validation for small sample sizes even if one of the models is true.

LOO is known to be model selection inconsistent in M-closed case, but it is good to remember that there are also cross-validation hold-out rules, which are model selection consistent (Shao, 1992). I don’t think these versions of cross-validation are that important, since there are better options for predictive performance estimation in M-closed case as mentioned above and below.

If the methods, that are model selection consistent in M-closed, are used in M-open case they will eventually give weight 1 to one model that is closest to the truth, but which can still be arbitrarily far from truth. In M-open case it can be better if the model weights stay away from 0 and 1 as demonstrated also by Yao, Vehtari, Simpson, and Gelman, 2018.

In M-open case, we assume that none of the models is true one, and we don’t trust any of our models to be even very close to true model. Due to the lack of trust, we make minimal assumptions for p_t(y) and instead of any explicit model, we re-use data as pseudo Monte Carlo draws from the unknown distribution. Alternatively we can also say that we are using Dirichlet distribution with unknown probability mass at the observed locations as a non-parametric model of p_t(y). This step assumes that we believe that this simple non-parametric assumption can represent the relevant properties of the true distribution, which is one of the big limitations of cross-validation. The edge case of theta_t being very close to 0 or 1 and respectively all observations being 0 or 1, is known to be very sensitive to prior assumptions in any inference (see, e.g., Bernardo and Juárez, 2003) and using all 1 observations for a Dirichlet distribution is clearly missing information about the possible variability in the future. In this kind of very weak data for some important features of the future distribution, I would not use Dirichlet distribution, but instead would insist that stronger model information is included in p_t(y). Classic example is the rare disease prevalence estimation, where we might observe hundreds of healthy persons and no persons with disease. Without good prior information, the posterior uncertainty remains large on how far from 0 probability we are and the results are necessarily sensitive to prior information (as reflected also in experiment 1). For experiments 2 and 3 the Dirichlet distribution approximation is likely to work better.

Non-monotonicity observed in some weight curves is likely due to the fact that idealized data is symmetric and centered, and when leaving one observation out, this symmetry and centering doesn’t hold anymore. I guess that in scenario 2, leave-two-out approach by leaving a pair of 0 and 1 at the same time wouldn’t have non-monotonicity in curves. The same holds for Example 3, and it would be better to generate the idealized data as pairs which have the same absolute value but different signs and leave two out at the same time. This way the examples would be even more idealized and still showing the limitations of LOO.

In M-closed case, it is known that LOO is not model selection consistent (e.g., Shao, 1992), which means that the weight for the true model never goes to 1. Gronau and Wagenmakers write that they were surprised that model weights stayed so far from 1. I was surprised that the model weights were so close to 1. If asymptotically the simpler and more complex model are giving practically the same predictions (except in example 1, H_0 model is not giving any probability for a 0 event), then I would assume the weights to be closer to 0.5. I can think two reasons why the weights are not closer to 0.5:

- Idealized data makes also LOO to favor the simpler model more and if the data would be generated from the true generating process in examples 2 and 3, I would expect the weights to be closer to 0.5.
- Gronau and Wagenmakers are computing the weights directly from LOO expectations (Pseudo BF), although at least since Breiman et al. (1984, ch. 11), it has been recommended that the uncertainty in cross-validation model comparison should also be taken into account. That is, the model with higher estimated performance is not automatically selected, but uncertainty in the estimates should be considered, too (see also, Vehtari, Gelman, and Gabry, 2017).

My recommendation is that if LOO comparison taking into account the uncertainties says that there is no clear winner, then neither of the models is selected and model averaging is used instead. In case of two models, if both models give very similar predictions, then it doesn’t matter which one is used. In case of nested models I would choose the more complex model to be certain that uncertainties are not underestimated (which happens in the simpler model as some parameters are fixed compare to the more complex model) and then make strict model checking and calibration, and then proceed with M-completed approach to decide if some parts can be dropped as discussed below.

Yao, Vehtari, Simpson, and Gelman (2018) propose Pseudo-BMA+ (for model weights this could be called Pseudo-BF+ weights) which take the relevant uncertainty in the account and produces weights which are further away from 0 and 1.

Yao, Vehtari, Simpson, and Gelman (2018) propose also Bayesian stacking which uses LOO as part of the approach. The Pseudo-BMA(+) has the limitations that it’s seeing only scalar value of the predictive performance, while stacking has the benefit that it sees also how similar or different the predictive distributions of different models are and thus it can jointly optimize better weights (Yao, Vehtari, Simpson, and Gelman, 2018).

Yuling Yao commented in an email:

A quick note is that in the beta-bernoulli examples, the stacking weights can be derived analytically. It is 1 for H_0 and 0 for H_1. Surprisingly it is independent of both the prior distribution and sample size n. The independence of n may look suspicious. But intuitively when each data point (example 1) or each pair of data point (example 2) uniformly supports H_0 more than H_1, it does not take n \to infinity to conclude H_0 dominates H_1. Also only when n goes to infinity shall we observe perfect 0-1 pair in example 2 and exact sample mean= 0 in example 3, so the stacking weighting/selection depends on sample size implicitly.

Although the stacking seems to produce something Gronau and Wagenmakers desire, I would not trust stacking in the edge case with all 1 observations for the reasons I mentioned above. I guess that in this idealized case, stacking with symmetric-leave-two-out would also converge faster.

Gronau and Wagenmakers focused on the case of the simpler model being true, but to better illustrate the limitation of the LOO, it would be good to consider also the case where the more complex model is true one. Consider following alternative experiments where the more complex model is true one.

- In example 1, let the true theta_t= 1-epsilon, but for the simpler model keep theta_0=1.
- In example 2, let the true theta_t= 1/2+epsilon, but for the simpler model keep theta_0=1/2.
- In example 3, let the true theta_t= epsilon, but for the simpler model keep theta_0.

If we choose epsilon very small but within the limits of the floating point accuracy for the experiments, we should see the same weights as as in the original experiments as long as we observe the same data, and only when we occasionally observe one extra 0 in example 1, one extra 1 in example 2, or extra positive value in 3 we would see differences. In the example 1, even one 0 will make theta=1 to have zero probability.

Another experiment to illustrate the limitations of LOO (and cross-validation and information criteria in general) would be to vary epsilon from very small to much larger, plotting how large the epsilon needs to be before we see that the more complex model is strongly favored. I’m not certain what happens for the Pseudo-BF version with no proper uncertainty handling used by Gronau and Wagenmakers, but one of the big limitations of the cross-validation is that the uncertainty about the future when not making any model assumptions is so big, that for being able to make confident choice between the model the epsilon needs to be larger than what would be needed if we just look at the posterior distribution of theta in this kind of simple models (see demonstration in a notebook, and related results by Wang and Gelman, 2014). This is the price we pay for not trusting any model and thus not getting benefits of reduced variance through the proper modeling of the future data distribution! This variability makes LOO bad for model selection when the differences between the model are small, and it just gets worse in case of a large number of models with similar true performance (see, e.g. Piironen and Vehtari, 2017). Even with just two models to compare, cross-validation has also a limitation that the simple variance estimate tend to be optimistic (Bengio and Grandvalet, 2004).

If we are brave enough to assume M-closed or M-completed case, then we can reduce the uncertainty in the model comparison by using a reference model for p_t(y) as demonstrated by Piironen and Vehtari (2017) (see also demonstration in a notebook). In M-completed case, we assume that none of the models is true one, but there is one model which in the best way presents all the uncertainties we can think of (see more in Vehtari and Ojanen, 2012). M-completed case is close to San-Martini and Spezzaferri’s M-closed approach, but with BMA model replaced with just some model we trust. For example, in variable selection the reference model in the M-completed case can be a model with all variables and a sensible prior on coefficients, and which has survived through the model checking and assessment (which may involve cross-validation for that single model). In such case we have started with M-open assumption, but after model checking and assessment we trust one of the models enough that we can switch to M-completed case for certain model selection tasks. In case of the reference model, we can further reduce the variance in the model selection by using the projection predictive approach as demonstrated by Piironen and Vehtari (2017) (with a reference implementation in projpred package). In M-closed case and BMA reference model, projection predictive approach is also model selection consistent, but more importantly it has very low variance in model selection in finite case.

Often the discussion on cross-validation vs. Bayes factors focuses on arguments whether M-closed is sensible assumption. I think M-closed case is rare, but if you insist on M-closed, then you can still use a predictive approach (Vehtari and Ojanen, 2012). If BMA is easy to compute and stable use that as the reference model and then do the model selection as San Martini and Spezzaferri (1984) or even better with the projection predictive approach (Piironen and Vehtari, 2017). If BMA is difficult to compute or unstable, I would recommend trying Bayesian stacking (Yao, Vehtari, Simpson, and Gelman, 2018).

Mostly I don’t trust any model, and I assume M-open case and that it’s possible that models are badly mis-specified. Before any model selection I can discard models based on prior, posterior and cross-validation predictive checking (see, e.g., Gabry, Simpson, Vehtari, Betancourt, and Gelman, 2018). For M-open model selection with a small number of models I use cross-validation with uncertainty estimates (Vehtari, Gelman, and Gabry, 2017). If there is no clear winner, then I recommend model averaging with Bayesian stacking (Yao, Vehtari, Simpson, and Gelman, 2018). In case of large number of models, I recommend trying to convert the problem to M-completed, and to use reference predictive approach or even better the projection predictive approach (Vehtari and Ojanen, 2012; Piironen and Vehtari, 2017).

Since the above is a long discussion, here is my final recommendations how to revise the paper if I would be a hypothetical reviewer

- Since the paper is citing Vehtari, Gelman, & Gabry (2017) and Yao, Vehtari, Simpson, & Gelman (in press), the minimal requirement would be to mention that these papers explicitly recommend to compute the uncertainties due to not knowing the feature data, and they do not recommend LOO weights as used in the experiments in the paper. In the current form, the paper gives misleading impression about the recommended use of LOO.
- Even better would be to add also experiments using the other LOO based model weights (Pseudo-BMA+ and Bayesian stacking) introduced in Yao, Vehtari, Simpson, & Gelman (in press). Or at least mention in the discussion that it would be better to make additional experiments with these. I would love to see these results.
- Even better would be to add small epsilon and varying epsilon experiments described above. Or at least mention in the discussion that it would be better to make additional experiments with these. I would love to see these results.
- Even better would be to add projection predictive experiments. I would love to see research on the limitations of the projection predictive approach. I know this one requires much more work, and thus I understand if the authors are not willing to go that far.
- Minor comment: For the earlier use of Bayesian LOO see Geisser (1975).

The post Comments on Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**