The house is stronger than the foundations

October 9, 2017

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Oliver Maclaren writes:

Regarding the whole ‘double use of data’ issue with posterior predictive checks [see here and, for a longer discussion, here], I just wanted to note that David Cox describes the ‘Fisherian reduction’ as (I’ve summarised slightly; see p. 24 of ‘Principles of Statistical Inference)

– Find the likelihood function
– Reduce to a sufficient statistic S of the same dimension as theta
– Estimate theta based on the sufficient statistic
– Use the conditional distribution of the data given S=s informally or formally to assess the adequacy of the formulation

Your conception of posterior predictive checks seems to me to be essentially the same
– Find a likelihood and prior
– Use Bayes to estimate the parameters.
– The data enters this estimation procedure only via the sufficient statistics (i.e. ‘the likelihood principle as applied within the model’)
– There is thus a ‘leftover’ part of the data y|S(y)
– You can use this to check the adequacy of the formulation
– Do this by conditioning on the sufficient statistics, i.e. using the posterior predictive distribution which was fit using the sufficient statistics

Formally, I think, the posterior predictive distribution is essentially p(y|S(y)) since Bayes only uses S(y) rather than the ‘full’ data.

Thus there is no ‘double use of data’ when checking the parts of the data corresponding to the ‘residual’ y|S(y).

On the other hand the aspects corresponding to the sufficient statistics are essentially ‘automatically fit’ by Bayes (to use your description in the PSA abstract).

You are probably aware of all of this, but it may help some conceptually.

I personally found it useful to make this connection in order to resolve (at least some parts of) the conflict between my intuitive understanding of why PPP are good and some of the formal objections raised.

My reply: When I did my formalization of predictive checks in the 1990s, it was really for non-Bayesian purposes: I had seen problems where I wanted to test a model and summarize that test, but the p-value depended on unknown parameters, so it made sense to integrate them out. Since then, posterior predictive checks have become popular among Bayesians, but I’ve been disappointed that non-Bayesians have not been making use of this tool. The non-Bayesians seem obsessed with the uniform distribution of the p-value, a property that makes no sense to me.

The following papers might be relevant here:

Two simple examples for understanding posterior p-values whose distributions are far from unform

Section 2.3 of A Bayesian formulation of exploratory data analysis and goodness-of-fit testing

Maclaren responded:

It seems to me that a relevant division of non-Bayesians is into something like

– Fisherians, e.g. David Cox and those who emphasise likelihood, conditioning, ‘information’ and ‘inference’. If they are interested in coverage it is usually conditional coverage with respect to the appropriate situation. Quite similar to your ideas on defining the appropriate ‘replications’ of interest.

– Neymanians, i.e. those with a more ‘pure’ Frequentist bent who emphasise optimality, decisions, coverage (often unconditional) etc.

I think the former are/would be much more sympathetic to your approach. For example, as noted I think Cox basically advocates the same thing in the simple case. Lindsey, Sprott etc also all emphasise the perspective of ‘information division’ which I think addresses at least some concerns with double use of data in simple cases.

With regard to having the ‘residual’ dependent on the parameters: presumably there is some intuitive notion here of a ‘weak’ or ‘local’ dependence on the fitted parameters (or something similar)? Or some kind of ‘inferential separation’? Perhaps an unusual model structure?

I’m trying to think of the ‘logic’ of information separation here.

For example, I can imagine a factorisation something like

P(Y|θ) = P(Y|S,α(λ))P(S|λ)


P(S|λ) gives the likelihood for fitting λ
P(Y|S,α(λ)) gives the residual for model checking, now depending on λ but via α(λ)

In this case θ = (λ,α(λ)) seem to provide the needed separation but they are not (variation) independent.

So it still makes sense to use your best estimate of λ in model checking to make sure you use a relevant α (i.e. average over λ’s posterior).

Something like a curved exponential model might fit this case.

Just thinking out loud, really.

Me again: Sure, but also there’s all the regularization and machine learning stuff. Take, for example, the Stanford school of statistics: Efron, Hastie, Tibshirani, Donoho, etc. They use what I (and they) would call modern methods which I think of as Bayesian and they think of as regularized likelihood or whatever, but I think we all worship the same god even if we give it different names. When it comes to foundations, I’m pretty sure that the Stanford crew think in a so-called Neyman-Pearson framework with null hypotheses and error rates. There’s no doubt that they’ve had real success, both methodological and applied, with that false discovery rate approach, even though I still find it lacking as to me it’s based on a foundation of null hypotheses that is in my opinion worse than rickety.

In any case, I have mixed feelings about the relevance of posterior predictive p-values for these people. I would definitely like them to do some model checks, and I continue to feel that some posterior predictive distribution is the best way to get a reference set to use to compare observed data in a model check. But I think more and more that p-values are a dead end. I guess what I’d really like of non-Bayesian statisticians is for them to make their assumptions more explicit—to express their assumptions in the form of generative models, so that then these models can be checked and improved. Right now things are so indirect: the method is implicitly based on assumps (or, to put it another way, the method will be most effective when averaging over data generating processes that are close to some ideal) but these assumps are not stated clearly or always well understood, which I think makes it difficult to choose among methods or to improve them in light of data.

I’ve been thinking this a long time. I have a discussion of a paper of Donoho et al. from, ummm, 1990 or 1992, making some of the above points (in proto-fashion). But I don’t think I explained myself clearly enough: in their rejoinder, Donoho et al. saw that I was drawing a mathematical equivalence between their estimators and Bayesian priors, but I hadn’t been so clear on the positive virtues of making assumptions that can be rejected, with that rejection providing a direction for improvement.


There’s a lot here I agree with of course.

And yes, the cultures of statistics, and quantitative modeling generally, are pretty variable and it can be difficult to bridge gaps in perspective.

Now, some more overly long comments from me.

As some context for the different cultures aspect, I’ve bounced around maths and engineering departments while working on biological problems, industrial problems, working with physicists, mathematicians, statisticians, engineers, computer scientists, biologists etc. It has of course been very rewarding but the biggest barriers are usually basic ‘philosophical’ or cultural differences in how people see and formulate the main questions and methods of addressing these. These are much more entrenched than you realise until you try to actually bridge these gaps.

I wouldn’t really describe myself as a Bayesian, Frequentist, Likelihoodist, Machine Learner etc, despite seeing plenty of value in each approach. The more I read on foundations the more I find myself – to my surprise, since I used to view them as old-fashioned – quite sympathetic with Fisher, Barnard, Cox, Sprott, Barndorff-Nielsen etc. In particular on the organisation, reduction, splitting, combining etc of ‘information’ and the geometric perspective on this.

Hence me trying to understand PPC from this point of view. I think the simple point that you can for example base estimation on part of the data and checking on another part, and in simple cases represent this as a factorisation, clears a few things up for me. It also explains some (retrospectively) obvious results I saw when using PPC eg the difference between checks based directly on fitted stats vs those based on ‘residual’ information.

But even then I have plenty of disagreements with the Fisherian school, and would like to see it extended to more complex problems. Bayes in the Jeffreys, Jaynes vein is of course similar to this ‘organisation of information’ perspective, but I find Jaynes in particular tends to often make overly strong claims while ignoring mathematical and philosophical subtleties. Classic physicist style of course!

(I started writing some notes on a reformulation of this sort of perspective in terms of category theory but I doubt I’ll ever finish them or anyone would read them if I did! Sander did, surprisingly, offer some encouragement on this – the DAG people are probably more open to general abstract nonsense in diagram form!).

RE: The Stanford school. Yes they seem a somewhat strange mix of decision theory, optimisation and function approximation. (Though again, not an unfamiliar mix to me – I spend a fair amount of time around operations research people, and minored in it in undergrad. Everything is rewritten as an optimal decision problem. And yes the statistical aspect seems to come from NP origins.

The models are often implicit while primary focus is given to the ‘fitting procedure’. And to them, likelihood is mainly just an objective function to be maximised to get estimators to evaluate Frequentist style.

(Of course this connects to the two big misconceptions about likelihood analysis from both Bayes and Freq – one that it’s just for getting Frequentist estimators, usually via maximisation. Two, it can’t handle nuisance parameters systematically.)

Bayes of course tends to blend modeling and inference. Both have pros and cons to me – I think there is benefit to separating a model from its analysis (think for example finding weak solutions to differential equations) while there is also benefit in seeing this in turn as a modified model (think for example rewriting a differential equation as an integral equation – again leads to weak solutions, but from a more model-based perspective).

Some people love to think in terms of models, some in terms of procedures. This is a difficult gap to bridge sometimes, particularly between eg stats/comp sci vs scientists. I think the idea of ‘measurement’ is important here. ‘Framework theories’ like quantum mechanics and thermodynamics provide a good guide to me, but of course there is no shortage of arguments over how to think about these subjects either!

In terms of p-values for model checking, I definitely prefer graphical checks. In terms of Frequentist parameter inference I differ from you I think in that I see value in seeing confidence intervals as inverted hypothesis tests. I prefer however to see them as something like an inverse image of the data in parameter space rather than as a measure of uncertainty or even as a measure of error rates.

Me: I think it’s great when people come up with effective methods. What irritates me is when people tie themselves into knots trying to solve problems that in my opinion aren’t real. For example there’s this huge literature on simulation from the distribution of a discrete contingency table with known margins. But I think that’s all a waste of time because the whole point is to compute a p-value with respect to a completely uninteresting model in which margins are fixed, which corresponds to a design that’s just about never used. For another example, Efron etc. wasted who knows how many journal pages and man-years of effort on the problem of bootstrap confidence intervals. But I think the whole confidence intervals thing is a waste of time. (I think uncertainty intervals are great; what I specifically don’t like are those inferences that are supposed to have specified coverage conditional on any value of the unknown parameters, and which are defined by inverting hypothesis tests.)

There’s an expression I sometimes use with this work, which is that the house is stronger than the foundations.

The post The house is stronger than the foundations appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science