“Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science”

As promised, let’s continue yesterday’s discussion of Christopher Tong’s article, “Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science.”

First, the title, which makes an excellent point. It can be valuable to think about measurement, comparison, and variation, even if commonly-used statistical methods can mislead.

This reminds me of the idea in decision analysis that the most important thing is not the solution of the decision tree but rather what you decide to put in the tree in the first place, or even, stepping back, what are your goals. The idea is that the threat of decision analysis is more powerful than its execution (as Chrissy Hesse might say): the decision-analytic thinking pushes you to think about costs and uncertainties and alternatives and opportunity costs, and that’s all valuable even if you never get around to performing the formal analysis. Similarly, I take Tong’s point that statistical thinking motivates you to consider design, data quality, bias, variance, conditioning, causal inference, and other concerns that will be relevant, whether or not they all go into a formal analysis.

That said, I have one concern, which is that “the threat is more powerful than the execution” only works if the threat is plausible. If you rule out the possibility of the execution, then the threat is empty. Similarly, while I understand the appeal of “Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science,” I think this might be good static advice, applicable right now, but not good dynamic advice: if we do away with statistical inference entirely (except in the very rare cases when no external assumptions are required to perform statistical modeling), then there may be less of a sense of the need for statistical thinking.

Overall, though, I agree with Tong’s message, and I think everybody should read his article.

Now let me go through some points where I disagree, or where I feel I can add something.

– Tong discusses “exploratory versus confirmatory analysis.” I prefer to think of exploratory and confirmatory analysis as two aspects of the same thing. (See also here.)

In short: exploratory data analysis is all about learning the unexpected. This is relative to “the expected,” that is, some existing model. So, exploratory data analysis is most effective when done in the context of sophisticated models. Conversely, exploratory data analysis is a sort of safety valve that can catch problems with your model, thus making confirmatory data analysis more effectively.

Here, I think of “confirmatory data analysis” not as significance testing and the rejection of straw-man null hypotheses, but rather as inference conditional on models of substantive interest.

– Tong:

There is, of course, one arena of science where the exploratory/confirmatory distinction is clearly made, and attitudes toward statistical inferences are sound: the phased experimentation of medical clinical trials.

I think this is a bit optimistic, for two reasons, First, I doubt the uncertainty in exploratory, pre-clinical analyses is correctly handled when it comes time to make decisions in designing clinical trials. Second, I don’t see statistical significance thresholds in clinical trials as being appropriate for deciding drug approval.

– Tong:

Medicine is a conservative science and behavior usually does not change on the basis of one study.

Sure, but the flip side of formal conservatism is that lots of informal decisions will be made based on noisy data. Waiting for conclusive results from a series of studies . . . that’s fine, but in the meantime, decisions need to be made, and are being made, every day. This is related to the Chestertonian principle that extreme skepticism is a form of credulity.

– Tong quotes Freedman (1995):

I wish we could learn to look at the data more directly, without the fictional models and priors. On the same wish list: We should stop pretending to fix bad designs and inadequate measurements by modeling.

I have no problem with this statement as literally construed: it represents someone’s wish. But to the extent it is taken as a prescription or recommendation for action, I have problems with it. First, in many cases it’s essentially impossible to look at the data without “fictional models.” For example, suppose you are doing a psychiatric study of depression: “the data” will strongly depend on whatever “fictional models” are used to construct the depression instrument. Similarly for studies of economic statistics, climate reconstruction, etc. I strongly do believe that looking at the data is important—indeed, I’m on record as saying I don’t believe statistical claims when their connection to the data is unclear—but, rather than wishing we could look at the data without models (just about all of which are “fictional”), I’d prefer to look at the data alongside, and informed by, our models.

Regarding the second wish (“stop pretending to fix bad designs and inadequate measurements by modeling”), I guess I might agree with this sentiment, depending on what is meant by “pretend” and “fix”—but I do think it’s a good idea to adjust bad designs and inadequate measurements by modeling. Indeed, if you look carefully, all designs are bad and all measurements are inadequate, so we should adjust as well as we can.

To paraphrase Bill James, the alternative to “inference using adjustment” is not “no inference,” it’s “inference not using adjustment.” Or, to put it in specific terms, if people don’t use methods such as our survey adjustment here, they’ll just use something cruder. I wouldn’t want criticism of the real flaws of useful models to be taken as a motivation for using worse models.

– Tong quotes Feller (1969):

The purpose of statistics in laboratories should be to save labor, time, and expense by efficient experimental designs.

Design is one purpose of statistics in laboratories, but I wouldn’t say it’s the purpose of statistics in laboratories. In addition to design, there’s analysis. A good design can be made even more effective with a good analysis. And, conversely, the existence of a good analysis can motivate a more effective design. This is not a new point; it dates back at least to split-plot, fractional factorial, and other complex designs in classical statistics.

– Tong quotes Mallows (1983):

A good descriptive technique should be appropriate for its purpose; effective as a mode of communication, accurate, complete, and resistant.

I agree, except possibly for the word “complete.” In complex problems, it can be asking too much to expect any single technique to give the whole picture.

– Tong writes:

Formal statistical inference may only be used in a confirmatory setting where the study design and statistical analysis plan are specified prior to data collection, and adhered to during and after it.

I get what he’s saying, but this just pushes the problem back, no? Take a field such as survey sampling where formal statistical inference is useful, both for obtaining standard errors (which give underestimates of total survey error, but an underestimate can still be useful as a starting point), for adjusting for nonresponse (this is a huge issue in any polling), and for small-area estimation (as here). It’s fair for Tong to say that all this is exploratory, not confirmatory. These formal tools are still useful, though. So I think it’s important to recognize that “exploratory statistics” is not just looking at raw data; it also can include all sorts of statistical analysis that is, in turn, relevant for real decision making.

– Tong writes:

A counterargument to our position is that inferential statistics (p-values, confidence intervals, Bayes factors, and so on) could still be used, but considered as just elaborate descriptive statistics, without inferential implications (e.g., Berry 2016, Lew 2016). We do not find this a compelling way to salvage the machinery of statistical inference. Divorced from the probability claims attached to such quantities (confidence levels, nominal Type I errors, and so on), there is no longer any reason to privilege such quantities over descriptive statistics that more directly characterize the data at hand.

I’ll just say, it depends on the context. Again, in survey research, there are good empirical and theoretical reasons for model-based adjustment as an alternative to just looking at the raw data. I do want to see the data, but if I want to learn about the population, I will do my best to adjust for known problems with the sample. I won’t just say that, because my models aren’t perfect, I shouldn’t use them at all.

To put it another way, I agree with Tong that there’s no reason to privilege such quantities as “p-values, confidence intervals, Bayes factors, . . . confidence levels, nominal Type I errors, and so on,” but I wouldn’t take this as a reason to throw away “the machinery of statistical inference.” Statistical inference gives us all sorts of useful estimates and data adjustments. Please don’t restrict “statistical inference” to those particular tools listed in that above paragraph!

– Tong writes:

A second counterargument is that, as George Box (1999) reminded us, “All models are wrong, but some are useful.” Statistical inferences may be biased per the Optimism Principle, but they are reasonably approximate (it might be claimed), and paraphrasing John Tukey (1962), we are concerned with approximate answers to the right questions, not exact answers to the wrong ones. This line of thinking also fails to be compelling, because we cannot safely estimate how large such approximation errors can be.

I think the secret weapon is helpful here. You can use inferences as they come up, but it’s hard to interpret them one at a time. Much better to see a series of estimates as they vary over space or time, as that’s the right “denominator” (as we used to say in the context of classical Anova) for comparison.

Summary

I like Tong’s article. The above discussion is intended to offer some modifications or clarifications of his good ideas.

Tomorrow’s post: “Superior: The Return of Race Science,” by Angela Saini