This post is not by Andrew.
Now it was spurred by Andrew’s recent post on Statistical Thinking enabling good science.
The day of that post, I happened to look in my email’s trash and noticed that it went back to 2011. One email way back then had an attachment entitled Learning Priorities of RCT versus Non-RCTs. I had forgotten about it. It was one of the last things I had worked on when I last worked in drug regulation.
It was draft of summary points I was putting together for clinical reviewers (clinicians and biologists working in a regulatory agency) to give them a sense of (hopefully good) statistical thinking in reviewing clinical trials for drug approval. I though it brought out many of the key points that were in Andrew’s post and in the paper by Tong that Andrew was discussing.
Now my summary points are in terms of statistical significance, type one error and power, but was 2011. Additionally, I do believe (along with David Spiegelhalter) that regulatory agencies do need to have lines drawn in the sand or set cut points. They have to approve or not approve. As the seriousness of the approval increases, arguably these set cut points should move from being almost automatic defaults, to inputs into a weight of evidence evaluation that may overturn them. Now I am working on a post to give an outline of what usually happens in drug regulation. I have received some links to material from a former colleague to help update my 2011 experience base.
In this post, I have made some minor edits, it is not meant to be polished prose but simply summary notes. I thought it might of interest to some and hey I have not posted in over a year and this one was quick and easy.
What can you learn from randomized versus non-randomized comparisons?
What You Can’t Learn (WYCL);
How/Why That’s Critical (HWTC);
Anticipate How To Lessen these limitations (AHTL)
What can you learn from randomized comparisons?
What You Can’t Learn (WYCL);
How/Why That’s Critical (HWTC);
Anticipate How To Lessen these limitations (AHTL)
The crucial uncertainties randomized comparisons are:
1. With perfect execution, just two, the variation of covariate distribution imbalance compared to the size of the signal of interest. The first, covariate distribution imbalance, is extra-sample or counterfactual in that with randomization you are assured balance in distribution but in any given randomization it favours treatment or control by some amount – it’s just not recognizable given there are unobserved covariates. However, it does not systematically favour either treatment or control and leads to statistical significance only 5% of the time (i.e the type one error rate). As for the size of the signal of interest, which determines power (with bigger signal having higher power), it is never known but just conjectured and often on limited and faulty historical data.
This is unfortunate, as it is critical to get the ratio of power to type one error high (e.g. 80% to 5%) as this better separates null signals from real signals (when it is unknown which are which.) One way to see the problem of low power, say 20% power, is that when there is a real signal (of just the right size to have 20% power), only 20% of signals investigated will have much attention paid to them.This subset will be highly biased upwards being a small non-random subset of all the studies that could have been done – being the ones where the treatment comparison was observed to be very large (most likely because the covariate imbalance already by chance favoured treatment over control by more than a trivial amount.) This results in exaggerated treatment effect estimates. The amount of exaggeration goes away quickly as the power increases and even 50% power is often good enough to make this exaggeration unimportant.
2. The execution of the randomized comparison was not perfect, but actually flawed to a degree that invalidates the above reasoning. There will always be uncertainty about how likely it is that these flaws will be noticed and exactly how much they impaired the comparisons i.e. increased type one error above 5% or decreased power.
What You Can’t Learn (WYCL):
1. How to further decrease type one error or increase power.
2. Almost anything about the treatment mechanism for the signal detected.
3. How to credibly generalize the signal much beyond the randomized study itself.
4. How to get good type 1 error and power/ type 1 error balance for all, most or even a few sub-groups.
How/Why That’s Critical (HWTC);
1. Even with perfection, always sometimes wrong (at least 5% and often 5%+) about benefits exceeding risks and always uncertain about the precise Benefit/Harm ratio (there just are no good confidence intervals methods that give even close to 5% error rates in ethical randomized comparisons in humans).
2. Wrong much more than 5%+ about benefits exceeding risks for subgroups and almost always wrong about the ratio Benefit/Harm.
3. There is almost always low power for detecting operational bias, testing assumptions, dealing with non-compliance or more generally finding out how things were not perfect and made the %5 actually 5%+++.
Anticipate How To Lessen these limitations (AHTL)
1. Nothing cures error better than replication and the never ending hope of getting close to perfection next time (that “guarantees” the just 5% error rate).
2. Transparent displays of all the pieces and processes that go into learning (and can go wrong).
3. More focussing in on errors in estimating the ratio benefit/harm rather than just Benefit >? Harm and on sub-groups and generalization.
4. Lots of sensitivity analyses that starts an endless loop of WYCL; HWTC; AHTL;…
What can you learn from non-randomized comparisons?
What You Can’t Learn (WYCL);
How/Why That’s Critical (HWTC);
Anticipate How To Lessen these limitations (AHTL)
The crucial uncertainties with non-randomized comparisons are:
1. The compared groups will be initially non-comparable in ways that can not be fully appreciated or noticed and the steps to lessen this (restriction, matching, stratification, adjustment, weighting and various combinations of these and others) will not be known with any assurance (say like working even 50% of the time). Anticipating further steps will be difficult and tenuous and though the resulting non-comparability is sometimes known to be less than initially, the remaining degree of non-comparability will be largely unknown.
2. In the almost never occurring case of actually getting comparable groups, all the uncertainties of randomized comparisons do remain. Although these uncertainties are likely less than uncertainties in randomized comparisons (especially if large groups were used to try to get groups less non-comparable) they can still be important.
What You Can’t Learn (WYCL):
1. Get a good sense of how unequal the groups compared were made or could be made.
2. Very general or generic methods for noticing non-comparability, recognizing how to make groups less non-comparable and doing it – at least with data in hand. It is always very situation specific.
How/Why That’s Critical (HWTC):
1. Never know if you would be better off ignoring the data completely. Never!
2. Unlike, randomized comparisons, it is very possible that more, even a lot more of the same type of data collection, will help very little if at all.
3. Anticipating what kind of different data and how to obtain it is very important but difficult.
4. Carefully and fully evaluating the current data for clues as to what this may be is absolutely necessary, though not at all very rewarding in the short run. Almost never are there any quick visible successes but rather just clearer understandings of how unlikely success ever is and how much work and uncertainty remains.
Anticipate How To Lessen these limitations (AHTL)
1. Identify key barriers to getting less unequal groups and ways to lessen these.
2. Communicate those clearly and widely.
3. Get the academic, pharmaceutical and regulatory communities to repeatedly do this, realizing there are few rewards for academics and even fewer for pharmaceutical firms (unless their products are currently being threatened).