Daniel Kapitan writes:
We are in the process of writing a paper on the outcome of cataract surgery. A (very rough!) draft can be found here, to provide you with some context: https://www.overleaf.com/read/wvnwzjmrffmw.
Using standard classification methods (Python sklearn, with synthetic oversampling to address the class imbalance), we are able to predict a poor outcome with sufficient sensitivity (> 60%) and specificity (>95%) to be of practical use at our clinics as a clinical decision support tool. As we are writing up our findings and methodology, we have an interesting debate on how to interpret what the most relevant features (i.e. patient characteristics) are.
My colleagues who are trained as epidemiologist/doctors, have been taught to do standard univariate testing, using a threshold p-value to identify statistically significant features.
Those of us who come from machine learning (including myself) are more inclined to just feed all the data into an algorithm (we’re comparing logistic regression and random forest), and then evaluate feature importance a posteriori.
The results from the two approaches are substantially different. Comparing the first approach (using sklearn SelectKBest) and the second (using sklearn Random Forest), for example, the variable ‘age’ ends up somewhere halfway in the ranking (p-value 0.005 with F_classif) vs. top-6 (feature importance from random forest)
As a regular reader of your blog, I am aware of the ongoing debate regarding p-values, reproducible science etc. Although I get the gist of it, my understanding of statistics is too limited to convincingly argue for or against the two approaches. Googling the subject, I come across some (partial) answers:
I would appreciate if you could provide some feedback and/or suggestions how to address this question. It will help us to gain confidence in applying machine learning in the day-to-day clinical practice.
First, I think it would help to define what you mean by “most relevant features” in a predictive model. That is, before deciding on your procedure to estimate relevance, to declare based on the data what are the most relevant features, first figure out how you would define relevance. As Rubin puts it: What would you do if you had all the data?
I don’t mind looking at classification error etc., but I think it’s hard to make any progress at all here without some idea of your goals.
Why do you want to evaluate the importance of predictors in your model?
You might have a ready answer to this question, and that’s fine—it’s not supposed to be a trick. Once we better understand the goals, it might be easier to move to questions of estimation and inference.
My aim of understanding the importance of predictors is to support clinical reasoning. Ideally, the results of the predictor should be ‘understandable’ such that the surgeon can explain why a patient is classified as a high risk patient. I.e. I would like to combine clinical reasoning (inference, as evidenced in ‘classical’ clinical studies) with the observed patterns (correlation). Perhaps this is a tall order, but I think worth trying. This is one of the reasons why I prefer using tree-based algorithms (rather than neural networks), because it is less of a black box.
To give a specific example: patients with multiple ocular co-morbidities are expected to have high risk of poor outcome. Various clinical studies have tried to ‘prove’ this, but never in relation patterns (i.e. feature importance) that are obtained from machine learning. Now, the current model tells us that co-morbidities are not that important (relative to the other features).
Another example: laterality ends up as second most important feature in the random forest model. Looking at the data, it may be the case that left-eyes have a higher risk of poor outcome. Talking to doctors, this could be explained that, given most doctors are right-handed, operating a left-eye is slightly more complex. But looking at the data naively (histograms on subpopulations) the difference does not seem significant. Laterality ends up in the bottom range with univariate testing.
I understand that the underlying statistics are different (linear vs non-linear) and intuitively I tend to ‘believe’ the results from random forest more. What I’m looking for is sound arguments and reasoning if and why this is indeed the case.
To start with, you should forget about statistical significance and start thinking about uncertainty. For example, if your estimated coefficient is 200 with a standard error of 300, and on a scale where 200 is a big effect, then all you can say is that you’re uncertain: maybe it’s a good predictor in the population, maybe not.
Next, try to answer questions as directly as possible. For example, “patients with multiple ocular co-morbidities are expected to have high risk of poor outcome.” To start with, look at the data. Look at the average outcome as a function of the number of ocular co-morbidities. It should be possible to look at this directly. Here’s another example: “it may be the case that left-eyes have a higher risk of poor outcome.” Can you look at this directly? A statement such as “Laterality ends up in the bottom range with univariate testing,” does not seem interesting to me; it’s an indirect question framed in statistical terms (“the bottom range,” “univariate testing”), and I think it’s better to try to ask the question more directly.
Another tip is that different questions can require different analyses. Instead of fitting one model and trying to tell a story with each coefficient, list your questions one at a time and try to answer each one using the data. Kinda like Bill James: he didn’t throw all his baseball data into a single analysis and then sit there reading off conclusions; no, he looked at his questions one at a time.