Jonathan Hughes writes:
I am an engineering doctoral student. As part of my dissertation I’m proposing a mode of adaptation for a predictive system to individual subgroup specific streams of data which come each from a specific subgroup of a mixture population distribution. As part of the proposal presentation someone referenced your work and believed that you may have address the problem described below. I have read many of your academic writings and I don’t know if it is the case, and I haven’t been able to find it.
I will explain the problem briefly:
Let M_p be a logistic regression model that assumes a single homogeneous population logit(pi) = beta + beta_1*x + noise, but where there are latent subgroups in the population with varying distributions (but same in form), i.e. the true case is modeled by
M_s := logit(pi) = beta_pop + beta_subgroup*indicator + beta_1*x + beta_1_subgroup*indicator*x + noise;
What is the expected gain in ROC* area under the curve, AUC, from including the subgroup information? i.e. what is E[AUC(M_s) – AUC(M_p)], under some reasonable assumptions? I would like to incorporate some theoretical results about this, either those I have been deriving myself or others’ with priority.
My reply: This looks like a varying-intercept, varying-slope logistic regression of the sort that is described in various places including my book with Jennifer Hill, with the twist that the groups are unknown. I have no results on area under the curve or the ROC curve more generally, so I suggest you explore this using fake-data simulation. For your data at hand, you can evaluate how much gain you’re getting by using leave-one-out cross-validation.