Here’s question 4 of our exam:
4. A researcher is imputing missing responses for income in a social survey of American households, using for the imputation a regression model given demographic variables. Which of the following two statements is basically true?
(a) If you impute income deterministically using a fitted regression model (that is, imputing using Xβ rather than Xβ + ε), you will tend to impute too many people as rich or poor: A deterministic procedure overstates your certainty, making you more likely to impute extreme values.
(b) If you impute income deterministically using a fitted regression model (that is, imputing using Xβ rather than Xβ + ε), you will tend to impute too many people as middle class: By not using the error term, you’ll impute too many values in the middle of the distribution.
And the solution to question 3:
Here is a fitted model from the Bangladesh analysis predicting whether a person with high-arsenic drinking water will switch wells, given the arsenic level in their existing well and the distance to the nearest safe well.
glm(formula = switch ~ dist100 + arsenic, family=binomial(link="logit")) coef.est coef.se (Intercept) 0.00 0.08 dist100 -0.90 0.10 arsenic 0.46 0.04 n = 3020, k = 3Compare two people who live the same distance from the nearest well but whose arsenic levels differ, with one person having an arsenic level of 0.5 and the other person having a level of 1.0. Approximately how much more likely is this second person to switch wells? Give an approximate estimate, standard error, and 95% interval.
Using the divide-by-4 rule, the expected difference in Pr(switch), per unit change in arsenic level, is approximately 0.46/4 = 0.11 (recall that with the divide-by-4 rule we round down) with standard error 0.01. But we’re looking at a difference of 0.5, so we need to multiply these coefficients by 0.5, thus 0.055 with standard error 0.005, and a 95% interval of [0.055 +/- 2*0.005] = [0.065, 0.075].
The divide-by-4 rule works when the predicted probabilities are near the middle of the range, that is, near 50/50. The arsenic example was in the textbook and students should be able to recall that the probabilities of switching are indeed not far from 50%.
Common mistakes
Most of the students had no problem with this one. The ones who made mistakes, did so by trying to apply the logistic formula directly. Please please please please please: Invlogit is invlogit. Do not write it as exp(x)/(1 + exp(x)) or as 1/(1 + exp(-x)). Logit is its own function which has as much integrity as log or exp. Understand what logit looks like and you’ll be fine.