We’re done with our exam.
And the solution to question 15:
15. Consider the following procedure.
• Set n = 100 and draw n continuous values x_i uniformly distributed between 0 and 10. Then simulate data from the model y_i = a + bx_i + error_i, for i = 1,…,n, with a = 2, b = 3, and independent errors from a normal distribution.
• Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the median ± 2 mad sd includes the true value, b = 3.
• Repeat the above two steps 1000 times.
(a) True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).
(b) Same as above, except the error distribution is bimodal, not normal. True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).
Both (a) and (b) are true.
(a) is true because everything’s approximately normally distributed so you’d expect a 95% chance for an estimate +/- 2 se’s to contain the true value. In real life we’re concerned with model violations, but here it’s all simulated data so no worries about bias. And n=100 is large enough that we don’t have to worry about the t rather than normal distribution. (Actually, even if n were pretty small, we’d be doing ok with estimates +/- 2 sd’s because we’re using the mad sd which gets wider when the t degrees of freedom are low.)
And (b) is true too because of the central limit theorem. Switching from a normal to a bimodal distribution will affect predictions for individual cases but it will have essentially no effect on the distribution of the estimate, which is an average from 100 data points.
Common mistakes
Most of the students got (a) correct but not (b). I guess I have to bang even harder on the relative unimportance of the error distribution (except when the goal is predicting individual cases).