Question 2 of our Applied Regression final exam (and solution to question 1)

Here’s question 2 of our exam:

2. A multiple-choice test item has four options. Assume that a student taking this question either knows the answer or does a pure guess. A random sample of 100 students take the item. 60% get it correct. Give an estimate and 95% confidence interval for the percentage in the population who know the answer.

And the solution to question 1:

1. A randomized experiment is performed within a survey. 1000 people are contacted. Half the people contacted are promised a \$5 incentive to participate, and half are not promised an incentive. The result is a 50% response rate among the treated group and 40% response rate among the control group.

(a) Give an estimate and standard error of the average treatment effect.

(b) Give code to fit a logistic regression of response on the treatment indicator. Give the complete code, including assigning the data, setting up the variables, etc. It is not enough to simply give the one line of code for running the logistic regression.

(a) The estimate is 0.5 – 0.4 = 0.1, and the standard error is sqrt(0.5^2/500 + 0.5^2/500) = 0.03.

(b) Here’s some code:

```n <- 1000
z <- rep(c(1,0), c(n/2, n/2))
y <- rep(c(1,0,1,0), c(0.5, 0.5, 0.4, 0.6)*n/2)
library("rstanarm")
fit <- stan_glm(y ~ z, family=binomial(link="logit"))
print(fit)
```

The estimated logistic regression coefficient is approximately 0.4 with a standard error of approximately 0.12, which is consistent with the answer from (a) after applying the divide-by-4 rule.

Common mistakes

Most of the students in the class got part (a) correct, but some were sloppy and did some sort of sqrt(p*(1-p)/n) calculation with n=1000, not recognizing that the estimate was a difference between proportions.

For (b), I was expecting some syntax errors---it was an in-class exam without computers, so I was just asking them to write some code as best they could---but a common mistake was that many, actually most, students did it by simulating fake data with probabilities 0.5 and 0.4, rather than by entering the actual data as in the code above. Perhaps I needed to write the question better to make it more clear what was required.