# Using Stan for week-by-week updating of estimated soccer team abilites

May 17, 2017
By

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Milad Kharratzadeh shares this analysis of the English Premier League during last year’s famous season. He fit a Bayesian model using Stan, and the R markdown file is here.

The analysis has three interesting features:

1. Team ability is allowed to continuously vary throughout the season; thus, once the season is over, you can see an estimate of which teams were improving or declining.

2. But that’s not what is shown in the plot above. Rather, the plot above shows estimated team abilities after the model was fit to prior information plus week 1 data alone; prior information plus data from weeks 1 and 2; prior information plus data from weeks 1, 2, and 3; etc. For example, look at the plot for surprise victor Leicester City: after a few games, the team is already estimated to be in the middle of the pack; then throughout the season the team’s estimated ability gradually increases. This does not necessarily mean that we think the team improved during the season; it’s a graph showing the accumulation of information.

3. The graphs showing parameter estimates and raw data on the same scale, which I find helpful for understanding the information that goes into the fit.

From a modeling standpoint, item 1 above is the most important: with Bayesian inference and Stan, we can estimate continuously varying parameters even though we only get one game per week from each team.

From the standpoint of interpreting the results, I like item 2 in that it addresses the question of when those notorious 5000-1 odds should’ve been lowered. Based on the above graph, it looks like the answer is: Right away, after the very first game. It’s not that the first game of the season necessarily offers any useful clue to the future, but it does inform a bit about the possibilities. The point is not that Leicester City’s early performance signaled that they were a top team; it’s that the data did not rule out that possibility at 5000:1 odds.

Also, Figure 2 is kinda counterintuitive, in that you might think that if you already have a time-varying parameter, you could just plot its estimate as a function of time and you’d be done. But, no, if you want to see how inferences develop during the season, you need to re-fit the model after each new week of data (or use some sort of particle-filtering, importance-sampling approach, but in this case it’s easier to use the hammer and just re-fit the entire model each time in Stan.

P.S. The R code is in the markdown file. Here’s the Stan program:

```data {
int nteams; // number of teams (20)
int ngames; // number of games
int nweeks; // number of weeks
int home_week[ngames]; // week number for the home team
int away_week[ngames]; // week number for the away team
int home_team[ngames]; // home team ID (1, ..., 20)
int away_team[ngames]; // away team ID (1, ..., 20)
vector[ngames] score_diff;    // home_goals - away_goals
row_vector[nteams] prev_perf; // a score between -1 and +1
}
parameters {
real b_home; // the effect of hosting the game in mean of score_diff dist.
real b_prev;                         // regression coefficient of prev_perf
real sigma_a0;              // teams ability variation
real tau_a;                 // hyper-param for game-to-game variation
real nu;                    // t-dist degree of freedom
real sigma_y;               // score_diff variation
row_vector[nteams] sigma_a_raw; // game-to-game variation
matrix[nweeks,nteams] eta_a;         // random component
}
transformed parameters {
matrix[nweeks, nteams] a;                        // team abilities
row_vector[nteams] sigma_a; // game-to-game variation
a[1] = b_prev * prev_perf + sigma_a0 * eta_a[1]; // initial abilities (at week 1)
sigma_a = tau_a * sigma_a_raw;
for (w in 2:nweeks) {
a[w] = a[w-1] + sigma_a .* eta_a[w];           // evolution of abilities
}
}
model {
vector[ngames] a_diff;
// Priors
nu ~ gamma(2,0.1);
b_prev ~ normal(0,1);
sigma_a0 ~ normal(0,1);
sigma_y ~ normal(0,5);
b_home ~ normal(0,1);
sigma_a_raw ~ normal(0,1);
tau_a ~ cauchy(0,1);
to_vector(eta_a) ~ normal(0,1);
// Likelihood
for (g in 1:ngames) {
a_diff[g] = a[home_week[g],home_team[g]] - a[away_week[g],away_team[g]];
}
score_diff ~ student_t(nu, a_diff + b_home, sigma_y);
}
generated quantities {
vector[ngames] score_diff_rep;
for (g in 1:ngames)
score_diff_rep[g] = student_t_rng(nu, a[home_week[g],home_team[g]] -
a[away_week[g],away_team[g]]+b_home, sigma_y);
}
```

As explained in the markdown file, the generated quantities are there for posterior predictive checks.

I really think Stan is a key part of reproducible science in that the models are so clear and portable.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: , , , ,

 Tweet

Email: