## 23.4 Example: Hierarchical Logistic Regression

Consider a hierarchical model of American presidential voting behavior
based on state of residence.^{44}

Each of the fifty states \(k \in \{1,\dotsc,50\}\) will have its own slope \(\beta_k\) and intercept \(\alpha_k\) to model the log odds of voting for the Republican candidate as a function of income. Suppose there are \(N\) voters and with voter \(n \in 1{:}N\) being in state \(s[n]\) with income \(x_n\). The likelihood for the vote \(y_n \in \{ 0, 1 \}\) is \[ y_n \sim \textsf{Bernoulli} \Big( \operatorname{logit}^{-1}\left( \alpha_{s[n]} + \beta_{s[n]} \, x_n \right) \Big). \]

The slopes and intercepts get hierarchical priors, \[\begin{align*} \alpha_k &\sim \textsf{normal}(\mu_{\alpha}, \sigma_{\alpha}) \\ \beta_k &\sim \textsf{normal}(\mu_{\beta}, \sigma_{\beta}) \end{align*}\]

### Unmapped Implementation

This model can be coded up in Stan directly as follows.

```
data {
int<lower = 0> K;
int<lower = 0> N;
int<lower = 1, upper = K> kk[N];
vector[N] x;
int<lower = 0, upper = 1> y[N];
}
parameters {
matrix[K,2] beta;
vector[2] mu;
vector<lower=0>[2] sigma;
}
model {
mu ~ normal(0, 2);
sigma ~ normal(0, 2);
for (i in 1:2)
beta[ , i] ~ normal(mu[i], sigma[i]);
y ~ bernoulli_logit(beta[kk, 1] + beta[kk, 2] .* x);
}
```

For this model the vector of predictors `x`

is coded as a vector,
corresponding to how it is used in the likelihood.
The priors for `mu`

and `sigma`

are vectorized. The priors
on the two components of `beta`

(intercept and slope,
respectively) are stored in a \(K \times 2\) matrix.

The likelihood is also
vectorized using multi-indexing with index `kk`

for the states
and elementwise multiplication (`.*`

) for the income `x`

.
The vectorized likelihood works out to the same thing as the following
less efficient looped form.

```
for (n in 1:N)
y[n] ~ bernoulli_logit(beta[kk[n], 1] + beta[kk[n], 2] * x[n]);
```

### Mapped Implementation

The mapped version of the model will map over the states `K`

.
This means the group-level parameters, real data, and integer-data
must be arrays of the same size.

The mapped implementation requires a function to be mapped. The
following function evaluates both the likelihood for the data observed
for a group as well as the prior for the group-specific parameters
(the name `bl_glm`

derives from the fact that it’s a generalized
linear model with a Bernoulli likelihood and logistic link function).

```
functions {
vector bl_glm(vector mu_sigma, vector beta,
real[] x, int[] y) {
vector[2] mu = mu_sigma[1:2];
vector[2] sigma = mu_sigma[3:4];
real lp = normal_lpdf(beta | mu, sigma);
real ll = bernoulli_logit_lpmf(y | beta[1] + beta[2] * to_vector(x));
return [lp + ll]';
}
}
```

The shared parameter `mu_sigma`

contains the locations
(`mu_sigma[1:2]`

) and scales (`mu_sigma[3:4]`

) of the
priors, which are extracted in the first two lines of the program.
The variable `lp`

is assigned the log density of the prior on
`beta`

. The vector `beta`

is of size two, as are the
vectors `mu`

and `sigma`

, so everything lines up for the
vectorization. Next, the variable `ll`

is assigned to the log
likelihood contribution for the group. Here `beta[1]`

is the
intercept of the regression and `beta[2]`

the slope. The
predictor array `x`

needs to be converted to a vector allow the
multiplication.

The data block is identical to that of the previous program, but repeated here for convenience. A transformed data block computes the data structures needed for the mapping by organizing the data into arrays indexed by group.

```
data {
int<lower = 0> K;
int<lower = 0> N;
int<lower = 1, upper = K> kk[N];
vector[N] x;
int<lower = 0, upper = 1> y[N];
}
transformed data {
int<lower = 0> J = N / K;
real x_r[K, J];
int<lower = 0, upper = 1> x_i[K, J];
{
int pos = 1;
for (k in 1:K) {
int end = pos + J - 1;
x_r[k] = to_array_1d(x[pos:end]);
x_i[k] = to_array_1d(y[pos:end]);
pos += J;
}
}
}
```

The integer `J`

is set to the number of observations per group.^{45}

The real data array `x_r`

holds the predictors and the integer
data array `x_i`

holds the outcomes. The grouped data arrays
are constructed by slicing the predictor vector `x`

(and
converting it to an array) and slicing the outcome array `y`

.

Given the transformed data with groupings, the parameters are the same
as the previous program. The model has the same priors for the
hyperparameters `mu`

and `sigma`

, but moves the prior for
`beta`

and the likelihood to the mapped function.

```
parameters {
vector[2] beta[K];
vector[2] mu;
vector<lower=0>[2] sigma;
}
model {
mu ~ normal(0, 2);
sigma ~ normal(0, 2);
target += sum(map_rect(bl_glm, append_row(mu, sigma), beta, x_r, x_i));
}
```

The model as written here computes the priors for each group’s parameters along with the likelihood contribution for the group. An alternative mapping would leave the prior in the model block and only map the likelihood. In a serial setting this shouldn’t make much of a difference, but with parallelization, there is reduced communication (the prior’s parameters need not be transmitted) and also reduced parallelization with the version that leaves the prior in the model block.

This makes the strong assumption that each group has the same number of observations!↩︎