Hierarchical Bayes
Sometimes, data contain multiple entries from observation units. For example, researchers may be interested in effectiveness of medical treatments, and tested several treatments in several different cities. Then, it may make sense to consider that the effectiveness may be overall consistent across cities but slightly vary between the cities, as population characteristic may vary between the cities. For instance, one city may have more younger people than another. In these cases, hierarchical Bayes may help. In this blog post, I review hierarchical Bayes.
For the illustration purposes, let’s say we want to fit a linear regression: [ y \sim \mathcal{N}(X \beta, \epsilon). ] In the above example of medical treatments, each row in $X$ corresponds to one observation. Here, $X$ indexes the type of treatment received, and $y$ is a measure of effectiveness.
For convenience, I make the condition on $X$ implicit in this blog post.
Complete pooling
Before diving into hierarchical Bayes, it is useful to discuss simpler approaches. One approach is to ignore the possibility that the effectiveness of treatments may vary between the cities, and assume that the effectiveness is identical across the cities. This approach is called complete pooling.
First, we define the prior distribution for the parameter: [ \beta \sim \mathcal{N}(\mu, \sigma). ] Here, $\mu$ and $\sigma$ are hyperparameters — something you have to specify. How to choose hyper-parameters is a topic for another blog post.
After specifying the prior distribution, we can optimise $\beta$ according to its posterior distribution: [ p(\beta \vert y) = \frac{p(\beta) \; p(y \vert \beta)}{p(y).} ] The denominator, $p(y)$, is usually difficult to compute, but it does not depend on $\beta$, so often we just consider the unnormalised posterior: [ p(\beta \vert y) \propto p(\beta) \; p(y \vert \beta). ]
No pooling
Another simpler approach is to acknowledge the possibility that the effectiveness of treatments varies between the cities and allow the effectiveness to vary as much as possible between the cities. Thus, this approach fits the model independently for each city. This approach is called no pooling.
Here the model for the $i$th city is [ y^{(i)} \sim \mathcal{N}(X^{(i)} \beta^{(i)}, \epsilon^{(i)}). ]
We specify the prior as before: [ \beta^{(i)} \sim \mathcal{N}(\mu, \sigma). ] And then, optimise $\beta^{(i)}$ according to the posterior $p(\beta^{(i)} \vert y)$.
Partial pooling (hierarchical Bayes)
In some cases, it makes sense to allow the effectiveness to vary between the cities but also to assume that the effectiveness is somewhat consistent across the cities. This is where the third approach, partial pooling comes in. The partial pooling is often called hierarchical Bayes.
As before, let’s say the model for the $i$th city is [ y^{(i)} \sim X^{(i)} \beta^{(i)} + \epsilon^{(i)}, ] and the prior is [ \beta^{(i)} \sim \mathcal{N}(\mu, \sigma). ]
The trick is to treat $\mu$ and $\sigma$ here as parameters, as opposed to hyperparameters. That is, we let the data to inform us of $\mu$ and $\sigma$. For this, we need to specify the prior for $\mu$ and $\sigma$. My usual choice is [ \mu \sim Uniform(-inf, inf) ] and [ \sigma \sim Uniform(0, inf). ] These are non-informative distributions. These are also improper, as their integral is not one. While the prior is improper, the posterior integrates to one, so it doesn’t cause much practical problems.
Then, we can work out the joint posterior distribution.
[
\begin{align}
p(\beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)}, \mu, \sigma \vert y)
&= \frac{p(\beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)}, \mu, \sigma) \; p(y \vert \beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)}, \mu, \sigma)}{p(y)} \\
&= \frac{p(\mu, \sigma) \; p(\beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)} \vert \mu, \sigma) \; p(y \vert \beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)})}{p(y).}
\end{align}
]
As before, we often only need to consider the unnormalised posterior:
[
p(\beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)}, \mu, \sigma \vert y) \propto
p(\mu, \sigma) \; p(\beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)} \vert \mu, \sigma) \; p(y \vert \beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)}).
]
The overall effectiveness of treatments can be inferred by integrating out $\beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)}$ from the above equation: [ p(\mu, \sigma \vert y) = \int p(\beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)}, \mu, \sigma \vert y) d_{\beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)}}. ] Similarly, the effectiveness for each city can be inferred by integrating out $\mu$ and $\sigma$: [ p(\beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)} \vert y) = \int \int p(\beta^{(1)}, \beta^{(2)}, \dots \beta^{(n)}, \mu, \sigma \vert y) d_{\mu} d_{\sigma}. ]