Module aliases imported by init_notebook.py:
--------------------------------------------
from cmdstanpy import CmdStanModel
import bridgestan as bs
import numpy as np
import pandas as pd
import arviz as az
import utils as utils
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
Watermark:
----------
Last updated: 2025-03-17T14:45:10.382615+04:00
Python implementation: CPython
Python version : 3.13.2
IPython version : 9.0.2
Compiler : Clang 16.0.0 (clang-1600.0.26.6)
OS : Darwin
Release : 24.4.0
Machine : arm64
Processor : arm
CPU cores : 8
Architecture: 64bit
scipy : 1.15.2
matplotlib: 3.10.1
arviz : 0.21.0
bridgestan: 2.6.1
pandas : 2.2.3
re : 2.2.1
numpy : 2.2.4
seaborn : 0.13.2
watermark : 2.5.0
cmdstanpy : 1.2.5
d = pd.read_csv("data/Howell1.csv", sep=';')d2 = d[d['age'] >=18]
1 Linear Prediction
What we’ve done above is a Gaussian model of height in a population of adults. But it doesn’t really have the usual feel of “regression” to it. Typically, we are interested in modeling how an outcome is related to some other variable, a predictor variable. If the predictor variable has any statistical association with the outcome variable, then we can use it to predict the outcome. When the predictor variable is built inside the model in a particular way, we’ll have linear regression.
So now let’s look at how height in these Kalahari foragers (the outcome variable) covaries with weight (the predictor variable). We’ll reconsider this example later from a more causal perspective. Right now, let’s focus on the mechanics of estimating an association between two variables.
Go ahead and plot adult height and weight against one another:
There’s obviously a relationship: Knowing a person’s weight helps you predict height.
To make this vague observation into a more precise quantitative model that relates values of weight to plausible values of height, we need some more technology. How do we take our Gaussian model from the previous section and incorporate predictor variables?
What is “regression”?
Many diverse types of models are called “regression.” The term has come to mean using one or more predictor variables to model the distribution of one or more outcome variables. The original use of term, however, arose from anthropologist Francis Galton’s (1822–1911) observation that the sons of tall and short men tended to be more similar to the population mean, hence regression to the mean.
The causal reasons for regression to the mean are diverse. In the case of height, the causal explanation is a key piece of the foundation of population genetics. But this phenomenon arises statistically whenever individual measurements are assigned a common distribution, leading to shrinkage as each measurement informs the others. In the context of Galton’s height data, attempting to predict each son’s height on the basis of only his father’s height is folly. Better to use the population of fathers. This leads to a prediction for each son which is similar to each father but “shrunk” towards the overall mean. Such predictions are routinely better. This same regression/shrinkage phenomenon applies at higher levels of abstraction and forms one basis of multilevel modeling.
1.1 The Linear Model Strategy
The strategy is to make the parameter for the mean of a Gaussian distribution, \(μ\), into a linear function of the predictor variable and other, new parameters that we invent. This strategy is often simply called the linear model. The linear model strategy instructs the model to assume that the predictor variable has a constant and additive relationship to the mean of the outcome. The model then computes the posterior distribution of this constant relationship.
What this means, recall, is that the machine considers every possible combination of the parameter values. With a linear model, some of the parameters now stand for the strength of association between the mean of the outcome, \(μ\), and the value of some other variable. For each combination of values, the machine computes the posterior probability, which is a measure of relative plausibility, given the model and data. So the posterior distribution ranks the infinite possible combinations of parameter values by their logical plausibility. As a result, the posterior distribution provides relative plausibilities of the different possible strengths of association, given the assumptions you programmed into the model. We ask the model: “Consider all the lines that relate one variable to the other. Rank all of these lines by plausibility, given these data.” The model answers with a posterior distribution.
Here’s how it works, in the simplest case of only one predictor variable. We’ll wait until the next chapter to confront more than one predictor. Recall the basic Gaussian model:
Now how do we get weight into a Gaussian model of height? Let \(x\) be the name for the column of weight measurements, d2['weight']. Let the average of the \(x\) values be \(\bar{x}\), “ex bar”. Now we have a predictor variable \(x\), which is a list of measures of the same length as \(h\). To get weight into the model, we define the mean \(μ\) as a function of the values in \(x\). This is what it looks like, with explanation to follow:
Let’s begin with just the probability of the observed height, the first line of the model. This is nearly identical to before, except now there is a little index \(i\) on the \(μ\) as well as the \(h\). You can read \(h_i\) as “each \(h\)” and \(μ_i\) as “each \(μ\).” The mean \(μ\) now depends upon unique values on each row \(i\). So the little \(i\) on \(μ_i\) indicates that the mean depends upon the row.
Linear model
The mean \(μ\) is no longer a parameter to be estimated. Rather, as seen in the second line of the model, \(μ_i\) is constructed from other parameters, \(α\) and \(β\), and the observed variable \(x\). This line is not a stochastic relationship—there is no \(\sim\) in it, but rather an \(=\) in it—because the definition of \(μ_i\) is deterministic. That is to say that, once we know \(α\) and \(β\) and \(x_i\), we know \(μ_i\) with certainty.
The value \(x_i\) is just the weight value on row \(i\). It refers to the same individual as the height value, \(h_i\), on the same row. The parameters \(α\) and \(β\) are more mysterious. Where did they come from? We made them up. The parameters \(μ\) and \(σ\) are necessary and sufficient to describe a Gaussian distribution. But \(α\) and \(β\) are instead devices we invent for manipulating \(μ\), allowing it to vary systematically across cases in the data.
You’ll be making up all manner of parameters as your skills improve. One way to understand these made-up parameters is to think of them as targets of learning. Each parameter is something that must be described in the posterior distribution. So when you want to know something about the data, you ask your model by inventing a parameter for it. This will make more and more sense as you progress. Here’s how it works in this context. The second line of the model definition is just:
\[
\mu_i = \alpha + \beta(x_i - \bar{x})
\]
What this tells the regression golem is that you are asking two questions about the mean of the outcome.
What is the expected height when \(x_i = \bar{x}\)? The parameter \(α\) answers this question, because when \(x_i = \bar{x}\), \(μ_i = α\). For this reason, \(α\) is often called the intercept. But we should think not in terms of some abstract line, but rather in terms of the meaning with respect to the observable variables.
What is the change in expected height, when \(x_i\) changes by 1 unit? The parameter \(β\) answers this question. It is often called a “slope,” again because of the abstract line. Better to think of it as a rate of change in expectation.
Jointly these two parameters ask the model to find a line that relates \(x\) to \(h\), a line that passes through \(α\) when \(x_i = \bar{x}\) and has slope \(β\). That is a task that models are very good at. It’s up to you, though, to be sure it’s a good question.
Nothing special or natural about linear models
Note that there’s nothing special about the linear model, really. You can choose a different relationship between \(α\) and \(β\) and \(μ\). For example, the following is a perfectly legitimate definition for \(μ_i\):
This does not define a linear regression, but it does define a regression model. The linear relationship we are using instead is conventional, but nothing requires that you use it. It is very common in some fields, like ecology and demography, to use functional forms for \(μ\) that come from theory, rather than the geocentrism of linear models. Models built out of substantive theory can dramatically outperform linear models of the same phenomena. We’ll revisit this point later.
Units and regression models
Readers who had a traditional training in physical sciences will know how to carry units through equations of this kind. For their benefit, here’s the model again (omitting priors for brevity), now with units of each symbol added.
So you can see that \(β\) must have units of cm/kg in order for the mean \(μ_i\) to have units of cm. One of the facts that labeling with units clears up is that a parameter like \(β\) is a kind of rate—centimeters per kilogram. There’s also a tradition called dimensionless analysis that advocates constructing variables so that they are unit-less ratios. In this context, for example, we might divide height by a reference height, removing its units. Measurement scales are arbitrary human constructions, and sometimes the unit-less analysis is more natural and general.
Priors
The remaining lines in the model define distributions for the unobserved variables. These variables are commonly known as parameters, and their distributions as priors. There are three parameters: \(α\), \(β\), and \(σ\). You’ve seen priors for \(α\) and \(σ\) before, although \(α\) was called \(μ\) back then.
The prior for \(β\) deserves explanation. Why have a Gaussian prior with mean zero? This prior places just as much probability below zero as it does above zero, and when \(β=0\), weight has no relationship to height. To figure out what this prior implies, we have to simulate the prior predictive distribution. There is no other reliable way to understand.
The goal is to simulate heights from the model, using only the priors. First, let’s consider a range of weight values to simulate over. The range of observed weights will do fine. Then we need to simulate a bunch of lines, the lines implied by the priors for \(α\) and \(β\). Here’s how to do it:
We have to plot line corresponding to 100 pairs of \(α\) and \(β\) values:
def plot_prior_pred(): N =100# 100 lines x = np.linspace(np.min(d2['weight']), np.max(d2['weight']), 100) xbar = d2['weight'].mean() a = stats.norm.rvs(loc=178, scale =20, size=N) b = stats.norm.rvs(loc=0, scale =10, size=N) fig, axs = plt.subplots(ncols=2, figsize=(9, 4))for i inrange(N): axs[0].plot(x, a[i] + b[i] * (x - xbar), color="black", alpha=0.2, linewidth=0.5) axs[0].set_title(r'$\beta \sim \text{Normal}(0,10)$')# Sample from new prior a = stats.norm.rvs(loc=178, scale=20, size=N) b = stats.lognorm.rvs(scale=np.exp(0), s=1, size=N)for i inrange(N): axs[1].plot(x, a[i] + b[i] * (x - xbar), color="black", alpha=0.2, linewidth=0.5) axs[1].set_title(r'$\text{log}(\beta) \sim \text{Normal}(0,1)$')for ax in axs: ax.axhline(272, color="black", linewidth=0.5) ax.axhline(0, color="black", linestyle="--", linewidth=0.5) ax.set(xlabel="Weight", ylabel="Height", ylim=(-100, 400)) fig.tight_layout()plot_prior_pred()
Prior predictive simulation for the height and weight model. Left: Simulation using the \(β \sim \text{Normal}(0,10)\) prior. Right: A more sensible \(log(β)\sim \text{Normal}(0,1)\) prior.
For reference, I’ve added a dashed line at zero—no one is shorter than zero—and the “Wadlow” line at 272 cm for the world’s tallest person. The pattern doesn’t look like any human population at all. It essentially says that the relationship between weight and height could be absurdly positive or negative. Before we’ve even seen the data, this is a bad model. Can we do better?
We know that average height increases with average weight, at least up to a point. Let’s try restricting it to positive values. The easiest way to do this is to define the prior as Log-Normal instead.
Defining \(β\) as Log-Normal(0,1) means to claim that the logarithm of \(β\) has a Normal(0,1) distribution. Plainly:
\[
beta \sim \text{Log-Normal}(0,1)
\]
SciPy provides the stats.lognorm.rvs and stats.lognorm.pdf densities for working with log-normal distributions. We can simulate this relationship to see what this means for \(\beta\):
If the logarithm of \(β\) is normal, then \(β\) itself is strictly positive. The reason is that exp(x) is greater than zero for any real number \(x\). This is the reason that Log-Normal priors are commonplace. They are an easy way to enforce positive relationships. So what does this earn us? Do the prior predictive simulation again, now with the Log-Normal prior:
N =100a = stats.norm.rvs(loc=178, scale=20, size=N)b = stats.lognorm.rvs(scale=np.exp(0), s=1, size=N)
Plotting as before produces the right-hand plot above. This is much more sensible. There is still a rare impossible relationship. But nearly all lines in the joint prior for \(α\) and \(β\) are now within human reason.
We’re fussing about this prior, even though as you’ll see in the next section there is so much data in this example that the priors end up not mattering. We fuss for two reasons. First, there are many analyses in which no amount of data makes the prior irrelevant. In such cases, non-Bayesian procedures are no better off. They also depend upon structural features of the model. Paying careful attention to those features is essential. Second, thinking about the priors helps us develop better models, maybe even eventually going beyond geocentrism.
What’s the correct prior?
People commonly ask what the correct prior is for a given analysis. The question sometimes implies that for any given set of data, there is a uniquely correct prior that must be used, or else the analysis will be invalid. This is a mistake. There is no more a uniquely correct prior than there is a uniquely correct likelihood. Statistical models are machines for inference. Many machines will work, but some work better than others. Priors can be wrong, but only in the same sense that a kind of hammer can be wrong for building a table.
In choosing priors, there are simple guidelines to get you started. Priors encode states of information before seeing data. So priors allow us to explore the consequences of beginning with different information. In cases in which we have good prior information that discounts the plausibility of some parameter values, like negative associations between height and weight, we can encode that information directly into priors. When we don’t have such information, we still usually know enough about the plausible range of values. And you can vary the priors and repeat the analysis in order to study how different states of initial information influence inference. Frequently, there are many reasonable choices for a prior, and all of them produce the same inference. And conventional Bayesian priors are conservative, relative to conventional non-Bayesian approaches. We’ll see how this conservatism arises later.
Making choices tends to make novices nervous. There’s an illusion sometimes that default procedures are more objective than procedures that require user choice, such as choosing priors. If that’s true, then all “objective” means is that everyone does the same thing. It carries no guarantees of realism or accuracy.
Prior predictive simulation and p-hacking
A serious problem in contemporary applied statistics is “p-hacking,” the practice of adjusting the model and the data to achieve a desired result. The desired result is usually a p-value less then 5%. The problem is that when the model is adjusted in light of the observed data, then p-values no longer retain their original meaning. False results are to be expected. We don’t pay any attention to p-values in this book. But the danger remains, if we choose our priors conditional on the observed sample, just to get some desired result. The procedure we’ve performed here is to choose priors conditional on pre-data knowledge of the variables—their constraints, ranges, and theoretical relationships. This is why the actual data are not shown in the earlier section. We are judging our priors against general facts, not the sample. We’ll look at how the model performs against the real data next.
1.2 Finding The Posterior Distribution
The code needed to approximate the posterior is a straightforward modification of the kind of code you’ve already seen. All we have to do is incorporate our new model for the mean into the model specification inside the Stan code and be sure to add a prior for the new parameter, \(β\). Let’s repeat the model definition:
Everything that depends upon parameters has a posterior distribution
In the model above, the parameter \(μ\) is no longer a parameter, since it has become a function of the parameters \(α\) and \(β\). But since the parameters \(α\) and \(β\) have a joint posterior, so too does \(μ\). Later in the chapter, you’ll work directly with the posterior distribution of \(μ\), even though it’s not a parameter anymore. Since parameters are uncertain, everything that depends upon them is also uncertain. This includes statistics like \(μ\), as well as model-based predictions, measures of fit, and everything else that uses parameters. By working with samples from the posterior, all you have to do to account for posterior uncertainty in any quantity is to compute that quantity for each sample from the posterior. The resulting quantities, one for each posterior sample, will approximate the quantity’s posterior distribution.
Logs and exps, oh my
My experience is that many natural and social scientists have naturally forgotten whatever they once knew about logarithms. Logarithms appear all the time in applied statistics. You can usefully think of \(y=log(x)\) as assigning to \(y\) the order of magnitude of \(x\). The function \(x=exp(y)\) is the reverse, turning a magnitude into a value. These definitions will make a mathematician shriek. But much of our computational work relies only on these intuitions.
These definitions allow the Log-Normal prior for \(β\) to be coded another way. Instead of defining a parameter \(β\), we define a parameter that is the logarithm of \(β\) and then assign it a normal distribution. Then we can reverse the logarithm inside the linear model. It looks like this:
m4_3b ='''data { int<lower=0> N; vector[N] height; vector[N] weight; real xbar;}parameters { real a; real log_b; real<lower=0, upper=50> sigma;}model { vector[N] mu; mu = a + exp(log_b) * (weight - xbar); // Likelihood Function height ~ normal(mu, sigma); // Priors a ~ normal(178, 20); log_b ~ normal(0, 1); sigma ~ uniform(0, 50);}'''data = {'N': len(d2), 'xbar': d2['weight'].mean(),'height': d2['height'].tolist(), 'weight': d2['weight'].tolist()}m4_3b_model = utils.StanQuap('stan_models/m4_3b', m4_3b, data, algorithm ='Newton')m4_3b_model.precis().round(2)m4_3b_model.vcov_matrix().round(3)
Note the exp(log_b) in the definition of mu. This is the same model as m4_3. It will make the same predictions. But instead of \(β\) in the posterior distribution, you get \(\text{log}(β)\). It is easy to translate between the two, because \(β=\text{exp}(\text{log}(β))\). In code form: b = np.exp(log_b).
1.3 Interpreting the Posterior Distribution
One trouble with statistical models is that they are hard to understand. Once you’ve fit the model, it can only report posterior distribution. This is the right answer to the question you asked. But it’s your responsibility to process the answer and make sense of it.
There are two broad categories of processing: (1) reading tables and (2) plotting simulations. For some simple questions, it’s possible to learn a lot just from tables of marginal values. But most models are very hard to understand from tables of numbers alone. A major difficulty with tables alone is their apparent simplicity compared to the complexity of the model and data that generated them. Once you have more than a couple of parameters in a model, it is very hard to figure out from numbers alone how all of them act to influence prediction. This is also the reason we simulate from priors. Once you begin adding interaction terms or polynomials, it may not even be possible to guess the direction of influence a predictor variable has on an outcome.
So throughout this book, I emphasize plotting posterior distributions and posterior predictions, instead of attempting to understand a table. Plotting the implications of your models will allow you to inquire about things that are hard to read from tables:
Whether or not the model fitting procedure worked correctly
The absolute magnitude, rather than merely relative magnitude, of a relationship between outcome and predictor
The uncertainty surrounding an average relationship
The uncertainty surrounding the implied predictions of the model, as these are distinct from mere parameter uncertainty
In addition, once you get the hang of processing posterior distributions into plots, you can ask any question you can think of, for any model type. And readers of your results will appreciate a figure much more than they will a table of estimates.
So in the remainder of this section, I first spend a little time talking about tables of estimates. Then I move on to show how to plot estimates that always incorporate information from the full posterior distribution, including correlations among parameters.
What do parameters mean?
A basic issue with interpreting model-based estimates is in knowing the meaning of parameters. There is no consensus about what a parameter means, however, because different people take different philosophical stances towards models, probability, and prediction. The perspective in this book is a common Bayesian perspective: Posterior probabilities of parameter values describe the relative compatibility of different states of the world with the data, according to the model. These are small world numbers. So reasonable people may disagree about the large world meaning, and the details of those disagreements depend strongly upon context. Such disagreements are productive, because they lead to model criticism and revision, something that models cannot do for themselves. Later, we’ll see that parameters can refer to observable quantities—data—as well as unobservable values. This makes parameters even more useful and their interpretation even more context dependent.
Tables of Marginal Distributions
With the new linear regression trained on the Kalahari data, we inspect the marginal posterior distributions of the parameters:
m4_3_model.precis().round(2)
Mean
StDev
5.5%
94.5%
Parameter
a
154.60
0.27
154.17
155.03
b
0.90
0.04
0.84
0.97
sigma
5.07
0.19
4.77
5.38
The first row gives the quadratic approximation for \(α\), the second the approximation for \(β\), and the third approximation for \(σ\). Let’s try to make some sense of them.
Let’s focus on b (\(β\)), because it’s the new parameter. Since \(β\) is a slope, the value 0.90 can be read as a person 1 kg heavier is expected to be 0.90 cm taller. 89% of the posterior probability lies between 0.84 and 0.97. That suggests that \(β\) values close to zero or greatly above one are highly incompatible with these data and this model. It is most certainly not evidence that the relationship between weight and height is linear, because the model only considered lines. It just says that, if you are committed to a line, then lines with a slope around 0.9 are plausible ones.
Remember, the numbers in the default precis output aren’t sufficient to describe the quadratic posterior completely. For that, we also require the variance-covariance matrix. You can see the covariances among the parameters with vcov:
Very little covariation among the parameters in this case. Constructing a pairs plot shows both the marginal posteriors and the covariance. Later, you’ll see that the lack of covariance among the parameters results from centering.
It’s almost always much more useful to plot the posterior inference against the data. Not only does plotting help in interpreting the posterior, but it also provides an informal check on model assumptions. When the model’s predictions don’t come close to key observations or patterns in the plotted data, then you might suspect the model either did not fit correctly or is rather badly specified. But even if you only treat plots as a way to help in interpreting the posterior, they are invaluable. For simple models like this one, it is possible (but not always easy) to just read the table of numbers and understand what the model says. But for even slightly more complex models, especially those that include interaction effects, interpreting posterior distributions is hard. Combine with this the problem of incorporating the information in vcov into your interpretations, and the plots are irreplaceable.
We’re going to start with a simple version of that task, superimposing just the posterior mean values over the height and weight data. Then we’ll slowly add more and more information to the prediction plots, until we’ve used the entire posterior distribution.
We’ll start with just the raw data and a single line. The code below plots the raw data, computes the posterior mean values for a and b, then draws the implied line:
Height in centimeters (vertical) plotted against weight in kilograms (horizontal), with the line at the posterior mean plotted in blue.
Each point in this plot is a single individual. The blue line is defined by the mean slope \(β\) and mean intercept \(α\). This is not a bad line. It certainly looks highly plausible. But there are an infinite number of other highly plausible lines near it. Let’s draw those too.
Adding Uncertainty Around the Mean
The posterior mean line is just the posterior mean, the most plausible line in the infinite universe of lines the posterior distribution has considered. Plots of the average line, like the plot above, are useful for getting an impression of the magnitude of the estimated influence of a variable. But they do a poor job of communicating uncertainty. Remember, the posterior distribution considers every possible regression line connecting height to weight. It assigns a relative plausibility to each. This means that each combination of \(α\) and \(β\) has a posterior probability. It could be that there are many lines with nearly the same posterior probability as the average line. Or it could be instead that the posterior distribution is rather narrow near the average line.
So how can we get that uncertainty onto the plot? Together, a combination of \(α\) and \(β\) define a line. And so we could sample a bunch of lines from the posterior distribution. Then we could display those lines on the plot, to visualize the uncertainty in the regression relationship.
To better appreciate how the posterior distribution contains lines, we work with all of the samples from the model. Let’s take a closer look at the samples now:
Each row is a correlated random sample from the joint posterior of all three parameters, using the covariances provided by the model. The paired values of a and b on each row define a line. The average of very many of these lines is the posterior mean line. But the scatter around that average is meaningful, because it alters our confidence in the relationship between the predictor and the outcome.
So now let’s display a bunch of these lines, so you can see the scatter. This lesson will be easier to appreciate, if we use only some of the data to begin. Then you can see how adding in more data changes the scatter of the lines. So we’ll begin with just the first 10 cases in d2. The following code extracts the first 10 cases and re-estimates the model:
Now let’s plot 20 of these lines, to see what the uncertainty looks like.
# extract 20 samples from the posteriorpost = pd.DataFrame(mN.extract_samples(n=20))def plot_post_infer(): plt.scatter(dN['weight'], dN['height'], facecolor="none", edgecolor="C0")for i inrange(20): a, b = post['a'][i], post['b'][i] x = np.linspace(d2['weight'].min(), d2['weight'].max(), 10) y = a + b * (x - dN['weight'].mean()) plt.plot(x, y, "C1-", alpha=0.5) plt.ylabel("Height (cm)") plt.xlabel("Weight (kg)") plt.tight_layout()plot_post_infer()
By plotting multiple regression lines, sampled from the posterior, it is easy to see both the highly confident aspects of the relationship and the less confident aspects. The cloud of regression lines displays greater uncertainty at extreme values for weight.
The other plots in below show the same relationships, but for increasing amounts of data. Just re-use the code from before, but change N = 10 to some other value. Notice that the cloud of regression lines grows more compact as the sample size increases. This is a result of the model growing more confident about the location of the mean.
def plot_reg_rel(): fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9, 9), sharey=True, sharex=True)for N, ax inzip((10, 50, 150, 352), axes.flatten()): weight, height = d2['weight'][:N], d2['height'][:N] xbar = np.mean(weight) data = {'N': len(height), 'xbar': xbar,'height': height.tolist(), 'weight': weight.tolist()} mN = utils.StanQuap('stan_models/m4_3', m4_3, data, algorithm ='Newton') post = pd.DataFrame(mN.extract_samples(n=20)) ax.scatter(weight, height, facecolor="none", edgecolor="C0")for i inrange(20): a, b = post['a'][i], post['b'][i] x = np.linspace(d2['weight'].min(), d2['weight'].max(), 10) y = a + b * (x - xbar) ax.plot(x, y, "C1-", alpha=0.5) ax.set(xlabel="Weight (kg)", ylabel="Height (cm)", title=f"N = {N}") fig.tight_layout()plot_reg_rel()
Plotting Regression Intervals and Contours
The cloud of regression lines in the plot above is an appealing display, because it communicates uncertainty about the relationship in a way that many people find intuitive. But it’s more common, and often much clearer, to see the uncertainty displayed by plotting an interval or contour around the average regression line. In this section, I’ll walk you through how to compute any arbitrary interval you like, using the underlying cloud of regression lines embodied in the posterior distribution.
Focus for the moment on a single weight value, say 50 kilograms. You can quickly make a list of 10,000 values of \(μ\) for an individual who weighs 50 kilograms, by using your samples from the posterior:
post = pd.DataFrame(m4_3_model.extract_samples())mu_at_50 = post["a"] + post["b"] * (50- d2.weight.mean())
The code above takes its form from the equation for \(μ_i\):
\[
\mu_i = \alpha + \beta(x_i - \bar{x}
\]
The value of \(x_i\) in this case is 50. Go ahead and take a look inside the result, mu_at_50. It’s a vector of predicted means, one for each random sample from the posterior. Since joint a and b went into computing each, the variation across those means incorporates the uncertainty in and correlation between both parameters. It might be helpful at this point to actually plot the density for this vector of means:
The quadratic approximate posterior distribution of the mean height, \(μ\), when weight is 50 kg. This distribution represents the relative plausibility of different values of the mean.
Since the components of \(μ\) have distributions, so too does \(μ\). And since the distributions of \(α\) and \(β\) are Gaussian, so too is the distribution of \(μ\) (adding Gaussian distributions always produces a Gaussian distribution).
Since the posterior for \(μ\) is a distribution, you can find intervals for it, just like for any posterior distribution. To find the 89% compatibility interval of \(μ\) at 50 kg, we can use the quantile or precis function as usual:
utils.precis(pd.DataFrame({'mu': mu_at_50}))
Mean
StDev
5.5%
94.5%
Parameter
mu
159.130667
0.342725
158.58562
159.681541
What these numbers mean is that the central 89% of the ways for the model to produce the data place the average height between about 159 cm and 160 cm (conditional on the model and data), assuming the weight is 50 kg.
That’s good so far, but we need to repeat the above calculation for every weight value on the horizontal axis, not just when it is 50 kg. We want to draw 89% intervals around the average slope.
This is made simple by strategic use of the link method. What link will do is take our StanQuap approximation, sample from the posterior distribution, and then pass all parameter samples to a custom defined linear model function (one that mirrors the linear model inside our Stan program) to compute \(μ\) for each case in the data and sample from the posterior distribution. Here’s what it looks like for the data you used to fit the model:
def lm(post, data): weight = data xbar = d2['weight'].mean() # ensure that xbar is based on the dataset supplied during fit a, b = post['a'].reshape(-1,1), post['b'].reshape(-1,1)return a + b * (weight - xbar)mu = m4_3_model.link(lm_func=lm, predictor=d2['weight'].to_numpy())print(mu)
We end up with a big matrix of values of \(μ\). Each row is a sample from the posterior distribution. The default is 1000 samples, but you can use as many or as few as you like. Each column is a case (row) in the data. There are 352 rows in d2, corresponding to 352 individuals. So there are 352 columns in the matrix mu above.
Now what can we do with this big matrix? Lots of things. The function link provides a posterior distribution of \(μ\) for each case we feed it. So above we have a distribution of \(μ\) for each individual in the original data. We actually want something slightly different: a distribution of \(μ\) for each unique weight value on the horizontal axis. It’s only slightly harder to compute that, by just passing link some new data:
# define sequence of weights to compute predictions for# these values will be on the horizontal axisweight_seq = np.arange(25, 71)# use link to compute mu# for each sample from posterior# and for each weight in weight.seqmu = m4_3_model.link(lm_func=lm, predictor=weight_seq)mu
And now there are only 46 rows in mu, because we fed it 46 different values for weight. To visualize what you’ve got here, let’s plot the distribution of \(μ\) values at each height.
The first 100 values in the distribution of \(μ\) at each weight value.
At each weight value in weight_seq, a pile of computed \(μ\) values are shown. Each of these piles is a Gaussian distribution, like that in the density plot for mu_at_50. You can see now that the amount of uncertainty in \(μ\) depends upon the value of weight. And this is the same fact we saw in the 2x2 plot of uncertainty in the regression relationship.
The final step is to summarize the distribution for each weight value:
# summarize the distribution of mumu_df = pd.DataFrame( {str(weight): mu[:, i] for i, weight inenumerate(weight_seq)} )mu_summary = utils.precis(mu_df, index_name='Weight')mu_summary
Mean
StDev
5.5%
94.5%
Weight
25
136.550323
0.878703
135.070529
137.909084
26
137.453266
0.838840
136.034951
138.758116
27
138.356209
0.799198
137.010952
139.602213
28
139.259151
0.759810
138.001088
140.443884
29
140.162094
0.720718
138.963611
141.283895
30
141.065037
0.681973
139.935747
142.129740
31
141.967980
0.643639
140.906037
142.975584
32
142.870923
0.605792
141.875437
143.820467
33
143.773865
0.568530
142.852814
144.667197
34
144.676808
0.531977
143.829398
145.512716
35
145.579751
0.496289
144.799501
146.371581
36
146.482694
0.461665
145.760978
147.229268
37
147.385637
0.428366
146.725594
148.077106
38
148.288579
0.396724
147.676114
148.924816
39
149.191522
0.367167
148.618854
149.789554
40
150.094465
0.340241
149.558762
150.645807
41
150.997408
0.316616
150.503683
151.506332
42
151.900351
0.297081
151.432825
152.376239
43
152.803293
0.282486
152.370539
153.247281
44
153.706236
0.273624
153.291584
154.134467
45
154.609179
0.271056
154.176898
155.032607
46
155.512122
0.274959
155.057455
155.930824
47
156.415065
0.285067
155.952303
156.843423
48
157.318007
0.300756
156.828689
157.781192
49
158.220950
0.321208
157.690513
158.695296
50
159.123893
0.345579
158.557402
159.656285
51
160.026836
0.373102
159.411816
160.606398
52
160.929779
0.403132
160.276651
161.547324
53
161.832721
0.435150
161.132537
162.510581
54
162.735664
0.468749
161.973220
163.459873
55
163.638607
0.503612
162.803465
164.427431
56
164.541550
0.539495
163.654450
165.387635
57
165.444493
0.576207
164.508352
166.352003
58
166.347436
0.613599
165.347721
167.315219
59
167.250378
0.651554
166.190272
168.271753
60
168.153321
0.689980
167.033532
169.239802
61
169.056264
0.728801
167.879005
170.220827
62
169.959207
0.767958
168.729773
171.173091
63
170.862150
0.807402
169.581060
172.148897
64
171.765092
0.847093
170.422054
173.125730
65
172.668035
0.886998
171.269175
174.090472
66
173.570978
0.927088
172.102070
175.061397
67
174.473921
0.967342
172.942478
176.039323
68
175.376864
1.007739
173.777626
177.012274
69
176.279806
1.048263
174.612753
177.988837
70
177.182749
1.088900
175.446181
178.964710
Now mu_summary contains the average \(μ\) at each weight value and 89% lower and upper bounds for each weight value. They are just different kinds of summaries of the distributions in mu, with each column being for a different weight value. These summaries are only summaries. The “estimate” is the entire distribution.
The !Kung height data again, now with 89% compatibility interval of the mean indicated by the shaded region. Compare this region to the distributions of points in the above plot.
Using this approach, you can derive and plot posterior prediction means and intervals for quite complicated models, for any data you choose. It’s true that it is possible to use analytical formulas to compute intervals like this. But without much training in probability theory and integral calculus \(∫\)’s, this would be challenging to do. We can quickly learn to generate and summarize samples derived from the posterior distribution. So while the mathematics would be a more elegant approach, and there is some additional insight that comes from knowing the mathematics, the pseudo-empirical approach presented here is very flexible and allows a much broader audience to pull insight from their statistical modeling. And again, when you start estimating models with MCMC, this is really the only approach available. So it’s worth learning now.
To summarize, here’s the recipe for generating predictions and intervals from the posterior of a fit model.
Use link to generate distributions of posterior values for \(μ\). The default behavior of link is to use the original data, so you have to pass it a list of new horizontal axis values you want to plot posterior predictions across.
Use summary functions like numpy.mean, numpy.quantile, arviz.hdi or precis to find averages and lower and upper bounds of \(μ\) for each value of the predictor variable.
Finally, use plotting functions to draw the lines and intervals. Or you might plot the distributions of the predictions, or do further numerical calculations with them.
This recipe works for every model we fit in this repo. As long as you know the structure of the model—how parameters relate to the data—you can use samples from the posterior to describe any aspect of the model’s behavior.
Overconfident intervals
The compatibility interval for the regression line in the plot above clings tightly to the MAP line. Thus there is very little uncertainty about the average height as a function of average weight. But you have to keep in mind that these inferences are always conditional on the model. Even a very bad model can have very tight compatibility intervals. It may help if you think of the regression line as saying: Conditional on the assumption that height and weight are related by a straight line, then this is the most plausible line, and these are its plausible bounds.
How link works
The link method is not really very sophisticated. All it is doing is using the linear model function you provide as an argument to compute the value of the linear model. It does this for each sample from the posterior distribution, for each case in the data. You could accomplish the same thing for any model, fit by any means, by performing these steps yourself. This is how it’d look for m4_3_model.
And the values in mu_mean and mu_CI should be very similar (allowing for simulation variance) to what you got the automated way, using link.
Knowing this manual method is useful both for (1) understanding and (2) sheer power. Whatever the model you find yourself with, this approach can be used to generate posterior predictions for any component of it. Automated tools like link save effort, but they are never as flexible as the code you can write yourself.
Prediction Intervals
Now let’s walk through generating an 89% prediction interval for actual heights, not just the average height, \(μ\). This means we’ll incorporate the standard deviation \(σ\) and its uncertainty as well. Remember, the first line of the statistical model here is:
\[
h_i \sim \text{Normal}(\mu_i, \sigma)
\]
What you’ve done so far is just use samples from the posterior to visualize the uncertainty in \(μ_i\), the linear model of the mean. But actual predictions of heights depend also upon the distribution in the first line. The Gaussian distribution on the first line tells us that the model expects observed heights to be distributed around \(μ\), not right on top of it. And the spread around \(μ\) is governed by \(σ\). All of this suggests we need to incorporate \(σ\) in the predictions somehow.
Here’s how you do it. Imagine simulating heights. For any unique weight value, you sample from a Gaussian distribution with the correct mean \(μ\) for that weight, using the correct value of \(σ\) sampled from the same posterior distribution. If you do this for every sample from the posterior, for every weight value of interest, you end up with a collection of simulated heights that embody the uncertainty in the posterior as well as the uncertainty in the Gaussian distribution of heights. This is how we can do it:
m4_3_pred ='''data { int<lower=1> N; vector[N] height; vector[N] weight; real xbar; int<lower=1> N_tilde; vector[N_tilde] weight_tilde;}parameters { real a; real<lower=0> b; real<lower=0, upper=50> sigma;}model { vector[N] mu; mu = a + b * (weight - xbar); // Likelihood Function height ~ normal(mu, sigma); // Priors a ~ normal(178, 20); b ~ lognormal(0, 1); sigma ~ uniform(0, 50);}generated quantities { vector[N_tilde] y_tilde; for (i in 1:N_tilde) { y_tilde[i] = normal_rng(a + b * (weight_tilde[i] - xbar), sigma); }}'''data = {'N': len(d2), 'xbar': d2['weight'].mean(),'height': d2['height'].tolist(), 'weight': d2['weight'].tolist(),'N_tilde': len(weight_seq),'weight_tilde': weight_seq.tolist()}m4_3_pred_model = utils.StanQuap('stan_models/m4_3_pred', m4_3_pred, data, algorithm ='Newton', generated_var=['y_tilde'])sim_height = m4_3_pred_model.sim(data=data)['y_tilde']sim_height
This matrix is much like the earlier one, mu, but it contains simulated heights, not distributions of plausible average height, µ.
We can summarize these simulated heights in the same way we summarized the distributions of µ:
sim_height_df = pd.DataFrame( {str(weight): sim_height[:, i] for i, weight inenumerate(weight_seq)} )sim_height_df = utils.precis(sim_height_df, index_name='Weight')sim_height_df
Mean
StDev
5.5%
94.5%
Weight
25
136.663695
5.370075
127.959990
144.573050
26
137.290701
4.986234
129.205560
145.673660
27
138.236428
5.243790
129.741350
146.514055
28
139.435145
4.994606
131.487505
147.316055
29
140.117355
5.146806
131.487120
148.065310
30
141.234863
5.199098
133.326405
149.682605
31
142.210937
4.901433
134.211020
149.744220
32
142.903343
5.198310
134.455395
151.009100
33
143.628855
5.074116
135.805395
151.474115
34
144.759202
5.145957
136.506955
153.331880
35
145.828686
5.045021
137.812670
153.675715
36
146.586450
5.080876
138.459900
154.762815
37
147.260928
5.136748
138.886800
155.239365
38
148.327942
5.265762
139.607390
156.541475
39
149.343324
4.987560
141.525205
157.163080
40
150.212610
5.084759
142.047560
158.550825
41
150.907793
5.109059
142.735840
158.993950
42
151.793248
5.321290
143.309955
160.744280
43
152.678431
5.152787
144.496340
161.027320
44
153.587360
5.026441
145.568325
161.505810
45
154.914438
5.158792
146.214230
163.205905
46
155.793384
5.183644
147.762200
164.542460
47
156.440156
5.050642
148.618940
164.625275
48
157.483916
4.964179
149.247035
165.290475
49
157.830968
5.264286
149.112855
166.188660
50
158.970304
5.131684
150.847185
166.844840
51
160.014394
5.060669
152.100560
168.552005
52
160.707716
5.088063
152.471305
168.658660
53
161.841900
5.043431
153.885835
170.291105
54
162.620316
5.300371
154.420930
170.684225
55
163.631253
5.139409
155.659900
171.987165
56
164.579445
5.073314
156.523525
172.811330
57
165.354755
4.919371
157.168615
173.013550
58
166.348509
5.183715
158.263460
174.491595
59
167.348257
5.072691
159.412130
175.534160
60
168.038060
5.102002
159.832525
176.365745
61
169.043566
5.312162
160.572965
176.976705
62
170.152270
5.133405
162.096015
178.054740
63
170.940040
4.992685
163.313490
179.320275
64
171.704648
5.144238
163.706515
179.834850
65
172.789544
5.434792
164.021450
181.372010
66
173.474984
5.132387
165.476250
181.859155
67
174.490067
5.013133
166.276775
182.355980
68
175.348226
5.157084
167.235790
183.364330
69
176.035404
5.312805
167.688775
184.350170
70
177.375047
5.200763
169.124950
185.726055
Now sim_height_df contains the 89% posterior prediction interval of observable (according to the model) heights, across the values of weight in weight_seq.
Let’s plot everything we’ve built up: (1) the average line, (2) the shaded region of 89% plausible \(μ\), and (3) the boundaries of the simulated heights the model expects.
89% prediction interval for height, as a function of weight. The solid line is the average line for the mean height at each weight. The two shaded regions show different 89% plausible regions. The narrow shaded interval around the line is the distribution of \(μ\). The wider shaded region represents the region within which the model expects to find 89% of actual heights in the population, at each weight.
The wide shaded region in the figure represents the area within which the model expects to find 89% of actual heights in the population, at each weight. There is nothing special about the value 89% here. You could plot the boundary for other percents, such as 67% and 97% (also both primes), and add those to the plot. Doing so would help you see more of the shape of the predicted distribution of heights.
Notice that the outline for the wide shaded interval is a little rough. This is the simulation variance in the tails of the sampled Gaussian values. If it really bothers you, increase the number of samples you take from the posterior distribution.
With extreme percentiles, it can be very hard to get out all of the roughness. Luckily, it hardly matters, except for aesthetics. Moreover, it serves to remind us that all statistical inference is approximate. The fact that we can compute an expected value to the 10th decimal place does not imply that our inferences are precise to the 10th decimal place.
Two kinds of uncertainty
In the procedure above, we encountered both uncertainty in parameter values and uncertainty in a sampling process. These are distinct concepts, even though they are processed much the same way and end up blended together in the posterior predictive simulation. The posterior distribution is a ranking of the relative plausibilities of every possible combination of parameter values. The distribution of simulated outcomes, like height, is instead a distribution that includes sampling variation from some process that generates Gaussian random variables. This sampling variation is still a model assumption. It’s no more or less objective than the posterior distribution. Both kinds of uncertainty matter, at least sometimes. But it’s important to keep them straight, because they depend upon different model assumptions. Furthermore, it’s possible to view the Gaussian likelihood as a purely epistemological assumption (a device for estimating the mean and variance of a variable), rather than an ontological assumption about what future data will look like. In that case, it may not make complete sense to simulate outcomes.
Rolling your own sim
Just like with link, it’s useful to know a little about how sim operates. For every distribution like stats.norm.pdf, there is a companion simulation (random number generator rng) function. For the Gaussian distribution, the companion is stats.norm.rvs, and it simulates sampling from a Gaussian distribution. What we want Python to do is simulate a height for each set of samples, and to do this for each value of weight. The following will do it:
The values in sim_height are practically identical to the ones computed in the main text and displayed in the plot above.
2 Curves from Lines
In the next notebook, you’ll see how to use linear models to build regressions with more than one predictor variable. But before then, it helps to see how to model the outcome as a curved function of a predictor. The models so far all assume that a straight line describes the relationship. But there’s nothing special about straight lines, aside from their simplicity.
We’ll consider two commonplace methods that use linear regression to build curves. The first is polynomial regression. The second is B-splines. Both approaches work by transforming a single predictor variable into several synthetic variables. But splines have some clear advantages. Neither approach aims to do more than describe the function that relates one variable to another. Causal inference, which we’ll consider much more beginning in the next note, wants more.
2.1 Polynomial Regression
Polynomial regression uses powers of a variable—squares and cubes—as extra predictors. This is an easy way to build curved associations. Polynomial regressions are very common, and understanding how they work will help scaffold later models. To understand how polynomial regression works, let’s work through an example, using the full !Kung data, not just the adults:
d = pd.read_csv("data/Howell1.csv", sep=';')d.head()
height
weight
age
male
0
151.765
47.825606
63.0
1
1
139.700
36.485807
63.0
0
2
136.525
31.864838
65.0
0
3
156.845
53.041914
41.0
1
4
145.415
41.276872
51.0
0
Go ahead and plot height and weight. The relationship is visibly curved, now that we’ve included the non-adult individuals.
The most common polynomial regression is a parabolic model of the mean. Let \(x\) be standardized body weight. Then the parabolic equation for the mean height is:
The above is a parabolic (second order) polynomial. The \(α+β_1 x_i\) part is the same linear function of x in a linear regression, just with a little “1” subscript added to the parameter name, so we can tell it apart from the new parameter. The additional term uses the square of \(x_i\) to construct a parabola, rather than a perfectly straight line. The new parameter \(β_2\) measures the curvature of the relationship.
Fitting these models to data is easy. Interpreting them can be hard. We’ll begin with the easy part, fitting a parabolic model of height on weight. The first thing to do is to standardize the predictor variable. We’ve done this in previous examples. But this is especially helpful for working with polynomial models. When predictor variables have very large values in them, there are sometimes numerical glitches. Even well-known statistical software can suffer from these glitches, leading to mistaken estimates. These problems are very common for polynomial regression, because the square or cube of a large number can be truly massive. Standardizing largely resolves this issue. It should be your default behavior.
To define the parabolic model, just modify the definition of \(μ_i\). Here’s the model:
The confusing issue here is assigning a prior for \(β_2\), the parameter on the squared value of \(x\). Unlike \(β_1\), we don’t want a positive constraint. In the practice problems at the end of the chapter, you’ll use prior predictive simulation to understand why. These polynomial parameters are in general very difficult to understand. But prior predictive simulation does help a lot.
Approximating the posterior is straightforward. Just modify the definition of mu so that it contains both the linear and quadratic terms. But in general it is better to pre-process any variable transformations—you don’t need the computer to recalculate the transformations on every iteration of the fitting procedure. So I’ll also build the square of weight_s as a separate variable and place weight_s2 inside Stan’s transformed data block:
Now, since the relationship between the outcome height and the predictor weight depends upon two slopes, b1 and b2, it isn’t so easy to read the relationship off a table of coefficients:
m4_5_model.precis().round(2)
Mean
StDev
5.5%
94.5%
Parameter
a
146.06
0.37
145.47
146.65
b1
21.73
0.29
21.27
22.19
b2
-7.80
0.27
-8.24
-7.37
sigma
5.77
0.18
5.49
6.06
The parameter \(α\) (a) is still the intercept, so it tells us the expected value of height when weight is at its mean value. But it is no longer equal to the mean height in the sample, since there is no guarantee it should in a polynomial regression. And those \(β_1\) and \(β_2\) parameters are the linear and square components of the curve. But that doesn’t make them transparent.
You have to plot these model fits to understand what they are saying. So let’s do that. We’ll calculate the mean relationship and the 89% intervals of the mean and the predictions, like in the previous section. Here’s the working code:
Computing the curve and intervals is similarly a small modification of the previous code. This cubic curve is even more flexible than the parabola, so it fits the data even better.
But it’s not clear that any of these models make a lot of sense. They are good geocentric descriptions of the sample, yes. But there are two problems. First, a better fit to the sample might not actually be a better model. Second, the model contains no biological information. We aren’t learning any causal relationship between height and weight.
Design Matrix: Stan Tips
We can combine the 3 separate Stan model definitions into one by turning weight (and its transformations as regressors), its associated \(\beta_{j}\) coefficients, and the \(\alpha\) intercept into a matrix.1
The left panel of the figure shows the familiar linear regression from earlier in the chapter, but now with the standardized predictor and full data with both adults and non-adults. The linear model makes some spectacularly poor predictions, at both very low and middle weights. Compare this to the middle panel, our new quadratic regression. The curve does a better job of finding a central path through the data. The right panel shows a higher-order polynomial regression, a cubic regression on weight.
Linear, additive, funky
The parabolic model of \(μ_i\) above is still a “linear model” of the mean, even though the equation is clearly not of a straight line. Unfortunately, the word “linear” means different things in different contexts, and different people use it differently in the same context. What “linear” means in this context is that \(μ_i\) is a linear function of any single parameter. Such models have the advantage of being easier to fit to data. They are also often easier to interpret, because they assume that parameters act independently on the mean. They have the disadvantage of being used thoughtlessly. When you have expert knowledge, it is often easy to do better than a linear model. These models are geocentric devices for describing partial correlations. We should feel embarrassed to use them, just so we don’t become satisfied with the phenomenological explanations they provide.
Converting back to natural scale
The plots in above have standard units on the horizontal axis. These units are sometimes called z-scores. But suppose you fit the model using standardized variables, but want to plot the estimates on the original scale. All that’s really needed is to transform the xticks when you plot the raw data:
The xticks at the end there overrides the horizontal axis. Then you explicitly construct the axis, by reversing the (d.weight.values - d.weight.mean())/d.weight.std() transformation back to original scale.
2.2 Splines
The second way to introduce a curve is to construct something known as a spline. The word spline originally referred to a long, thin piece of wood or metal that could be anchored in a few places in order to aid drafters or designers in drawing curves. In statistics, a spline is a smooth function built out of smaller, component functions. There are actually many types of splines. The b-spline we’ll look at here is commonplace. The “B” stands for “basis,” which here just means “component.” B-splines build up wiggly functions from simpler less-wiggly components. Those components are called basis functions. While there are fancier splines, we want to start B-splines because they force you to make a number of choices that other types of splines automate. You’ll need to understand B-splines before you can understand fancier splines.
To see how B-splines work, we’ll need an example that is much wigglier—that’s a scientific term—than the !Kung stature data. Cherry trees blossom all over Japan in the spring each year, and the tradition of flower viewing (Hanami 花見) follows. The timing of the blossoms can vary a lot by year and century. Let’s load a thousand years of blossom dates:
d = pd.read_csv("data/cherry_blossoms.csv", sep=';')utils.precis(d, index_name='Columns').round(2)
Mean
StDev
5.5%
94.5%
Columns
year
1408.00
350.88
867.77
1948.23
doy
104.54
6.41
94.43
115.00
temp
6.14
0.66
5.15
7.29
temp_upper
7.19
0.99
5.90
8.90
temp_lower
5.10
0.85
3.79
6.37
We’re going to work with the historical record of first day of blossom, doy, for now. It ranges from 86 (late March) to 124 (early May). The years with recorded blossom dates run from 812 CE to 2015 CE. You should go ahead and plot doy against year to see. There might be some wiggly trend in that cloud. It’s hard to tell.
Let’s try extracting a trend with a B-spline. The short explanation of B-splines is that they divide the full range of some predictor variable, like year, into parts. Then they assign a parameter to each part. These parameters are gradually turned on and off in a way that makes their sum into a fancy, wiggly curve. The long explanation contains lots more details. But all of those details just exist to achieve this goal of building up a big, curvy function from individually less curvy local functions.
Here’s a longer explanation, with visual examples. Our goal is to approximate the blossom trend with a wiggly function. With B-splines, just like with polynomial regression, we do this by generating new predictor variables and using those in the linear model, \(μ_i\). Unlike polynomial regression, B-splines do not directly transform the predictor by squaring or cubing it. Instead they invent a series of entirely new, synthetic predictor variables. Each of these synthetic variables exists only to gradually turn a specific parameter on and off within a specific range of the real predictor variable. Each of the synthetic variables is called a basis function. The linear model ends up looking very familiar:
where \(B_{i,n}\) is the n-th basis function’s value on row \(i\), and the \(w\) parameters are corresponding weights for each. The parameters act like slopes, adjusting the influence of each basis function on the mean \(μ_i\). So really this is just another linear regression, but with some fancy, synthetic predictor variables. These synthetic variables do some really elegant descriptive (geocentric) work for us.
How do we construct these basis variables \(B\)? we display the simplest case below, in which I approximate the blossom date data with a combination of linear approximations. First, I divide the full range of the horizontal axis into four parts, using pivot points called knots. The knots are shown by the + symbols in the top plot. I’ve placed the knots at even quantiles of the blossom data. In the blossom data, there are fewer recorded blossom dates deep in the past. So using even quantiles does not produce evenly spaced knots. This is why the second knot is so far from the first knot. Don’t worry right now about the code to make these knots. You’ll see it later.
Focus for now just on the picture. The knots act as pivots for five different basis functions, our B variables. These synthetic variables are used to gently transition from one region of the horizontal axis to the next. Essentially, these variables tell you which knot you are close to. Beginning on the left of the top plot, basis function 1 has value 1 and all of the others are set to zero. As we move rightwards towards the second knot, basis 1 declines and basis 2 increases. At knot 2, basis 2 has value 1, and all of the others are set to zero.
The nice feature of these basis functions is that they make the influence of each parameter quite local. At any point on the horizontal axis in the plot below, only two basis functions have non-zero values. For example, the dashed blue line in the top plot shows the year 1200. Basis functions 1 and 2 are non-zero for that year. So the parameters for basis functions 1 and 2 are the only parameters influencing prediction for the year 1200. This is quite unlike polynomial regression, where parameters influence the entire shape of the curve.
In the middle plot, we show each basis function multiplied by its corresponding weight parameter. I got these weights by fitting the model to the data. I’ll show you how to do that in a moment. Again, focus on the figure for now. Weight parameters can be positive or negative. So for example basis function 5 ends up below the zero line. It has negative weight. To construct a prediction for any given year, say for example 1200 again, we just add up these weighted basis functions at that year. In the year 1200, only basis functions 1 and 2 influence prediction. Their sum is slightly above the zero (the mean).
Finally, in the bottom plot, we display the spline, as a 97% posterior interval for \(μ\), over the raw blossom date data. All the spline seems to pick up is a change in trend around 1800. You can probably guess which global climate trend this reflects. But there is more going on in the data, before 1800. To see it, we can do two things. First, we can use more knots. The more knots, the more flexible the spline. Second, instead of linear approximations, we can use higher-degree polynomials.
Using B-splines to make local, linear approximations. Top: Each basis function is a variable that turns on specific ranges of the predictor variable. At any given value on the horizontal axis, e.g. 1200, only two have non-zero values. Middle: Parameters called weights multiply the basis functions. The spline at any given point is the sum of these weighted basis functions. Bottom: The resulting B-spline shown against the data. Each weight parameter determines the slope in a specific range of the predictor variable.
Let’s build up the code that will let you reproduce the plots above, but also let you change the knots and degree to anything you like. First, we choose the knots. Remember, the knots are just values of year that serve as pivots for our spline. Where should the knots go? There are different ways to answer this question. You can, in principle, put the knots wherever you like. Their locations are part of the model, and you are responsible for them. Let’s do what we did in the simple example above, place the knots at different evenly spaced quantiles of the predictor variable. This gives you more knots where there are more observations. We used only 5 knots in the first example. Now let’s go for 15:
The next choice is polynomial degree. This determines how basis functions combine, which determines how the parameters interact to produce the spline. For degree 1, as in the plot image above, two basis functions combine at each point. For degree 2, three functions combine at each point. For degree 3, four combine. patsy already has a nice function that will build basis functions for any list of knots and degree. This code will construct the necessary basis functions for a degree 3 (cubic) spline:
Here we will use formulaic as a simple way of building the b-spline matrix.2
The matrix B should have 827 rows and 17 columns. Each row is a year, corresponding to the rows in the d2 data frame. Each column is a basis function, one of our synthetic variables defining a span of years within which a corresponding parameter will influence prediction.
The plot below, displays the basis functions against year:
Now to get the parameter weights for each basis function, we need to actually define the model and make it run. The model is just a linear regression. The synthetic basis functions do all the work. We’ll use each column of the matrix B as a variable. We’ll also have an intercept to capture the average blossom day. This will make it easier to define priors on the basis weights, because then we can just conceive of each as a deviation from the intercept. In mathematical form, we start with the probability of the data, the linear model, and the priors:
That linear model might look weird. But all it is doing is multiplying each basis value by a corresponding parameter \(w_k\) and then adding up all \(K\) of those products. This is just a compact way of writing a linear model. The rest should be familiar. Although I will ask you to simulate from those priors in the practice problems at the end of the chapter. You might guess already that the \(w\) priors influence how wiggly the spline can be.
This is also the first time we’ve used an exponential distribution as a prior. Exponential distributions are useful priors for scale parameters, parameters that must be positive. The prior for \(σ\) is exponential with a rate of 1. The way to read an exponential distribution is to think of it as containing no more information than an average deviation. That average is the inverse of the rate. So in this case it is \(1/1 =1\). If the rate were 0.5, the mean would be \(1/0.5 =2\). We’ll use exponential priors for the rest of the notebooks, in place of uniform priors. It is much more common to have a sense of the average deviation than of the maximum.
To build this model in StanQuap, we just need a way to do that sum. The easiest way is to use matrix multiplication3. If you aren’t familiar with linear algebra in this context, that’s fine. There is a box at the end with some more detail about why this works.
m4_7a ='''data { int N; int K; array[N] int doy; matrix[N, K] B;}parameters { real a; vector[K] w; real<lower=0> sigma;}transformed parameters { vector[N] mu; mu = a + B * w;}model { doy ~ normal(mu, sigma); a ~ normal(100, 10); w ~ normal(0, 10); sigma ~ exponential(1);}'''data = {'N': B_mat.shape[0], 'K': B_mat.shape[1],'doy': d2['doy'].astype(int).tolist(), 'B': B_mat.tolist()}m4_7a_model = utils.StanQuap('stan_models/m4_7a', m4_7a, data, algorithm ='Newton', generated_var=['mu'])
You can look at the posterior means if you like. But it won’t reveal much. You should see 17 w parameters. But you can’t tell what the model thinks from the parameter summaries. Instead we need to plot the posterior predictions. First, here are the weighted basis functions:
A cubic spline with 15 knots. The top plot is, just like in the previous figure, the basis functions. However now more of these overlap. The middle plot is again each basis weighted by its corresponding parameter. And the sum of these weighted basis functions, at each point, produces the spline shown at the bottom, displayed as a 97% posterior interval of \(\mu\).
The spline is much wigglier now. Something happened around 1500, for example. If you add more knots, you can make this even wigglier. You might wonder how many knots is correct. We’ll be ready to address that question in a few more chapters. Really we’ll answer it by changing the question. So hang on to the question, and we’ll turn to it later.
Distilling the trend across years provides a lot of information. But year is not really a causal variable, only a proxy for features of each year. In the practice problems below, you’ll compare this trend to the temperature record, in an attempt to explain those wiggles.
Matrix multiplication in the spline model
Matrix algebra is a stressful topic for many scientists. If you have had a course in it, it’s obvious what it does. But if you haven’t, it is mysterious. Matrix algebra is just a new way to represent ordinary algebra. It is often much more compact. So to make model m4_7 easier to program, we used a matrix multiplication of the basis matrix B by the vector of parameters w: mu = a + B * w;. B is matrix of N rows and K columns and w is a column vector of length K. Here Stan performs linear algebra for (1) multiplying each element of the column vector w by each value in the corresponding column of B and then (2) summing up each result.
This is how we end up with exactly what we need: A sum linear predictor for each year (row). If you haven’t worked with much linear algebra, matrix notation can be intimidating. It is useful to remember that it is nothing more than the mathematics you already know, but expressed in a highly compressed form that is convenient when working with repeated calculations on lists of numbers.
2.3 Smooth Functions for a Rough World
The splines in the previous section are just the beginning. A entire class of models, generalized additive models (GAMs), focuses on predicting an outcome variable using smooth functions of some predictor variables. The topic is deep enough to deserve its own book.
3 Summary
This chapter introduced the simple linear regression model, a framework for estimating the association between a predictor variable and an outcome variable. The Gaussian distribution comprises the likelihood in such models, because it counts up the relative numbers of ways different combinations of means and standard deviations can produce an observation. To fit these models to data, the notebook introduced quadratic approximation of the posterior distribution and the tool PyMC. It also introduced new procedures for visualizing prior and posterior distributions.
The next notebook expands on these concepts by introducing regression models with more than one predictor variable.