03c - Geocentric Models

Author

Abdullah Mahmood

Published

March 17, 2025

# ruff: noqa: F405, F403
from init import * 
from formulaic import Formula
%config InlineBackend.figure_formats = ['svg']
Module aliases imported by init_notebook.py:
--------------------------------------------
from cmdstanpy import CmdStanModel
import bridgestan as bs
import numpy as np
import pandas as pd
import arviz as az
import utils as utils
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

Watermark:
----------
Last updated: 2025-03-17T14:47:38.253725+04:00

Python implementation: CPython
Python version       : 3.13.2
IPython version      : 9.0.2

Compiler    : Clang 16.0.0 (clang-1600.0.26.6)
OS          : Darwin
Release     : 24.4.0
Machine     : arm64
Processor   : arm
CPU cores   : 8
Architecture: 64bit

bridgestan: 2.6.1
numpy     : 2.2.4
pandas    : 2.2.3
re        : 2.2.1
watermark : 2.5.0
matplotlib: 3.10.1
seaborn   : 0.13.2
arviz     : 0.21.0
scipy     : 1.15.2
cmdstanpy : 1.2.5

1 Practice Problems

Source:

Linear Models & Causal Inference - Jake Thompson

1.1 4E1

In the model definition below, which line is the likelihood?

\[ \begin{align*} y_{i} &\sim \text{Normal}(\mu, \sigma)\\ \mu &\sim \text{Normal}(0,10)\\ \sigma &\sim \text{Exponential}(1)\\ \end{align*} \]

The first line \(y_{i} \sim \text{Normal}(\mu, \sigma)\) is the likelihood. The second line is very similar, but is instead the prior for the parameter \(μ\). The third line is the prior for the parameter \(σ\). Likelihoods and priors can look very similar, because a likelihood is effectively a prior for the residuals.

1.2 4E2

In the model definition just above, how many parameters are in the posterior distribution?

There are two parameters in the posterior distribution \(\mu\) and \(\sigma\).

1.3 4E3

Using the model definition above, write down the appropriate form of Bayes’ theorem that includes the proper likelihood and priors.

Ignoring the specific distributions for the moment:

\[ Pr(\mu, \sigma \mid y) = \frac{Pr(y \mid \mu, \sigma) Pr(\mu) Pr(\sigma)}{\int\int Pr(y \mid \mu, \sigma) Pr(\mu) Pr(\sigma) d\mu d\sigma} \]

Now inserting the distributional assumptions:

\[ Pr(\mu, \sigma \mid y) = \frac{\sum_i \text{Normal}(y_i \mid \mu, \sigma) \text{Normal}(\mu \mid 0,10) \text{Exponential}(\sigma \mid 1) }{\int\int\sum_i \text{Normal}(y_i \mid \mu, \sigma) \text{Normal}(\mu \mid 0,10) \text{Exponential}(\sigma \mid 1)d\mu d\sigma} \]

\[ \begin{align} Pr(\mu,\sigma \mid \underline{x}) &= \frac{Pr(\underline{x}|\mu,\sigma)Pr(\mu)Pr(\sigma)}{Pr(\underline{h})} \propto \frac{1}{\sigma}\prod_i exp(-\frac{1}{2}(\frac{x_i - \mu}{\sigma})^2) \times exp(-\frac{1}{2}(\frac{\mu}{10})^2) \times exp(-\sigma) \\ \implies Pr(\mu,\sigma \mid \underline{x}) &= \frac{\frac{1}{\sigma}\prod_i exp(-\frac{1}{2}(\frac{x_i - \mu}{\sigma})^2)exp(-\frac{1}{2}(\frac{\mu}{10})^2)exp(-\sigma)}{\int \int \frac{1}{\sigma}\prod_i exp(-\frac{1}{2}(\frac{x_i - \mu}{\sigma})^2)exp(-\frac{1}{2}(\frac{\mu}{10})^2)exp(-\sigma) d\mu d\sigma} \end{align} \]

1.4 4E4

In the model definition below, which line is the linear model?

\[ \begin{align*} y_{i} &\sim \text{Normal}(\mu, \sigma)\\ \mu_{i} &= \alpha + \beta x_i\\ \alpha &\sim \text{Normal}(0,10)\\ \beta &\sim \text{Normal}(0,1)\\ \sigma &\sim \text{Exponential}(2)\\ \end{align*} \]

The second line \(\mu_{i} = \alpha + \beta x_i\) is the linear model.

1.5 4E5

In the model definition just above, how many parameters are in the posterior distribution?

There are three parameters in the posterior distribution: \(\alpha, \beta, \sigma\). The symbol \(\mu\) is no longer a parameter in the posterior, because it is entirely determined by \(\alpha, \beta\) and \(x\)

1.6 4M1

For the model definition below, simulate observed \(y\) values from the prior (not the posterior).

\[ \begin{align*} y_{i} &\sim \text{Normal}(\mu, \sigma)\\ \mu &\sim \text{Normal}(0,10)\\ \sigma &\sim \text{Exponential}(1)\\ \end{align*} \]

To sample from the prior distribution of y, we use stats.norm.rvs to simulate, while averaging over the prior distributions of \(μ\) and \(σ\). The easiest way to do this is to sample from the priors and then pass those samples to stats.norm.rvs to simulate y. This code will sample from the priors:

mu_prior = stats.norm.rvs(loc=0, scale=10, size=int(1e4))
sigma_prior = stats.expon.rvs(1, size=int(1e4))

You may want to visualize these samples, just to help school your intuition for the priors.

Now to simulate heights that average over these prior distributions of parameters:

y_sim = stats.norm.rvs(loc=mu_prior, scale=sigma_prior, size=int(1e4))
def plot_dens():
    bw = utils.bw_nrd0(y_sim)
    az.plot_kde(y_sim, bw=bw*0.5)
    plt.xlabel(r"Sample $y$")
    plt.ylabel("Density")

plot_dens()

Note that the prior distribution of y is centered on zero. If that strikes you as odd, remember that a continuous variable like y can always be re-centered to any value you like, without changing any of the information in the data.

1.7 4M2

Translate the model just above into a Stan model.

m4_4m1 = '''
data {
  int<lower=0> N;
}
generated quantities {
  real mu = normal_rng(0, 10);
  real sigma = exponential_rng(1);
  real y_sim = normal_rng(mu, sigma);
}
'''

m4_4m1_model = utils.Stan('stan_models/m4_4m1', m4_4m1)
y_sim = m4_4m1_model.laplace_sample(data={'N': 1}, draws=int(1e4)).stan_variable('y_sim')

def plot_dens():
    bw = utils.bw_nrd0(y_sim)
    az.plot_kde(y_sim, bw=bw*0.5)
    plt.xlabel(r"Sample $y$")
    plt.ylabel("Density")

plot_dens()

1.8 4M3

Translate the quap model formula below into a mathematical model definition.

y ~ dnorm( mu , sigma ),
mu <- a + b*x,
a ~ dnorm( 0 , 10 ),
b ~ dunif( 0 , 1 ),
sigma ~ dexp( 1 )

\[ \begin{align*} y_{i} &\sim \text{Normal}(\mu_i, \sigma)\\ \mu_{i} &= \alpha + \beta x_i\\ \alpha &\sim \text{Normal}(0,10)\\ \beta &\sim \text{Normal}(0,1)\\ \sigma &\sim \text{Exponential}(2)\\ \end{align*} \]

1.9 4M4

A sample of students is measured for height each year for 3 years. After the third year, you want to fit a linear regression predicting height using year as a predictor. Write down the mathematical model definition for this regression, using any variable names and priors you choose. Be prepared to defend your choice of priors.

\[ \begin{align*} h_{i} &\sim \text{Normal}(\mu, \sigma)\\ \mu_{i} &= \alpha + \beta y_i\\ \alpha &\sim \text{Normal}(0,100)\\ \beta &\sim \text{Log-Normal}(0,1)\\ \sigma &\sim \text{Uniform}(0,50)\\ \end{align*} \]

Where \(h\) is height and \(y\) is year. These priors aren’t great, but they’ll do. The prior on the intercept \(α\) is effectively uninformative. The problem didn’t say what scale height is measured on, so it’s hard to do much else here. The prior on \(β\) is very weakly informative, centered on zero, which corresponds to no impact of year. This doesn’t seem like a great prior, because surely height increases each year or stays the same. So maybe a uniform prior above zero would be better?

  • \(\mu = \alpha + \beta y\) <- Assume the mean has some linear dependence on time.

  • \(\beta \sim \text{Log-Normal}(0,1)\) <- It seems perfectly sensible that the amount an average student’s height changes per year should only be positive. Additionally, I have no idea what the growth rate is so this encodes ignorance about the rate of change of height per year.

There are other options, including truncated distributions or log-Normal distributions, but this hasn’t been introduced yet.

a = stats.norm.rvs(0, 100, size=50)
b = stats.lognorm.rvs(s=np.exp(0), size=50)
sigma = stats.uniform.rvs(0, 50, size=50)

year = (np.arange(3)+1).reshape(-1,1)

height_pred = stats.norm.rvs(a + (b * year), sigma)

def plot_ppc():
    plt.plot(year, height_pred, color='k', alpha=0.2)
    plt.xlabel('Year')
    plt.ylabel('Height')

plot_ppc()

\[ \begin{align} h_{ij} &\sim \text{Normal}(\mu_{ij}, \sigma) \\ \mu_{ij} &= \alpha + \beta(y_j - \bar{y}) \\ \alpha &\sim \text{Normal}(100, 10) \\ \beta &\sim \text{Normal}(0, 10) \\ \sigma &\sim \text{Exponential}(1) \end{align} \]

Because height is centered, \(\alpha\) represents the average height in the average year (i.e., year 2). The prior of \(\text{Normal}(100, 10)\) was chosen assuming that height is measured in centimeters and that that sample is of children who are still growing.

The slope is extremely vague. The a prior centered on zero, and the standard deviation of the prior of 10 represents a wide range of possible growth (or shrinkage). During growth spurts, height growth averages 6–13 cm/year. The standard deviation of 10 encompasses the range we might expect to see if growth were occurring at a high rate.

Finally, the exponential prior on \(\sigma\) assumes an average deviation of 1.

Prior predictive simulations also appear to give reasonably plausible regression lines, given our current assumptions.

a = stats.norm.rvs(100, 10, size=50)
b = stats.norm.rvs(0, 10, size=50)
sigma = stats.expon.rvs(1, size=50)

year = (np.arange(3)+1).reshape(-1,1)

height_pred = stats.norm.rvs(a + b * (year - np.mean(year)), sigma)

def plot_ppc():
    plt.plot(year, height_pred, color='k', alpha=0.2)
    plt.xlabel('Year')
    plt.ylabel('Height')

plot_ppc()

1.10 4M5

Now suppose I remind you that every student got taller each year. Does this information lead you to change your choice of priors? How?

This one is subtle. On the one hand, having information on measurement scale lets us set a more sensible prior for the intercept, \(α\). You might center it on 120 now, for example. On the other hand, for the prior to really be “prior,” it needs to be ignorance of the actual data values. Since the observed mean is a feature of the sample, not of the measurement scale, basing the prior on it could get you into trouble. What kind of trouble? It could lead to an illusionary fit to data, as essentially you’ve used the data twice: once in setting the prior and once in conditioning on the data (computing the posterior).

Much later, we’ll see how constraints on the data can be used to design models, without running the risk of using the data twice.

Because we know that an increase in year will always lead to increased height, we know that \(\beta\) will be positive. Therefore, our prior should reflect this by using, for example, a log-normal distribution.

\[ \beta \sim \text{Log-Normal}(1,0.5) \]

This prior gives an expectation of about 3cm per year, with the 89% highest density interval between 0.87cm and 5.18cm per year.

def plot_dens():
    x = np.linspace(0,11,1000)
    samples = stats.lognorm.pdf(x, s=0.5, scale=np.exp(1))
    
    plt.plot(x, samples)
    plt.xlim(-1,11)
    plt.xlabel(r"$\beta$")
    plt.ylabel("Density")

plot_dens()

prior_lnorm = stats.lognorm.rvs(s=0.5, scale=np.exp(1), size=int(1e5))

np.quantile(prior_lnorm, 0.055)
np.quantile(prior_lnorm, 1-0.055)

az.hdi(prior_lnorm, hdi_prob=0.89)
array([0.85529522, 5.16528497])

Prior predictive simulations of plausible lines using this new log-normal prior indicate that these priors still represent plausible values. Most the lines are positive, due to the prior constraint. However, because of variation around the mean, some lines do show a decrease in height. If it is truly impossible for students to shrink, then data like this might arise from measurement error.

a = stats.norm.rvs(100, 10, size=50)
b = stats.lognorm.rvs(s=0.5, scale=np.exp(1), size=50)
sigma = stats.expon.rvs(1, size=50)

year = (np.arange(3)+1).reshape(-1,1)

height_pred = stats.norm.rvs(a + b * (year - np.mean(year)), sigma)

def plot_ppc():
    plt.plot(year, height_pred, color='k', alpha=0.2)
    plt.xlabel('Year')
    plt.ylabel('Height')

plot_ppc()

1.11 4M6

Now suppose I tell you that the variance among heights for students of the same age is never more than 64cm. How does this lead you to revise your priors?

Again, this is a subtle issue. All the advice from just above applies. But it’s not clear here whether “variance among heights for students of the same age is never more than 64cm” is a feature of the sample or of the population. If it’s of the population, then setting a prior for it may be okay. In that case:

\[ \sigma \sim \text{Uniform}(0,64) \]

would make sense.

1.12 4M7

Refit model m4_3 from the notebook, but omit the mean weight xbar this time. Compare the new model’s posterior to that of the original model. In particular, look at the covariance among the parameters. What is different? Then compare the posterior predictions of both models.

d = pd.read_csv("data/Howell1.csv", sep=';')
d2 = d[d['age'] >= 18]
m4_3_pred = '''
data {
    int<lower=1> N;
    vector[N] height;
    vector[N] weight;
    real xbar;
    
    int<lower=1> N_tilde;
    vector[N_tilde] weight_tilde;
}
parameters {
    real a;
    real<lower=0> b;
    real<lower=0, upper=50> sigma;
}
model {
    vector[N] mu;
    mu = a + b * (weight - xbar);
    
    // Likelihood Function
    height ~ normal(mu, sigma);
    
    // Priors
    a ~ normal(178, 20);
    b ~ lognormal(0, 1);
    sigma ~ uniform(0, 50);
}
generated quantities {
    vector[N_tilde] y_tilde; 
    for (i in 1:N_tilde) {
        y_tilde[i] = normal_rng(a + b * (weight_tilde[i] - xbar), sigma);
    }
}
'''
weight_seq = np.arange(25, 71)
data = {'N': len(d2), 
        'xbar': d2['weight'].mean(),
        'height': d2['height'].tolist(), 
        'weight': d2['weight'].tolist(),
        'N_tilde': len(weight_seq),
        'weight_tilde': weight_seq.tolist()}

m4_3_model = utils.StanQuap('stan_models/m4_3_pred', m4_3_pred, 
                            data, algorithm = 'Newton', 
                            generated_var=['y_tilde'])
m4_3_model.precis().round(2)
m4_3_model.vcov_matrix().round(3)
array([[ 0.073, -0.   ,  0.   ],
       [-0.   ,  0.002, -0.   ],
       [ 0.   , -0.   ,  0.037]])
def lm(post, data):
    weight = data
    xbar = d2['weight'].mean() # ensure that xbar is based on the dataset supplied during fit 
    a, b = post['a'].reshape(-1,1), post['b'].reshape(-1,1)
    return a + b * (weight - xbar)

mu = m4_3_model.link(lm_func=lm, predictor=weight_seq)
mu_mean, mu_plo, mu_phi = utils.precis(mu)
sim = m4_3_model.sim(data=data)['y_tilde']
_, sim_plo, sim_phi = utils.precis(sim)

def plot_post_sim():
    plt.plot(d2.weight, d2.height, 'o', fillstyle='none', label='Observed Data', color='k')
    plt.plot(weight_seq, mu_mean, 'k', label='Mean')
    plt.fill_between(weight_seq, mu_plo, mu_phi, alpha=0.2, label='Uncertainty in Mean')
    plt.fill_between(weight_seq, sim_plo, sim_phi, alpha=0.2, label='Posterior Predictive Samples')
    plt.legend()
    plt.xlabel('Weight (Standardized)')
    plt.ylabel('Height (cm)')    

plot_post_sim()

m4_3c = '''
data {
    int<lower=0> N;
    vector[N] height;
    vector[N] weight;
    
    int<lower=1> N_tilde;
    vector[N_tilde] weight_tilde;    
}
parameters {
    real a;
    real<lower=0> b;
    real<lower=0, upper=50> sigma;
}
model {
    vector[N] mu;
    mu = a + b * weight;
    
    // Likelihood Function
    height ~ normal(mu, sigma);
    
    // Priors
    a ~ normal(178, 20);
    b ~ lognormal(0, 1);
    sigma ~ uniform(0, 50);
}
generated quantities {
    vector[N_tilde] y_tilde; 
    for (i in 1:N_tilde) {
        y_tilde[i] = normal_rng(a + b * weight_tilde[i], sigma);
    }
}
'''

weight_seq = np.arange(25, 71)
data = {'N': len(d2), 
        'height': d2['height'].tolist(), 
        'weight': d2['weight'].tolist(),
        'N_tilde': len(weight_seq),
        'weight_tilde': weight_seq.tolist()}

m4_3c_model = utils.StanQuap('stan_models/m4_3c', m4_3c, 
                            data, algorithm = 'Newton', 
                            generated_var=['y_tilde'])
m4_3c_model.precis().round(2)
m4_3c_model.vcov_matrix().round(3)
array([[ 3.6e+00, -7.8e-02,  9.0e-03],
       [-7.8e-02,  2.0e-03, -0.0e+00],
       [ 9.0e-03, -0.0e+00,  3.7e-02]])
def lm(post, data):
    weight = data
    a, b = post['a'].reshape(-1,1), post['b'].reshape(-1,1)
    return a + b * weight

mu = m4_3c_model.link(lm_func=lm, predictor=weight_seq)
mu_mean, mu_plo, mu_phi = utils.precis(mu)
sim = m4_3c_model.sim(data=data)['y_tilde']
_, sim_plo, sim_phi = utils.precis(sim)

def plot_post_sim():
    plt.plot(d2.weight, d2.height, 'o', fillstyle='none', label='Observed Data', color='k')
    plt.plot(weight_seq, mu_mean, 'k', label='Mean')
    plt.fill_between(weight_seq, mu_plo, mu_phi, alpha=0.2, label='Uncertainty in Mean')
    plt.fill_between(weight_seq, sim_plo, sim_phi, alpha=0.2, label='Posterior Predictive Samples')
    plt.legend()
    plt.xlabel('Weight (Standardized)')
    plt.ylabel('Height (cm)')    

plot_post_sim()

It can be seen that the posterior values of both models look rather similar to each other. However the in the variance-covariance matrix appear to be a couple of orders of magnitude larger when one doesn’t subtract the mean values from the data. Why is this? It’s because the parameters signify different things in the two different models.

Inspect the model again:

Originally we had

\[ \mu_i = \alpha + \beta(x_i - \bar{x}) \]

and in the second run we are instead deciding to fit

\[ \mu_i = \alpha' + \beta x_i \]

We can see that doing this in no way changes the model that is being fit to the data as we can derive one from the other as follows

\[ \mu_i = \alpha + \beta(x_i - \bar{x}) = \alpha - \beta \bar{x} + \beta x_i = \alpha' + \beta x_i \]

That is to say

\[ \alpha' = \alpha - \beta \bar{x} \]

So as we can see, we can literally derive one parameter from the other. So why is the variance and covariance for the posterior of \(\alpha'\) so much larger than for \(\alpha\)? It’s because of what these parameters represent. For \(\alpha\), it can be seen that \(\mu = \alpha\) when \(x = \bar{x}\). That is \(\alpha\) represents the mean height at the mean weight. Therefore, the uncertainty in \(\alpha\) is equivalent to the uncertainty in \(\mu\) when \(x = \bar{x}\), the mean of the weight. On the other hand, \(\mu = \alpha'\) when \(x = 0\), that is \(\alpha\) represent the y-intercept of the data.

1.13 4M8

In the chapter, we used 15 knots with the cherry blossom spline. Increase the number of knots and observe what happens to the resulting spline. Then adjust also the width of the prior on the weights—change the standard deviation of the prior and watch what happens. What do you think the combination of knot number and the prior on the weights controls?

d = pd.read_csv("data/cherry_blossoms.csv", sep=';')
d2 = d[d[['doy']].notna().all(axis=1)]

m4_7a = '''
data {
    int N;
    int K;
    array[N] int doy;
    matrix[N, K] B;
}
parameters {
    real a;
    vector[K] w;
    real<lower=0> sigma;
}
transformed parameters {
    vector[N] mu;
    mu = a + B * w;
}
model {
    doy ~ normal(mu, sigma);
    a ~ normal(100, 10);
    w ~ normal(0, 10);
    sigma ~ exponential(1);
}
'''

m4_7a_mod = '''
data {
    int N;
    int K;
    array[N] int doy;
    matrix[N, K] B;
}
parameters {
    real a;
    vector[K] w;
    real<lower=0> sigma;
}
transformed parameters {
    vector[N] mu;
    mu = a + B * w;
}
model {
    doy ~ normal(mu, sigma);
    a ~ normal(100, 10);
    w ~ normal(0, 30);
    sigma ~ exponential(1);
}
'''

def bspline_regression_plots(data, num_knots, degree, model='m4_7a'):
    model_var = globals().get(model)
    knot_list = np.quantile(data.year, np.linspace(0, 1, num_knots))[1:-1].tolist()
    bspline_formula = Formula(f'bs(year, knots={knot_list}, degree={degree}, include_intercept=True)-1')
    B_mat = bspline_formula.get_model_matrix(data).to_numpy()

    input_data = {'N': B_mat.shape[0], 
                  'K': B_mat.shape[1],
                  'doy': data['doy'].astype(int).tolist(), 
                  'B': B_mat.tolist()}

    model = utils.StanQuap(f'stan_models/{model}',
                           model_var, input_data, algorithm = 'Newton',
                           generated_var=['mu'])
    post = model.extract_samples(n=1000, drop=[])
    w = post['w'].mean(axis=0)
    
    fig, axs = plt.subplots(3, 1, figsize=(8, 8))
    axs[0].plot(data.year, B_mat)
    axs[0].set(xlabel='Year', ylabel='Basis Value')
    
    axs[1].plot(data.year, B_mat * w)
    axs[1].set(xlabel='Year', ylabel='Basis * Weight')
    
    _, mu_plo, mu_phi = utils.precis(post['mu'])    
    axs[2].plot(data.year, data.doy, 'o', color='pink', alpha=0.6, zorder=2)
    axs[2].fill_between(data.year, mu_plo, mu_phi, alpha=0.6, zorder=3)
    axs[2].set(xlabel='Year', ylabel='Day in Year')
    
    fig.tight_layout()
num_knots = 20
knot_list = np.quantile(d2.year, np.linspace(0, 1, num_knots))
bspline_regression_plots(data=d2, num_knots=num_knots, degree=3)

num_knots = 20
knot_list = np.quantile(d2.year, np.linspace(0, 1, num_knots))
bspline_regression_plots(data=d2, num_knots=num_knots, degree=3, model='m4_7a_mod')

num_knots = 30
knot_list = np.quantile(d2.year, np.linspace(0, 1, num_knots))
bspline_regression_plots(data=d2, num_knots=num_knots, degree=3)

Increasing the number of knots increases the ‘wiggles’ and the increasing the standard deviation of the weight increases the width/thickness of \(\mu\)’s HDI plot.

1.14 4H1

The weights listed below were recorded in the !Kung census, but heights were not recorded for these individuals. Provide predicted heights and 89% intervals for each of these individuals. That is, fill in the table below, using model-based predictions.

Individual Weight Expected Height 89% Interval
1 46.95
2 43.72
3 64.78
4 32.59
5 54.63

The question is ambiguous as to whether it wants \(µ\) only or rather the distribution of individual height measurements (using \(σ\)). I’ll show the harder approach, that uses \(σ\) and simulates individual heights.

I’ll use quite flat priors here, but if you used stronger priors, that’s fine. What matters is the procedure be coherent.

Method 1: Using StanQuap convenience methods

d = pd.read_csv("data/Howell1.csv", sep=';')
d2 = d[d['age'] >= 18]

m4_3d = '''
data {
    int<lower=1> N;
    vector[N] height;
    vector[N] weight;
    
    int<lower=1> N_tilde;
    vector[N_tilde] weight_tilde;
}
parameters {
    real a;
    real b;
    real<lower=0, upper=50> sigma;
}
model {
    vector[N] mu;
    mu = a + b * weight;
    
    // Likelihood Function
    height ~ normal(mu, sigma);
    
    // Priors
    a ~ normal(100, 100);
    b ~ normal(0, 10);
    sigma ~ uniform(0, 50);
}
generated quantities {
    vector[N_tilde] y_tilde; 
    for (i in 1:N_tilde) {
        y_tilde[i] = normal_rng(a + b * weight_tilde[i], sigma);
    }
}
'''

missing_heights = np.array([46.95, 43.72, 64.78, 32.59, 54.63])
data = {'N': len(d2), 
        'height': d2['height'].tolist(), 
        'weight': d2['weight'].tolist(),
        'N_tilde': len(missing_heights),
        'weight_tilde': missing_heights.tolist()}

m4_3d_model = utils.StanQuap('stan_models/m4_3d', m4_3d,
                            data, algorithm = 'Newton',
                            generated_var=['y_tilde'])

sim = m4_3d_model.sim(data=data, n=10_000)['y_tilde']
sim_mean = sim.mean(axis=0)
sim_hdi = az.hdi(sim, hdi_prob=0.89)
missing_data = pd.DataFrame({'Individual': np.arange(5)+1,
                             'Weight': missing_heights,
                             'Expected Height': sim_mean,
                             '89% Interval': [f"{x[0]} - {x[1]}" for x in sim_hdi]})
                             
missing_data.set_index('Individual')
Weight Expected Height 89% Interval
Individual
1 46.95 156.320006 148.058 - 164.309
2 43.72 153.433954 145.363 - 161.324
3 64.78 172.527412 164.582 - 181.329
4 32.59 143.378311 135.276 - 151.409
5 54.63 163.333577 154.817 - 171.227

Method 2: Using SciPy

We first produce samples from the quadratic approximate posterior. We could use the model’s means and variance-covariance matrix and pass it in scipy.stats.multivariate_normal directly, or the convenient extract_samples method:

post = m4_3d_model.extract_samples(n=10_000)

Now we want to plug the weights in the table into this model, and then average over the posterior to compute predictions for each individual’s height. The question is ambiguous as to whether it wants \(µ\) only or rather the distribution of individual height measurements (using \(σ\)). I’ll show the harder approach, that uses \(σ\) and simulates individual heights.

For the first individual, the code might look like this:

y = stats.norm.rvs(loc=post['a'] + post['b'] * 46.95, scale=post['sigma'])
y.mean()
az.hdi(y, hdi_prob=0.89)
array([148.3268378 , 164.53844283])

How does the code work? The first line, which includes stats.norm.rvs, simulates 100-thousand heights, using the samples from the posterior and an assumed weight of 46.95 kg. The second line then computes the average of these simulated heights. That gives the expected (mean) height. The third line then computes the 90% HPDI of height.

Do the above for each row in the table, and you should get numbers close to the table produced earlier.

mean_array = np.array(list(m4_3d_model.opt_params.values()))
cov = m4_3d_model.vcov_matrix()
post = stats.multivariate_normal(mean_array, cov).rvs(10_000)

y = stats.norm.rvs(loc=post[:,0] + post[:,1] * 46.95, scale=post[:,2])
y.mean()
az.hdi(y, hdi_prob=0.89)
array([147.90758214, 164.41048719])

We could have also computed the expected heights straight from the MAP. That approach is fine, and will give nearly the same answer.

1.15 4H2

Select out all the rows in the Howell1 data with ages below 18 years of age. If you do it right, you should end up with a new data frame with 192 rows in it.

  1. Fit a linear regression to these data, using StanQuap. Present and interpret the estimates. For every 10 units of increase in weight, how much taller does the model predict a child gets?
  2. Plot the raw data, with height on the vertical axis and weight on the horizontal axis. Super-impose the MAP regression line and 89% interval for the mean. Also superimpose the 89% interval for predicted heights.
  3. What aspects of the model fit concern you? Describe the kinds of assumptions you would change, if any, to improve the model. You don’t have to write any new code. Just explain what the model appears to be doing a bad job of, and what you hypothesize would be a better model.
d = pd.read_csv("data/Howell1.csv", sep=';')
d2 = d[d['age'] < 18]
def plot_data():
    plt.plot(d2.weight, d2.height, 'o', fillstyle='none')
    plt.ylabel("Height (cm)")
    plt.xlabel("Weight (kg)")

plot_data()

  1. Fit a linear regression to these data, using StanQuap. Present and interpret the estimates. For every 10 units of increase in weight, how much taller does the model predict a child gets?

Now find the quadratic approximate posterior. The code is the same as before. The data frame just changes.

m4_3d = '''
data {
    int<lower=1> N;
    vector[N] height;
    vector[N] weight;
    
    int<lower=1> N_tilde;
    vector[N_tilde] weight_tilde;
}
parameters {
    real a;
    real b;
    real<lower=0, upper=50> sigma;
}
model {
    vector[N] mu;
    mu = a + b * weight;
    
    // Likelihood Function
    height ~ normal(mu, sigma);
    
    // Priors
    a ~ normal(100, 100);
    b ~ normal(0, 10);
    sigma ~ uniform(0, 50);
}
generated quantities {
    vector[N_tilde] y_tilde; 
    for (i in 1:N_tilde) {
        y_tilde[i] = normal_rng(a + b * weight_tilde[i], sigma);
    }
}
'''

weight_seq = np.linspace(1,45,50)
data = {'N': len(d2), 
        'height': d2['height'].tolist(), 
        'weight': d2['weight'].tolist(),
        'N_tilde': len(weight_seq),
        'weight_tilde': weight_seq.tolist()}

m4_3d_model = utils.StanQuap('stan_models/m4_3d', m4_3d,
                            data, algorithm = 'Newton',
                            generated_var=['y_tilde'])

m4_3d_model.precis().round(2)
Mean StDev 5.5% 94.5%
Parameter
a 58.24 1.40 56.01 60.47
b 2.72 0.07 2.61 2.83
sigma 8.44 0.43 7.75 9.13

The estimates suggest that the MAP coefficient for weight is 2.7. This implies that for a unit change of 1kg of weight, we predict an average of 2.7cm of increase in height.

  1. Plot the raw data, with height on the vertical axis and weight on the horizontal axis. Super-impose the MAP regression line and 89% interval for the mean. Also superimpose the 89% interval for predicted heights.

Now to plot the raw data and superimpose the model estimates, modify the code from earlier. We will sample from the naive posterior, then compute 90% intervals for the mean and predicted heights. This is what the complete code looks like, if you opt not to use the convenience methods link and sim:

We will sample from the posterior predictive, then compute 90% intervals for the mean and predicted heights.

post = m4_3d_model.extract_samples(n=10_000)
mu = post['a'].reshape(-1,1) + post['b'].reshape(-1,1) * weight_seq
mu_mean = mu.mean(axis=0)
mu_ci = az.hdi(mu, hdi_prob=0.89)
pred = stats.norm.rvs(loc=mu, scale=post['sigma'].reshape(-1,1))
pred_ci = az.hdi(pred, hdi_prob=0.89)

def plot_post():
    plt.plot(d2.weight, d2.height, 'o', fillstyle='none', label='Observed Data', color='k')
    plt.plot(weight_seq, mu_mean, 'k', label='Mean')
    plt.plot(weight_seq, mu_ci[:,0], 'k--')
    plt.plot(weight_seq, mu_ci[:,1], 'k--')
    plt.plot(weight_seq, pred_ci[:,0], 'k--')
    plt.plot(weight_seq, pred_ci[:,1], 'k--')
    plt.xlabel('Weight')
    plt.ylabel('Height')

plot_post()

  1. What aspects of the model fit concern you? Describe the kinds of assumptions you would change, if any, to improve the model. You don’t have to write any new code. Just explain what the model appears to be doing a bad job of, and what you hypothesize would be a better model.

The major problem with this model appears to be that the relationship between weight and height, for non-adults, isn’t very linear. Instead it is curved. As a result, at low weight values, the predicted mean is above most of the actual heights. At middle weight values, the predicted mean is below most of the heights. Then again at high weight values, the mean is above the heights.

A parabolic model would likely fit these data much better. But that’s not the only option. What we’re after essentially is some way to model a reduction of the slope between height and weight, as weight increases.

1.16 4H3

Suppose a colleague of yours, who works on allometry, glances at the practice problems just above. Your colleague exclaims, “That’s silly. Everyone knows that it’s only the logarithm of body weight that scales with height!” Let’s take your colleague’s advice and see what happens.

  1. Model the relationship between height (cm) and the natural logarithm of weight (log-kg). Use the entire Howell1 data frame, all 544 rows, adults and non-adults. Can you interpret the resulting estimates?

  2. Begin with this plot: plt.plot(d2.weight, d2.height). Then use samples from the quadratic approximate posterior of the model in (a) to superimpose on the plot: (1) the predicted mean height as a function of weight, (2) the 97% interval for the mean, and (3) the 97% interval for predicted heights.

d = pd.read_csv("data/Howell1.csv", sep=';')
  1. Model the relationship between height (cm) and the natural logarithm of weight (log-kg). Use the entire Howell1 data frame, all 544 rows, adults and non-adults. Can you interpret the resulting estimates?

We can just use log inside the Stan mode:

mlw = '''
data {
    int<lower=1> N;
    vector[N] height;
    vector[N] weight;
    
    int<lower=1> N_tilde;
    vector[N_tilde] weight_tilde;
}
parameters {
    real a;
    real b;
    real<lower=0, upper=50> sigma;
}
model {
    vector[N] mu;
    mu = a + b * log(weight);
    
    // Likelihood Function
    height ~ normal(mu, sigma);
    
    // Priors
    a ~ normal(138, 100);
    b ~ normal(0, 100);
    sigma ~ uniform(0, 50);
}
generated quantities {
    vector[N_tilde] y_tilde; 
    for (i in 1:N_tilde) {
        y_tilde[i] = normal_rng(a + b * weight_tilde[i], sigma);
    }
}
'''

lw_seq = np.linspace(1.4,4.2,50)
data = {'N': len(d), 
        'height': d['height'].tolist(), 
        'weight': d['weight'].tolist(),
        'N_tilde': len(lw_seq),
        'weight_tilde': lw_seq.tolist()}

mlw_model = utils.StanQuap('stan_models/mlw', mlw,
                            data, algorithm = 'Newton',
                            generated_var=['y_tilde'])

mlw_model.precis().round(2)
Mean StDev 5.5% 94.5%
Parameter
a -23.79 1.34 -25.92 -21.66
b 47.08 0.38 46.47 47.69
sigma 5.13 0.16 4.89 5.38

Pretty hard to know what to make of these estimates, aside from the fact that the confidence intervals are quite narrow, owing to there being 544 rows. The estimate for b (\(β\)) is hard to understand, because it refers to log-kg, not raw kg. It means that for every increase of 1 log-kg of weight, you expect a increase of 47 cm of height. But what’s a log-kg? You want to know what the model predicts on the natural scale of measurement. So now to plotting…

  1. Begin with this plot: plt.plot(d2.weight, d2.height). Then use samples from the quadratic approximate posterior of the model in (a) to superimpose on the plot: (1) the predicted mean height as a function of weight, (2) the 97% interval for the mean, and (3) the 97% interval for predicted heights.

Begin by sampling from the naive posterior and computing the confidence intervals as per the examples in the book. Again, I’ll not use the convenience functions, but it’s fine if you did.

post = mlw_model.extract_samples(n=10_000)
mu = post['a'].reshape(-1,1) + post['b'].reshape(-1,1) * lw_seq
mu_mean = mu.mean(axis=0)
mu_ci = az.hdi(mu, hdi_prob=0.89)
pred = stats.norm.rvs(loc=mu, scale=post['sigma'].reshape(-1,1))
pred_ci = az.hdi(pred, hdi_prob=0.89)

Now you have lists of numbers that can be plotted to produce the confidence curves. But how to translate back to the non-log scale of measurement on the horizontal axis? Just exponentiate the x-axis coordinates in lw_seq, and the job is done:

def plot_post():
    weight_seq = np.exp(lw_seq)
    plt.plot(d.weight, d.height, 'o', fillstyle='none', label='Observed Data', color='k')
    plt.plot(weight_seq, mu_mean, 'k', label='Mean')
    plt.plot(weight_seq, mu_ci[:,0], 'k--')
    plt.plot(weight_seq, mu_ci[:,1], 'k--')
    plt.plot(weight_seq, pred_ci[:,0], 'k--')
    plt.plot(weight_seq, pred_ci[:,1], 'k--')
    plt.xlabel('Weight')
    plt.ylabel('Height')    

plot_post()

The model may have been linear, but plotted on the raw scale of measurement, it is clearly non-linear. Not only is the trend for the mean curved, but the variance around the mean is not constant, on this scale. Instead, the variance around the mean increases with weight. On the scale you fit the model on, the variance was assumed to be constant. But once you transform the measurement scale, it usually won’t be.

Notice also that the estimate for the mean is so precise that you can hardly even see the confidence interval for it. Don’t get too confident about such results, though. Remember, all inferences of the model are conditional on the model. Even estimated trends that do a terrible job of prediction can have tight confidence intervals, when the data set is large.

1.17 4H4

Plot the prior predictive distribution for the parabolic polynomial regression model in the notebook. You can modify the code that plots the linear regression prior predictive distribution. Can you modify the prior distributions of \(α\), \(β_1\), and \(β_2\) so that the prior predictions stay within the biologically reasonable outcome space? That is to say: Do not try to fit the data by hand. But do try to keep the curves consistent with what you know about height and weight, before seeing these exact data.

m4_5_priors = '''
data {
    int<lower=0> N;
    vector[N] weight_s;
}
generated quantities {
    real a = normal_rng(178, 20);
    real<lower=0> b1 = lognormal_rng(0, 1);
    real b2 = normal_rng(0, 1);
    vector[N] y_sim = a + b1 * weight_s + b2 * weight_s^2;
}
'''

weight_s_seq = np.linspace(-2.2, 2, 30)
data = {'N': len(weight_s_seq), 'weight_s': weight_s_seq}

prior_pred = utils.Stan('stan_models/m4_5_priors', m4_5_priors)
y_sim = prior_pred.laplace_sample(data=data, draws=1000).stan_variable('y_sim')

def plot_dens_ppc():
    _, axes = plt.subplots(1,2, figsize=(9,4))
    bw = utils.bw_nrd0(y_sim)
    az.plot_kde(y_sim, bw=bw*0.5, ax=axes[0])
    axes[0].set(xlabel='Sample Height', ylabel='Density')
    axes[1].plot(weight_s_seq, y_sim.T, color='k', alpha=0.2)
    axes[1].set(xlabel='Weight (Standardized)', ylabel='Height')

plot_dens_ppc()

n = int(1e4)
a_dens = np.random.normal(178, 20, size=n)
b1_dens = np.random.lognormal(0, 1, size=n)
b2_dens = np.random.normal(0, 1, size=n)
sigma_dens = np.random.uniform(0, 50, size=n)

def plot_dens():
  dens_list = [a_dens, b1_dens, b2_dens, sigma_dens]
  labels = [r'$\alpha$', r'$\beta_1$', r'$\beta_2$', r'$\sigma$']
  
  _ , axs = plt.subplots(2, 2, figsize=(6,6))
  for i, (dens, ax) in enumerate(zip(dens_list, axs.flatten())):
      bw = utils.bw_nrd0(dens)
      az.plot_kde(dens, bw=bw*0.5, ax=ax)
      ax.set(xlabel=labels[i], ylabel='Density')

plot_dens()

rand_index = np.random.choice(len(weight_s_seq)) # select a random weight
mu_dens = a_dens + b1_dens * weight_s_seq[rand_index] + b2_dens * weight_s_seq[rand_index]
prior_h = np.random.normal(loc=mu_dens, scale=sigma_dens)

def plot_dens():
  bw = utils.bw_nrd0(prior_h)
  az.plot_kde(prior_h, bw=bw*0.5)

plot_dens()

Modified priors and their justifications are written along side:

\(H_i \sim N(\mu_i,\sigma)\)

\(\mu_i = \alpha + \beta_1 (x_i - \overline{x}) - \beta_2(x_i - \overline{x})^2\) -> Note that it’s \(-\beta_2\). Positive prior on \(\beta\) ensures the paraboloid curves in the correct direction.

\(\alpha \sim N(150,30^2)\) -> Reduced mean height at mean weight, as we are looking at a much larger age range now, so we expect younger people to drag the mean height down at any given weight relative to the adult only data set

\(\beta_1 \sim \text{Lognormal}(0,1)\) -> Need a linear base line to perturb with a quadratic term

The seconds term puts a small deviation from linearity

\(\beta_2 \sim \text{Exponential}(0.05)\) -> This should be positive to ensure concavity, and small to ensure small only small perturbation from linearity in weights

\(\sigma \sim \text{Uniform}(0,50)\)

m4_5_priors2 = '''
data {
    int<lower=0> N;
    vector[N] weight_s;
}
generated quantities {
    real a = normal_rng(150, 30);
    real<lower=0> b1 = lognormal_rng(0, 1);
    real b2 = exponential_rng(0.05);
    vector[N] y_sim = a + b1 * weight_s - b2 * weight_s^2;
}
'''

weight_s_seq = np.linspace(-2.2, 2, 30)
data = {'N': len(weight_s_seq), 'weight_s': weight_s_seq}

prior_pred = utils.Stan('stan_models/m4_5_priors2', m4_5_priors2)
y_sim = prior_pred.laplace_sample(data=data, draws=1000).stan_variable('y_sim')

def plot_dens_ppc():
    _, axes = plt.subplots(1,2, figsize=(9,4))
    bw = utils.bw_nrd0(y_sim)
    az.plot_kde(y_sim, bw=bw*0.5, ax=axes[0])
    axes[0].set(xlabel='Sample Height', ylabel='Density')
    axes[1].plot(weight_s_seq, y_sim.T, color='k', alpha=0.2)
    axes[1].set(xlabel='Weight (Standardized)', ylabel='Height')

plot_dens_ppc()

n = int(1e4)
a_dens = np.random.normal(150, 30, size=n)
b1_dens = np.random.lognormal(0, 1, size=n)
b2_dens = np.random.exponential(1/0.05, size=n)
sigma_dens = np.random.uniform(0, 1, size=n)

def plot_dens():
  dens_list = [a_dens, b1_dens, b2_dens, sigma_dens]
  labels = [r'$\alpha$', r'$\beta_1$', r'$\beta_2$', r'$\sigma$']
  
  _ , axs = plt.subplots(2, 2, figsize=(6,6))
  
  for i, (dens, ax) in enumerate(zip(dens_list, axs.flatten())):
      bw = utils.bw_nrd0(dens)
      az.plot_kde(dens, bw=bw*0.5, ax=ax)
      ax.set(xlabel=labels[i], ylabel='Density')

plot_dens()

rand_index = np.random.choice(len(weight_s_seq))
mu_dens = a_dens + b1_dens * weight_s_seq[rand_index] - b2_dens * weight_s_seq[rand_index]
prior_h = np.random.normal(loc=mu_dens, scale=sigma_dens)

def plot_dens():
  bw = utils.bw_nrd0(prior_h)
  az.plot_kde(prior_h, bw=bw*0.5)

plot_dens()

By using an exponential prior on \(\beta_2\) this ensures the parabolic curve must be concave, thus no increasing growth in height with weight is possible and makes perfect sense on physical grounds. However, extreme and absurd values are still found near the extremes of the data. One could tighten the parameters to ensure this didn’t happen, but as a first approximation this isn’t too bad.

Clearly there is room for improvement here. However, because it’s not intuitive how exactly each parameter effects the parabolic curve, finding a good prior distribution is really hard! After much trial and error and playing with parabola calculators online, here is what I ended up with:

\[ \begin{align} h_i &\sim \text{Normal}(\mu_i,\sigma) \\ \mu_i &= \alpha + \beta_1x_i + \beta_2x_i^2 \\ \alpha &\sim \text{Normal}(-190,5) \\ \beta_1 &\sim \text{Normal}(13,0.2) \\ \beta_2 &\sim \text{Uniform}(-0.13,-0.10) \\ \sigma &\sim \text{Uniform}(0,50) \end{align} \]

Which has the following prior predictive distribution.

m4_5_priors3 = '''
data {
    int<lower=0> N;
    vector[N] weight_s;
}
generated quantities {
    real a = normal_rng(-190, 5);
    real b1 = normal_rng(13, 0.2);
    real b2 = uniform_rng(-0.13, -0.1);
    vector[N] y_sim = a + b1 * weight_s + b2 * weight_s^2;
}
'''

weight_seq = np.linspace(27,70,100)
data = {'N': len(weight_seq), 'weight_s': weight_seq}

prior_pred = utils.Stan('stan_models/m4_5_priors3', m4_5_priors3)
y_sim = prior_pred.laplace_sample(data=data, draws=1000).stan_variable('y_sim')

def plot_dens_ppc():
    _, axes = plt.subplots(1,2, figsize=(9,4))
    bw = utils.bw_nrd0(y_sim)
    az.plot_kde(y_sim, bw=bw*0.5, ax=axes[0])
    axes[0].set(xlabel='Sample Height', ylabel='Density')
    axes[1].plot(weight_seq, y_sim.T, color='k', alpha=0.2)
    axes[1].set(xlabel='Weight (kg)', ylabel='Height')

plot_dens_ppc()

n = 1000
a = np.random.normal(-190, 5, size=n)
b1 = np.random.normal(13, 0.1, size=n)
b2 = np.random.uniform(-0.13, -0.1, size=n)

weight = np.linspace(27,70,100).reshape(-1,1)

height_pred = a + (b1 * weight) + (b2 * weight**2)

def plot_ppc():
  plt.plot(weight, height_pred, color='k', alpha=0.2)
  plt.xlabel('Weight (kg)')
  plt.ylabel('Height')

plot_ppc()

1.18 4H5

Return to cherry_blossoms and model the association between blossom date (doy) and March temperature (temp). Note that there are many missing values in both variables. You may consider a linear model, a polynomial, or a spline on temperature. How well does temperature trend predict the blossom trend?

d = pd.read_csv("data/cherry_blossoms.csv", sep=';')
d2 = d.dropna(subset=["doy", "temp"]).reset_index(drop=True)
d2['temp_s'] = utils.standardize(d2.temp, ddof=1)

We’ll try each type of model: linear, polynomial, and spline. For each, we’ll fit the model, and then visualize the predictions with the observed data.

Linear Model:

\[ \begin{aligned} \text{doy}_i &\sim \mathcal{N}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta \cdot \text{t}_{i} - \bar{\text{t}} \\ \alpha &\sim \mathcal{N}(100, 10) \\ \beta &\sim \mathcal{N}(0, 10) \\ \sigma &\sim \text{Exponential}(1) \end{aligned} \]

Quadratic Model

\[ \begin{aligned} \text{doy}_i &\sim \mathcal{N}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta_1 \cdot \text{t}_{s,i} + \beta_2 \cdot \text{t}_{s,i}^2 \\ \alpha &\sim \mathcal{N}(100, 10) \\ \beta_1 &\sim \mathcal{N}(0, 10) \\ \beta_2 &\sim \mathcal{N}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{aligned} \]

Cubic Model

\[ \begin{aligned} \text{doy}_i &\sim \mathcal{N}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta_1 \cdot \text{t}_{s,i} + \beta_2 \cdot \text{t}_{s,i}^2 + \beta_3 \cdot \text{t}_{s,i}^3 \\ \alpha &\sim \mathcal{N}(100, 10) \\ \beta_1 &\sim \mathcal{N}(0, 10) \\ \beta_2 &\sim \mathcal{N}(0, 1) \\ \beta_3 &\sim \mathcal{N}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{aligned} \]

Spline Model

\[ \begin{aligned} \text{doy}_i &\sim \mathcal{N}(\mu_i, \sigma) \\ \mu_i &= \alpha + \sum_{j=1}^{k} \beta_j B_{j,i} \\ \alpha &\sim \mathcal{N}(100, 10) \\ \beta_j &\sim \mathcal{N}(0, 10), \quad \text{for } j = 1, \dots, k \\ \sigma &\sim \text{Exponential}(1) \end{aligned} \]

def plot_data():
    plt.figure(figsize=(8,3))
    plt.plot(d2.temp_s, d2.doy, 'o', fillstyle='none')
    plt.xlabel('March Temperature')
    plt.ylabel('Day in Year')

plot_data()

By leveraging a matrix formulation (design matrix), we are setting up our StanQuap model to handle all four models described above.

m4_8 = '''
data {
    int<lower=1> N;             // Number of Observations
    int<lower=1> K;             // Number of Coefficients (including intercept)
    vector[N] doy;              // DoY Outcome
    matrix[N, K] X;             // Regressors (March Temperature)
    
    int<lower=1> N_tilde;
    matrix[N_tilde, K] X_tilde;
}
parameters {
    vector[K] b;
    real<lower=0, upper=50> sigma;
}
transformed parameters {
    vector[N] mu;
    mu = X * b;
    
    vector[N_tilde] mu_tilde;
    mu_tilde = X_tilde * b;
}
model {
    doy ~ normal(mu, sigma);
    
    b[1] ~ normal(100, 10);
    b[2] ~ normal(0, 10);
    // Up to Cubic regression
    if (K > 2 && K < 5) {
        for (i in 3:K) {
            b[i] ~ normal(0, 1);
        }
    }
    // B-Spline
    if (K > 5) {
        for (i in 3:K) {
            b[i] ~ normal(0, 10);
        }
    }
    sigma ~ exponential(1);
}
generated quantities {
    vector[N_tilde] y_tilde;
    for (i in 1:N_tilde) {
        y_tilde[i] = normal_rng(mu_tilde[i], sigma);
    }
}
'''

temp_seq = np.linspace(-2.2, 3.3, 100)

def dmat(data, degree):
    poly_terms = " + ".join([f"I(temp_s**{i})" for i in range(1, degree + 1)])   
    formula = Formula(f"1 + {poly_terms}")
    return formula.get_model_matrix(data).to_numpy()

def bspline_dmat(data, num_knots, degree=3):
    knot_list = np.quantile(data.temp_s, np.linspace(0, 1, num_knots))[1:-1].tolist()
    # Maintain first/intercept column
    bspline_formula = Formula(f'bs(temp_s, knots={knot_list}, degree={degree}, include_intercept=True)')
    return bspline_formula.get_model_matrix(data).to_numpy()

def stan_data_dict(degree=2, num_knots=None):
    X = dmat(d2, degree) if num_knots is None else bspline_dmat(d2, num_knots)
    temp_seq_df = pd.DataFrame({'temp_s': temp_seq})
    X_tilde = dmat(temp_seq_df, degree) if num_knots is None else bspline_dmat(temp_seq_df, num_knots)
    return {'N': X.shape[0],
            'K': X.shape[1],
            'doy': d2['doy'].tolist(),
            'X': X.tolist(),
            'N_tilde': len(temp_seq_df),
            'X_tilde': X_tilde.tolist()}

title = ['Linear', 'Quadratic', 'Cubic', 'Spline']
def plot_model():
    model = utils.Stan('stan_models/m4_8', m4_8)
    
    _, axes = plt.subplots(2, 2, figsize=(9,4), sharex=True, sharey=True)
    for idx, ax in enumerate(axes.flatten()):
        data = stan_data_dict(degree=idx+1) if title[idx] != 'Spline' else stan_data_dict(num_knots=30)
        post = model.laplace_sample(data=data, draws=10_000).stan_variables()
        mu_mean, mu_plo, mu_phi = utils.precis(post['mu_tilde'])
        _, sim_plo, sim_phi = utils.precis(post['y_tilde'])
        
        ax.plot(d2.temp_s, d2.doy, 'o', fillstyle='none', color='k', alpha=0.3)
        ax.plot(temp_seq, mu_mean, 'k')
        ax.fill_between(temp_seq, mu_plo, mu_phi, color='green', alpha=0.2)
        ax.fill_between(temp_seq, sim_plo, sim_phi, color='blue', alpha=0.2)
        ax.set(xlabel='March Temperature (Standardized)', ylabel='Day in Year')
        ax.set_title(title[idx])   

plot_model()

Overall the predictions from each model are remarkably similar. Therefore, I would go with a linear model, as that is the simplest of the models.

1.19 4H6

Simulate the prior predictive distribution for the cherry blossom spline in the chapter. Adjust the prior on the weights and observe what happens. What do you think the prior on the weights is doing?

As a reminder, here is the cherry blossom spline model from the chapter:

\[ \begin{align} D_i &\sim \text{Normal}(\mu_i,\sigma) \\ \mu_i &= \alpha + \sum_{k=1}^Kw_kB_{k,i} \\ \alpha &\sim \text{Normal}(100, 10) \\ w_k &\sim \text{Normal}(0,10) \\ \sigma &\sim \text{Exponential}(1) \end{align} \]

We’ll also need to recreate the basis functions that that model uses and finally, we can generate data from the priors, and combine those parameters with the basis functions to get the prior predictive distributions.

d = pd.read_csv("data/cherry_blossoms.csv", sep=';')
d2 = d.dropna(subset=["doy", "year"]).reset_index(drop=True)
def bspline_prior_predictive(model_name, num_knots, degree, title):
    
    year_N = np.arange(812, 2016)
    
    knot_list = np.quantile(year_N, np.linspace(0, 1, num_knots))[1:-1].tolist()
    bspline_formula = Formula(f'bs(year, knots={knot_list}, degree={degree}, include_intercept=True)')
    B_mat = bspline_formula.get_model_matrix(pd.DataFrame({'year': year_N})).to_numpy()  

    model = utils.Stan(f'stan_models/{model_name}', globals().get(model_name))
    data = {'N': B_mat.shape[0],
            'K': B_mat.shape[1],
            'X': B_mat}
    post = model.laplace_sample(data=data, draws=100).stan_variables()

    plt.figure(figsize=(8,3))
    plt.plot(year_N, post['mu_sim'].T, 'k')
    plt.xlabel('Year')
    plt.ylabel('Day of Year')
    plt.title(title)
m4_7_priors = '''
data {
    int<lower=1> N;
    int<lower=1> K;
    matrix[N, K] X;
}
generated quantities {
    vector[K] b;
    b[1] = normal_rng(100, 10);
    for (i in 2:K) {
        b[i] = normal_rng(0, 10);
    }
    vector[N] mu_sim = X * b;
}
'''

bspline_prior_predictive('m4_7_priors', 15, 3, 
                         title='Prior Predictive Distribution - Weight mu = 0, sigma = 10')

Now let’s tighten the prior on w to \(\text{Normal}(0,2)\), as we used for exercise 4M8. Now the lines are much less wiggly, which is consistent with what we found in the previous exercise, which used the observed data.

m4_7_priors1 = '''
data {
    int<lower=1> N;
    int<lower=1> K;
    matrix[N, K] X;
}
generated quantities {
    vector[K] b;
    b[1] = normal_rng(100, 10);
    for (i in 2:K) {
        b[i] = normal_rng(0, 2);
    }
    vector[N] mu_sim = X * b;
}
'''

bspline_prior_predictive('m4_7_priors1', 15, 3,
                         title='Prior Predictive Distribution - Weight mu = 0, sigma = 2')

1.20 4H8

The cherry blossom spline in the chapter used an intercept α, but technically it doesn’t require one. The first basis functions could substitute for the intercept. Try refitting the cherry blossom spline without the intercept. What else about the model do you need to change to make this work?

d = pd.read_csv("data/cherry_blossoms.csv", sep=';')
d2 = d.dropna(subset=["doy"]).reset_index(drop=True)

m4_7a_mod1 = '''
data {
    int N;
    int K;
    array[N] int doy;
    matrix[N, K] B;
}
parameters {
    vector[K] w;
    real<lower=0> sigma;
}
transformed parameters {
    vector[N] mu;
    mu = B * w;
}
model {
    doy ~ normal(mu, sigma);
    w ~ normal(0, 10);
    sigma ~ exponential(1);
}
'''

def bspline_regression_plots(data, num_knots, degree, model='m4_7a_mod1'):
    knot_list = np.quantile(data.year, np.linspace(0, 1, num_knots))[1:-1].tolist()
    bspline_formula = Formula(f'bs(year, knots={knot_list}, degree={degree}, include_intercept=True)-1')
    B_mat = bspline_formula.get_model_matrix(data).to_numpy()

    input_data = {'N': B_mat.shape[0], 
                  'K': B_mat.shape[1],
                  'doy': data['doy'].astype(int).tolist(), 
                  'B': B_mat.tolist()}

    model = utils.Stan(f'stan_models/{model}', globals().get(model))
    
    post = model.laplace_sample(data=input_data, draws=1_000).stan_variables()
    _, mu_plo, mu_phi = utils.precis(post['mu']) 
    
    plt.figure(figsize=(8,3))
    plt.plot(data.year, data.doy, 'o', color='pink', alpha=0.6, zorder=2)
    plt.fill_between(data.year, mu_plo, mu_phi, alpha=0.6, zorder=3)
    plt.xlabel('Year')
    plt.ylabel('Day in Year')

bspline_regression_plots(d2, 15, 3)

This looks a lot like our original model, except the left hand side of the spline is pulled down. This is likely due to the prior on w. The prior is centered on 0, but that assumes an intercept is present (i.e., the curves of the spline average a deviation of 0 from the mean). However, without the intercept, the prior drags the line down to actual zero when the first basis function in non-zero.