04b - The Many Variables & The Spurious Waffles

Author

Abdullah Mahmood

Published

March 17, 2025

0.1 Imports

# ruff: noqa: F405
from init import *
from causalgraphicalmodels import CausalGraphicalModel
import daft as daft

%config InlineBackend.figure_formats = ['svg']
Module aliases imported by init_notebook.py:
--------------------------------------------
from cmdstanpy import CmdStanModel
import bridgestan as bs
import numpy as np
import pandas as pd
import arviz as az
import utils as utils
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

Watermark:
----------
Last updated: 2025-03-17T16:32:32.532468+04:00

Python implementation: CPython
Python version       : 3.13.2
IPython version      : 9.0.2

Compiler    : Clang 16.0.0 (clang-1600.0.26.6)
OS          : Darwin
Release     : 24.4.0
Machine     : arm64
Processor   : arm
CPU cores   : 8
Architecture: 64bit

bridgestan: 2.6.1
cmdstanpy : 1.2.5
watermark : 2.5.0
scipy     : 1.15.2
numpy     : 2.2.4
seaborn   : 0.13.2
pandas    : 2.2.3
arviz     : 0.21.0
matplotlib: 3.10.1
re        : 2.2.1

1 Masked Relationship

The divorce rate example demonstrates that multiple predictor variables are useful for knocking out spurious association. A second reason to use more than one predictor variable is to measure the direct influences of multiple factors on an outcome, when none of those influences is apparent from bivariate relationships. This kind of problem tends to arise when there are two predictor variables that are correlated with one another. However, one of these is positively correlated with the outcome and the other is negatively correlated with it.

You’ll consider this kind of problem in a new data context, information about the composition of milk across primate species, as well as some facts about those species, like body mass and brain size. Milk is a huge investment, being much more expensive than gestation. Such an expensive resource is likely adjusted in subtle ways, depending upon the physiological and development details of each mammal species. Let’s load the data:

d = pd.read_csv("data/milk.csv", sep=';')

You should see in the structure of the data frame that you have 29 rows for 8 variables. The variables we’ll consider for now are kcal.per.g (kilocalories of energy per gram of milk), mass (average female body mass, in kilograms), and neocortex.perc (percent of total brain mass that is neocortex mass).

A popular hypothesis has it that primates with larger brains produce more energetic milk, so that brains can grow quickly. Answering questions of this sort consumes a lot of effort in evolutionary biology, because there are many subtle statistical issues that arise when comparing species. It doesn’t help that many biologists have no reference model other than a series of regressions, and so the output of the regressions is not really interpretable. The causal meaning of statistical estimates always depends upon information outside the data.

We won’t solve these problems here. But we will explore a useful example. The question here is to what extent energy content of milk, measured here by kilocalories, is related to the percent of the brain mass that is neocortex. Neocortex is the gray, outer part of the brain that is especially elaborate in some primates. We’ll end up needing female body mass as well, to see the masking that hides the relationships among the variables. Let’s standardize these three variables. As in previous examples, standardizing helps us both get a reliable approximation of the posterior as well as build reasonable priors.

d['K'] = utils.standardize(d['kcal.per.g'])
d['N'] = utils.standardize(d['neocortex.perc'])
d['M'] = utils.standardize(np.log(d['mass']))

The first model to consider is the simple bivariate regression between kilocalories and neocortex percent. You already know how to set up this regression. In mathematical form:

\[ \begin{align*} K_i &\sim \text{Normal}(\mu_i,\sigma) \\ \mu_i &= \alpha + \beta_N N_{i} \\ \end{align*} \]

where \(K\) is standardized kilocalories and \(N\) is standardized neocortex percent. We still need to consider the priors. But first let’s just try to run this as a StanQuap model with some vague priors, because there is another key modeling issue to address first. We note that the column d.N consists of NaN or missing values. If you pass a vector like this to Stan’s likelihood function, it doesn’t know what to do. After all, what’s the probability of a missing value? Whatever the answer, it isn’t a number, and so Stan returns an error failing to evaluate the log probability at the “initial value”.

d.N
0    -2.080196
1          NaN
2          NaN
3          NaN
4          NaN
5    -0.508641
6    -0.508641
7     0.010742
8          NaN
9     0.213470
10   -1.461962
11   -0.986139
12   -1.215673
13         NaN
14         NaN
15    0.401118
16         NaN
17    0.474837
18         NaN
19    0.975791
20         NaN
21   -0.007687
22         NaN
23    0.617249
24    0.841756
25         NaN
26    0.446355
27    1.461666
28    1.325956
Name: N, dtype: float64

This is easy to fix. What you need to do here is manually drop all the cases with missing values. This is known as a complete case analysis. Some automated model fitting commands will silently drop such cases for you. But this isn’t always a good thing. First, it’s validity depends upon the process that caused these particular values to go missing. Later, we’ll explore this in much more depth. Second, once you start comparing models, you must compare models fit to the same data. If some variables have missing values that others do not, automated tools will silently produce misleading comparisons. Let’s march forward for now, dropping any cases with missing values. It’s worth learning how to do this yourself. To make a new data frame with only complete cases, use:

dcc = d.dropna(subset=['K', 'N', 'M']).reset_index(drop=True)
dcc.shape
(17, 11)

This makes a new data frame, dcc, that consists of the 17 rows from d that have no missing values in any of the variables listed inside df.dropna. Now let’s work with the new data frame:

m5_5_draft = '''
data {
    int<lower=0> n;
    vector[n] N_seq;
}
generated quantities {
    real a = normal_rng(0, 1);
    real bN = normal_rng(0, 1);
    vector[n] mu = a + bN * N_seq;
}
'''

data = {'n': 2, 'N_seq': [-2, 2]}

m5_5_draft_model = utils.Stan('stan_models/m5_5_draft', m5_5_draft)

Before considering the posterior predictions, let’s consider those priors. As in many simple linear regression problems, these priors are harmless. But are they reasonable? It is important to build reasonable priors, because as the model becomes less simple, the priors can be very helpful, but only if they are scientifically reasonable. To simulate and plot 50 prior regression lines:

mu = m5_5_draft_model.laplace_sample(data=data, draws=1000).stan_variable('mu')
def plot_prior():
    plt.plot([-2,2], mu[:50,].T, 'k', alpha=0.2)
    plt.xlim(-2.1,2.1)
    plt.ylim(-2.1,2.1)
    plt.xlabel('Neocortex Percent (std)')
    plt.ylabel('Kilocal per g (std)')
    plt.title(r'$\alpha$ ~ Normal(0, 1), $\beta_N$ ~ Normal(0, 1)')

plt.clf()
plot_prior()

We’ve shown a range of 2 standard deviations for both variables. So that is most of the outcome space. These lines are crazy. As in previous examples, we can do better by both tightening the \(α\) prior so that it sticks closer to zero. With two standardized variables, when predictor is zero, the expected value of the outcome should also be zero. And the slope \(β_N\) needs to be a bit tighter as well, so that it doesn’t regularly produce impossibly strong relationships. Here’s an attempt:

m5_5 = '''
data {
    int<lower=0> n;
    vector[n] N;
    vector[n] K;

    int<lower=0> n_tilde;
    vector[n_tilde] N_seq;
}
parameters {
    real a;
    real bN;
    real<lower=0> sigma;
}
transformed parameters {
    vector[n] mu;
    mu = a + bN * N;
}
model {
    K ~ normal(mu, sigma);
    a ~ normal(0, 0.2);
    bN ~ normal(0, 0.5);
    sigma ~ exponential(1);
}
generated quantities {
    real a_sim = normal_rng(0, 0.2);
    real bN_sim = normal_rng(0, 0.5);
    vector[n_tilde] mu_sim = a_sim + bN_sim * N_seq;
}
'''

data = {'n': len(dcc),
        'N': dcc.N.tolist(),
        'K': dcc.K.tolist(),
        'n_tilde': 2,
        'N_seq': [-2, 2]}

m5_5_model = utils.StanQuap('stan_models/m5_5', m5_5, data=data,
                            generated_var=['mu_sim','mu', 'a_sim', 'bN_sim'])
mu = m5_5_model.extract_samples(n=1000, select=['mu_sim'])
def plot_prior():
    plt.plot([-2,2], mu['mu_sim'][:50,].T, 'k', alpha=0.2)
    plt.xlim(-2.1,2.1)
    plt.ylim(-2.1,2.1)
    plt.xlabel('Neocortex Percent (std)')
    plt.ylabel('Kilocal per g (std)')
    plt.title(r'$\alpha$ ~ Normal(0, 0.2), $\beta_N$ ~ Normal(0, 0.5)')

plt.clf()
plot_prior()

The modified priors produces the above plot. These are still very vague priors, but at least the lines stay within the high probability region of the observable data.

Now let’s look at the posterior:

m5_5_model.precis().round(2)
Mean StDev 5.5% 94.5%
Parameter
a 0.04 0.15 -0.21 0.29
bN 0.13 0.22 -0.22 0.49
sigma 1.00 0.16 0.74 1.26

From this summary, you can possibly see that this is neither a strong nor very precise association. The standard deviation is almost twice the posterior mean. But as always, it’s much easier to see this if we draw a picture. We can plot the predicted mean and 89% compatibility interval for the mean to see this more easily. The code below contains no surprises. But we have extended the range of \(N\) values to consider, in N_seq, so that the plot looks nicer.

xseq = np.linspace(dcc.M.min()-0.15, dcc.M.max()+0.15, 30)

def lm(post, predictor):
    a, bN, sigma = post.values()
    mu = a + bN * predictor.reshape(-1,1)
    return mu.T   
mu = m5_5_model.link(lm_func=lm, predictor=xseq, select=['a', 'bN', 'sigma'])
mu_mean, mu_plo, mu_phi = utils.precis(mu)

def plot_post():
    plt.plot(dcc.N, dcc.K, 'o', fillstyle='none')
    plt.plot(xseq, mu_mean, c="black")
    plt.fill_between(xseq, mu_plo, mu_phi, alpha=0.2)
    plt.xlabel("Neocortex Percent (std)")
    plt.ylabel("Kilocal per g (std)")

plt.clf()
plot_post()

The posterior mean line is weakly positive, but it is highly imprecise. A lot of mildly positive and negative slopes are plausible, given this model and these data.

Now consider another predictor variable, adult female body mass, mass in the data frame. Let’s use the logarithm of mass, np.log(mass), as a predictor as well. Why the logarithm of mass instead of the raw mass in kilograms? It is often true that scaling measurements like body mass are related by magnitudes to other variables. Taking the log of a measure translates the measure into magnitudes. So by using the logarithm of body mass here, we’re saying that we suspect that the magnitude of a mother’s body mass is related to milk energy, in a linear fashion. Much later, we’ll see why these logarithmic relationships are almost inevitable results of the physics of organisms.

Now we construct a similar model, but consider the bivariate relationship between kilocalories and body mass. Since body mass is also standardized, we can use the same priors and stay within possible outcome values. But if you were a domain expert in growth, you could surely do better than this.

m5_6 = '''
data {
    int<lower=0> n;
    vector[n] M;
    vector[n] K;

    int<lower=0> n_tilde;
    vector[n_tilde] xseq;
}
parameters {
    real a;
    real bM;
    real<lower=0> sigma;
}
transformed parameters {
    vector[n] mu;
    mu = a + bM * M;
}
model {
    K ~ normal(mu, sigma);
    a ~ normal(0, 0.2);
    bM ~ normal(0, 0.5);
    sigma ~ exponential(1);
}
generated quantities {
    vector[n_tilde] y_tilde = a + bM * xseq;
}
'''
xseq = np.linspace(dcc.M.min()-0.15, dcc.M.max()+0.15, 30)

data = {'n': len(dcc),
        'M': dcc.M.tolist(),
        'K': dcc.K.tolist(),
        'n_tilde': len(xseq),
        'xseq': xseq.tolist()}

m5_6_model = utils.StanQuap('stan_models/m5_6', m5_6, data=data,
                            generated_var=['mu', 'y_tilde'])

m5_6_model.precis().round(2)
Mean StDev 5.5% 94.5%
Parameter
a 0.05 0.15 -0.20 0.29
bM -0.28 0.19 -0.59 0.03
sigma 0.95 0.16 0.70 1.20
mu = m5_6_model.extract_samples(n=1000, select=['y_tilde'])
mu_mean, mu_plo, mu_phi = utils.precis(mu['y_tilde'])

def plot_post():
    plt.plot(dcc.M, dcc.K, 'o', fillstyle='none')
    plt.plot(xseq, mu_mean, c="black")
    plt.fill_between(xseq, mu_plo, mu_phi, alpha=0.2)
    plt.xlabel("Log Body Mass (std)")
    plt.ylabel("Kilocal per g (std)")

plt.clf()
plot_post()

Log-mass is negatively associated with kilocalories. This association does seem stronger than that of neocortex percent, although in the opposite direction. It is quite uncertain though, with a wide compatibility interval that is consistent with a wide range of both weak and stronger relationships.

Now let’s see what happens when we add both predictor variables at the same time to the regression. This is the multivariate model, in math form:

\[ \begin{align*} K_i &\sim \text{Normal}(\mu_i,\sigma) \\ \mu_i &= \alpha + \beta_N N_{i} + \beta_M M_i \\ \alpha &\sim \text{Normal}(0,0.2) \\ \beta_N &\sim \text{Normal}(0,0.5) \\ \beta_M &\sim \text{Normal}(0,0.5) \\ \sigma &\sim \text{Exponential}(1) \\ \end{align*} \]

m5_7 = '''
data {
    int<lower=0> n;
    vector[n] K;
    vector[n] N;
    vector[n] M;

    int<lower=0> n_tilde;    // Number of counterfactual simulations
    vector[n_tilde] N_seq;   // Counterfactual N values for N -> M -> D path
    vector[n_tilde] M_seq;   // Counterfactual M values for direct M -> D path
}
parameters {
    real a;
    real bN;
    real bM;
    real<lower=0> sigma;
}
transformed parameters {
    vector[n] mu;
    mu = a + bN * N + bM * M;   // N -> D <- M   
}
model {
    // Priors
    a ~ normal(0, 0.2);
    bN ~ normal(0, 0.5);
    bM ~ normal(0, 0.5);
    sigma ~ exponential(1);

    // Likelihood
    K ~ normal(mu, sigma);
}
generated quantities {
    vector[n_tilde] mu_tilde;
    vector[n_tilde] K_tilde;

    for (i in 1:n_tilde) {
        mu_tilde[i] = a + bN * N_seq[i] + bM * M_seq[i];
    }

    for (i in 1:n_tilde) {
        K_tilde[i] = normal_rng(mu_tilde[i], sigma);
    }
}
'''

xseq = np.linspace(dcc.M.min()-0.15, dcc.M.max()+0.15, 30)

data = {'n': len(dcc),
        'K': dcc.K.tolist(),
        'N': dcc.N.tolist(),
        'M': dcc.M.tolist(),
        'n_tilde': len(xseq),
        'N_seq': np.zeros_like(xseq).tolist(),
        'M_seq': xseq.tolist()}

m5_7_model = utils.StanQuap('stan_models/m5_7', m5_7, data=data, algorithm='LBFGS',
                            generated_var=['mu_tilde', 'K_tilde', 'mu'])
m5_7_model.precis().round(2)
Mean StDev 5.5% 94.5%
Parameter
a 0.07 0.13 -0.15 0.28
bN 0.68 0.25 0.28 1.07
bM -0.70 0.22 -1.06 -0.35
sigma 0.74 0.13 0.53 0.95

By incorporating both predictor variables in the regression, the posterior association of both with the outcome has increased. Visually comparing this posterior to those of the previous two models helps to see the pattern of change:

def plot_forest():
    y_range = np.arange(4)
    plt.hlines(y_range[1],
               xmin=m5_7_model.precis()['5.5%']['bN'], 
               xmax=m5_7_model.precis()['94.5%']['bN'], linewidth=2)
    plt.hlines(y_range[0],
               xmin=m5_5_model.precis()['5.5%']['bN'], 
               xmax=m5_5_model.precis()['94.5%']['bN'], linewidth=2)
    plt.hlines(y_range[3],
               xmin=m5_7_model.precis()['5.5%']['bM'], 
               xmax=m5_7_model.precis()['94.5%']['bM'], linewidth=2)
    plt.hlines(y_range[2],
               xmin=m5_6_model.precis()['5.5%']['bM'], 
               xmax=m5_6_model.precis()['94.5%']['bM'], linewidth=2)
    mean_range = np.array([m5_5_model.precis()['Mean']['bN'],
                           m5_7_model.precis()['Mean']['bN'],
                           m5_6_model.precis()['Mean']['bM'],
                           m5_7_model.precis()['Mean']['bM']])
    plt.plot(mean_range, y_range, 'o', fillstyle='none')
    plt.axvline(0, linestyle='--', linewidth=0.3)
    plt.yticks(y_range, labels=['bN - m5_5', 'bN - m5_7', 'bM - m5_6', 'bM - m5_7'])
    plt.xlabel('Estimate')

plt.clf()
plot_forest()

The posterior means for neocortex percent and log-mass have both moved away from zero. Adding both predictors to the model seems to have made their estimates move apart.

What happened here? Why did adding neocortex and body mass to the same model lead to stronger associations for both? This is a context in which there are two variables correlated with the outcome, but one is positively correlated with it and the other is negatively correlated with it. In addition, both of the explanatory variables are positively correlated with one another. Try a simple pairs plot to appreciate this pattern of correlation. The result of this pattern is that the variables tend to cancel one another out.

def plot_pairs():
    _, ax = plt.subplots(3,3)
    az.plot_pair({'K': dcc.K, 'M': dcc.M, 'N': dcc.N}, marginals=True, ax=ax);

plt.clf()
plot_pairs()
<Figure size 500x400 with 0 Axes>

This is another case in which multiple regression automatically finds the most revealing cases and uses them to produce inferences. What the regression model does is ask if species that have high neocortex percent for their body mass have higher milk energy. Likewise, the model asks if species with high body mass for their neocortex percent have higher milk energy. Bigger species, like apes, have milk with less energy. But species with more neocortex tend to have richer milk. The fact that these two variables, body size and neocortex, are correlated across species makes it hard to see these relationships, unless we account for both.

Some DAGs will help. There are at least three graphs consistent with these data.

dag5_1 = CausalGraphicalModel(nodes=["M", "N", "K"], edges=[("M", "N"), ("M", "K"), ("N", "K")])
pgm1 = daft.PGM()
coordinates = {"M": (0, 0), "K": (1, 1), "N": (2, 0)}
for node in dag5_1.dag.nodes:
    pgm1.add_node(node, node, *coordinates[node], plot_params={'linewidth': 0})
for edge in dag5_1.dag.edges:
    pgm1.add_edge(*edge)
pgm1.render()
plt.gca().invert_yaxis()

dag5_2 = CausalGraphicalModel(nodes=["M", "N", "K"], edges=[("N", "M"), ("M", "K"), ("N", "K")])
pgm2 = daft.PGM()
coordinates = {"M": (0, 0), "K": (1, 1), "N": (2, 0)}
for node in dag5_2.dag.nodes:
    pgm2.add_node(node, node, *coordinates[node], plot_params={'linewidth': 0})
for edge in dag5_2.dag.edges:
    pgm2.add_edge(*edge)
pgm2.render()
plt.gca().invert_yaxis()

dag5_3 = CausalGraphicalModel(nodes=["M", "N", "K", "U"], edges=[("U", "M"), ("U", "N"), ("M", "K"), ("N", "K")])
pgm3 = daft.PGM()
coordinates = {"M": (0, 0), "K": (1, 1), "N": (2, 0), "U": (1, 0)}
for node in dag5_3.dag.nodes:
    if node != "U":
        pgm3.add_node(node, node, *coordinates[node], plot_params={'linewidth': 0})
    else:
        pgm3.add_node(node, node, *coordinates[node])
for edge in dag5_3.dag.edges:
    pgm3.add_edge(*edge)
pgm3.render()
plt.gca().invert_yaxis()

Beginning on the left, the first possibility is that body mass (M) influences neocortex percent (\(N\)). Both then influence kilocalories in milk (\(K\)). Second, in the middle, neocortex could instead influence body mass. The two variables still end up correlated in the sample. Finally, on the right, there could be an unobserved variable \(U\) that influences both \(M\) and \(N\), producing a correlation between them. In this book, I’ll circle variables that are unobserved. One of the threats to causal inference is that there are potentially many unobserved variables that influence an outcome or the predictors. We’ll consider this more in the next chapter.

Which of these graphs is right? We can’t tell from the data alone, because these graphs imply the same set of conditional independencies. In this case, there are no conditional independencies—each DAG above implies that all pairs of variables are associated, regardless of what we condition on. A set of DAGs with the same conditional independencies is known as a Markov equivalence set. Next, I’ll show you how to simulate observations consistent with each of these DAGs, how each can produce the masking phenomenon, and how to use the CausalGraphicalModel package to compute the complete set of Markov equivalent DAGs. Remember that while the data alone can never tell you which causal model is correct, your scientific knowledge of the variables will eliminate a large number of silly, but Markov equivalent, DAGs.

The final thing we’d like to do with these models is to make counterfactual plots again. Suppose the third DAG above is the right one. Then imagine manipulating \(M\) and \(N\), breaking the influence of \(U\) on each. In the real world, such experiments are impossible. If we change an animal’s body size, natural selection would then change the other features to match it. But these counterfactual plots do help us see how the model views the association between each predictor and the outcome. Here is the code:

def generate_counterfactual_data(dcc, xseq, hold_variable):
    return {
        'n': len(dcc),
        'K': dcc.K.tolist(),
        'N': dcc.N.tolist(),
        'M': dcc.M.tolist(),
        'n_tilde': len(xseq),
        'N_seq': np.zeros_like(xseq).tolist() if hold_variable == 'N' else xseq.tolist(),
        'M_seq': xseq.tolist() if hold_variable == 'N' else np.zeros_like(xseq).tolist()
    }

def plot_counterfactual(ax, xseq, mu_mean, mu_plo, mu_phi, xlabel, title):
    ax.plot(xseq, mu_mean, c="black")
    ax.fill_between(xseq, mu_plo, mu_phi, alpha=0.2)
    ax.set_xlabel(xlabel)
    ax.set_ylabel("Kilocal per g (std)")
    ax.set_title(title)

def plot_counterfactuals():
    fig, axes = plt.subplots(1, 2, figsize=(9, 4))
    
    counterfactuals = [
        {"xseq": np.linspace(dcc.M.min() - 0.15, dcc.M.max() + 0.15, 30), 
         "hold_variable": "N", 
         "xlabel": "Log Body Mass (std)", 
         "title": "Counterfactual holding N = 0"},
        {"xseq": np.linspace(dcc.N.min() - 0.15, dcc.N.max() + 0.15, 30), 
         "hold_variable": "M", 
         "xlabel": "Neocortex Percent (std)", 
         "title": "Counterfactual holding M = 0"}
    ]
    
    for ax, cf in zip(axes, counterfactuals):
        data = generate_counterfactual_data(dcc, cf["xseq"], cf["hold_variable"])
        mu = m5_7_model.laplace_sample(data=data, draws=1000).stan_variable('mu_tilde')
        mu_mean, mu_plo, mu_phi = utils.precis(mu)
        plot_counterfactual(ax, cf["xseq"], mu_mean, mu_plo, mu_phi, cf["xlabel"], cf["title"])
    
    plt.tight_layout()
    plt.show()

plot_counterfactuals()


Simulating a Masking Relationship

Just as with understanding spurious association, it may help to simulate data in which two meaningful predictors act to mask one another. In the previous section, I showed three DAGs consistent with this. To simulate data consistent with the first DAG:

# M -> K <- N
# M -> N
n = 100
M = stats.norm.rvs(size=n)
N = stats.norm.rvs(loc=M)
K = stats.norm.rvs(loc=N-M)
d_sim = pd.DataFrame({'K':K, 'N': N, 'M':M})

You can quickly see the masking pattern of inferences by replacing dcc with d_sim in models m5_5, m5_6, and m5_7. Look at the summaries and you’ll see the same masking pattern where the slopes become more extreme in m5_7.

data = {'n': len(d_sim),
        'N': d_sim.N.tolist(),
        'K': d_sim.K.tolist(),
        'n_tilde': 2,
        'N_seq': [-2, 2]}

m5_5_dsim = utils.StanQuap('stan_models/m5_5', m5_5, data=data,
                            generated_var=['mu_sim','mu', 'a_sim', 'bN_sim'])

print(m5_5_dsim.precis().round(2))

data = {'n': len(d_sim),
        'M': d_sim.M.tolist(),
        'K': d_sim.K.tolist(),
        'n_tilde': 2,
        'xseq': [-2, 2]}

m5_6_dsim = utils.StanQuap('stan_models/m5_6', m5_6, data=data,
                            generated_var=['mu', 'y_tilde'])

print(m5_6_dsim.precis().round(2))

data = {'n': len(d_sim),
        'K': d_sim.K.tolist(),
        'N': d_sim.N.tolist(),
        'M': d_sim.M.tolist(),
        'n_tilde': 2,
        'N_seq': [-2, 2],
        'M_seq': [-2, 2]}

m5_7_dsim = utils.StanQuap('stan_models/m5_7', m5_7, data=data, algorithm='LBFGS',
                            generated_var=['mu_tilde', 'K_tilde', 'mu'])

print(m5_7_dsim.precis().round(2))
           Mean  StDev  5.5%  94.5%
Parameter                          
a          0.05   0.10 -0.12   0.21
bN         0.54   0.09  0.39   0.68
sigma      1.20   0.08  1.07   1.34
           Mean  StDev  5.5%  94.5%
Parameter                          
a          0.16   0.12 -0.03   0.34
bM        -0.09   0.14 -0.30   0.13
sigma      1.40   0.10  1.24   1.56
           Mean  StDev  5.5%  94.5%
Parameter                          
a         -0.00   0.09 -0.15   0.14
bN         0.93   0.09  0.79   1.08
bM        -0.85   0.12 -1.04  -0.65
sigma      0.97   0.07  0.86   1.08
def plot_pairs():
    _, ax = plt.subplots(3,3)
    az.plot_pair({'K': K, 'M': M, 'N': N}, marginals=True, ax=ax);

plt.clf()
plot_pairs()
<Figure size 500x400 with 0 Axes>

The other two DAGs can be simulated like this:

# M -> K <- N
# N -> M
n = 100
N = stats.norm.rvs(size = n)
M = stats.norm.rvs(loc=N)
K = stats.norm.rvs(loc=N-M)
d_sim2 = pd.DataFrame({'K':K, 'N': N, 'M':M})

# M -> K <- N
# M <- U -> N
n = 100
U = stats.norm.rvs(size=n)
M = stats.norm.rvs(loc=U)
N = stats.norm.rvs(loc=U)
K = stats.norm.rvs(loc=N-M)
d_sim3 = pd.DataFrame({'K':K, 'N': N, 'M':M})

In the primate milk example, it may be that the positive association between large body size and neocortex percent arises from a tradeoff between lifespan and learning. Large animals tend to live a long time. And in such animals, an investment in learning may be a better investment, because learning can be amortized over a longer lifespan. Both large body size and large neocortex then influence milk composition, but in different directions, for different reasons. This story implies that the DAG with an arrow from \(M\) to \(N\), the first one, is the right one. But with the evidence at hand, we cannot easily see which is right. We should compute the Markov equivalence set and eliminate those that we think are not valid based upon our scientific knowledge of the variables.

We can compute the Markov equivalence set in R or Julia using the daggity package. The below DAGs were produced using dagitty:

dag5.7 <- dagitty( "dag{
    M -> K <- N
    M -> N }" )
coordinates(dag5.7) <- list( x=c(M=0,K=1,N=2) , y=c(M=0.5,K=1,N=0.5) )
MElist <- equivalentDAGs(dag5.7)
drawdag(MElist)

The MElist variable contains six different DAGs. Which of these do you think you could eliminate, based upon scientific knowledge of the variables?