Quick Start#

We presently run through a simple example of how kulprit provides a simple and effective model selection procedure in the case of linear regression.

To begin with, we import the necessary dependencies for the initial model fitting (for which we use bambi) and data handling. The kulprit package provides all the machinery for model selection once we have fitted a bambi model.

import pandas as pd
import numpy as np

import bambi as bmb
import kulprit as kpt

import arviz as az
import matplotlib.pyplot as plt

az.style.use('arviz-white')

A synthetic example#

To illustrate the use of kulprit, we begin with a linear regression with synthetic data.

RANDOM_SEED = 453
np.random.seed(RANDOM_SEED)
x1 = np.random.normal(0, 0.5, 121)
x2 = np.random.normal(0, 1, 121)
y = np.random.normal(2 * x1 + 3, 0.25)
data = pd.DataFrame({"x1": x1, "x2": x2, "y": y})

We presently look to fit a linear regression to these data, namely the model

$$ y \sim \mathcal{N}(\beta_0 + \beta_1 x_1 + \beta_2 x_2, \sigma) $$

with $\sigma>0$ and $\beta_i \in \mathbb{R}$, in which we treat $y$ as our variate, and $x_1$ and $x_2$ as the covariates. This is cleanly communicated to bambi using the formula "y ~ x1 + x2". We will use the default bambi priors over $\beta_i$ and $\sigma$, on which more information can be found in the bambi docs.

model = bmb.Model("y ~ x1 + x2", data)
idata = model.fit(random_seed=RANDOM_SEED, idata_kwargs={"log_likelihood": True})

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [y_sigma, Intercept, x1, x2]

100.00% [8000/8000 00:03<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4 seconds.

Model selection#

We now perform the model selection procedure provided by kulprit by instanciating the ProjectionPredictive class.

ppi = kpt.ProjectionPredictive(model, idata)

And then we perform a search. Kulprit will start from the intercept-only model and it will keep adding one variable at a time until we have a model with all the variables, as in the original model. Notice that at each step kulprit has to “decide” which variable to add to the model. For this simple model we only have two possible search paths:

intercept –> intercept + x1 –> intercept + x1 + x2
intercept –> intercept + x2 –> intercept + x1 + x2

Notice that as we add variables the number of alternative path will grow very fast. From the following cell we can see that the first path is the one that is chosen.

ppi.search()
ppi

y ~ 1
y ~ x1
y ~ x1 + x2

Having performed the search, it is now our interest to identify the submodel with the fewest parameters that most closely resembles the reference model. We do so by computing the expected log-pointwise density (ELPD) using leave-one-out cross-validation (LOO-CV).

cmp, ax = ppi.plot_compare(plot=True, figsize=(10, 4));

../_images/41efe04578c0a5488b05759d1287d72347947940a287f4cacfaa3acd0f1c52c4.png

From these results we can see that the difference between the reference model and the model with only one parameter is very small. Thus, we may want to keep this smaller model. In the following cell we can see that the model is y ~ x1.

submodel = ppi.project(1)
submodel

y ~ x1

Having achieved this submodel, we can more closely investigate the posterior distributions of its parameters with ArviZ.

ax = az.plot_posterior(submodel.idata);

../_images/5505eb5a8564414189cfcc838aa89efc12839da68eb5bd4ef97d9b55b2123141.png

Or we can compare the posterior distributions of the parameters of the reference model and the submodel of size 2, i.e. the model with as many parameters as the reference model. We can see that their posterior distributions are very similar.

ppi.plot_densities(submodels=[2], kind="forest", figsize=(10, 4));

../_images/af86fd7cfd16fd7b8715b1a8989fbfdd946fec2a4f8417813ad2d80fe7fc5a81.png

OK, but what is going on?#

At each step, kulprit has to compute the posterior distribution of each submodel and then evaluate its predictive performance. The computation of the posterior distribution is not done with MCMC as in bambi, instead kulprit “projects” the reference model (the model with all the variables) onto a given submodel.

By projection, we mean that for a reference model posterior $p(\boldsymbol y \mid \boldsymbol \beta^\ast)$ with $\boldsymbol \beta^\ast = (\beta_0, \beta_1, \beta_2)^T$, and a parameter subset $\boldsymbol \beta^\perp \subset \boldsymbol \beta^\ast$, we find the parameter values of $\boldsymbol \beta^\perp$ according to the solution to the optimization problem

$$ \boldsymbol\beta^\perp = \arg\min_{\boldsymbol\beta} d \left( p(\boldsymbol y\mid \boldsymbol\beta),, p(\boldsymbol y\mid \boldsymbol\beta^\ast) \right) $$

In this formulation, $d(\cdot, \cdot)$ is the Kullback-Leibler distance metric. In a word, we look to achieve the values of $\boldsymbol \beta^\perp$ such that the posterior of their associated model is most close to the reference model.