API Reference#

This reference provides detailed documentation for user functions in the current release of kulprit.

kulprit#

Kulprit.

Kullback-Leibler projections for Bayesian model selection.

class kulprit.ProjectionPredictive(model, idata, rng=456)[source]#

Projection Predictive class from which we perform the model selection procedure.

Parameters:#

modelBambi model

The reference model to project

idataInferenceData or DataTree

The result of fitting reference model

rngRandomState

Random number generator used for sampling from the posterior predictive if the group is not present in idata.

compare(stats='elpd', min_model_size=0, round_to=None)[source]#

Return a DataFrame with the performance stats of the reference and submodels.

Parameters:#

statsstr

The statistics to compute. Defaults to “elpd”. * “elpd”: expected log (pointwise) predictive density (ELPD). * “mlpd”: mean log predictive density (MLPD), that is, the ELPD divided by the number of observations. * “gmpd”: geometric mean predictive density (GMPD), that is, exp(MLPD). For discrete response families the GMPD is bounded by zero and one.

min_model_sizeint

The minimum size of the submodels to compare. Defaults to 0, which means the intercept-only model is included in the comparison.

round_toint

Number of decimals used to round results. Defaults to None

Returns:#

DataFrame

A DataFrame with the ELPD and standard error of the submodels and the reference model. The index of the DataFrame is the term names of the submodels, and the first row is the reference model.

project(method='forward', user_terms=None, num_samples=400, num_clusters=20, early_stop=None, require_lower_terms=True, tolerance=1)[source]#

Perform model projection.

Parameters:#

methodstr

The search method to employ, either “forward” for a forward search, or “l1” for an L1-regularized search. Ignored if “user_terms” is provided.

user_termslist of list of str

If a nested list of terms is provided, model with those terms will be projected directly.

num_samplesint

The number of samples to draw from the posterior predictive distribution for the projection procedure and ELPD computation. Defaults to 400.

num_clustersint

The number of clusters to use during the forward search. Defaults to 20.

If None, the number of clusters is set to the number of samples. If num_clusters is larger than num_samples, it is set to num_samples. early_stop : str or int, optional

Whether to stop the search earlier. If an integer is provided, the search stops when the submodel size is equal to the integer. If a string is provided, the search stops when the difference in ELPD between the reference and submodel is small. There are two criteria to define what small is, “mean” and “se”. The “mean” criterion stops the search when the difference between a the ELPD is smaller than 4. The “se” criterion stops the search when the ELPD of the submodel is within one standard error of the reference model. Defaults to None.

require_lower_termsbool

Include higher-order interactions only if all lower-order interactions and main effects are already in the subset. Defaults to True. Ignored if user_terms is provided or if the method is not “forward”.

tolerancefloat

The tolerance for the optimization procedure. Defaults to 1. Decreasing this value will increase the accuracy of the projection at the cost of speed.

select(criterion='mean')[source]#

Select the smallest submodel

The selection is based on comparing the ELPDs of the reference and submodels.

Parameters:

criterion (str) – The criterion to use for selecting the best submodel. Either “mean” or “se”. The “mean” criterion selects the smallest submodel with an ELPD that is within 4 units of the reference model. The “se” criterion selects the smallest submodel with an ELPD that is within one standard error of the reference model.

Returns:

The selected submodel.

Return type:

SubModel

kulprit.plot_compare(cmp_df, relative_scale=True, backend=None, visuals=None, **pc_kwargs)[source]#

Summary plot for model comparison.

Models are compared based on their expected log pointwise predictive density (ELPD). Higher ELPD values indicate better predictive performance.

The ELPD is estimated by Pareto smoothed importance sampling leave-one-out cross-validation (LOO). Details are presented in [1] and [2].

The ELPD can only be interpreted in relative terms. But differences in ELPD less than 4 are considered negligible [3].

Parameters:
  • comp_df (pandas.DataFrame) – The result of Kulprit’s compare function

  • relative_scale (bool, optional.) – If True scale the ELPD values relative to the reference model. Defaults to True.

  • backend ({"bokeh", "matplotlib", "plotly"}) – Select plotting backend. Defaults to ArviZ’s rcParams[“plot.backend”].

  • visuals (mapping of {str : mapping or bool}, optional) –

    Valid keys are:

    • point_estimate -> passed to scatter()

    • error_bar -> passed to line()

    • ref_line -> passed to hline().

    • ref_band -> passed to hspan()

    • similar_line -> passed to hline() or Defaults to False

    • labels -> passed to xticks() and yticks()

    • title -> passed to title(). Defaults to False.

    • ticklabels -> passed to yticks()

  • **pc_kwargs – Passed to arviz_plots.PlotCollection

Return type:

PlotCollection

References

kulprit.plot_dist(ppi, submodels=None, include_reference=False, var_names=None, filter_vars=None, coords=None, sample_dims=None, kind=None, point_estimate=None, ci_kind=None, ci_prob=None, plot_collection=None, backend=None, labeller=None, aes_by_visuals=None, visuals=None, stats=None, **pc_kwargs)[source]#

Plot 1D marginal densities.

This function is a thin wrapper around arviz_plots.plot_dist() that prepares the data from a kulprit.ProjectionPredictive object.

Parameters:
  • ppi (kulprit.ProjectionPredictive object)

  • submodels (list of {int, str}, optional) – List of submodel sizes or names to be plotted.

  • include_reference (bool, default False) – Whether to include the reference model in the plot.

  • var_names (str or list of str, optional) – One or more variables to be plotted. Prefix the variables by ~ when you want to exclude them from the plot.

  • filter_vars ({None, “like”, “regex”}, default=None) – If None, interpret var_names as the real variables names. If “like”, interpret var_names as substrings of the real variables names. If “regex”, interpret var_names as regular expressions on the real variables names.

  • coords (dict, optional)

  • sample_dims (str or sequence of hashable, optional) – Dimensions to reduce unless mapped to an aesthetic. Defaults to rcParams["data.sample_dims"]

  • kind ({"kde", "hist", "dot", "ecdf"}, optional) – How to represent the marginal density. Defaults to rcParams["plot.density_kind"]

  • point_estimate ({"mean", "median", "mode"}, optional) – Which point estimate to plot. Defaults to rcParam stats.point_estimate

  • ci_kind ({"eti", "hdi"}, optional) – Which credible interval to use. Defaults to rcParams["stats.ci_kind"]

  • ci_prob (float, optional) – Indicates the probability that should be contained within the plotted credible interval. Defaults to rcParams["stats.ci_prob"]

  • plot_collection (PlotCollection, optional)

  • backend ({"matplotlib", "bokeh"}, optional)

  • labeller (labeller, optional)

  • aes_by_visuals (mapping of {str : sequence of str}, optional) –

    Mapping of visuals to aesthetics that should use their mapping in plot_collection when plotted. Valid keys are the same as for visuals.

    With a single model, no aesthetic mappings are generated by default, each variable+coord combination gets a plot but they all look the same, unless there are user provided aesthetic mappings. With multiple models, plot_dist maps “color” and “y” to the “model” dimension.

    By default, all aesthetics but “y” are mapped to the density representation, and if multiple models are present, “color” and “y” are mapped to the credible interval and the point estimate.

    When “point_estimate” key is provided but “point_estimate_text” isn’t, the values assigned to the first are also used for the second.

  • visuals (mapping of {str : mapping or bool}, optional) –

    Valid keys are:

    • dist -> depending on the value of kind passed to:

      • ”kde” -> passed to line_xy()

      • ”ecdf” -> passed to ecdf_line()

      • ”hist” -> passed to :func: ~arviz_plots.visuals.step_hist

    • face -> visual that fills the area under the marginal distribution representation.

      Defaults to False. Depending on the value of kind it is passed to:

      • ”kde” or “ecdf” -> passed to fill_between_y()

      • ”hist” -> passed to hist()

    • credible_interval -> passed to line_x(). Defaults to False.

    • point_estimate -> passed to scatter_x(). Defaults to False.

    • point_estimate_text -> passed to point_estimate_text(). False.

    • title -> passed to labelled_title()

    • rug -> passed to scatter_x(). Defaults to False.

    • remove_axis -> not passed anywhere, can only be False to skip calling this function

  • stats (mapping, optional) –

    Valid keys are:

    • dist -> passed to kde, ecdf, …

    • credible_interval -> passed to eti or hdi

    • point_estimate -> passed to mean, median or mode

  • **pc_kwargs – Passed to arviz_plots.PlotCollection.wrap

Return type:

PlotCollection

kulprit.plot_forest(ppi, submodels=None, include_reference=False, var_names=None, filter_vars=None, coords=None, sample_dims=None, point_estimate=None, ci_kind=None, ci_probs=None, labels=None, shade_label='__variable__', plot_collection=None, backend=None, labeller=None, aes_by_visuals=None, visuals=None, stats=None, **pc_kwargs)[source]#

Plot 1D marginal credible intervals in a single plot.

This function is a thin wrapper around arviz.plots.plot_forest() that prepares the data from a kulprit.ProjectionPredictive object.

Parameters:
  • ppi (kulprit.ProjectionPredictive object)

  • submodels (list of {int, str}, optional) – List of submodel sizes or names to be plotted.

  • include_reference (bool, default False) – Whether to include the reference model in the plot.

  • var_names (str or list of str, optional) – One or more variables to be plotted. Prefix the variables by ~ when you want to exclude them from the plot.

  • filter_vars ({None, “like”, “regex”}, default None) – If None, interpret var_names as the real variables names. If “like”, interpret var_names as substrings of the real variables names. If “regex”, interpret var_names as regular expressions on the real variables names.

  • group (str, default "posterior") – Group to be plotted.

  • coords (dict, optional)

  • sample_dims (str or sequence of hashable, optional) – Dimensions to reduce unless mapped to an aesthetic. Defaults to rcParams["data.sample_dims"]

  • combined (bool, default False) – Whether to plot intervals for each chain or not. Ignored when the “chain” dimension is not present.

  • point_estimate ({"mean", "median", "mode"}, optional) – Which point estimate to plot. Defaults to rcParam stats.point_estimate

  • ci_kind ({"eti", "hdi"}, optional) – Which credible interval to use. Defaults to rcParams["stats.ci_kind"]

  • ci_probs ((float, float), optional) – Indicates the probabilities that should be contained within the plotted credible intervals. It should be sorted as the elements refer to the probabilities of the “trunk” and “twig” elements. Defaults to (0.5, rcParams["stats.ci_prob"])

  • labels (sequence of str, optional) – Sequence with the dimensions to be labelled in the plot. By default all dimensions except “chain” and “model” (if present). The order of labels is ignored, only elements being present in it matters. It can include the special “__variable__” indicator, and does so by default.

  • shade_label (str, default None) – Element of labels that should be used to add shading horizontal strips to the plot. Note that labels and credible intervals are plotted in different plots. The shading is applied to both plots, and the spacing between them is set to 0 if possible, which is not always the case (one notable example being matplotlib’s constrained layout).

  • plot_collection (PlotCollection, optional)

  • backend ({"matplotlib", "bokeh"}, optional)

  • labeller (labeller, optional)

  • aes_by_visuals (mapping of {str : sequence of str or False}, optional) –

    Mapping of visuals to aesthetics that should use their mapping in plot_collection when plotted. Valid keys are the same as for visuals except “ticklabels” and “remove_axis” which do not apply, and “twig” and “trunk” which take the same aesthetics through the “credible_interval” key.

    By default, aesthetic mappings are generated for: y, alpha, overlay and color (if multiple models are present). All aesthetic mappings but alpha are applied to both the credible intervals and the point estimate; overlay is applied to labels; and both overlay and alpha are applied to the shade.

    ”overlay” is a dummy aesthetic to trigger looping over variables and/or dimensions using all aesthetics in every iteration. “alpha” gets two values (0, 0.3) in order to trigger the alternate shading effect.

  • visuals (mapping of {str : mapping or bool}, optional) –

    Valid keys are:

    • trunk, twig -> passed to line_x()

    • point_estimate -> passed to scatter_x()

    • labels -> passed to annotate_label()

    • shade -> passed to fill_between_y()

    • ticklabels -> passed to xticks()

    • remove_axis -> not passed anywhere, can only take False as value to skip calling remove_axis()

  • stats (mapping, optional) –

    Valid keys are:

    • trunk, twig -> passed to eti or hdi

    • point_estimate -> passed to mean, median or mode

  • **pc_kwargs – Passed to arviz_plots.PlotCollection.grid

Return type:

PlotCollection