API Reference#
This reference provides detailed documentation for user functions in the current release of kulprit.
kulprit#
Kulprit.
Kullback-Leibler projections for Bayesian model selection.
- class kulprit.ProjectionPredictive(model, idata, rng=456)[source]#
Projection Predictive class from which we perform the model selection procedure.
Parameters:#
- modelBambi model
The reference model to project
- idataInferenceData or DataTree
The result of fitting reference model
- rngRandomState
Random number generator used for sampling from the posterior predictive if the group is not present in idata.
- compare(stats='elpd', min_model_size=0, round_to=None)[source]#
Return a DataFrame with the performance stats of the reference and submodels.
Parameters:#
- statsstr
The statistics to compute. Defaults to “elpd”. * “elpd”: expected log (pointwise) predictive density (ELPD). * “mlpd”: mean log predictive density (MLPD), that is, the ELPD divided by the number of observations. * “gmpd”: geometric mean predictive density (GMPD), that is, exp(MLPD). For discrete response families the GMPD is bounded by zero and one.
- min_model_sizeint
The minimum size of the submodels to compare. Defaults to 0, which means the intercept-only model is included in the comparison.
- round_toint
Number of decimals used to round results. Defaults to None
Returns:#
- DataFrame
A DataFrame with the ELPD and standard error of the submodels and the reference model. The index of the DataFrame is the term names of the submodels, and the first row is the reference model.
- project(method='forward', user_terms=None, num_samples=400, num_clusters=20, early_stop=None, require_lower_terms=True, tolerance=1)[source]#
Perform model projection.
Parameters:#
- methodstr
The search method to employ, either “forward” for a forward search, or “l1” for an L1-regularized search. Ignored if “user_terms” is provided.
- user_termslist of list of str
If a nested list of terms is provided, model with those terms will be projected directly.
- num_samplesint
The number of samples to draw from the posterior predictive distribution for the projection procedure and ELPD computation. Defaults to 400.
- num_clustersint
The number of clusters to use during the forward search. Defaults to 20.
If None, the number of clusters is set to the number of samples. If num_clusters is larger than num_samples, it is set to num_samples. early_stop : str or int, optional
Whether to stop the search earlier. If an integer is provided, the search stops when the submodel size is equal to the integer. If a string is provided, the search stops when the difference in ELPD between the reference and submodel is small. There are two criteria to define what small is, “mean” and “se”. The “mean” criterion stops the search when the difference between a the ELPD is smaller than 4. The “se” criterion stops the search when the ELPD of the submodel is within one standard error of the reference model. Defaults to None.
- require_lower_termsbool
Include higher-order interactions only if all lower-order interactions and main effects are already in the subset. Defaults to True. Ignored if user_terms is provided or if the method is not “forward”.
- tolerancefloat
The tolerance for the optimization procedure. Defaults to 1. Decreasing this value will increase the accuracy of the projection at the cost of speed.
- select(criterion='mean')[source]#
Select the smallest submodel
The selection is based on comparing the ELPDs of the reference and submodels.
- Parameters:
criterion (str) – The criterion to use for selecting the best submodel. Either “mean” or “se”. The “mean” criterion selects the smallest submodel with an ELPD that is within 4 units of the reference model. The “se” criterion selects the smallest submodel with an ELPD that is within one standard error of the reference model.
- Returns:
The selected submodel.
- Return type:
SubModel
- kulprit.plot_compare(cmp_df, relative_scale=True, backend=None, visuals=None, **pc_kwargs)[source]#
Summary plot for model comparison.
Models are compared based on their expected log pointwise predictive density (ELPD). Higher ELPD values indicate better predictive performance.
The ELPD is estimated by Pareto smoothed importance sampling leave-one-out cross-validation (LOO). Details are presented in [1] and [2].
The ELPD can only be interpreted in relative terms. But differences in ELPD less than 4 are considered negligible [3].
- Parameters:
comp_df (pandas.DataFrame) – The result of Kulprit’s compare function
relative_scale (bool, optional.) – If True scale the ELPD values relative to the reference model. Defaults to True.
backend ({"bokeh", "matplotlib", "plotly"}) – Select plotting backend. Defaults to ArviZ’s rcParams[“plot.backend”].
visuals (mapping of {str : mapping or bool}, optional) –
Valid keys are:
point_estimate -> passed to
scatter()error_bar -> passed to
line()ref_line -> passed to
hline().ref_band -> passed to
hspan()similar_line -> passed to
hline()or Defaults to Falselabels -> passed to
xticks()andyticks()title -> passed to
title(). Defaults to False.ticklabels -> passed to
yticks()
**pc_kwargs – Passed to
arviz_plots.PlotCollection
- Return type:
PlotCollection
References
- kulprit.plot_dist(ppi, submodels=None, include_reference=False, var_names=None, filter_vars=None, coords=None, sample_dims=None, kind=None, point_estimate=None, ci_kind=None, ci_prob=None, plot_collection=None, backend=None, labeller=None, aes_by_visuals=None, visuals=None, stats=None, **pc_kwargs)[source]#
Plot 1D marginal densities.
This function is a thin wrapper around
arviz_plots.plot_dist()that prepares the data from akulprit.ProjectionPredictiveobject.- Parameters:
ppi (kulprit.ProjectionPredictive object)
submodels (list of {int, str}, optional) – List of submodel sizes or names to be plotted.
include_reference (bool, default False) – Whether to include the reference model in the plot.
var_names (str or list of str, optional) – One or more variables to be plotted. Prefix the variables by ~ when you want to exclude them from the plot.
filter_vars ({None, “like”, “regex”}, default=None) – If None, interpret var_names as the real variables names. If “like”, interpret var_names as substrings of the real variables names. If “regex”, interpret var_names as regular expressions on the real variables names.
coords (dict, optional)
sample_dims (str or sequence of hashable, optional) – Dimensions to reduce unless mapped to an aesthetic. Defaults to
rcParams["data.sample_dims"]kind ({"kde", "hist", "dot", "ecdf"}, optional) – How to represent the marginal density. Defaults to
rcParams["plot.density_kind"]point_estimate ({"mean", "median", "mode"}, optional) – Which point estimate to plot. Defaults to rcParam
stats.point_estimateci_kind ({"eti", "hdi"}, optional) – Which credible interval to use. Defaults to
rcParams["stats.ci_kind"]ci_prob (float, optional) – Indicates the probability that should be contained within the plotted credible interval. Defaults to
rcParams["stats.ci_prob"]plot_collection (PlotCollection, optional)
backend ({"matplotlib", "bokeh"}, optional)
labeller (labeller, optional)
aes_by_visuals (mapping of {str : sequence of str}, optional) –
Mapping of visuals to aesthetics that should use their mapping in plot_collection when plotted. Valid keys are the same as for visuals.
With a single model, no aesthetic mappings are generated by default, each variable+coord combination gets a plot but they all look the same, unless there are user provided aesthetic mappings. With multiple models,
plot_distmaps “color” and “y” to the “model” dimension.By default, all aesthetics but “y” are mapped to the density representation, and if multiple models are present, “color” and “y” are mapped to the credible interval and the point estimate.
When “point_estimate” key is provided but “point_estimate_text” isn’t, the values assigned to the first are also used for the second.
visuals (mapping of {str : mapping or bool}, optional) –
Valid keys are:
dist -> depending on the value of kind passed to:
”kde” -> passed to
line_xy()”ecdf” -> passed to
ecdf_line()”hist” -> passed to :func: ~arviz_plots.visuals.step_hist
face -> visual that fills the area under the marginal distribution representation.
Defaults to False. Depending on the value of kind it is passed to:
”kde” or “ecdf” -> passed to
fill_between_y()”hist” -> passed to
hist()
credible_interval -> passed to
line_x(). Defaults to False.point_estimate -> passed to
scatter_x(). Defaults to False.point_estimate_text -> passed to
point_estimate_text(). False.title -> passed to
labelled_title()rug -> passed to
scatter_x(). Defaults to False.remove_axis -> not passed anywhere, can only be
Falseto skip calling this function
stats (mapping, optional) –
Valid keys are:
dist -> passed to kde, ecdf, …
credible_interval -> passed to eti or hdi
point_estimate -> passed to mean, median or mode
**pc_kwargs – Passed to
arviz_plots.PlotCollection.wrap
- Return type:
PlotCollection
- kulprit.plot_forest(ppi, submodels=None, include_reference=False, var_names=None, filter_vars=None, coords=None, sample_dims=None, point_estimate=None, ci_kind=None, ci_probs=None, labels=None, shade_label='__variable__', plot_collection=None, backend=None, labeller=None, aes_by_visuals=None, visuals=None, stats=None, **pc_kwargs)[source]#
Plot 1D marginal credible intervals in a single plot.
This function is a thin wrapper around
arviz.plots.plot_forest()that prepares the data from akulprit.ProjectionPredictiveobject.- Parameters:
ppi (kulprit.ProjectionPredictive object)
submodels (list of {int, str}, optional) – List of submodel sizes or names to be plotted.
include_reference (bool, default False) – Whether to include the reference model in the plot.
var_names (str or list of str, optional) – One or more variables to be plotted. Prefix the variables by ~ when you want to exclude them from the plot.
filter_vars ({None, “like”, “regex”}, default None) – If None, interpret var_names as the real variables names. If “like”, interpret var_names as substrings of the real variables names. If “regex”, interpret var_names as regular expressions on the real variables names.
group (str, default "posterior") – Group to be plotted.
coords (dict, optional)
sample_dims (str or sequence of hashable, optional) – Dimensions to reduce unless mapped to an aesthetic. Defaults to
rcParams["data.sample_dims"]combined (bool, default False) – Whether to plot intervals for each chain or not. Ignored when the “chain” dimension is not present.
point_estimate ({"mean", "median", "mode"}, optional) – Which point estimate to plot. Defaults to rcParam
stats.point_estimateci_kind ({"eti", "hdi"}, optional) – Which credible interval to use. Defaults to
rcParams["stats.ci_kind"]ci_probs ((float, float), optional) – Indicates the probabilities that should be contained within the plotted credible intervals. It should be sorted as the elements refer to the probabilities of the “trunk” and “twig” elements. Defaults to
(0.5, rcParams["stats.ci_prob"])labels (sequence of str, optional) – Sequence with the dimensions to be labelled in the plot. By default all dimensions except “chain” and “model” (if present). The order of labels is ignored, only elements being present in it matters. It can include the special “__variable__” indicator, and does so by default.
shade_label (str, default None) – Element of labels that should be used to add shading horizontal strips to the plot. Note that labels and credible intervals are plotted in different plots. The shading is applied to both plots, and the spacing between them is set to 0 if possible, which is not always the case (one notable example being matplotlib’s constrained layout).
plot_collection (PlotCollection, optional)
backend ({"matplotlib", "bokeh"}, optional)
labeller (labeller, optional)
aes_by_visuals (mapping of {str : sequence of str or False}, optional) –
Mapping of visuals to aesthetics that should use their mapping in plot_collection when plotted. Valid keys are the same as for visuals except “ticklabels” and “remove_axis” which do not apply, and “twig” and “trunk” which take the same aesthetics through the “credible_interval” key.
By default, aesthetic mappings are generated for: y, alpha, overlay and color (if multiple models are present). All aesthetic mappings but alpha are applied to both the credible intervals and the point estimate; overlay is applied to labels; and both overlay and alpha are applied to the shade.
”overlay” is a dummy aesthetic to trigger looping over variables and/or dimensions using all aesthetics in every iteration. “alpha” gets two values (0, 0.3) in order to trigger the alternate shading effect.
visuals (mapping of {str : mapping or bool}, optional) –
Valid keys are:
trunk, twig -> passed to
line_x()point_estimate -> passed to
scatter_x()labels -> passed to
annotate_label()shade -> passed to
fill_between_y()ticklabels -> passed to
xticks()remove_axis -> not passed anywhere, can only take
Falseas value to skip callingremove_axis()
stats (mapping, optional) –
Valid keys are:
trunk, twig -> passed to eti or hdi
point_estimate -> passed to mean, median or mode
**pc_kwargs – Passed to
arviz_plots.PlotCollection.grid
- Return type:
PlotCollection