# Summarize pdqr-functions with summ_*()

Concept of summary functions is to take one or more pdqr-function(s) and return a summary value (which shouldn’t necessarily be a number). Argument method is used to choose function-specific algorithm of computation.

Note that some summary functions can accumulate pdqr approximation error (like summ_moment() for example). For better precision increase number intervals for piecewise-linear density using either n argument for density() in new_*() or n_grid argument in as_*().

We will use the following distributions throughout this vignette:

my_beta <- as_d(dbeta, shape1 = 2, shape2 = 5)
my_norm <- as_d(dnorm, mean = 0.5)
my_beta_mix <- form_mix(list(my_beta, my_beta + 1))

Although they both are continuous, discrete distributions are also fully supported.

## Basic numerical summary

### Moments

summ_moment() has extra arguments for controlling the nature of moment (which can be combined):

There are wrappers for most common moments: skewness and kurtosis:

### Quantiles

summ_quantile(f, probs) is essentially a more strict version of as_q(f)(probs):

### Entropy

summ_entropy() computes differential entropy (which can be negative) for “continuous” type pdqr-functions, and information entropy for “discrete”:

summ_entropy2() computes entropy based summary of relation between a pair of distributions. There are two methods: default “relative” (for relative entropy which is Kullback-Leibler divergence) and “cross” (for cross-entropy). It handles different supports by using clip (default exp(-20)) value instead of 0 during log() computation. Order of input does matter: summ_entropy2() uses support of the first pdqr-function as integration/summation reference.

## Regions

Distributions can be summarized with regions: union of closed intervals. Region is represented as data frame with rows representing intervals and two columns “left” and “right” with left and right interval edges respectively.

### Single interval

summ_interval() summarizes input pdqr-function with single interval based on the desired coverage level supplied in argument level. It has three methods:

• Default “minwidth”: interval with total probability of level that has minimum width.
• “percentile”: 0.5*(1-level) and 1 - 0.5*(1-level) quantiles.
• “sigma”: interval centered at the mean of distribution. Left and right edges are distant from center by the amount of standard deviation multiplied by level’s critical value (computed from normal distribution). Corresponds to classical confidence interval of sample based on assumption of normality.

### Highest density region

summ_hdr() computes highest density region (HDR) of a distribution: set of intervals with the lowest total width among all sets with total probability not less than an input level. With unimodal distribution it is essentially the same as summ_interval() with “minwidth” method.

## Separation and classification

### Separation

Function summ_separation() computes a threshold that optimally separates distributions represented by pair of input pdqr-functions. In other words, summ_separation() solves a binary classification problem with one-dimensional linear classifier: values not more than some threshold are classified as one class, and more than threshold - as another. Order of input functions doesn’t matter.

### Classification metrics

Functions summ_classmetric() and summ_classmetric_df() compute metric(s) of classification setup, similar to one used in summ_separation(). Here classifier threshold should be supplied and order of input matters. Classification is assumed to be done as follows: any x value not more than threshold value is classified as “negative”; if more - “positive”. Classification metrics are computed based on two pdqr-functions: f, which represents the distribution of values which should be classified as “negative” (“true negative”), and g - the same for “positive” (“true positive”).

With summ_roc() and summ_rocauc() one can compute data frame of ROC curve points and ROC AUC value respectively. There is also a roc_plot() function for predefined plotting of ROC curve.

## Ordering

‘pdqr’ has functions that can order set of distributions. They are summ_order(), summ_sort(), and summ_rank(), which are analogues of order(), sort(), and rank() respectively. They take a list of pdqr-functions as input, establish their ordering based on specified method, and return the desired output.

There are two sets of methods:

• Method “compare” uses the following ordering relation: pdqr-function f is greater than g if and only if P(f >= g) > 0.5, or in ‘pdqr’ code summ_prob_true(f >= g) > 0.5. This method orders input based on this relation and order() function. Notes:
• This relation doesn’t strictly define ordering because it is not transitive. It is solved by first preordering input list based on method “mean” and then calling order().
• Because comparing two pdqr-functions can be time consuming, this method becomes rather slow as number of distributions grows. To increase computation speed (sacrificing a little bit of approximation precision), use less intervals in piecewise-linear approximation of density for “continuous” types of pdqr-functions.
• Methods “mean”, “median”, and “mode” are based on summ_center(): ordering of distributions is defined as ordering of corresponding measures of distribution’s center.
# Here the only clear "correct" ordering is that a <= b.
f_list <- list(a = my_beta, b = my_beta + 1, c = my_norm)

# Returns an integer vector representing a permutation which rearranges f_list
# in desired order
summ_order(f_list, method = "compare")
#> [1] 1 3 2

# In this particular case of f_list all orderings agree with each other, but
# generally this is not the case: for any pair of methods there is a case
# when they disagree with each other
summ_order(f_list, method = "mean")
#> [1] 1 3 2
summ_order(f_list, method = "median")
#> [1] 1 3 2
summ_order(f_list, method = "mode")
#> [1] 1 3 2

# Use decreasing = TRUE to sort decreasingly
summ_order(f_list, method = "compare", decreasing = TRUE)
#> [1] 2 3 1

# Sort list
summ_sort(f_list)
#> $a #> Density function of continuous type #> Support: ~[0, 0.95557] (10000 intervals) #> #>$c
#> Density function of continuous type
#> Support: ~[-4.25342, 5.25342] (10000 intervals)
#>
#> $b #> Density function of continuous type #> Support: ~[1, 1.95557] (10000 intervals) summ_sort(f_list, decreasing = TRUE) #>$b
#> Density function of continuous type
#> Support: ~[1, 1.95557] (10000 intervals)
#>
#> $c #> Density function of continuous type #> Support: ~[-4.25342, 5.25342] (10000 intervals) #> #>$a
#> Density function of continuous type
#> Support: ~[0, 0.95557] (10000 intervals)

# Rank elements: 1 indicates "the smallest", length(f_list) - "the biggest"
summ_rank(f_list)
#> a b c
#> 1 3 2

## Other

Functions summ_prob_true() and summ_prob_false() should be used to extract probabilities from boolean pdqr-functions: outputs of comparing basic operators (like >=, ==, etc.):

summ_prob_true(my_beta >= my_norm)
#> [1] 0.416062
summ_prob_false(my_beta >= 2*my_norm)
#> [1] 0.6391

summ_pval() computes p-value(s) of observed statistic(s) based on the distribution. You can compute left, right, or two-sided p-values with methods “left”, “right”, and “both” respectively. By default multiple input values are adjusted for multiple comparisons (using stats::p.adjust()):

# By default two-sided p-value is computed
summ_pval(my_beta, obs = 0.7)
#> [1] 0.02186803
summ_pval(my_beta, obs = 0.7, method = "left")
#> [1] 0.989066
summ_pval(my_beta, obs = 0.7, method = "right")
#> [1] 0.01093401

# Multiple values are adjusted with p.adjust() with "holm" method by default
obs_vec <- seq(0, 1, by = 0.1)
summ_pval(my_beta, obs = obs_vec)
#>  [1] 0.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000
#>  [6] 1.0000000000 0.4915085377 0.1530761780 0.0255840348 0.0009720023
#> [11] 0.0000000000

# Use adjust = "none" to not adjust
summ_pval(my_beta, obs = obs_vec, adjust = "none")
#>  [1] 0.0000000000 0.2285302047 0.6892806594 0.8403488674 0.4665584871
#>  [6] 0.2187482323 0.0819180896 0.0218680254 0.0031980044 0.0001080003
#> [11] 0.0000000000