Package ‘ggpmisc’ makes it easier to add to plots created using ‘ggplot2’ annotations based on fitted models and other statistics. It does this by wrapping existing model fit and other functions. The same annotations can be produced by calling the model fit functions, extracting the desired estimates and adding them to plots. There are two advantages in wrapping these functions in an extension to package ‘ggplot2’: 1) we ensure the coupling of graphical elements and the annotations by building all elements of the plot using the same data and a consistent grammar and 2) we make it easier to annotate plots to the casual user of R, within the grammar of graphics.
To avoid confusion it is good to make clear what may seem obvious to some: if no plot is needed, then there is no reason to use this package. The values shown as annotations are not computed by ‘ggpmisc’ but instead by the usual modelfit and statistical functions from R and other packages. The same is true for model predictions, residuals, etc. that some of the functions in ‘ggpmisc’ display as lines, segments, or other graphical elements.
It is also important to remember that in most cases data analysis proper should take place before annotated plots for publication are produced. Even though data analysis can benefit from combined numerical and graphical representation of the results, the use I envision for ‘ggpmisc’ is mainly for the production of plots for publication or communication.
Attaching package ‘ggpmisc’ also attaches package ‘ggpp’ as this package was created as a spinoff from ‘ggpmisc’. ‘ggpp’ provides several of the geometries used by default in the statistics described below. Package ‘ggpp’ can be loaded and attached on its own, and has its own documentation.
We load all the packages used in the examples.
library(ggpmisc)
library(tibble)
library(dplyr)
library(quantreg)
eval_nlme < requireNamespace("nlme", quietly = TRUE)
if (eval_nlme) library(nlme)
eval_broom < requireNamespace("broom", quietly = TRUE)
if (eval_broom) library(broom)
eval_broom_mixed < requireNamespace("broom.mixed", quietly = TRUE)
if (eval_broom_mixed) library(broom.mixed)
eval_gginnards < requireNamespace("gginnards", quietly = TRUE)
if (eval_gginnards) library(gginnards)
As we will use text and labels on the plotting area we change the default theme to an uncluttered one.
Common plot annotations based on model fits are model equations,
tests of significance and various indicators of goodness of fit. What
annotations are most useful depend on whether both x and
y are mapped to continuous variables, or y to a
continuous variable, and x to a factor (or vice versa). In some
cases it may be desirable to add an ANOVA table or a summary table as
plot insets. Other annotations computed from data are various summaries
and features such as location of local or global maxima and minima.
Several additional statistics for simple plot annotations computed from
data
but not dependent of fitting of statistical models are
documented in the vignettes and help pages of package ‘ggpp’ even if
these statistics are imported and reexported by ‘ggpmisc’.
Supporting computed and modelfit based annotations in ggplots can be implemented as new statistics that do computations on the plot data and pass the results to existing or new geometries. These new statistics are only a convenience in the sense that similar plots can be achieved by fitting models and operating on the fitted model objects before adding the estimates to the plot by passing them as data directly to geoms in a plot. This approach requires additional coding by users to assemble character strings to be parsed into R expressions. Fitting the models before constructing the plot object can weaken reproducibility as the values in the annotations are not computed at the time the plot is rendered and not guaranteed to use the data being plotted (When using our statistics the user remains responsible for the consistency of methods and model formulas used in different plot layers!).
The statistics defined in ‘ggpmisc’ even if usable with geometries from package ‘ggplot2’, in most cases have as default geometries defined in package ‘ggpp’. These new geometries are better suited for plot annotations than those from ‘ggplot2’, which are designed for data labels.
In user interfaces there is always a compromise between simplicity and flexibility in which the choice of default arguments plays a key role in finding a balance. I have aimed to make everyday use simple without constraining the flexibility inherent to the Grammar of Graphics. However, the assumption is that users desire/need the flexibility inherent in the layered approach of the Grammar of Graphics, and that the layer functions provided will be combined with those from ‘ggplot2’, ‘ggpp’ and other packages extending the Grammar of Graphics as implemented in ‘ggplot2’.
Currently, the core statistics made available by package ‘ggpmisc’
help with reporting parameter and other estimates from fitted models,
both as equations and numbers in annotations and by plotting. Similarly
to ggplot2::stat_smooth()
they call modelfitfunctions to
fit a modelformula to the plotlayer data and, as all ggplot
statistics, return new data
as a data frame that is in turn
passed to a geometry. The statistics are stat_poly_line()
,
stat_poly_eq()
, stat_ma_line()
,
stat_ma_eq()
, stat_quant_line()
,
stat_quant_band()
, stat_quant_eq()
,
stat_fit_residuals()
, stat_fit_deviations()
,
stat_fit_glance()
, stat_fit_augment()
,
stat_fit_tidy()
and stat_fit_tb()
. In
addition, stat_correlation()
helps with the reporting of
Pearson, Kendall and Spearman correlations. Most of these statistics
support orientation flipping, both using parameter
orientation
similarly to package ‘ggplot2’ in recent
versions or through a model formula
with x
as
response variable and y
as explanatory variable.
These statistics support grouping and faceting. In the case of those
that return character strings to be mapped to the label
aesthetic they also return suitable x
and y
,
or npcx
and npcy
values as well as
hjust
and vjust
values. These default values
ensure that the annotations do not spill out the borders of the plotting
area that also minimize overlaps among labels. The automatic computation
of these positions can be adjusted through arguments passed to these
statistics and also overridden by usersupplied manual positions.
Table 1 below summarizes the use of the modelfittings statistics and the type of plotlayer geometries they are most useful with them. The four statistics at the bottom of the table require in their use more effort as the mapping to aesthetics and conversion of numeric values into character strings must be coded in the layer function call, but on the other hand they support all model fit objects catered by package ‘broom’ itself or any of the packages extending ‘broom’ such as ‘broom.mixed’ (see next section).
Statistic  Returned values (default geometry) 
Methods 

stat_poly_eq() 
equation, R^{2}, P, etc.
(text_npc ) 
lm, rlm (1, 2, 7) 
stat_ma_eq() 
equation, R^{2}, P, etc.
(text_npc ) 
lmodel2 (6, 7) 
stat_quant_eq() 
equation, P, etc. (text_npc ) 
rq (1, 3, 4, 7) 
stat_correlation() 
correlation, Pvalue, CI
(text_npc ) 
Pearson (t), Kendall (z), Spearman (S) 
stat_poly_line() 
line + conf. (smooth ) 
lm, rlm (1, 2, 7) 
stat_ma_line() 
line + conf. (smooth ) 
lmodel2 (6, 7) 
stat_quant_line() 
line + conf. (smooth ) 
rq, rqss (1, 3, 4, 7) 
stat_quant_band() 
median + quartiles (smooth ) 
rq, rqss (1, 4, 5, 7) 
stat_fit_residuals() 
residuals (point ) 
lm, rlm, rq (1, 2, 4, 7, 8) 
stat_fit_deviations() 
deviations from observations
(segment ) 
lm, rlm, lqs, rq (1, 2, 4, 7, 9) 
stat_fit_glance() 
equation, R^{2}, P, etc.
(text_npc ) 
those supported by ‘broom’ 
stat_fit_augment() 
predicted and other values (smooth ) 
those supported by ‘broom’ 
stat_fit_tidy() 
fit results, e.g., for equation
(text_npc ) 
those supported by ‘broom’ 
stat_fit_tb() 
ANOVA and summary tables (table_npc ) 
those supported by ‘broom’ 
stat_multcomp() 
Multiple comparisons (label_pairwise ,
table_npc ) 
those supported by glht 
Statistics stat_poly_eq()
, stat_ma_eq()
,
stat_quant_eq()
and stat_correlation()
return
by default ready formatted character strings that can be mapped to the
label
aesthetic individually or concatenated. The
convenience function use_label()
makes this easier by
accepting abbreviated/common names for the component labels, and
handling both the concatenation and mapping to the label
aesthetics. For full flexibility in the assembly of concatenated labels
paste()
or sprintf()
functions can the used
within a call to ggplot2::aes()
. By default the character
strings are formatted ready to be parsed as R expressions. Optionally
they can be formatted as R Markdown, \(\LaTeX\) or plain text as shown in the
table below.
These statistics as of ‘ggpmisc’ (>= 0.5.0) allow additional
flexibility in the model fit method
functions: they can be
any function with a suitable interface, as long as they return a fitted
model object of the expected class, or that inherits from the expected
class. When possible, as in stat_poly_eq()
, the model
formula is retrieved from the model fit object, and this retrieved model
formula overrides, only in the returned value, the one supplied as
argument by the user. In practice, this means that method
can be a function that does model selection. This is specially useful
with grouped data and facets, making it possible for model formula to
vary from group to group or from panel to panel within a single ggplot
object.
The returned values for output.type
different from
numeric
are shown below in Table 2. The variable mapped by
default to the label
aesthetics is indicated with an
asterisk. The returned values for x
and y
are
for the default computed location of the labels. The returned data frame
has one row per group of observations in each panel of the plot.
Function use_label()
provides an easy way of selecting,
possibly combining and mapping the text labels for different parameters
from a model fit, supporting the use of both variable names and
the usual abbreviations or symbols used in statistics as
selection key.
Variable  Key  Mode  poly_eq 
ma_eq 
quant_eq 
correlation 
multcomp 

eq.label  eq  character  y  y  y*  n  n 
rr.label  R2  character  y*  y*  n  y(1)  n 
rr.confint.label  R2.CI  character  y  n  n  n  n 
adj.rr.label  adj.R2  character  y  n  n  n  n 
cor.label  cor  character  n  n  n  y(1)  n 
rho.label  rho  character  n  n  y  y(1)  n 
tau.label  tau  character  n  n  n  y(1)  n 
r.label  R  character  n  n  n  y*  n 
cor.confint.label  cor.CI  character  n  n  n  y(1)  n 
rho.confint.label  rho.CI  character  n  n  y  y(1)  n 
tau.confint.label  tau.CI  character  n  n  n  y(1)  n 
r.confint.label  R.CI  character  n  n  n  y  n 
f.value.label  F  character  y  n  n  n  n 
t.value.label  t  character  n  n  n  y(1)  y 
z.value.label  z  character  n  n  n  y(1)  n 
S.value.label  S  character  n  n  n  y(1)  n 
p.value.label  P  character  y  y  n  y  y 
delta.label  delta  character  n  n  n  n  y (3) 
letters.label  delta  character  n  n  n  n  y (3) 
AIC.label  AIC  character  y  n  y  n  n 
BIC.label  BIC  character  y  n  n  n  n 
n.label  n  character  y  y  y  y  y 
grp.label  grp  character  y  y  y  y  y 
method.label  method  character  y  y  y  y  n 
r.squared  numeric  y  y  n  n  n  
adj.r.squared  numeric  y  n  n  n  n  
rho  numeric  n  n  y  y(1)  n  
tau  numeric  n  n  n  y(1)  n  
cor  numeric  n  n  n  y(1)  n  
p.value  numeric  y  y  n  y  y  
n  numeric  y  y  y  y  y  
rq.method  character  n  n  y  n  n  
method  character  y  y  y  y  n  
test  character  n  n  n  y  n  
quantile  numeric  n  n  y  n  n  
quantile.f  factor  n  n  y  n  n  
fm.method  character  y  y  y  y  y  
fm.class  character  y  y  y  y  y  
fm.formula.chr  character  y  y  y  y  y  
mc.adjusted  character  n  n  n  n  y  
mc.contrast  character  n  n  n  n  y  
x.left.tip  numeric  n  n  n  n  y (3)  
x.right.tip  numeric  n  n  n  n  y (3)  
x(2)  numeric  y  y  y  y  y  
y(2)  numeric  y  y  y  y  y  
npcx(2)  numeric  y  y  y  y  n  
npcy(2)  numeric  y  y  y  y  n 
Even though the number of significant digits and some other
formatting can be adjusted by passing arguments when calling the
statistics function, in some cases it maybe necessary for the user to do
the conversion of numeric values into character strings. For these
special cases, the statistics can return all values as
numeric
. The values returned for output.type
equal to numeric
are shown below in Table 3. The returned
data frame has one row per group of observations in each panel of the
plot.
Variable  Mode  poly_eq 
ma_eq 
quant_eq 
correlation 

coef.ls  list  y  y  y  n 
f.value  numeric  y  n  n  n 
f.df1  numeric  y  n  n  n 
f.df2  numeric  y  n  n  n 
t.value  numeric  n  n  n  y(1) 
df  numeric  n  n  n  y(1) 
z.value  numeric  n  n  n  y(1) 
S.value  numeric  n  n  n  y(1) 
p.value  numeric  y  y  n  y 
AIC  numeric  y  y  y  n 
BIC  numeric  y  y  n  n 
b_0.constant  numeric  y  y  y  n 
b_0  numeric  y  y  y  n 
b_1  numeric  y  y  y  n 
b_2…b_n  numeric  y  y  y  n 
In the cases of the statistics based on the methods from package
‘broom’ and its extensions, the returned value depends on that of the
method specialization that is automatically dispatched based on the
class of the model fit object returned by the method
function (Table 4). The expectation is that these statistics will be
used only in cases when those described in Tables 2 and 3 cannot be
used. Implementation based on generic methods from package ‘broom’ and
its extensions provides support for very many different model fit
procedures, but at the same time requires that the user consults the
documentation of these broom methods and that the user builds the
character strings for labels in his own code. The simplest way of
discovering what variables are returned is to use
gginnards::geom_debug()
to explore the returned data frame.
Examples are included in the help to some of the statistics and in a
later section of this User Guide.
Variable  Mode  fit_tb 
fit_tidy 
fit_glance 
fit_augment 

broom method  tidy() 
tidy() 
glance() 
augment() 

x(1)  numeric  y  y  y  y 
y(1)  numeric  y  y  y  y 
npcx(1)  numeric  y  y  y  n 
npcy(1)  numeric  y  y  y  n 
fm.tb  data.frame  y  n  n  n 
fm.tb.type  character  y  n  n  n 
fm.method  character  y  y  y  y 
fm.class  character  y  y  y  y 
fm.formula.chr  character  y  y  y  y 
Statistics stat_poly_line()
,
stat_ma_line()
, stat_quant_line()
and
stat_quant_band()
return objects similar to those returned
by ggplot2:stat_smooth()
.
Statistics stat_poly_line()
,
stat_ma_line()
, stat_quant_line()
and
stat_quant_band()
return numeric data suitable for plotting
lines or lines plus bands. They are similar to
ggplot2::stat_smooth()
in their use. They return a basic
set of variables, x
, y
, ymin,
ymax
and weights
which can be expanded by
passing fm.values = TRUE
. The expanded returned
data
contains in addition n
,
r.squared
, adj.r.squared
and
p.value
when available. These numeric variables make it
possible to conditionally hide or highlight using aesthetics, fitted
lines that fulfil conditions tested on these variables. The number of
rows per group
and panel
in the plot, i.e.,
number of points used to plot the smooth line, is that given by
parameter n
with default n = 80
. The user
interface and default arguments are consistent with those of statistics
stat_poly_eq()
, stat_ma_eq()
and
stat_quant_eq()
described above.
Statistics stat_fit_residuals()
and
stat_fit_deviations()
can be used to plot the residuals of
model fits on their own and to highlight them in a plot of the fitted
model curve, respectively.
Statistic stat_fit_tb()
can be used to add ANOVA or
summary tables as plots insets. While the statistics described in the
two previous subsections are useful when fitting curves when both
x
and y
are mapped to continuous
numeric
variables, stat_fit_tb()
is also
suitable for cases when x
or y
is a
factor
.
Which columns from the objects returned by R’s anova()
and summary()
methods as well as which terms from the model
formula
are included in the inset table can be selected by
the user. Column headers and term names can also be replaced by R
expressions. This statistics uses ggpp::geom_table()
, a
geometry that supports table formatting styles.
Package ‘broom’ provides a translation layer between various
model fit functions and a consistent naming and organization for the
returned values. This made it possible to design statistics
stat_fit_glance()
and stat_fit_augment()
that
provide similar capabilities as those described above and that work with
any of the many model fit functions supported by package ‘broom’ and its
extensions. This means that the returned data are in columns with
generic names. Consequently, when using stat_fit_glance()
text labels need to be created in user code and then mapped to the
label
aesthetic and when using
stat_fit_augment()
with complex fitted models know the
details of the model fitting functions used.
Statistics stat_peaks()
and stat_valleys()
can be used to locate and annotate global and local maxima and minima in
the variable mapped to the y
(or x
)
aesthetic. Date time variables mapped to the x
or
y
aesthetics are supported.
When plotting omics data we usually want to highlight
observations based on the outcome of tests of significance for each one
of 100’s or 1000’s genes or metabolites. We define some scales, as
wrappers on continuous x
and y
scales from
package ggplot2
that are suitable for 1) differences in
abundance expressed as fold changes, 2) Pvalue and 3) FDR.
Scales scale_x_logFC()
and scale_y_logFC()
are suitable for plotting of log fold change data. Scales
scale_x_Pvalue()
, scale_y_Pvalue()
,
scale_x_FDR()
and scale_y_FDR()
are suitable
for plotting pvalues and adjusted pvalues or false
discovery rate (FDR). Default arguments are suitable for volcano and
quadrant plots as used for transcriptomics, metabolomics and similar
data.
Encoding of test outcomes into binary and ternary colour scales is also frequent when plotting omics data, so as a convenience we define wrappers on the colour, fill and shape scales from ‘ggplot2’ as well as defined functions suitable for converting binary and ternary outcomes and numeric Pvalues into factors.
Scales scale_colour_outcome()
,
scale_fill_outcome()
and scale_shape_outcome()
and functions outome2factor()
,
threshold2factor()
, xy_outcomes2factor()
and
xy_thresholds2factor()
used together make it easy to map
ternary numeric outputs and logical binary outcomes to colour, fill and
shape aesthetics. Default arguments are suitable for volcano, quadrant
and other plots as used for genomics, metabolomics and similar data.
Package ‘ggpubr’ provides several “canned” functions that produce
whole plots as well as several statistics. Package ‘ggpmisc’ sticks to
the grammar of graphics and extends it and does not provide “canned”
functions to produce whole plots at once. Function
ggpubr::stat_regline_equation()
is a copy of
ggpmisc::stat_poly_eq()
from 2018 or earlier, renamed and
used without acknowledging the source. Since the version in ‘ggpmisc’
has been significantly improved with bug fixes and enhancements, I
suggest you use ggpmisc::stat_poly_eq()
.
Several simple use examples are given in the help for each of the layer and other functions exported by package ‘ggpmisc’. Here we give additional examples, mostly more advanced, together with brief explanations.
The statistic stat_correlation()
computes one of
Pearson, Kendall or Spearman correlation coefficients and tests if the
differ from zero. This is done by calling
stats::cor.test()
, with all of its formal parameters
supported. Depending on the argument passed to parameter
output.type
the values returned very. The default behaviour
is to generate labels as character strings using R’s expression
syntax, suitable to be parsed. This ggplot statistic can also
output labels using other markup languages, including \(\LaTeX\) and Markdown, or just numeric
values. Nevertheless, all examples below use expression syntax, which is
the most commonly used.
We first generate a set of artificial data suitable for the plotting examples in this and subsequent sections.
set.seed(4321)
x < (1:100) / 10
y < x + rnorm(length(x))
my.data < data.frame(x = x,
y = y,
y.desc =  y,
group = c("A", "B"))
For the first example we use defaults to add an annotation with Pearson’s correlation coefficient.
Grouping is supported.
We can also compute Spearman’s rank correlation. (The symbol used for it is the letter rho to distinguish it from Pearson’s correlation for which R or r are used as symbols.)
## Warning in compute_group(...): Skipping bootstrap estimation as 'boot.R' < 50
## Warning in compute_group(...): Skipping bootstrap estimation as 'boot.R' < 50
Statistic stat_correlation()
generates multiple labels
as listed in the tables above. We can combine them freely within a call
to aes()
to customize the annotations.
ggplot(my.data, aes(x, y, color = group)) +
geom_point() +
stat_correlation(mapping = use_label(c("R", "t", "P", "n")))
Facets are also supported.
Using the numeric values returned it is possible to set other aesthetics onthefly.
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_correlation(mapping = aes(color = ifelse(after_stat(cor) > 0.955,
"red", "black"))) +
scale_color_identity() +
facet_wrap(~group)
Above we use the value of the correlation coefficient, but other
numeric values such as p.value
can be used similarly. In
addition to colour, any aesthetics recognized by the geometry used such
as alpha
and face
can be also mapped. The
statistics described below for adding equations of the fitted models
have user interfaces very similar to that of
stat_correaltion()
so examples can be easily adapted.
Please, see the next section for examples of the use of different
geometries instead of the default one, and other advanced examples.
A frequently used annotation in plots showing fitted lines is the
display of the parameters estimates from the model fit as an equation.
stat_poly_eq()
automates this for models of y on
x or x on y fitted with function
lm()
, and MASS:rlm()
or a usersupplied model
fitting function and arguments. By default this statistic
behaves similarly as ggplot2::stat_smooth()
with
method = "lm"
for y on x fits but adds
support for model formulas of x on y (in addition to
the use of orientation = "y"
). Its default behaviour is to
generate labels as character strings using R’s expression
syntax. This statistic can also output equations using different markup
languages, including \(\LaTeX\) and
Markdown, or just numeric values selected by the argument passed to
parameter output.type
. Nevertheless, all examples below use
R’s expression syntax.
The statistic stat_poly_line()
adds a layer nearly
identical to that from ggplot2::stat_smooth()
but has the
same user interface and default arguments as
stat_poly_eq()
.
We first generate a set of artificial data suitable for the plotting examples.
set.seed(4321)
# generate artificial data
x < 1:100
y < (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 4)
y < y / max(y)
my.data < data.frame(x,
y,
group = c("A", "B"),
y2 = y * c(1, 2) + c(0, 0.2),
block = c("a", "a", "b", "b"),
wt = sqrt(x))
First, one example using defaults except for the model
formula
. The best practice, ensuring that the formula used
in both statistics is the same is to assign the formula to a
variable, as shown here. This is because the same model is fit twice to
the same data
, once in stat_poly_line()
and
once in stat_poly_eq()
.
We can write the model formula for a polynomial explicitly or using
function poly()
. When using function poly()
it
is in most cases necessary to set raw = TRUE
as the default
is to use orthogonal polynomials which although fitting the same line
will have different estimates for the coefficients than those in the
usual raw polynomials.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula)
A ready formatted equation is also returned as a string that needs to
be parsed into an expression for display. Function
use_label()
can be used to select or assemble the
expression and map it to the label
aesthetic. The default
separator is suitable for labels to be parsed into R expressions, but
this default can be overridden by passing an argument to parameter
sep
. In this example we select the fitted model equation
using key "eq"
as argument to use_label()
.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(mapping = use_label("eq"), formula = formula)
The mapping can also be done with ggplot2::aes()
using
the name of the variable in the returned data
containing
the desired label and enclosing it in ggplot2::after_stat()
to signal that it is a variable returned by the statistic.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(mapping = aes(label = after_stat(eq.label)),
formula = formula)
stat_poly_eq()
makes available several character strings
in the returned data frame in separate columns: eq.label
,
rr.label
, adj.rr.label
,
AIC.label
, BIC.label
,
f.value.label
, p.value.label
and
n
. One of these, rr.label
, is used by default,
but use_label()
and aes()
together with
after_stat()
can be used to map a different one to the
label
aesthetic. Here we show the adjusted coefficient of
determination, \(R^2_\mathrm{adj}\),
Akaike’s information criterion, AIC. We call stat_poly_eq()
twice to be able to separately control the position and size of each
label. As parameter mapping
is the first one in the
statistics, we can pass the mapping by position (without writing
mapping =
).
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(mapping = use_label("adj.R2"), formula = formula) +
stat_poly_eq(mapping = use_label("AIC"), label.x = "right", label.y = "bottom", size = 3,
formula = formula)
Within ggplot2::aes()
it is also possible to
compute new labels based on those returned plus “arbitrary”
text. The labels returned by default are meant to be parsed
into expressions, so any text added should be valid for a string that
will be parsed. Inserting a comma plus white space in an expression
requires some trickery in the argument passed to parameter
sep
of paste()
, or use_label()
.
Do note the need to escape the embedded quotation marks as
\"
. Combining the labels in this way ensures correct
alignment. To insert only white space sep = "~~~~~"
can be
used, with each tilde character, ~
, adding a rather small
amount of white space. We show here how to change the axis labels to
ensure consistency with the typesetting of the equation.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(mapping = use_label(c("eq", "adj.R2")),
formula = formula) +
labs(x = expression(italic(x)), y = expression(italic(y)))
Above we combined two characterstring labels, but it is possible to
add additional ones and even character strings constants. In the next
example we use several labels instead of just two, and separate them
with various character strings. In this case we need to use
paste()
and ggplot2::aes()
as
use_label()
uses a single value for sep
. We
also change the size of the text in the label.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = paste(after_stat(eq.label), "*\" with \"*",
after_stat(rr.label), "*\", \"*",
after_stat(f.value.label), "*\", and \"*",
after_stat(p.value.label), "*\".\"",
sep = "")),
formula = formula, size = 3)
It is also possible to format the text using plain()
,
italic()
, bold()
or bolditalic()
an other commands as described in the help entry for
plotmath.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(mapping = use_label(c("eq", "adj.R2"), sep = "~~italic(\"with\")~~"),
formula = formula)
As these are expressions, to include two lines of text, we need
either to add stat_poly_eq()
twice, passing an argument to
label.y
to reposition one of the labels (as shown above) or
use (as shown here) atop()
within the expression to create
a single label with two lines of text.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = paste("atop(", after_stat(AIC.label), ",", after_stat(BIC.label), ")", sep = "")),
formula = formula)
Next, one example of how to remove the left hand side (lhs).
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(mapping = use_label("eq"),
eq.with.lhs = FALSE,
formula = formula)
Replacing the lhs.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(mapping = use_label("eq"),
eq.with.lhs = "italic(hat(y))~`=`~",
formula = formula)
Replacing both the lhs and the variable symbol used on the rhs.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(mapping = use_label(c("eq", "R2")),
eq.with.lhs = "italic(h)~`=`~",
eq.x.rhs = "~italic(z)",
formula = formula) +
labs(x = expression(italic(z)), y = expression(italic(h)))
As any valid R expression can be used, Greek letters are also supported, as well as the inclusion in the label of variable transformations used in the model formula or applied to the data. In addition, in the next example we insert white space in between the parameter estimates and the variable symbol in the equation.
formula < y ~ poly(x, 2, raw = TRUE)
ggplot(my.data, aes(x, log10(y + 1e6))) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(mapping = use_label("eq"),
eq.with.lhs = "plain(log)[10](italic(delta)+10^6)~`=`~",
eq.x.rhs = "~Omega",
formula = formula) +
labs(y = expression(plain(log)[10](italic(delta)+10^6)), x = expression(Omega))
Examples of polynomials of different orders, and fits through the
origin. A degree 1 polynomial is linear regression with
formula = y ~ x
, which is the default. Linear regression
through the origin is supported by passing
formula = y ~ x  1
, or formula = y ~ x + 0
.
High order polynomials are most easily requested using
poly()
in the model formula, as shown next.
formula < y ~ poly(x, 5, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(mapping = use_label("eq"), formula = formula)
Intercept forced to zero for a third degree polynomial defined
explicitly, as this case is not supported by poly()
.
formula < y ~ x + I(x^2) + I(x^3)  1
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = after_stat(eq.label)), formula = formula)
Even when labels are returned, numeric values are also returned for
r.squared
, adj.r.squared
, p.value
and n
. This can be used in ggplot2::aes()
for
example to show the equation only if a certain condition like a minimum
value for r.square
or p.value
has been
reached.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = ifelse(after_stat(adj.r.squared > 0.3),
paste(after_stat(eq.label), after_stat(adj.rr.label),
sep = "*\", \"*"),
after_stat(adj.rr.label))),
formula = formula) +
labs(x = expression(italic(x)), y = expression(italic(y)))
Fitting different models to different panels is supported.
Userdefined method functions are required to return an object that
inherits from class "lm"
. The function is applied per group
and panel, and the model formula
fitted can differ among
them, as it is retrieved from this object to construct the equation
label.
In the example below instead of using stats::lm()
as
method, we define a wrapper function that tests for the significance of
the slope in linear regression, and if not significant, fits the mean
instead.
poly_or_mean < function(formula, data, ...) {
fm < lm(formula = formula, data = data, ...)
if (anova(fm)[["Pr(>F)"]][1] > 0.1) {
lm(formula = y ~ 1, data = data, ...)
} else {
fm
}
}
We then use our function poly_or_mean()
as the
method.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
stat_poly_line(method = "poly_or_mean") +
stat_poly_eq(method = poly_or_mean,
aes(label = after_stat(eq.label)),
label.x = "right") +
theme(legend.position = "bottom") +
facet_wrap(~class, ncol = 2)
We give below several examples to demonstrate how other components of
the ggplot
object affect the behaviour of this
statistic.
Facets work as expected both with fixed or free scales. Although below we had to adjust the size of the font used for the equation.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = after_stat(eq.label)), size = 3,
formula = formula) +
facet_wrap(~group)
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = after_stat(eq.label)), size = 3,
formula = formula) +
facet_wrap(~group, scales = "free_y")
Grouping works as expected. In this example we create groups
using the colour aesthetic, but factors or discrete variables mapped to
other aesthetics including the “invisible” group
aesthetic.
We can use justification and supply an absolute location for the
equation, but the default frequently works well as in the example
below.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = after_stat(eq.label)), formula = formula, vstep = 0.06)
To add a label to the equation for each group, we map it to a
pseudoaesthetic named grp.label
.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group, grp.label = group)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = after_stat(paste("bold(", grp.label, "*\":\")~~",
eq.label, sep = ""))),
formula = formula)
Being able to label equations allows us to dispense with the use of colour, which in some cases is needed for publication in journals and books.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, linetype = group, grp.label = group)) +
geom_point() +
stat_poly_line(formula = formula, color = "black") +
stat_poly_eq(aes(label = after_stat(paste("bold(", grp.label, "*':')~~~",
eq.label, sep = ""))),
formula = formula)
User supplied label positions relative to the ranges of the
x and y scales are also supported, both through string
constants and numeric values in the range 0 to 1, when using the default
ggpp::geom_text_npc()
.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = after_stat(eq.label)),
formula = formula,
label.x = "centre")
The default locations are based on normalized coordinates, and consequently these defaults work even when the range of the x and y scales varies from panel to panel.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, fill = block)) +
geom_point(shape = 21, size = 3) +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = after_stat(rr.label)), size = 3,
geom = "label_npc", alpha = 0.33,
formula = formula) +
facet_wrap(~group, scales = "free_y")
If grouping is not the same within each panel created by faceting,
i.e., not all groups have data in all panels, the automatic location of
labels results in “holes” for the factor levels not present in a given
panel. This behaviour ensures consistent positioning of labels for the
same groups across panels. If positioning that leaves no gaps is
desired, it is possible to explicitly pass as argument the positions for
each individual label. In the plot below the simultaneous mapping to
both colour and fill aesthetics creates four groups, two with data in
panel A and the other two in panel B, but the vector passed as argument
to label.y
determines the y
position (in
npc
units) four each of the four labels.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group, fill = block)) +
geom_point(shape = 21, size = 3) +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = after_stat(rr.label)), size = 3, alpha = 0.2,
geom = "label_npc", label.y = c(0.95, 0.85, 0.95, 0.85),
formula = formula) +
facet_wrap(~group, scales = "free_y")
It is possible to use ggplot2::geom_text()
and
ggplot2::geom_label()
instead of the default
ggpp::geom_text_npc()
(or
ggpp::geom_label_npc()
) but in this case, if label
coordinates are given explicitly they should be expressed in native data
coordinates. When multiple labels need to be positioned a vector of
coordinates can be used as shown here for label.x
and
label.y
.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(geom = "text", aes(label = after_stat(eq.label)),
label.x = c(100, 20), label.y = c(0.1, 2.1), hjust = "inward",
formula = formula)
Note: Automatic positioning using
ggplot2::geom_text()
and ggplot2::geom_label()
is supported except when faceting uses free scales. In this case
ggpp::geom_text_npc()
and/or
ggpp::geom_label_npc()
or positions supplied by the user
must be used.
Occasionally, possibly as a demonstration in teaching, we may want to compare a fit of x on y and one of y on x in the same plot. Here, this example also serves as introduction to the next section, which describes the use of major axis regression or Model II regression.
As is the case for model fitting functions themselves, which variable
is dependent (response) and which is independent (explanatory) is simply
indicated by the model equation passed as argument to
formula
, or consistently with
ggplot2::stat_smooth()
through parameter
orientation
as show below.
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = y ~ x, color = "blue") +
stat_poly_eq(mapping = use_label(c("R2", "eq")), color = "blue") +
stat_poly_line(formula = y ~ x, color = "red", orientation = "y") +
stat_poly_eq(mapping = use_label(c("R2", "eq")), color = "red", orientation = "y",
label.y = 0.9)
Package ‘lmodel2’ implements Model II fits for major axis (MA), standard major axis (SMA) and ranged major axis (RMA) regression for linear relationships. These approaches of regression are used when both variables are subject to random variation and the intention is not to estimate y from x but instead to describe the slope of the relationship between them. A typical example in biology are allometric relationships. Rather frequently, OLS linear model fits are used in cases where Model II regression would be preferable.
In the figure above we see that the ordinary least squares (OLS) linear regression of y on x and x on y differ. Major axis regression provides an estimate of the slope that is independent of the direction of the relationship relation. \(R^2\) remains the same in all cases, and with major axis regression while the coefficient estimates in the equation change, they correspond to exactly the same line.
Package ‘quantreg’ implements quantile regression for multiple types
of models: linear models with function quantreg::rq()
,
nonlinear models with function quantreg::nlreg()
, and
additive models with function quantreg::rqss()
. Each of
these functions supports different methods of estimation which can be
passed through parameter method.args
or more easily using a
character string such as "rq:br"
argument to parameter
method
of the statistic, where quantreg::rq()
is the model fit function and "br"
the argument passed to
the method
parameter of quantreg::rq()
. As
with the statistics for polynomials described in the previous section,
all the statistics described in the current section support linear
models while stat_quant_line()
and
stat_quant_band()
also support additive models. (Support
for quantreg::nlreg()
is not yet implemented.)
In the case of quantile regression it is frequently the case that
multiple quantile regressions are fitted. This case is not supported by
stat_poly_eq()
and we need to use
stat_quant_eq()
. This statistics is very similar to
stat_poly_eq()
both in behaviour and parameters. It
supports quantile regression of y on x or x
on y fitted with function quantreg::rq
.
The statistic stat_quant_line()
adds a layer similar to
that created by ggplot2::stat_smooth()
including confidence
bands for quantile regression, by default for p = 0.95.
However, it has the same user interface and default arguments as
stat_quant_eq()
, and supports both x and
y as explanatory variable in formula
and also
orientation
. Confidence intervals for the fits to
individual quantiles fulfil the same purpose as in OLS fits: providing a
boundary for the estimated line computed for each individual
quantile. They are not shown by default, unless a single quantile is
passed as argument, but can be enabled with se = TRUE
and
disabled with se = FALSE
.
The statistic stat_quant_band()
creates a line plus band
plot that is, by default equivalent to the box of a box plot showing
regressions as a band for quartiles plus a line for the median
regression. In this case, the band informs about the spread of the
observations in the same way as the box in box plots does.
Which three quantiles are fitted can be set through parameter
quantiles
but in contrast to stat_quant_line()
and stat_quant_eq()
only a vector of two or three quantiles
is accepted (with two quantiles the line is omitted and only a band
displayed).
Similarly to ggplot2::stat_quantiles()
,
stat_quant_eq()
and stat_quant_line()
accept a
numeric
vector as argument to quantiles
and
output one equation or one line per group and quantile. Statistic
stat_quant_eq()
also returns column grp.label
containing by default a character string indicating the value of the
quantile, and if a variable is mapped to the pseudo aesthetic
grp.label
its value is prepended to the values of the
quantiles.
Here we use the default value for quantiles,
c(0.25, 0.50, 0.75)
, equivalent to those used for the box
of box plots, and we obtain a plot with a line for median regression and
a band highlighting the space between the quartiles.
Overriding the values for quantiles
we find a boundary
containing 90% of the observations.
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_quant_line(formula = y ~ poly(x, 2), quantiles = c(0.05, 0.95))
A line for the quantile regression plus a confidence band for it is
the default for a single quantile. (This is similar to the behaviour of
ggplot2::stat_smooth()
.)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_quant_line(formula = y ~ poly(x, 2), quantiles = 0.5)
With stat_quant_eq()
we add the fitted equations to the
first example in this section (we also override the use of colour).
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_quant_band(formula = formula, color = "black", fill = "grey60") +
stat_quant_eq(aes(label = paste(after_stat(grp.label), "*\": \"*",
after_stat(eq.label), sep = "")),
formula = formula) +
theme_classic()
Grouping and faceting are supported by stat_quant_eq()
,
stat_quant_line()
and stat_quant_band()
. Below
are some examples of grouping.
ggplot(my.data, aes(x, y, color = group)) +
geom_point() +
stat_quant_line(formula = formula) +
stat_quant_eq(aes(label = paste(after_stat(grp.label), "*\": \"*",
after_stat(eq.label), sep = "")),
formula = formula)
Grouping can be combined with a group label by mapping in
aes()
a character variable to grp.label
.
ggplot(my.data, aes(x, y, group = group, linetype = group,
shape = group, grp.label = group)) +
geom_point() +
stat_quant_line(formula = formula, quantiles = c(0.05, 0.95), color = "black") +
stat_quant_eq(aes(label = paste(after_stat(grp.label), "*\": \"*",
after_stat(eq.label), sep = "")),
formula = formula, quantiles = c(0.05, 0.95)) +
theme_classic()
In some cases double quantile regression is an informative
method to assess reciprocal constraints. For this a fit of x on
y and one of y on x in the same plot are
computed for a large quantile. As is the case for model fitting
functions themselves this is simply indicated by the model passed as
argument to formula
, or consistently with
ggplot2::stat_smooth()
through parameter
orientation
.
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_quant_line(formula = y ~ x, color = "blue", quantiles = 0.05) +
stat_quant_eq(mapping = use_label("eq"), formula = y ~ x, color = "blue",
quantiles = 0.05) +
stat_quant_line(formula = x ~ y, color = "red", quantiles = 0.95) +
stat_quant_eq(mapping = use_label("eq"), formula = x ~ y, color = "red",
quantiles = 0.95, label.y = 0.9)
Please, see the section on stat_poly_eq()
for additional
examples, as here we have highlighted mainly the differences between
these two statistics.
This section is still preliminary. Multiple comparisons are
frequently used when the effect of different levels of a factor are
individually of interest. Function stat_multcomp()
computes
adjusted Pvalues for pairwise comparison using Tukey or Dunnet
contrasts using function glht()
from package ‘multcomp’. To
complement the examples in the documentation of the function, we show
below examples that override several defaults.
Alternatives to the default “tag” values used to describe the
adjustment method can be obtained with arguments passed to parameter
adj.method.tag
. In the first example the abbreviation is
three characters long, instead of the default of four.
# position of contrasts' bars (manual)
ggplot(mpg, aes(factor(cyl), hwy)) +
geom_boxplot(width = 0.33) +
stat_multcomp(p.adjust.method = "bonferroni",
adj.method.tag = 3,
size = 2.75) +
expand_limits(y = 0)
Here we use a negative value to use an abbreviation of the word “adjusted” to three characters.
# position of contrasts' bars (manual)
ggplot(mpg, aes(factor(cyl), hwy)) +
geom_boxplot(width = 0.33) +
stat_multcomp(p.adjust.method = "bonferroni",
adj.method.tag = 3,
size = 2.75) +
expand_limits(y = 0)
A character string passed as argument is used as is, here to set the tag in Spanish.
# position of contrasts' bars (manual)
ggplot(mpg, aes(factor(cyl), hwy)) +
geom_boxplot(width = 0.33) +
stat_multcomp(adj.method.tag = "ajustada",
size = 2.75) +
expand_limits(y = 0)
A numeric vector passed to label.y
can be used to
manually set the location of each label along the y axis.
# position of contrasts' bars (manual)
ggplot(mpg, aes(factor(cyl), hwy)) +
geom_boxplot(width = 0.33) +
stat_multcomp(label.y = c(7, 4, 1),
contrasts = "Dunnet",
size = 2.75) +
expand_limits(y = 0)
ggplot(mpg, aes(factor(cyl), hwy)) +
geom_boxplot(width = 0.33) +
stat_multcomp(label.y =
seq(from = 15,
by = 3,
length.out = 6),
size = 2.5) +
expand_limits(y = 0)
We can precompute the numeric vector to achieve special positioning, in this case, next to each observation.
means <
aggregate(mpg$hwy,
by = list(mpg$cyl),
FUN = mean,
na.rm = TRUE)[["x"]]
ggplot(mpg, aes(factor(cyl), hwy)) +
stat_summary(fun.data = mean_se) +
stat_multcomp(label.type = "letters",
label.y = c(18, means), # 18 is for critical P label
position = position_nudge(x = 0.1))
We can override the default use of geom_text()
and also
remove the P critical label.
# Using other geometries
ggplot(mpg, aes(factor(cyl), hwy)) +
geom_boxplot(width = 0.33) +
stat_multcomp(label.type = "letters",
adj.method.tag = FALSE,
geom = "label")
With Dunnet contrasts, the use of bars can clutter a plot, and we can alternatively show the outcome for each level that has been compared to a control, the first level.
ggplot(mpg, aes(factor(cyl), hwy)) +
geom_boxplot(width = 0.33) +
stat_multcomp(aes(x = stage(start = factor(cyl),
after_stat = x.right.tip)),
geom = "text",
label.y = "bottom",
vstep = 0,
contrasts = "Dunnet")
ggplot(mpg, aes(factor(cyl), hwy)) +
geom_boxplot(width = 0.33) +
stat_multcomp(aes(x = stage(start = factor(cyl),
after_stat = x.right.tip),
label = after_stat(stars.label)),
geom = "text",
label.y = "bottom",
vstep = 0,
contrasts = "Dunnet")
The returned value includes numeric values in addition to character
strings mapped to the label
aesthetic. The numeric values
can be used to encode the outcomes using additional or different
aesthetics than the default.
# use colour to show significance
ggplot(mpg, aes(factor(cyl), hwy)) +
geom_boxplot(width = 0.33) +
stat_multcomp(aes(colour = after_stat(p.value) < 0.01),
size = 2.75) +
scale_colour_manual(values = c("grey60", "black")) +
theme_bw()
# add arrow heads to segments and use fill to show significance
ggplot(mpg, aes(factor(cyl), hwy)) +
geom_boxplot(width = 0.33) +
stat_multcomp(aes(fill = after_stat(p.value) < 0.01),
size = 2.5,
arrow = grid::arrow(angle = 45,
length = unit(1, "mm"),
ends = "both")) +
scale_fill_manual(values = c("white", "lightblue"))
Sometimes we need to quickly plot residuals matching fits plotted
with stat_poly_line()
, stat_quant_line()
,
stat_ma_line()
or ggplot2::stat_smooth()
(e.g., using grouping and facets). Function
stat_fit_residuals()
makes this easy. In teaching it is
specially important to avoid distractions and plots residuals from
different types of models consistently.
stat_fit_residuals()
can be used to plot the residuals
from models of y on x or x on y
fitted with function lm
, MASS:rlm
,
quantreg::rq
(only median regression) or a usersupplied
model fitting function and arguments. In the data frame returned by the
statistic the response variable, y or x, is replaced
by the residuals from the fit and thus also by default mapped to the
aesthetic of the same name. The default geom is
"point"
.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y, colour = group)) +
geom_hline(yintercept = 0, linetype = "dashed") +
stat_fit_residuals(formula = formula)
We can of course also map the weights returned in the model fit
object to ggplot2 aesthetics, here we use size
.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y, colour = group)) +
geom_hline(yintercept = 0, linetype = "dashed") +
stat_fit_residuals(formula = formula,
method = "rlm",
mapping = aes(size = sqrt(after_stat(weights))),
alpha = 2/3)
Weighted residuals are available, but in this case they do not differ
as we have not mapped any variable to the weight
aesthetic
in the input to the statistic.
When teaching we may wish to highlight residuals in plots used in
slides and notes. Statistic stat_fit_deviations()
makes it
easy to achieve this for the residuals from models of y on
x or x on y fitted with function
lm
, MASS:rlm
, quantreg::rq
(only
median regression) or a usersupplied model fitting function and
arguments. The observed values are mapped to aesthetics x and
y, and the fitted values are mapped to aesthetics xend
and yend. As the default geom is "segment"
, each
deviation is displayed as a segment.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_smooth(method = "lm", formula = formula) +
stat_fit_deviations(formula = formula, colour = "red") +
geom_point()
The geometry used by default is ggplot2::geom_segment()
and additional aesthetics can be mapped or set to constant values. Here
we add arrowheads.
formula < y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
geom_smooth(method = "lm", formula = formula) +
geom_point() +
stat_fit_deviations(formula = formula, colour = "red",
arrow = arrow(length = unit(0.015, "npc"),
ends = "both"))
When weights are available, either supplied by the user, or computed
as part of the fit, they are returned in data
. Having
weights available allows encoding them using colour. We here use a
robust regression fit with MASS::rlm()
.
my.data.outlier < my.data
my.data.outlier[6, "y"] < my.data.outlier[6, "y"] * 5
ggplot(my.data.outlier, aes(x, y)) +
stat_smooth(method = MASS::rlm, formula = formula) +
stat_fit_deviations(formula = formula, method = "rlm",
mapping = aes(colour = after_stat(weights)),
show.legend = TRUE) +
scale_color_gradient(low = "red", high = "blue", limits = c(0, 1)) +
geom_point()
Package ‘broom’ provides consistently formatted output from different
model fitting functions. These made it possible to write a
modelannotation statistic that is very flexible but that requires
additional input from the user to generate the character strings to be
mapped to the label
aesthetic.
As we have above given some simple examples, we here exemplify this statistic in a plot with grouping, and assemble a label for the Pvalue using a string parsed into a expression. We also change the default position of the labels.
# formula < y ~ poly(x, 3, raw = TRUE)
# broom::augment does not handle poly() correctly!
formula < y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y, colour = group)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_fit_glance(method = "lm",
method.args = list(formula = formula),
label.x = "right",
label.y = "bottom",
aes(label = sprintf("italic(P)*\"value = \"*%.3g",
after_stat(p.value))),
parse = TRUE)
It is also possible to fit a nonlinear model with
method = "nls"
, and any other model for which a
broom::glance()
method exists. Do consult the documentation
for package ‘broom’. Here we fit the MichaelisMenten equation to
reaction rate versus concentration data.
micmen.formula < y ~ SSmicmen(x, Vm, K)
ggplot(Puromycin, aes(conc, rate, colour = state)) +
geom_point() +
geom_smooth(method = "nls",
formula = micmen.formula,
se = FALSE) +
stat_fit_glance(method = "nls",
method.args = list(formula = micmen.formula),
aes(label = paste("AIC = ", signif(after_stat(AIC), digits = 3),
", BIC = ", signif(after_stat(BIC), digits = 3),
sep = "")),
label.x = "centre", label.y = "bottom")
This ggplot statistic makes it possible to add summary or ANOVA
tables for any fitted model for which broom::tidy()
is
implemented. The output from broom::tidy()
is embedded as a
single list value within the returned data
, as an object of
class tibble::tibble
.
This statistic ignores grouping based on aesthetics,
i.e., uses a panel function to fit the model. Fitting the model
by panel allows fitting models when x
or y
is
a factor
(as in such cases ggplot
splits the
data into groups, one for each level of the factor, which is needed for
example for stat_summary()
to work as expected). By
default, the "table"
geometry is used. The use of
ggpp::geom_table()
is described in the User Guide of
package ‘ggpp’. Table themes and mapped aesthetics are supported.
The default output of stat_fit_tb()
is the default
output from tidy(fm)
where fm
is the fitted
model.
formula < y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_fit_tb(method = "lm",
method.args = list(formula = formula),
tb.vars = c(Parameter = "term",
Estimate = "estimate",
"s.e." = "std.error",
"italic(t)" = "statistic",
"italic(P)" = "p.value"),
label.y = "top", label.x = "left",
parse = TRUE)
When tb.type = "fit.anova"
the output returned is that
from tidy(anova(fm))
where fm
is the fitted
model. Here we also show how to replace names of columns and terms, and
exclude one column, in this case, the mean squares.
formula < y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_fit_tb(method = "lm",
method.args = list(formula = formula),
tb.type = "fit.anova",
tb.vars = c(Effect = "term",
df = "df",
"italic(F)" = "statistic",
"italic(P)" = "p.value"),
tb.params = c(x = 1, "x^2" = 2, "x^3" = 3, Resid = 4),
label.y = "top", label.x = "left",
parse = TRUE)
When tb.type = "fit.coefs"
the output returned is that
of tidy(fm)
after selecting the term
and
estimate
columns.
formula < y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_fit_tb(method = "lm",
method.args = list(formula = formula),
tb.type = "fit.coefs", parse = TRUE,
label.y = "center", label.x = "left")
Faceting works as expected, but grouping is ignored as mentioned
above. In this case, the colour aesthetic is not applied to the text of
the tables. Furthermore, if label.x.npc
or
label.y.npc
are passed numeric vectors of length > 1,
the corresponding values are obeyed by the different panels.
micmen.formula < y ~ SSmicmen(x, Vm, K)
ggplot(Puromycin, aes(conc, rate, colour = state)) +
facet_wrap(~state) +
geom_point() +
geom_smooth(method = "nls",
formula = micmen.formula,
se = FALSE) +
stat_fit_tb(method = "nls",
method.args = list(formula = micmen.formula),
tb.type = "fit.coefs",
label.x = 0.9,
label.y = c(0.75, 0.2)) +
theme(legend.position = "none") +
labs(x = "C", y = "V")
The data in the example below are split by ggplot into six groups
based on the levels of the feed
factor. However, as
stat_fit_tb()
ignores groupings, we can still fit a linear
model to all the data in the panel.
ggplot(chickwts, aes(factor(feed), weight)) +
stat_summary(fun.data = "mean_se") +
stat_fit_tb(tb.type = "fit.anova",
label.x = "center",
label.y = "bottom") +
expand_limits(y = 0)
We can flip the system of coordinates, if desired.
ggplot(chickwts, aes(factor(feed), weight)) +
stat_summary(fun.data = "mean_se") +
stat_fit_tb(tb.type = "fit.anova", label.x = "left", size = 3) +
scale_x_discrete(expand = expansion(mult = c(0.2, 0.5))) +
coord_flip()
It is also possible to rotate the table using angle
.
Here we also show how to replace the column headers with strings to be
parsed into R expressions.
ggplot(chickwts, aes(factor(feed), weight)) +
stat_summary(fun.data = "mean_se") +
stat_fit_tb(tb.type = "fit.anova",
angle = 90, size = 3,
label.x = "right", label.y = "center",
hjust = 0.5, vjust = 0,
tb.vars = c(Effect = "term",
"df",
"M.S." = "meansq",
"italic(F)" = "statistic",
"italic(P)" = "p.value"),
parse = TRUE) +
scale_x_discrete(expand = expansion(mult = c(0.1, 0.35))) +
expand_limits(y = 0)
This stat makes it possible to add the equation for any fitted model
for which broom::tidy()
is implemented. Alternatively,
individual values such as estimates for the fitted parameters, standard
errors, or Pvalues can be added to a plot. However, the user
has to explicitly construct the labels within
ggplot2::aes()
. This statistic respects grouping based on
aesthetics, and reshapes the output of broom::tidy()
so
that the values for a given group are in a single row in the returned
data
.
As first example we fit a nonlinear model, the MichaelisMenten
equation of reaction kinetics, using stats::nls()
. We use
the selfstarting function stats::SSmicmen()
available in
R.
micmen.formula < y ~ SSmicmen(x, Vm, K)
ggplot(Puromycin, aes(conc, rate, colour = state)) +
geom_point() +
geom_smooth(method = "nls",
formula = micmen.formula,
se = FALSE) +
stat_fit_tidy(method = "nls",
method.args = list(formula = micmen.formula),
label.x = "right",
label.y = "bottom",
aes(label = paste("V[m]~`=`~", signif(after_stat(Vm_estimate), digits = 3),
"%+%", signif(after_stat(Vm_se), digits = 2),
"~~~~K~`=`~", signif(after_stat(K_estimate), digits = 3),
"%+%", signif(after_stat(K_se), digits = 2),
sep = "")),
parse = TRUE)
Using paste we can build a string that can be parsed into an R expression, in this case for a nonlinear equation.
micmen.formula < y ~ SSmicmen(x, Vm, K)
ggplot(Puromycin, aes(conc, rate, colour = state)) +
geom_point() +
geom_smooth(method = "nls",
formula = micmen.formula,
se = FALSE) +
stat_fit_tidy(method = "nls",
method.args = list(formula = micmen.formula),
size = 3,
label.x = "center",
label.y = "bottom",
vstep = 0.12,
aes(label = paste("V~`=`~frac(", signif(after_stat(Vm_estimate), digits = 2), "~C,",
signif(after_stat(K_estimate), digits = 2), "+C)",
sep = "")),
parse = TRUE) +
labs(x = "C", y = "V")
What if we would need a more specific statistic, similar to
stat_poly_eq()
? We can use stat_fit_tidy()
as
the basis for its definition.
stat_micmen_eq < function(vstep = 0.12,
size = 3,
...) {
stat_fit_tidy(method = "nls",
method.args = list(formula = micmen.formula),
aes(label = paste("V~`=`~frac(", signif(after_stat(Vm_estimate), digits = 2), "~C,",
signif(after_stat(K_estimate), digits = 2), "+C)",
sep = "")),
parse = TRUE,
vstep = vstep,
size = size,
...)
}
The code for the figure is now simpler, and still produces the same figure (not shown).
micmen.formula < y ~ SSmicmen(x, Vm, K)
ggplot(Puromycin, aes(conc, rate, colour = state)) +
geom_point() +
geom_smooth(method = "nls",
formula = micmen.formula,
se = FALSE) +
stat_micmen_eq(label.x = "center",
label.y = "bottom") +
labs(x = "C", y = "V")
quantreg::rq()
from package ‘quantreg’.my_formula < y ~ x
ggplot(mpg, aes(displ, 1 / hwy)) +
geom_point() +
geom_quantile(quantiles = 0.5, formula = my_formula) +
stat_fit_tidy(method = "rq",
method.args = list(formula = y ~ x, tau = 0.5),
tidy.args = list(se.type = "nid"),
mapping = aes(label = sprintf('y~"="~%.3g+%.3g~x*", with "*italic(P)~"="~%.3f',
after_stat(Intercept_estimate),
after_stat(x_estimate),
after_stat(x_p.value))),
parse = TRUE)
We can define a stat_rq_eq()
if we need to add similar
equations to several plots. In this example we retain the ability of the
user to override most of the default default arguments.
stat_rq_eqn <
function(formula = y ~ x,
tau = 0.5,
method = "br",
mapping = aes(label = sprintf('y~"="~%.3g+%.3g~x*", with "*italic(P)~"="~%.3f',
after_stat(Intercept_estimate),
after_stat(x_estimate),
after_stat(x_p.value))),
parse = TRUE,
...) {
method.args < list(formula = formula, tau = tau, method = method)
stat_fit_tidy(method = "rq",
method.args = method.args,
tidy.args = list(se.type = "nid"),
mapping = mapping,
parse = parse,
...)
}
And the code of the figure now as simple as. Figure not shown, as is identical to the one above.
Use ggplot2::stat_smooth
, stat_poly_line()
,
stat_ma_line()
, stat_quant_line()
,
stat_deviations()
or stat_residuals()
instead
of stat_fit_augment
if possible.
When statistics specific to the type of model fitted are available,
their use is in general simpler that the use of
stat_fit_augment()
. On the other hand,
stat_fit_augment()
can handle any fitted model that is
supported by package ‘broom’ or its extensions. All these statistics
support grouping and facets.
# formula < y ~ poly(x, 3, raw = TRUE)
# broom::augment does not handle poly correctly!
formula < y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_fit_augment(method = "lm",
method.args = list(formula = formula))
formula < y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y, colour = group)) +
geom_point() +
stat_fit_augment(method = "lm",
method.args = list(formula = formula))
We can override the variable returned as y
to be any of
the variables in the data frame returned by
broom::augment()
while still preserving the original
y
values.
formula < y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
stat_fit_augment(method = "lm",
method.args = list(formula = formula),
geom = "point",
y.out = ".resid")
formula < y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y, colour = group)) +
stat_fit_augment(method = "lm",
method.args = list(formula = formula),
geom = "point",
y.out = ".std.resid")
We can use any model fitting method for which
broom::augment()
is implemented.
args < list(formula = y ~ k * e ^ x,
start = list(k = 1, e = 2))
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
stat_fit_augment(method = "nls",
method.args = args)
args < list(formula = y ~ k * e ^ x,
start = list(k = 1, e = 2))
ggplot(mtcars, aes(wt, mpg)) +
stat_fit_augment(method = "nls",
method.args = args,
geom = "point",
y.out = ".resid")
Note: The tidiers for mixed models have moved to package ‘broom.mixed’!
args < list(model = y ~ SSlogis(x, Asym, xmid, scal),
fixed = Asym + xmid + scal ~1,
random = Asym ~1  group,
start = c(Asym = 200, xmid = 725, scal = 350))
ggplot(Orange, aes(age, circumference, colour = Tree)) +
geom_point() +
stat_fit_augment(method = "nlme",
method.args = args,
augment.args = list(data = quote(data)))
When using stat_fit_glance()
,
stat_fit_tidy()
, stat_fit_augment()
and
stat_fit_tb()
the documentation cannot describe the
variables present in the returned data
because they call
methods using their generic interface and the exact especialization used
depends on the argument passed to method, which can be even one that I
have not been aware of and never tested. Thanks to package ‘gginnards’
and its gginnards::geom_debug()
we can find this out with a
temporary change to the plot code.
gginnards::geom_debug()
uses by default
head()
to echo to the R console the data it receives as
input. It ignores the aesthetic mappings so it can be used with any
statistics or directly as a layer function. However, unrecognised
aesthetics can trigger warnings during plot rendering. These warnings
can be ignored in most cases.
# formula < y ~ poly(x, 3, raw = TRUE)
# broom::augment does not handle poly() correctly!
formula < y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y, colour = group)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_fit_glance(geom = "debug",
method = "lm",
method.args = list(formula = formula),
label.x = "right",
label.y = "bottom",
aes(label = sprintf("italic(P)*\"value = \"*%.3g",
after_stat(p.value))),
parse = TRUE)
## Warning in stat_fit_glance(geom = "debug", method = "lm", method.args =
## list(formula = formula), : Ignoring unknown parameters: `parse`
## [1] "PANEL 1; group(s) 1, 2; 'draw_function()' input 'data' (head):"
## colour npcx npcy label r.squared adj.r.squared
## 1 #F8766D NA NA italic(P)*"value = "*1.24e32 0.9619032 0.9594187
## 2 #00BFC4 NA NA italic(P)*"value = "*1.73e33 0.9650270 0.9627461
## sigma statistic p.value df logLik AIC BIC deviance
## 1 0.05391446 387.1505 1.237801e32 3 77.15544 144.3109 134.7508 0.1337114
## 2 0.05521918 423.0996 1.732868e33 3 75.95986 141.9197 132.3596 0.1402613
## df.residual nobs fm.class fm.method fm.formula
## 1 46 50 lm lm y ~ x + I(x^2) + I(x^3)
## 2 46 50 lm lm y ~ x + I(x^2) + I(x^3)
## fm.formula.chr x y PANEL group
## 1 y ~ x + I(x^2) + I(x^3) 95.05 0.03289940 1 1
## 2 y ~ x + I(x^2) + I(x^3) 95.05 0.04864529 1 2
In the case of stat_fit_tb()
the computed variables
returned include a list column containing embedded data frames. In this
case, function str()
is more informative than
head()
, so we can pass it as argument overriding the
default.
formula < y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_fit_tb(geom = "debug",
summary.fun = str,
method = "lm",
method.args = list(formula = formula),
tb.vars = c(Parameter = "term",
Estimate = "estimate",
"s.e." = "std.error",
"italic(t)" = "statistic",
"italic(P)" = "p.value"),
label.y = "top", label.x = "left",
parse = TRUE)
## Warning in stat_fit_tb(geom = "debug", summary.fun = str, method = "lm", : Ignoring unknown parameters: `table.theme`, `table.rownames`, `table.colnames`,
## `table.hjust`, `parse`, and `summary.fun`
## [1] "PANEL 1; group(s) NULL; 'draw_function()' input 'data' (head):"
## label
## 1 (Intercept), x, I(x^2), I(x^3), 0.0045, 0.00109, 2.14e05, 1.06e06, 0.0227, 0.00194, 4.44e05, 2.89e07, 0.198, 0.563, 0.482, 3.65, 0.843, 0.575, 0.631, < 0.001
## PANEL x y
## 1 1 1 1
## fm.tb
## 1 (Intercept), x, I(x^2), I(x^3), 0.0045, 0.00109, 2.14e05, 1.06e06, 0.0227, 0.00194, 4.44e05, 2.89e07, 0.198, 0.563, 0.482, 3.65, 0.843, 0.575, 0.631, < 0.001
## fm.tb.type fm.class fm.method fm.formula
## 1 fit.summary lm lm y ~ x + I(x^2) + I(x^3)
## fm.formula.chr npcx npcy
## 1 y ~ x + I(x^2) + I(x^3) NA NA
gginnards::geom_debug()
can be used with ‘ggplot2’ and
its extensions. More examples are given in the documentation of
package ‘gginnards’.
Volcano and quadrant plots are scatter plots with some peculiarities. Creating these plots from scratch using ‘ggplot2’ can be time consuming, but allows flexibility in design. Rather than providing a ‘canned’ function to produce volcano plots, package ‘ggpmisc’ provides several building blocks that facilitate the production of volcano and quadrant plots. These are wrappers on scales from packages ‘scales’ and ‘ggplot2’ with suitable defaults plus helper functions to create factors from numeric outcomes.
Manual scales for colour and fill aesthetics with defaults suitable for the three way outcomes from statistical tests.
Scales for x or y aesthetics mapped to Pvalues, FDR (false discovery rate) or log FC (logarithm of foldchange) as used in volcano and quadrant plots of transcriptomics and metabolomics data.
A simple function to expand scale limits to be symmetric around zero. Can be passed as argument to parameter limits of continuous scales from packages ‘ggpmisc’, ‘ggplot2’ or ‘scales’ (and extensions).
Volcano plots are frequently used for transcriptomics and metabolomics. They are used to show Pvalues or FDR (false discovery rate) as a function of effect size, which can be either an increase or a decrease. Effect sizes are expressed as foldchanges compared to a control or reference condition. Colours are frequently used to highlight significant responses. Counts of significant increases and decreases are frequently also added.
Outcomes encoded as 1, 0 or 1, as seen in the tibble below need to
be converted into factors with suitable labels for levels. This can be
easily achieved with function outcome2factor()
.
## tag gene outcome logFC PValue genotype
## 1 AT1G01040 ASU1 0 0.15284466 0.35266997 Ler
## 2 AT1G01290 ASG4 0 0.30057068 0.05471732 Ler
## 3 AT1G01560 ATSBT1.1 0 0.57783350 0.06681310 Ler
## 4 AT1G01790 AtSAM1 0 0.04729662 0.74054263 Ler
## 5 AT1G02130 AtTRM82 0 0.14279891 0.29597519 Ler
## 6 AT1G02560 PRP39 0 0.23320752 0.07487043 Ler
Most frequently foldchange data is available logtransformed, using either 2 or 10 as base. In general it is more informative to use tick labels expressed as untransformed foldchange.
By default scale_x_logFC()
and
scale_y_logFC()
expect the logFC data log2transformed, but
this can be overridden. The default use of untransformed foldchange for
tick labels can also be overridden. Scale scale_x_logFC()
in addition by default expands the scale limits to make them symmetric
around zero. If %unit
is included in the name character
string, suitable units are appended as shown in the example below.
Scales scale_y_Pvalue()
and
scale_x_Pvalue()
have suitable defaults for name and
labels, while scale_colour_outcome()
provides suitable
defaults for the colours mapped to the outcomes. To change the labels
and title of the key
or guide
pass suitable
arguments to parameters name
and labels
of
these scales. These x and y scales by default squish offlimits
(outofbounds) observations towards the limits.
ggplot(volcano_example.df,
aes(logFC, PValue, colour = outcome2factor(outcome))) +
geom_point() +
scale_x_logFC(name = "Transcript abundance%unit") +
scale_y_Pvalue() +
scale_colour_outcome() +
stat_quadrant_counts(data = function(x) {subset(x, outcome != 0)})
By default outcome2factor()
creates a factor with three
levels as in the example above, but this default can be overridden as
shown below.
Quadrant plots are scatter plots with comparable variables on both axes and usually include the four quadrants. When used for transcriptomics and metabolomics, they are used to compare responses expressed as foldchange to two different conditions or treatments. They are useful when many responses are measured as in transcriptomics (many different genes) or metabolomics (many different metabolites).
A single panel quadrant plot is as easy to produce as a volcano plot
using scales scale_x_logFC()
and
scale_y_logFC()
. The data includes two different outcomes
and two different log foldchange variables. See the previous section on
volcano plots for details. In this example we use as shape a filled
circle and map one of the outcomes to colour and the other to fill,
using the two matched scales scale_colour_outcome()
and
scale_fill_outcome()
.
## tag gene outcome.x outcome.y logFC.x logFC.y genotype
## 1 AT5G11060 TIC56 0 0 0.17685517 0.32956762 Ler
## 2 AT3G01280 ATWRKY48 0 0 0.06471884 0.07771315 Ler
## 3 AT5G06320 ANY1 0 0 0.35200684 0.04331339 Ler
## 4 AT1G05850 AP1M1 0 0 0.40284744 0.04505435 Ler
## 5 AT2G22300 HSA32 0 0 0.34482142 0.11185095 Ler
## 6 AT2G16070 UBQ11 0 0 0.22328946 0.23210780 Ler
In this plot we do not include those genes whose change in transcript abundance is uncertain under both x and y conditions.
ggplot(subset(quadrant_example.df,
xy_outcomes2factor(outcome.x, outcome.y) != "none"),
aes(logFC.x, logFC.y,
colour = outcome2factor(outcome.x),
fill = outcome2factor(outcome.y))) +
geom_quadrant_lines(linetype = "dotted") +
stat_quadrant_counts(size = 3, colour = "white") +
geom_point(shape = "circle filled") +
scale_x_logFC(name = "Transcript abundance for x%unit") +
scale_y_logFC(name = "Transcript abundance for y%unit") +
scale_colour_outcome() +
scale_fill_outcome() +
theme_dark()
To plot in separate panels those observations that are significant along both x and y axes, x axis, y axis, or none, with quadrants merged takes more effort. We first define two helper functions to add counts and quadrant lines to each of the four panels.