Introduction

This vignette shows how to develop, internally and externally validate a (logistic) regression prediction model with mice and psfmi.

Steps

Step 1 - Install the psfmi and mice package

You can install the released version of psfmi with:

install.packages("psfmi")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("mwheymans/psfmi")

You can install the released version of mice with:

install.packages("mice")

Back to Steps

Step 2 - Impute the missing data with mice

I generated 5 imputed datasets to handle the missing values in the lbp_orig dataset that is included in the psfmi package. Depending on the amount of missing data, the number of imputed datasets may be increased (see on-line book: Applied Missing Data analysis).

  library(mice)
  library(psfmi)

  imp <- mice(lbp_orig, m=5, maxit=10) 
## 
##  iter imp variable
##   1   1  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   1   2  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   1   3  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   1   4  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   1   5  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   2   1  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   2   2  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   2   3  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   2   4  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   2   5  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   3   1  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   3   2  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   3   3  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   3   4  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   3   5  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   4   1  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   4   2  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   4   3  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   4   4  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   4   5  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   5   1  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   5   2  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   5   3  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   5   4  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   5   5  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   6   1  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   6   2  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   6   3  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   6   4  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   6   5  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   7   1  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   7   2  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   7   3  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   7   4  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   7   5  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   8   1  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   8   2  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   8   3  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   8   4  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   8   5  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   9   1  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   9   2  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   9   3  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   9   4  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   9   5  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   10   1  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   10   2  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   10   3  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   10   4  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport
##   10   5  Carrying  Pain  Tampascale  Function  Radiation  Age  Satisfaction  JobControl  JobDemands  SocialSupport

Use the complete function of the mice package to extract the completed datasets. By setting action = "long" and include = FALSE, the imputed datasets are stacked under each other to form one long dataset and the original dataset (that included missing values) is not included in that long dataset.

  library(mice)
  
  data_comp <- complete(imp, action = "long", include = FALSE)

Back to Steps

Step 3 - Develop the model

Use the psfmi_lr function in the psfmi package to perform backward selection over the 5 multiply imputed datasets with a p-value of 0.05 and as selection method D1.
Note that with this function it is possible to include cubic spline coefficients in case of non-linear relationships during backward selection and that predictors may be forced in the model during backward selection by including them at the setting keep.predictors. When the setting direction is changed to FW, forward selection is done.

  library(psfmi)

  pool_lr <- psfmi_lr(data=data_comp, nimp=5, impvar=".imp", Outcome="Chronic",
  predictors=c("Gender", "Age", "JobControl", "Tampascale", "Pain", "Radiation",
  "JobDemands", "SocialSupport", "Smoking"), cat.predictors = c("Satisfaction", "Carrying"), 
  spline.predictors = "Function", nknots = 3,  
  keep.predictors = "Radiation", p.crit = 0.157, method="D1", direction = "BW")
## Removed at Step 1 is - Gender
## Removed at Step 2 is - JobDemands
## Removed at Step 3 is - Smoking
## Removed at Step 4 is - Age
## Removed at Step 5 is - rcs(Function,3)
## Removed at Step 6 is - JobControl
## Removed at Step 7 is - SocialSupport
## Removed at Step 8 is - Tampascale
## 
## Selection correctly terminated, 
## No more variables removed from the model
  pool_lr$RR_model_final
## $`Step 9`
##                    term   estimate std.error  statistic        df      p.value
## 1           (Intercept) -5.4555864 1.0533349 -5.1793466  80.89631 1.596456e-06
## 2                  Pain  0.9066812 0.1912409  4.7410420  36.76855 3.169734e-05
## 3             Radiation  0.4783856 0.4964348  0.9636423 130.77651 3.370038e-01
## 4 factor(Satisfaction)2 -0.5236336 0.6191642 -0.8457104  62.08012 4.009623e-01
## 5 factor(Satisfaction)3 -2.7018731 0.7981299 -3.3852547  95.26055 1.034237e-03
## 6     factor(Carrying)2  1.2435674 0.6128183  2.0292596  92.09083 4.531842e-02
## 7     factor(Carrying)3  1.7770774 0.7317224  2.4286223  40.12301 1.973018e-02
##            OR   lower.EXP  upper.EXP
## 1 0.004272371 0.000542066  0.0336733
## 2 2.476091328 1.702076603  3.6020872
## 3 1.613467475 0.609798549  4.2690775
## 4 0.592364189 0.176013335  1.9935724
## 5 0.067079751 0.014034907  0.3206073
## 6 3.467963196 1.043357183 11.5269909
## 7 5.912550841 1.409029900 24.8101601
  pool_lr$multiparm_final
## $`Step 9`
##                       p-values D1 F-statistic
## Pain                 1.532075e-05  22.4774796
## Radiation            3.353720e-01   0.9286065
## factor(Satisfaction) 2.483896e-03   6.1898977
## factor(Carrying)     3.699533e-02   3.4226311

The same result can be obtained by using the formula setting in the psfmi_lr function

  library(psfmi)

  pool_lr <- psfmi_lr(data=data_comp, nimp=5, impvar=".imp", 
  formula = Chronic ~ Gender + Age + JobControl + Tampascale + Pain + Radiation +
    JobDemands + SocialSupport + Smoking + factor(Satisfaction) + factor(Carrying) + 
  rcs(Function, 3), keep.predictors = "Radiation", 
  p.crit = 0.157, method="D1", direction = "BW")
## Removed at Step 1 is - Gender
## Removed at Step 2 is - JobDemands
## Removed at Step 3 is - Smoking
## Removed at Step 4 is - Age
## Removed at Step 5 is - rcs(Function,3)
## Removed at Step 6 is - JobControl
## Removed at Step 7 is - SocialSupport
## Removed at Step 8 is - Tampascale
## 
## Selection correctly terminated, 
## No more variables removed from the model
  pool_lr$RR_model_final
## $`Step 9`
##                    term   estimate std.error  statistic        df      p.value
## 1           (Intercept) -5.4555864 1.0533349 -5.1793466  80.89631 1.596456e-06
## 2                  Pain  0.9066812 0.1912409  4.7410420  36.76855 3.169734e-05
## 3             Radiation  0.4783856 0.4964348  0.9636423 130.77651 3.370038e-01
## 4 factor(Satisfaction)2 -0.5236336 0.6191642 -0.8457104  62.08012 4.009623e-01
## 5 factor(Satisfaction)3 -2.7018731 0.7981299 -3.3852547  95.26055 1.034237e-03
## 6     factor(Carrying)2  1.2435674 0.6128183  2.0292596  92.09083 4.531842e-02
## 7     factor(Carrying)3  1.7770774 0.7317224  2.4286223  40.12301 1.973018e-02
##            OR   lower.EXP  upper.EXP
## 1 0.004272371 0.000542066  0.0336733
## 2 2.476091328 1.702076603  3.6020872
## 3 1.613467475 0.609798549  4.2690775
## 4 0.592364189 0.176013335  1.9935724
## 5 0.067079751 0.014034907  0.3206073
## 6 3.467963196 1.043357183 11.5269909
## 7 5.912550841 1.409029900 24.8101601
  pool_lr$multiparm_final
## $`Step 9`
##                       p-values D1 F-statistic
## Pain                 1.532075e-05  22.4774796
## Radiation            3.353720e-01   0.9286065
## factor(Satisfaction) 2.483896e-03   6.1898977
## factor(Carrying)     3.699533e-02   3.4226311

Back to Steps

Step 4 - Internally validate the model

To internally validate the model we use the psfmi_perform function. With this function five different methods can be used to internally validate models in MI data see these Vignettes, Three methods use cross-validation and two bootstrapping in combination with MI.

We will use cross-validation and more specific the method cv_MI_RR. With this method it is possible to integrate backward selection into the cross-validation procedure. The last model that is selected by the psfmi_lr function is internally validated. So, if we want to apply backward selection during cross-validation from the full model we first have to apply the psfmi_lr function without variable selection. That is the example that we want to use here, because we know that variable selection is one of the main reasons that prediction models are over-fitted.

library(psfmi)

pool_val <- psfmi_lr(data=data_comp, formula = Chronic ~ Gender + Age + JobControl + Tampascale + 
                      Pain + Radiation + JobDemands + SocialSupport + Smoking + factor(Satisfaction) + 
                      factor(Carrying) + rcs(Function, 3), p.crit = 1, direction="BW",
                      nimp=5, impvar=".imp", method="D1")

set.seed(200)
res_cv <- psfmi_perform(pool_val, val_method = "cv_MI_RR", data_orig = lbp_orig, folds = 3,
                     p.crit=0.05, BW=TRUE, nimp_mice = 10, miceImp = miceImp, printFlag = FALSE)
## 
## fold 1
## Removed at Step 1 is - rcs(Function,3)
## Removed at Step 2 is - JobDemands
## Removed at Step 3 is - Smoking
## Removed at Step 4 is - SocialSupport
## Removed at Step 5 is - JobControl
## Removed at Step 6 is - Age
## Removed at Step 7 is - Tampascale
## Removed at Step 8 is - Gender
## Removed at Step 9 is - Radiation
## 
## Selection correctly terminated, 
## No more variables removed from the model
## 
## fold 2
## Removed at Step 1 is - JobDemands
## Removed at Step 2 is - Smoking
## Removed at Step 3 is - JobControl
## Removed at Step 4 is - Age
## Removed at Step 5 is - Radiation
## Removed at Step 6 is - Tampascale
## Removed at Step 7 is - rcs(Function,3)
## Removed at Step 8 is - Gender
## Removed at Step 9 is - factor(Carrying)
## Removed at Step 10 is - SocialSupport
## 
## Selection correctly terminated, 
## No more variables removed from the model
## 
## fold 3
## Removed at Step 1 is - Age
## Removed at Step 2 is - JobDemands
## Removed at Step 3 is - Radiation
## Removed at Step 4 is - Tampascale
## Removed at Step 5 is - Gender
## Removed at Step 6 is - Smoking
## Removed at Step 7 is - rcs(Function,3)
## Removed at Step 8 is - SocialSupport
## Removed at Step 9 is - JobControl
## Removed at Step 10 is - factor(Satisfaction)
## Removed at Step 11 is - factor(Carrying)
## 
## Selection correctly terminated, 
## No more variables removed from the model
res_cv
## $stats
##                  Train      Test
## AUC          0.8866677 0.8377989
## Brier scaled 0.4771826 0.2751471
## Rsq          0.5484889 0.4137824
## 
## $slope
##    Intercept        Slope 
## -0.009177862  0.864415077

Back to Steps

Step 5 - Shrink pooled coefficients and determine new intercept

We can use the slope value of 0.864415077 that was estimated at the previous step as a shrinkage factor to prevent our model from being overfitted in new data. We do this by multiplying the pooled coefficients with the shrinkage factor and also to determine a new intercept value that is aligned with the shrunken coefficients.

We use the pool_intadj function for this that will provide the adjusted coefficients and new intercept value.

library(psfmi)

res <- pool_intadj(pool_lr, shrinkage_factor = 0.864415077)

res$int_adj
## [1] -4.74758
res$coef_shrink_pooled
##                  Pain             Radiation factor(Satisfaction)2 
##             0.7837489             0.4135237            -0.4526368 
## factor(Satisfaction)3     factor(Carrying)2     factor(Carrying)3 
##            -2.3355398             1.0749584             1.5361325

The last step is to externally validate this adjusted model.

Back to Steps

Step 6 - Externally validate the model

We validate the model in an external dataset that is imputed five times due to missing data.

library(psfmi)

 res_extval <- mivalext_lr(data.val = lbpmi_extval, nimp = 5, impvar = "Impnr",
  Outcome = "Chronic", predictors = pool_lr$predictors_final,
  lp.orig = c(res$int_adj, res$coef_shrink_pooled),
  cal.plot = TRUE)
## 
## Pooled performance measures over m = 5 imputed external validation datasets correctly estimated

 res_extval$ROC
## $`ROC (logit)`
##             95% Low    AUC 95% Up
## AUC (logit)  0.7365 0.8477 0.9173
## 
## $`ROC (median)`
## 1st Qu.  Median 3rd Qu. 
## 0.84678 0.84772 0.85366
 res_extval$R2_fixed
## $`Fisher Z (fixed)`
## [1] 0.45526
## 
## $`Median (fixed)`
## 1st Qu.  Median 3rd Qu. 
## 0.45742 0.45831 0.47240
 res_extval$HLtest
##         D         p        df       df2 
##   0.89517   0.52245   8.00000 138.64820
 res_extval$LP_pooled_ext
## intercept     slope 
##   0.00589   0.87747

In the external dataset the AUC is 0.84772 and the R-squared 0.45. The Hosmer and Lemeshow test has a p-value of 0.63168, which means that the fit is satisfactory. This is also confirmed on the calibration plot where the calibration curves in each imputed dataset are near the optimal (dashed) line. The slope (regression coefficients) is has a value of 0.86526 and slightly deviates from the value of 1, which means that the coefficients values differ between the development and external dataset.

Back to Steps