Let’s see an example for DALEX
package for
classification models for the survival problem for Titanic dataset. Here
we are using a dataset titanic_imputed
avaliable in the
DALEX
package. Note that this data was copied from the
stablelearner
package and changed for practicality.
library("DALEX")
head(titanic_imputed)
#> gender age class embarked fare sibsp parch survived
#> 1 male 42 3rd Southampton 7.11 0 0 0
#> 2 male 13 3rd Southampton 20.05 0 2 0
#> 3 male 16 3rd Southampton 20.05 1 1 0
#> 4 female 39 3rd Southampton 20.05 1 1 1
#> 5 female 16 3rd Southampton 7.13 0 0 1
#> 6 male 25 3rd Southampton 7.13 0 0 1
Ok, now it’s time to create a model. Let’s use the Random Forest model.
# prepare model
library("ranger")
model_titanic_rf <- ranger(survived ~ gender + age + class + embarked +
fare + sibsp + parch,
data = titanic_imputed, probability = TRUE)
model_titanic_rf
#> Ranger result
#>
#> Call:
#> ranger(survived ~ gender + age + class + embarked + fare + sibsp + parch, data = titanic_imputed, probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 500
#> Sample size: 2207
#> Number of independent variables: 7
#> Mtry: 2
#> Target node size: 10
#> Variable importance mode: none
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.1422968
The third step (it’s optional but useful) is to create a
DALEX
explainer for random forest model.
library("DALEX")
explain_titanic_rf <- explain(model_titanic_rf,
data = titanic_imputed[,-8],
y = titanic_imputed[,8],
label = "Random Forest")
#> Preparation of a new explainer is initiated
#> -> model label : Random Forest
#> -> data : 2207 rows 7 cols
#> -> target variable : 2207 values
#> -> predict function : yhat.ranger will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package ranger , ver. 0.14.1 , task classification ( default )
#> -> predicted values : numerical, min = 0.01164526 , mean = 0.3215481 , max = 0.9899436
#> -> residual function : difference between y and yhat ( default )
#> -> residuals : numerical, min = -0.7923093 , mean = 0.0006086512 , max = 0.8905081
#> A new explainer has been created!
Use the feature_importance()
explainer to present
importance of particular features. Note that
type = "difference"
normalizes dropouts, and now they all
start in 0.
library("ingredients")
fi_rf <- feature_importance(explain_titanic_rf)
head(fi_rf)
#> variable mean_dropout_loss label
#> 1 _full_model_ 0.3408062 Random Forest
#> 2 parch 0.3520488 Random Forest
#> 3 sibsp 0.3520933 Random Forest
#> 4 embarked 0.3527842 Random Forest
#> 5 age 0.3760269 Random Forest
#> 6 fare 0.3848921 Random Forest
plot(fi_rf)
As we see the most important feature is gender
. Next
three importnat features are class
, age
and
fare
. Let’s see the link between model response and these
features.
Such univariate relation can be calculated with
partial_dependence()
.
Kids 5 years old and younger have much higher survival probability.
pp_age <- partial_dependence(explain_titanic_rf, variables = c("age", "fare"))
head(pp_age)
#> Top profiles :
#> _vname_ _label_ _x_ _yhat_ _ids_
#> 1 fare Random Forest 0.0000000 0.3630884 0
#> 2 age Random Forest 0.1666667 0.5347603 0
#> 3 age Random Forest 2.0000000 0.5536098 0
#> 4 age Random Forest 4.0000000 0.5595259 0
#> 5 fare Random Forest 6.1793080 0.3100674 0
#> 6 age Random Forest 7.0000000 0.5159751 0
plot(pp_age)
cp_age <- conditional_dependence(explain_titanic_rf, variables = c("age", "fare"))
plot(cp_age)
ap_age <- accumulated_dependence(explain_titanic_rf, variables = c("age", "fare"))
plot(ap_age)