# This document

The purpose of this vignette is to introduce readers to DMRnet package, bring them up to speed by providing a simple use case example and point interested readers towards more comprehensive informative material.

# DMRnet package

DMRnet is an R package for regression and classification with a family of model selection algorithms. DMRnet package supports both continuous and categorical predictors. The overall number of regressors may exceed the number of observations.

The selected model consists of a subset of numerical regressors and partitions of levels of factors.

The available model selection algorithms are the following:

• DMRnet is the default and the most comprehensive model selection algorithm in the package, it can be used both for $$p<n$$ and $$p \geq n$$ scenarios. It first executes a screening step in order to reduce the problem to $$p<n$$ and then uses DMR subsequently.
• GLAMER stands for Group LAsso MERger and it is a new (simplified in relation to DMRnet) model selection algorithm that can be used both for $$p<n$$ and $$p \geq n$$ scenarios. In comparison to DMRnet, the step of reordering variables based on their t-statistics is skipped. This is the first partition selection algorithm in literature for which a partition selection consistency has been proven for high dimensional scenario [Nowakowski et al., 2021].
• PDMR stands for Plain DMR and it is a threshold parameter agnostic variant of GLAMER algorithm. It can be used both for $$p<n$$ and $$p \geq n$$ scenarios. [Nowakowski et al., 2022].
• DMR [Maj-Kańska et al., 2015] is a model selection algorithm that can be used for $$p<n$$ scenario only.
• SOSnet [Pokarowski and Mielniczuk, 2015, Pokarowski et al., 2022] is a model selection algorithm that is used both for $$p<n$$ and $$p \geq n$$ scenarios for continuous-only regressors (with no categorical variables and, consequently, no partition selection).

Algorithm-wise, the package user has the following choices:

• The user can select DMRnet algorithm by calling DMRnet() or cv.DMRnet() functions.
• The user can select GLAMER or PDMR algorithms by calling DMRnet() or cv.DMRnet() functions and passing algorithm="glamer" or algorithm="PDMR" parameter, respectively.
• The user can select DMR algorithm by calling DMR() or cv.DMR() functions.
• The SOSnet algorithm is automatically chosen if DMRnet() or cv.DMRnet() functions were called with an input consisting of continuous-only regressors.

As this vignette is introductory only, and the choice of an algorithm is a somewhat more advanced topic, from now on it is assumed that the user works with DMRnet algorithm only.

# OK, so let’s get started

To load the package into memory execute the following line:

library(DMRnet)
#> Loaded DMRnet version 0.3.4

DMRnet package features two built-in data sets: Promoter and Miete [Fahrmeir et al., 2004].

We will work with Miete data set in this vignette. The Miete data consists of $$n = 2053$$ households interviewed for the Munich rent standard 2003. The response is continuous, it is monthly rent per square meter in Euros. There are 9 categorical and 2 continuous variables.

data("miete")
X <- miete[,-1]
y <- miete$rent head(X) #> bathextra tiles area kitchen rooms best good warm central year size #> 1 0 0 2 0 2 0 1 0 0 1918 68 #> 2 0 0 2 0 2 0 1 0 0 1995 65 #> 3 0 0 2 0 3 0 1 0 0 1918 63 #> 4 1 0 16 0 3 0 0 0 0 1983 65 #> 5 1 0 16 1 4 0 1 0 0 1995 100 #> 6 0 0 16 0 4 0 0 0 0 1980 81 Out of 9 categorical variables 7 have 2 levels each, and the two remaining (area and rooms) have 25 and 6 levels, respectively. This sums up to 45 levels. The way DMRnet handles levels results in 36 parameters for the categorical variables (the first level in each categorical variable is already considered included in the intercept). The 2 continuous variables give 2 additional parameters, the intercept is one more, so it totals in $$p = 39$$. # Estimating and inspecting a sequence of models In contrast to explicit group lasso methods which need a design matrix with explicit groups, DMRnet package internally transforms an input matrix into a design matrix (e.g. by expanding factors into a set of one-hot encoded dummy variables). Thus, X needs no additional changes and already is all set to be fed into DMRnet: models <- DMRnet(X, y, family="gaussian") models #> #> Arguments: #>$family
#> [1] "gaussian"
#>
#> $clust.method #> [1] "complete" #> #>$o
#> [1] 5
#>
#> $nlambda #> [1] 100 #> #>$maxp
#> [1] 1027
#>
#> $lambda #> NULL #> #> Family of models: #> Df RSS #> [1,] 39 44889691 #> [2,] 38 44889694 #> [3,] 37 44889702 #> [4,] 36 44889714 #> [5,] 35 44889736 #> [6,] 34 44889761 #> [7,] 33 44889800 #> [8,] 32 44889868 #> [9,] 31 44890015 #> [10,] 30 44890219 #> [11,] 29 44890407 #> [12,] 28 44890641 #> [13,] 27 44890887 #> [14,] 26 44891333 #> [15,] 25 44892150 #> [16,] 24 44892640 #> [17,] 23 44894059 #> [18,] 22 44898820 #> [19,] 21 44904071 #> [20,] 20 44910418 #> [21,] 19 44921810 #> [22,] 18 44930083 #> [23,] 17 44958900 #> [24,] 16 45063641 #> [25,] 15 45106343 #> [26,] 14 45232724 #> [27,] 13 45290696 #> [28,] 12 45367049 #> [29,] 11 45699667 #> [30,] 10 46235150 #> [31,] 9 46865054 #> [32,] 8 47378415 #> [33,] 7 48034125 #> [34,] 6 49694598 #> [35,] 5 50786340 #> [36,] 4 53188046 #> [37,] 3 56328078 #> [38,] 2 61742060 #> [39,] 1 123608575 Here, family="gaussian" argument was used to indicate, that we are interested in a linear regression for modeling a continuous response variable. The family="gaussian" is the default and could have been omitted as well. In a printout above you can notice a bunch of other default values set for some other parameters (e.g. nlambda=100 requesting the cardinality of a net of lambda values used) in model estimation. The last part of a printout above presents a sequence of models estimated from X. Notice the models have varying number of parameters (model dimension df ranges from 39 for a full model down to 1 for the 39-th model). Let us plot the path for the coefficients for the 10 smallest models as a function of their model dimension df: plot(models, xlim=c(1, 10), lwd=2) Notice how you can also pass other parameters (xlim=c(1, 10), lwd=2) to change the default behavior of the plot() function. To inspect the coefficients for a particular model in detail, one can use the coef() function. For the sake of the example, let’s examine the last model from the plot, with df=10: coef(models, df=10) #> (Intercept) tiles1 area7 area10 area11 area14 #> -2696.776884 -41.326719 -55.358136 -55.358136 -55.358136 -55.358136 #> area16 area17 area19 area20 area22 area23 #> -55.358136 -55.358136 -55.358136 -55.358136 -55.358136 -55.358136 #> area24 area25 kitchen1 best1 good1 warm1 #> -55.358136 -55.358136 115.522301 133.699574 39.478992 -173.480798 #> central1 year size #> -73.402473 1.428681 6.950459 Notice how area-related coefficients are all equal, effectively merging all 12 listed area levels with a coefficient -55.36 and merging all 13 remaining area levels with a coefficient 0.0. The 13 remaining area levels merged with a coefficient 0.0 were unlisted in a printout above. The reason behind it is that only non-zero coefficients get listed in the coef() function. # Selecting best models There are two basic methods for model selection supported in DMRnet package. One is based on a $$K$$-fold Cross Validation (CV), the other is based on Generalized Information Criterion (GIC) [Foster and George, 1994]. ## Model selection with GIC A GIC-based model selection is performed directly on a precomputed sequence of models gic.model <- gic.DMR(models) plot(gic.model) One can read a model dimension with gic.model$df.min
#> [1] 12

coef(gic.model)
#>  (Intercept)   bathextra1       tiles1        area7       area10       area11
#> -2736.453989    59.260643   -42.372063   -55.249855   -55.249855   -55.249855
#>       area14       area15       area16       area17       area19       area20
#>   -55.249855   -55.249855   -55.249855   -55.249855   -55.249855   -55.249855
#>       area21       area22       area23       area24       area25     kitchen1
#>   -55.249855   -55.249855   -55.249855   -55.249855   -55.249855   111.371688
#>       rooms4       rooms5       rooms6        best1        good1        warm1
#>   -44.187398   -44.187398   -44.187398   127.381589    39.499137  -168.441539
#>     central1         year         size
#>   -74.219946     1.443990     7.156026

## Model selection with $$K$$-fold Cross Validation

The default $$K$$ value is 10. Because the data needs to be partitioned into CV subsets which alternate as training and validating sets, the precomputed sequence of models in models variable cannot be used directly, as was the case with GIC-based model selection. Instead, to perform a 10-fold CV-based model selection execute

cv.model <- cv.DMRnet(X, y)
plot(cv.model)

As before, one can access a model dimension with

cv.model$df.min #> [1] 12 Well, the dimension is the same as when we used GIC directly. Let’s check if the CV-selected model is the same as the best model selected with GIC: coef(cv.model)==coef(gic.model) #> (Intercept) bathextra1 tiles1 area7 area10 area11 #> TRUE TRUE TRUE TRUE TRUE TRUE #> area14 area15 area16 area17 area19 area20 #> TRUE TRUE TRUE TRUE TRUE TRUE #> area21 area22 area23 area24 area25 kitchen1 #> TRUE TRUE TRUE TRUE TRUE TRUE #> rooms4 rooms5 rooms6 best1 good1 warm1 #> TRUE TRUE TRUE TRUE TRUE TRUE #> central1 year size #> TRUE TRUE TRUE # Predicting response for new observations Models selected with both model selection methods can be used to predict response variable values for new observations: predict(gic.model, newx=head(X)) #> [,1] #> 1 559.2277 #> 2 648.9469 #> 3 523.4476 #> 4 596.1306 #> 5 970.6029 #> 6 602.8470 predict(cv.model, newx=head(X)) #> [,1] #> 1 559.2277 #> 2 648.9469 #> 3 523.4476 #> 4 596.1306 #> 5 970.6029 #> 6 602.8470 No wonder the corresponding predictions are all equal, since the selected models were the same. One can predict the response variable values for the whole sequence of models as well: predict(models, newx=head(X)) #> df39 df38 df37 df36 df35 df34 df33 df32 #> 1 570.1916 570.1880 570.1944 570.2044 570.2462 570.2468 570.2519 570.2554 #> 2 663.6774 663.6772 663.7068 663.7338 663.7017 663.7233 663.7777 663.8257 #> 3 519.4320 519.4302 519.4352 519.4451 519.4860 519.4831 519.4827 519.4941 #> 4 568.8714 568.8747 568.8634 568.8770 568.8583 568.8622 568.8667 568.8647 #> 5 948.5308 948.5381 948.5760 948.6384 948.6232 948.6464 948.6823 948.7183 #> 6 583.6860 583.6893 583.6874 583.6696 583.6517 583.6554 583.6507 583.6619 #> df31 df30 df29 df28 df27 df26 df25 df24 #> 1 570.2455 570.2503 569.4073 569.4309 569.4629 569.4797 569.5341 569.1385 #> 2 663.8759 663.8688 662.9924 662.9661 662.9836 662.9517 662.8686 662.3913 #> 3 519.4867 519.5139 518.6397 518.6799 518.6922 518.7602 518.7982 518.4336 #> 4 568.8863 568.8725 568.8900 568.8979 568.2476 568.2456 568.1406 568.1838 #> 5 948.7620 948.6792 948.5951 948.6107 947.9537 947.8719 947.6397 947.5754 #> 6 583.6725 583.6609 583.6405 583.6242 583.0231 583.0253 583.0193 582.9143 #> df23 df22 df21 df20 df19 df18 df17 df16 #> 1 569.0645 569.1271 569.4311 570.5289 570.8667 570.3454 570.3404 562.3298 #> 2 662.3284 662.6799 662.7128 663.9165 663.8429 663.4175 663.1364 652.6998 #> 3 518.4404 518.3905 518.6901 519.9875 520.3221 520.7543 521.0591 512.5275 #> 4 568.1961 568.3222 568.1244 569.0227 568.8951 569.5745 581.6448 580.8725 #> 5 947.3725 948.2670 948.6530 948.6425 949.1207 947.5811 959.4972 960.5421 #> 6 582.8929 582.9015 582.6029 583.3883 583.1541 581.7241 594.4963 592.5491 #> df15 df14 df13 df12 df11 df10 df9 df8 #> 1 556.3796 557.3476 557.1075 559.2277 553.6168 555.5438 537.3356 528.6948 #> 2 647.3305 647.7989 647.2196 648.9469 645.3276 644.7009 626.9365 622.1707 #> 3 520.5629 521.4951 521.4796 523.4476 519.7666 520.7916 502.3516 493.5944 #> 4 587.4935 597.0187 597.1792 596.1306 587.3871 532.7196 538.8990 531.8596 #> 5 961.7100 972.8877 970.1441 970.6029 997.1003 948.1311 919.9631 915.9963 #> 6 595.7414 604.5666 602.7236 602.8470 634.4291 639.6409 646.5391 639.7183 #> df7 df6 df5 df4 df3 df2 df1 #> 1 528.5285 533.6460 526.5034 557.4039 568.6685 559.0850 570.093 #> 2 621.0150 623.7535 650.1912 536.6915 547.5318 538.3833 570.093 #> 3 493.3071 498.1606 490.2621 522.8832 533.4407 524.5822 570.093 #> 4 527.6910 551.1119 552.7880 536.6915 547.5318 538.3833 570.093 #> 5 917.8128 997.3743 829.1419 929.4490 794.1263 779.9030 570.093 #> 6 635.9730 660.3250 663.0939 647.1579 660.2607 648.7923 570.093 Notice how the df12 model predictions overlap with the values we obtained above. # Classification with 2 classes For a classification task with 2 classes, the non-default family="binomial" should be provided in a call to DMRnet and cv.DMRnet (but not to gic.DMR) and the response variable should be a factor with two classes, with the last level in alphabetical order interpreted as the target class. It is illustrated with the following code with a somewhat artificial example: binomial_y <- factor(y > mean(y)) #changing Miete response var y into a binomial factor with 2 classes binomial_models <- DMRnet(X, binomial_y, family="binomial") gic.binomial_model <- gic.DMR(binomial_models) gic.binomial_model$df.min
#> [1] 12

A corresponding predict call has a type parameter with the default value "link", which returns the linear predictors. To change that behavior one can substitute the default with type="response" to get the fitted probabilities or type="class" to get the class labels corresponding to the maximum probability. So to get actual values compatible with a binomial y, type="class" should be used:

predict(gic.binomial_model, newx=head(X), type="class")
#>   [,1]
#> 1    0
#> 2    1
#> 3    0
#> 4    1
#> 5    1
#> 6    1

Please note that 1 is the value of a target class in the predict output.

# References

1. Szymon Nowakowski, Piotr Pokarowski, Wojciech Rejchel and Agnieszka Sołtys. 2022. Improving Group Lasso for high-dimensional categorical data. arXiv [stat.ME]. https://doi.org/10.48550/arXiv.2210.14021
2. Szymon Nowakowski, Piotr Pokarowski and Wojciech Rejchel. 2021. Group Lasso Merger for Sparse Prediction with High-Dimensional Categorical Data. arXiv [stat.ME]. https://doi.org/10.48550/arXiv.2112.11114
3. Aleksandra Maj-Kańska, Piotr Pokarowski and Agnieszka Prochenka, 2015. Delete or merge regressors for linear model selection. Electronic Journal of Statistics 9(2): 1749-1778. doi:10.1214/15-EJS1050
4. Piotr Pokarowski and Jan Mielniczuk, 2015. Combined l1 and greedy l0 penalized least squares for linear model selection. Journal of Machine Learning Research 16(29): 961-992. https://www.jmlr.org/papers/volume16/pokarowski15a/pokarowski15a.pdf
5. Piotr Pokarowski, Wojciech Rejchel, Agnieszka Sołtys, Michał Frej and Jan Mielniczuk, 2022. Improving Lasso for model selection and prediction. Scandinavian Journal of Statistics, 49(2): 831–863. doi:10.1111/sjos.12546
6. Ludwig Fahrmeir, Rita Künstler, Iris Pigeot, Gerhard Tutz, 2004. Statistik: der Weg zur Datenanalyse. 5. Auflage, Berlin: Springer-Verlag.
7. Dean P. Foster and Edward I. George, 1994. The Risk Inflation Criterion for Multiple Regression. The Annals of Statistics 22 (4): 1947–75.