```
library(butcher)
library(parsnip)
```

One of the beauties of working with `R`

is the ease with
which you can implement intricate models and make challenging data
analysis pipelines seem almost trivial. Take, for example, the
`parsnip`

package; with the installation of a few associated
libraries and a few lines of code, you can fit something as complex as a
boosted tree:

```
library(rpart)
<- boost_tree(trees = 15) %>%
fitted_model set_engine("C5.0") %>%
fit(as.factor(am) ~ disp + hp, data = mtcars)
```

Or, let’s say you’re working on petabytes of data, in which data are
distributed across many nodes, just switch out the `parsnip`

engine:

```
library(sparklyr)
<- spark_connect(master = "local")
sc
<- sdf_copy_to(sc, mtcars[, c("am", "disp", "hp")])
mtcars_tbls
<- boost_tree(trees = 15) %>%
fitted_model set_engine("spark") %>%
fit(am ~ disp + hp, data = mtcars_tbls)
```

Yet, while our code may appear compact, the underlying fitted result
may not be. Since `parsnip`

works as a wrapper for many
modeling packages, its fitted model objects inherit the same properties
as those that arise from the original modeling package. A
straightforward example is the popular `lm`

function from the
base `stats`

package. Whether you leverage
`parsnip`

or not, you arrive at the same result:

```
<- linear_reg() %>%
parsnip_lm set_engine("lm") %>%
fit(mpg ~ ., data = mtcars)
parsnip_lm#> parsnip model object
#>
#>
#> Call:
#> stats::lm(formula = mpg ~ ., data = data)
#>
#> Coefficients:
#> (Intercept) cyl disp hp drat wt
#> 12.30337 -0.11144 0.01334 -0.02148 0.78711 -3.71530
#> qsec vs am gear carb
#> 0.82104 0.31776 2.52023 0.65541 -0.19942
```

Using just `lm`

:

```
<- lm(mpg ~ ., data = mtcars)
old_lm
old_lm#>
#> Call:
#> lm(formula = mpg ~ ., data = mtcars)
#>
#> Coefficients:
#> (Intercept) cyl disp hp drat wt
#> 12.30337 -0.11144 0.01334 -0.02148 0.78711 -3.71530
#> qsec vs am gear carb
#> 0.82104 0.31776 2.52023 0.65541 -0.19942
```

Let’s say we take this familiar `old_lm`

approach in
building our in-house modeling pipeline. Such a pipeline might entail
wrapping `lm()`

in other function, but in doing so, we may
end up carrying some junk.

```
<- function() {
in_house_model <- runif(1e6) # we didn't know about
some_junk_in_the_environment lm(mpg ~ ., data = mtcars)
}
```

The linear model fit that exists in our pipeline is:

```
library(lobstr)
obj_size(in_house_model())
#> 8.02 MB
```

When it is fundamentally the same as our `old_lm`

, which
only takes up:

```
obj_size(old_lm)
#> 22.22 kB
```

Ideally, we want to avoid saving this new
`in_house_model()`

on disk, when we could have something like
`old_lm`

that takes up less memory. So, what the heck is
going on here? We can examine possible issues with a fitted model object
using the `butcher`

package:

```
<- in_house_model()
big_lm ::weigh(big_lm, threshold = 0, units = "MB")
butcher#> # A tibble: 25 × 2
#> object size
#> <chr> <dbl>
#> 1 terms 8.01
#> 2 qr.qr 0.00666
#> 3 residuals 0.00286
#> 4 fitted.values 0.00286
#> 5 effects 0.0014
#> 6 coefficients 0.00109
#> 7 call 0.000728
#> 8 model.mpg 0.000304
#> 9 model.cyl 0.000304
#> 10 model.disp 0.000304
#> # … with 15 more rows
```

The problem here is in the `terms`

component of
`big_lm`

. Because of how `lm`

is implemented in
the base `stats`

package—relying on intermediate forms of the
data from the `model.frame`

and `model.matrix`

output, the *environment* in which the linear fit was created
*was carried along* in the model output.

We can see this with the `env_print`

function from the
`rlang`

package:

```
library(rlang)
env_print(big_lm$terms)
#> <environment: 0x7fef65261ee0>
#> Parent: <environment: global>
#> Bindings:
#> • some_junk_in_the_environment: <dbl>
```

To avoid carrying possible junk in our production pipeline, whether
it be associated with an `lm`

model (or something more
complex), we can leverage `axe_env()`

within the
`butcher`

package. In other words,

`<- butcher::axe_env(big_lm, verbose = TRUE) cleaned_lm `

Comparing it against our `old_lm`

, we find:

```
::weigh(cleaned_lm, threshold = 0, units = "MB")
butcher#> # A tibble: 25 × 2
#> object size
#> <chr> <dbl>
#> 1 terms 0.00789
#> 2 qr.qr 0.00666
#> 3 residuals 0.00286
#> 4 fitted.values 0.00286
#> 5 effects 0.0014
#> 6 coefficients 0.00109
#> 7 call 0.000728
#> 8 model.mpg 0.000304
#> 9 model.cyl 0.000304
#> 10 model.disp 0.000304
#> # … with 15 more rows
```

…it now takes the same memory on disk:

```
::weigh(old_lm, threshold = 0, units = "MB")
butcher#> # A tibble: 25 × 2
#> object size
#> <chr> <dbl>
#> 1 terms 0.00781
#> 2 qr.qr 0.00666
#> 3 residuals 0.00286
#> 4 fitted.values 0.00286
#> 5 effects 0.0014
#> 6 coefficients 0.00109
#> 7 call 0.000728
#> 8 model.mpg 0.000304
#> 9 model.cyl 0.000304
#> 10 model.disp 0.000304
#> # … with 15 more rows
```

Axing the environment, however, is not the only functionality of
`butcher`

. This package provides five S3 generics that
include:

`axe_call()`

: Remove the call object.`axe_ctrl()`

: Remove the controls fixed for training.`axe_data()`

: Remove the original data.`axe_env()`

: Replace inherited environments with empty environments.`axe_fitted()`

: Remove fitted values.

In our case here with `lm`

, if we are only interested in
prediction as the end product of our modeling pipeline, we could free up
a lot of memory if we execute all the possible axe functions at once. To
do so, we simply run `butcher()`

:

```
<- butcher::butcher(big_lm)
butchered_lm predict(butchered_lm, mtcars[, 2:11])
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 22.59951 22.11189 26.25064 21.23740
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> 17.69343 20.38304 14.38626 22.49601
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> 24.41909 18.69903 19.19165 14.17216
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> 15.59957 15.74222 12.03401 10.93644
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> 10.49363 27.77291 29.89674 29.51237
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> 23.64310 16.94305 17.73218 13.30602
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> 16.69168 28.29347 26.15295 27.63627
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> 18.87004 19.69383 13.94112 24.36827
```

Alternatively, we can pick and choose specific axe functions, removing only those parts of the model object that we are no longer interested in characterizing.

```
<- big_lm %>%
butchered_lm ::axe_env() %>%
butcher::axe_fitted()
butcherpredict(butchered_lm, mtcars[, 2:11])
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 22.59951 22.11189 26.25064 21.23740
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> 17.69343 20.38304 14.38626 22.49601
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> 24.41909 18.69903 19.19165 14.17216
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> 15.59957 15.74222 12.03401 10.93644
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> 10.49363 27.77291 29.89674 29.51237
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> 23.64310 16.94305 17.73218 13.30602
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> 16.69168 28.29347 26.15295 27.63627
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> 18.87004 19.69383 13.94112 24.36827
```

`butcher`

makes it easy to axe parts of the fitted output
that are no longer needed, without sacrificing much functionality from
the original model object.