# Basic Walkthrough

## 1. Introduction

Welcome to the world of LightGBM, a highly efficient gradient boosting implementation (Ke et al. 2017).

``````library(lightgbm)
``````

This vignette will guide you through its basic usage. It will show how to build a simple binary classification model based on a subset of the `bank` dataset (Moro, Cortez, and Rita 2014). You will use the two input features “age” and “balance” to predict whether a client has subscribed a term deposit.

## 2. The dataset

The dataset looks as follows.

``````data(bank, package = "lightgbm")

bank[1L:5L, c("y", "age", "balance")]
#>         y   age balance
#>    <char> <int>   <int>
#> 1:     no    30    1787
#> 2:     no    33    4789
#> 3:     no    35    1350
#> 4:     no    30    1476
#> 5:     no    59       0
``````
``````
# Distribution of the response
table(bank\$y)
#>
#>   no  yes
#> 4000  521
``````

## 3. Training the model

The R package of LightGBM offers two functions to train a model:

• `lgb.train()`: This is the main training logic. It offers full flexibility but requires a `Dataset` object created by the `lgb.Dataset()` function.
• `lightgbm()`: Simpler, but less flexible. Data can be passed without having to bother with `lgb.Dataset()`.

### 3.1 Using the `lightgbm()` function

In a first step, you need to convert data to numeric. Afterwards, you are ready to fit the model by the `lightgbm()` function.

``````# Numeric response and feature matrix
y <- as.numeric(bank\$y == "yes")
X <- data.matrix(bank[, c("age", "balance")])

# Train
fit <- lightgbm(
data = X
, label = y
, params = list(
num_leaves = 4L
, learning_rate = 1.0
, objective = "binary"
)
, nrounds = 10L
, verbose = -1L
)

# Result
summary(predict(fit, X))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#> 0.01192 0.07370 0.09871 0.11593 0.14135 0.65796
``````

It seems to have worked! And the predictions are indeed probabilities between 0 and 1.

### 3.2 Using the `lgb.train()` function

Alternatively, you can go for the more flexible interface `lgb.train()`. Here, as an additional step, you need to prepare `y` and `X` by the data API `lgb.Dataset()` of LightGBM. Parameters are passed to `lgb.train()` as a named list.

``````# Data interface
dtrain <- lgb.Dataset(X, label = y)

# Parameters
params <- list(
objective = "binary"
, num_leaves = 4L
, learning_rate = 1.0
)

# Train
fit <- lgb.train(
params
, data = dtrain
, nrounds = 10L
, verbose = -1L
)
``````

Try it out! If stuck, visit LightGBM’s documentation for more details.

## 4. References

Ke, Guolin, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. “LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” In Advances in Neural Information Processing Systems 30 (NIPS 2017).

Moro, Sérgio, Paulo Cortez, and Paulo Rita. 2014. “A Data-Driven Approach to Predict the Success of Bank Telemarketing.” Decision Support Systems 62: 22–31.