Nithesh - Classification

In this blog regarding classification algorithms, we will see about the multiclass random forest model using penguins data. Based on certain features of penguins will try to train a multiclass random forest model and predict the penguin species.

Random Forest is a robust and versatile ensemble learning technique widely employed for both classification and regression tasks. Comprising an ensemble of decision trees, each tree is trained on a bootstrapped sample of the original dataset and considers a random subset of features at each split, mitigating overfitting. The model’s predictions are determined through a majority vote in classification or an average in regression, fostering accuracy and stability. Notably, Random Forest provides insights into feature importance, aiding in the identification of variables contributing significantly to predictive performance. It exhibits resilience against overfitting, particularly due to its feature randomization and bootstrapping mechanisms. Hyperparameter tuning can further enhance the model’s performance, making Random Forest a versatile and powerful tool in machine learning applications.

1. Required libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ broom        1.0.5     ✔ rsample      1.2.0
✔ dials        1.2.0     ✔ tune         1.1.2
✔ infer        1.0.5     ✔ workflows    1.1.3
✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
✔ parsnip      1.1.1     ✔ yardstick    1.2.0
✔ recipes      1.0.8     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

library(palmerpenguins)


Attaching package: 'palmerpenguins'

The following object is masked from 'package:modeldata':

    penguins

data("penguins")

2. Check data

head(penguins)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

#Separate the species

penguins %>% select(species) %>% unique()

# A tibble: 3 × 1
  species  
  <fct>    
1 Adelie   
2 Gentoo   
3 Chinstrap

levels(penguins$species)

[1] "Adelie"    "Chinstrap" "Gentoo"

3. Create data for modelling

#remove year feature and na's from the data

data_df = penguins %>% drop_na()

head(data_df)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           36.7          19.3               193        3450
5 Adelie  Torgersen           39.3          20.6               190        3650
6 Adelie  Torgersen           38.9          17.8               181        3625
# ℹ 2 more variables: sex <fct>, year <int>

GGally::ggpairs(data_df, aes(color = species))

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4. Split training and testing data

data_split <- initial_split(data_df, strata = species)
data_train <- training(data_split)
data_test <- testing(data_split)
data_split

<Training/Testing/Total>
<249/84/333>

5. Define model

### Define Random Forest Model
rf_spec <- rand_forest(trees = 1000,
                       mtry = tune(),
                       min_n = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("ranger")
rf_spec

Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = 1000
  min_n = tune()

Computational engine: ranger

6. Cross validation data

### Cross Validation Sets to tune RF
data_cv <- vfold_cv(data_df, strata = species)
data_cv

#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits           id    
   <list>           <chr> 
 1 <split [299/34]> Fold01
 2 <split [299/34]> Fold02
 3 <split [299/34]> Fold03
 4 <split [299/34]> Fold04
 5 <split [299/34]> Fold05
 6 <split [299/34]> Fold06
 7 <split [300/33]> Fold07
 8 <split [300/33]> Fold08
 9 <split [301/32]> Fold09
10 <split [302/31]> Fold10

7. workflow

### Add formula and model together with workflow
tune_wf <- workflow() %>% 
  add_formula(species ~ .) %>% 
  add_model(rf_spec)
tune_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
species ~ .

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = 1000
  min_n = tune()

Computational engine: ranger

8. Tune Hyperparameters

### Tune HyperParameters
rf_tune <- tune_grid(
  tune_wf,
  resamples = data_cv,
  control = control_resamples(save_pred = TRUE),
  grid = 20
)

i Creating pre-processing data to finalize unknown parameter: mtry

rf_tune

# Tuning results
# 10-fold cross-validation using stratification 
# A tibble: 10 × 5
   splits           id     .metrics          .notes           .predictions      
   <list>           <chr>  <list>            <list>           <list>            
 1 <split [299/34]> Fold01 <tibble [40 × 6]> <tibble [0 × 3]> <tibble [680 × 9]>
 2 <split [299/34]> Fold02 <tibble [40 × 6]> <tibble [0 × 3]> <tibble [680 × 9]>
 3 <split [299/34]> Fold03 <tibble [40 × 6]> <tibble [0 × 3]> <tibble [680 × 9]>
 4 <split [299/34]> Fold04 <tibble [40 × 6]> <tibble [0 × 3]> <tibble [680 × 9]>
 5 <split [299/34]> Fold05 <tibble [40 × 6]> <tibble [0 × 3]> <tibble [680 × 9]>
 6 <split [299/34]> Fold06 <tibble [40 × 6]> <tibble [0 × 3]> <tibble [680 × 9]>
 7 <split [300/33]> Fold07 <tibble [40 × 6]> <tibble [0 × 3]> <tibble [660 × 9]>
 8 <split [300/33]> Fold08 <tibble [40 × 6]> <tibble [0 × 3]> <tibble [660 × 9]>
 9 <split [301/32]> Fold09 <tibble [40 × 6]> <tibble [0 × 3]> <tibble [640 × 9]>
10 <split [302/31]> Fold10 <tibble [40 × 6]> <tibble [0 × 3]> <tibble [620 × 9]>

rf_tune %>% autoplot()

9. predictions

rf_tune %>% collect_predictions() %>% conf_mat(.pred_class, species )

           Truth
Prediction  Adelie Chinstrap Gentoo
  Adelie      2825        94      1
  Chinstrap     65      1295      0
  Gentoo         0        12   2368

10. ROC curves

### ROC curves for 3 classes
rf_tune %>% 
  collect_predictions() %>% 
  group_by(id) %>% 
  roc_curve(species, .pred_Adelie:.pred_Gentoo) %>% 
  autoplot()

### Finalize Model according to accuracy
rf_final <- finalize_model(
  rf_spec,
  rf_tune %>% select_best("accuracy")
)
rf_final

Random Forest Model Specification (classification)

Main Arguments:
  mtry = 2
  trees = 1000
  min_n = 4

Computational engine: ranger

### Testing Data
final_res <- workflow() %>% 
  add_formula(species ~ .)%>%
  add_model(rf_final) %>%
  last_fit(data_split)


### Metrics on Test data
final_res %>% collect_metrics()

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy multiclass     0.988 Preprocessor1_Model1
2 roc_auc  hand_till      1     Preprocessor1_Model1

final_res %>% collect_predictions() %>% conf_mat(.pred_class, species )

           Truth
Prediction  Adelie Chinstrap Gentoo
  Adelie        36         1      0
  Chinstrap      0        17      0
  Gentoo         0         0     30

final_prod <- workflow() %>% 
  add_formula(species ~ .)%>%
  add_model(rf_final) %>%
  fit(data_df)


final_prod

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
species ~ .

── Model ───────────────────────────────────────────────────────────────────────
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L,      x), num.trees = ~1000, min.node.size = min_rows(~4L, x),      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  1000 
Sample size:                      333 
Number of independent variables:  7 
Mtry:                             2 
Target node size:                 4 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.01591039

11. Variable importance

library(vip)


Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi

rf_final %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(species ~ . ,
      data = data_df
  ) %>%
  vip(geom = "col")