Nithesh - Clustering

K-means clustering is a popular unsupervised machine learning algorithm designed for partitioning a dataset into distinct groups, or clusters, based on similarity patterns among data points. The algorithm iteratively assigns each data point to the cluster whose mean is closest, forming clusters with minimized intra-cluster distances. K-means requires a pre-specified number of clusters, denoted by ‘k.’ The process continues until convergence, where the assignment of data points and the cluster centroids stabilize. The algorithm is efficient and widely used for tasks such as customer segmentation, image compression, and pattern recognition. However, its performance can be sensitive to the initial placement of cluster centroids, and the choice of ‘k’ requires careful consideration. Despite its limitations, K-means clustering serves as a foundational technique in exploratory data analysis and unsupervised learning.

1. Data

For clustering using K-means, we use the freely available USArrests dataset. We aim to cluster the states of US based on the crimes from the USAressts data.

library(skimr)

data("USArrests")


skim(USArrests)

Data summary
Name	USArrests
Number of rows	50
Number of columns	4
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Murder	1	7.79	4.36	0.8	4.08	7.25	11.25	17.4	▇▇▅▅▃
Assault	1	170.76	83.34	45.0	109.00	159.00	249.00	337.0	▆▇▃▅▃
UrbanPop	1	65.54	14.47	32.0	54.50	66.00	77.75	91.0	▁▆▇▅▆
Rape	1	21.23	9.37	7.3	15.08	20.10	26.17	46.0	▆▇▅▂▂

head(USArrests)

           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

2. Identify and remove variables with high correlation

library(GGally)

Loading required package: ggplot2

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

ggpairs(
  
  data = USArrests,
  columns = c(1:4))

From the analysis, we see that there’s a high correlation between murder and assault. So, we have to remove either one of them. Here, we remove murder for the further clustering analysis.

3. Scale the variables

#Scale

data_scaled = scale(USArrests)

head(data_scaled)

               Murder   Assault   UrbanPop         Rape
Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
Arizona    0.07163341 1.4788032  0.9989801  1.042878388
Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144  1.7589234  2.067820292
Colorado   0.02571456 0.3988593  0.8608085  1.864967207

data_selected = data_scaled[, 2:4]

head(data_selected)

             Assault   UrbanPop         Rape
Alabama    0.7828393 -0.5209066 -0.003416473
Alaska     1.1068225 -1.2117642  2.484202941
Arizona    1.4788032  0.9989801  1.042878388
Arkansas   0.2308680 -1.0735927 -0.184916602
California 1.2628144  1.7589234  2.067820292
Colorado   0.3988593  0.8608085  1.864967207

4. Required packages

library(cluster)
library(factoextra)

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(ggplot2)


set.seed(123)
crime = sample(1:50, 10)

crime_1 = data_selected[crime,]

head(crime_1)

              Assault    UrbanPop        Rape
New Mexico  1.3708088  0.30812248  1.16031965
Iowa       -1.3770485 -0.58999237 -1.06038781
Indiana    -0.6930840 -0.03730631 -0.02476943
Arizona     1.4788032  0.99898006  1.04287839
Tennessee   0.2068693 -0.45182086  0.60514278
Texas       0.3628612  0.99898006  0.45567209

5. Euclidean Distance

dist_eucli = dist(crime_1, method = "euclidean")

head(dist_eucli)

[1] 3.6453905 2.4048723 0.7090412 1.4968268 1.4105924 1.5550653

round(as.matrix(dist_eucli)[1:4, 1:4], 1)

           New Mexico Iowa Indiana Arizona
New Mexico        0.0  3.6     2.4     0.7
Iowa              3.6  0.0     1.4     3.9
Indiana           2.4  1.4     0.0     2.6
Arizona           0.7  3.9     2.6     0.0

fviz_dist(dist_eucli)

6. Optimal number of clusters

# Elbow method
fviz_nbclust(data_selected, kmeans, method = 'wss') +
             geom_vline(xintercept = 4, linetype=5, col= "green")+
    labs(subtitle = "Elbow method")

# Silhouette method
fviz_nbclust(data_selected, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

kmeans_1 = kmeans(data_selected, 4, nstart = 20)
kmeans_1

K-means clustering with 4 clusters of sizes 18, 11, 13, 8

Cluster means:
     Assault   UrbanPop        Rape
1 -0.3151036  0.5844655 -0.16593620
2  1.1875456  0.7351981  1.40005511
3 -1.1066010 -0.9301069 -0.96676331
4  0.8743346 -0.8145211  0.01927104

Clustering vector:
       Alabama         Alaska        Arizona       Arkansas     California 
             4              2              2              4              2 
      Colorado    Connecticut       Delaware        Florida        Georgia 
             2              1              1              2              4 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
             1              3              2              1              3 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
             1              3              4              3              2 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
             1              2              3              4              1 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
             3              3              2              3              1 
    New Mexico       New York North Carolina   North Dakota           Ohio 
             2              2              4              3              1 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
             1              1              1              1              4 
  South Dakota      Tennessee          Texas           Utah        Vermont 
             3              4              1              1              3 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
             1              1              3              3              1 

Within cluster sum of squares by cluster:
[1] 16.371159 14.194571  8.421711  5.818560
 (between_SS / total_SS =  69.5 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

kmeans_1$betweenss

[1] 102.194

df = cbind(data_selected, cluster = kmeans_1$cluster)

head(df)

             Assault   UrbanPop         Rape cluster
Alabama    0.7828393 -0.5209066 -0.003416473       4
Alaska     1.1068225 -1.2117642  2.484202941       2
Arizona    1.4788032  0.9989801  1.042878388       2
Arkansas   0.2308680 -1.0735927 -0.184916602       4
California 1.2628144  1.7589234  2.067820292       2
Colorado   0.3988593  0.8608085  1.864967207       2

kmeans_2 = kmeans(data_selected, 5, nstart = 20)
kmeans_2

K-means clustering with 5 clusters of sizes 13, 11, 5, 8, 13

Cluster means:
      Assault    UrbanPop       Rape
1 -1.10660100 -0.93010687 -0.9667633
2  1.18754560  0.73519807  1.4000551
3  1.07322427 -1.14267844 -0.2084049
4 -0.57309024  1.06806582 -0.4798543
5  0.04164707  0.09023663  0.1575520

Clustering vector:
       Alabama         Alaska        Arizona       Arkansas     California 
             3              2              2              3              2 
      Colorado    Connecticut       Delaware        Florida        Georgia 
             2              4              5              2              5 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
             4              1              2              5              1 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
             5              1              5              1              2 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
             4              2              1              3              5 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
             1              1              2              1              4 
    New Mexico       New York North Carolina   North Dakota           Ohio 
             2              2              3              1              4 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
             5              5              4              4              3 
  South Dakota      Tennessee          Texas           Utah        Vermont 
             1              5              5              4              1 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
             5              5              1              1              5 

Within cluster sum of squares by cluster:
[1]  8.421711 14.194571  2.616274  4.873222  8.006551
 (between_SS / total_SS =  74.1 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

kmeans_3 = kmeans(data_scaled, 4, nstart = 20)
kmeans_3

K-means clustering with 4 clusters of sizes 8, 13, 13, 16

Cluster means:
      Murder    Assault   UrbanPop        Rape
1  1.4118898  0.8743346 -0.8145211  0.01927104
2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
3  0.6950701  1.0394414  0.7226370  1.27693964
4 -0.4894375 -0.3826001  0.5758298 -0.26165379

Clustering vector:
       Alabama         Alaska        Arizona       Arkansas     California 
             1              3              3              1              3 
      Colorado    Connecticut       Delaware        Florida        Georgia 
             3              4              4              3              1 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
             4              2              3              4              2 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
             4              2              1              2              3 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
             4              3              2              1              3 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
             2              2              3              2              4 
    New Mexico       New York North Carolina   North Dakota           Ohio 
             3              3              1              2              4 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
             4              4              4              4              1 
  South Dakota      Tennessee          Texas           Utah        Vermont 
             2              1              3              4              2 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
             4              4              2              2              4 

Within cluster sum of squares by cluster:
[1]  8.316061 11.952463 19.922437 16.212213
 (between_SS / total_SS =  71.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

7. Visualizing clusters with 4 and 5 clusters

We use two different dataframes, data_selected and data_scaled for our clustering with 4 and 5 clusters. The variables without removing murder variable (data_scaled) for high correlation with assault variable has a better cluster visualization. In this case, using PCA would be really useful in creating new variable with less correlation without losing most information.

fviz_cluster(kmeans_1, data = data_selected,
             palette=c("red", "blue", "black", "darkgreen"),
             ellipse.type = "euclid",
             star.plot = T,
             repel = T,
             ggtheme = theme())

fviz_cluster(kmeans_2, data = data_selected,
             palette=c("red", "blue", "black", "darkgreen", "yellow"),
             ellipse.type = "euclid",
             star.plot = T,
             repel = T,
             ggtheme = theme())

fviz_cluster(kmeans_3, data = data_scaled,
             palette=c("red", "blue", "black", "darkgreen"),
             ellipse.type = "euclid",
             star.plot = T,
             repel = T,
             ggtheme = theme())