Self-Organizing Map and Selection of k

Produces a low-dimensional representation of the input feature space for subsequent estimation of the "optimal" number of clusters (k) in a multivariate dataset. The dimension reduction is based on the self-organizing map technique (SOM) of Kohonen (1982; 1990), and implemented in R by the function supersom of Wehrens and Kruisselbrink (2018). To estimate the optimal k, the partitioning around medoids (PAM) of Kaufman and Rousseeuw (1990), coupled with the gap statistic of Tibshirani et al. (2001), is performed on the SOM's codebook vectors. This is achieved by internally calling pam and clusGap (Maechler et al., 2021). See Details for a brief theoretical background.

Usage

som_gap(
  var.rast,
  xdim = 12,
  ydim = 12,
  topo = "hexagonal",
  neighbourhood.fct = "gaussian",
  rlen = 600,
  dist.fcts = c("sumofsquares", "manhattan"),
  mode = "pbatch",
  K.max,
  stand = FALSE,
  B = 500,
  d.power = 2,
  spaceH0 = "original",
  method = "globalSEmax",
  SE.factor = 1,
  ...
)

Arguments

var.rast: SpatRaster, as in rast. This Multi-layer SpatRaster must contain n continuous variables from which the SOM will be created.
xdim: Integer. Horizontal dimension of the SOM's grid. Default: 12
ydim: Integer. Vertical dimension of the SOM's grid. Default: 12
topo: Character. Topology of the SOM's grid. Options = "rectangular", "hexagonal". Default: "hexagonal"
neighbourhood.fct: Character. Neighborhood of the SOM's grid. Options = "bubble", "gaussian". Default: "gaussian"
rlen: Integer. Number of times the complete dataset will be presented to the SOM's network. Default: 600
dist.fcts: Character. Vector of length 2 containing the distance functions to use for SOM (First element, options = "sumofsquares", "euclidean", "manhattan") and for PAM (second element, options = "euclidean", "manhattan"). Default: c("sumofsquares", "manhattan")
mode: Character. Type of learning algorithm. Options are “online", "batch", and "pbatch". Default: "pbatch"
K.max: Integer. Maximum number of clusters to consider, must be at least two (2).
stand: Boolean. For PAM function, does SOM's codebook vectors need to be standardized? Default: FALSE
B: Integer. Number of bootstrap samples for the gap statistic. Default: 500
d.power: Integer. Positive Power applied to euclidean distances for the gap statistic. Default: 2
spaceH0: Character. Space of the reference distribution for the gap statistic. Options = "scaledPCA", "original" (See Details). Default: "original"
method: Character. Optimal k selection criterion for the gap statistic. Options = "globalmax", "firstmax", "Tibs2001SEmax", "firstSEmax", "globalSEmax". See clusGap for more details. Default: "globalSEmax"
SE.factor: Numeric. Factor to feed into the standard error rule for the gap statistic. Only applicable for methods based on standard error (SE). See clusGap for more details. Default: 1
...: Additional arguments as for supersom.

Value

SOM: An object of class kohonen (see supersom). The components of class kohonen returned by this function are: (1) data = original input matrix, (2) unit.classif = winning units for all observations, (3) distances = distance between each observation and its corresponding winning unit, (4) grid = object of class somgrid (see somgrid), (5) codes = matrix of codebook vectors, (6) changes = matrix of mean average deviations from codebook vectors, (7) dist.fcts = selected distance function, and other arguments passed to supersom (e.g., radius, distance.weights, etc.). Note that components 1, 2, and 3 will only be returned if keep.data = TRUE, which is the default.

SOMdist: Object of class dist. Matrix of pairwise distances calculated from the SOM's codebook vectors.

SOMgap: Object of class clusGap. The main component of class clusGap returned by this function is Tab, which is a matrix of the gap statistic results (see clusGap). Additional components are the arguments passed to the function (i.e., spaceH0, B), the PAM function, n (number of observations) and call (the clusGap call-type object).

Kopt: Optimal k, as selected by arguments method and (possibly) SE.factor.

Details

The clustering of SOM's codebook vectors has been proposed in several works, notably in that from Vesanto and Alhoniemi (2000). These authors proposed a two-stage clustering routine as an efficient method to reduce computational load, while obtaining satisfactory correspondence between the clustered codebook vectors and the clustered original feature space.

The main purpose of this function is to allow the use of clustering and k-selection algorithms that may result prohibitive for large datasets, such as matrices derived from raster layers commonly used during geocomputational routines. Thus, the SOM's codebook vectors can be subsequently used for the calculation of distance matrices, which given the large size of their input feature space, may otherwise be impossible to create due to insufficient memory allocation capacity. Similarly, robust clustering algorithms that require full pairwise distance matrices (e.g., hierarchical clustering, PAM) and/or eigenvalues (e.g., spectral clustering) may also be performed on SOM's codebook vectors.

Note that supersom will internally equalize the importance (i.e., weights) of variables such that differences in scale will not affect distance calculations. This behavior can be prevented by setting normalizeDataLayers = FALSE in additional arguments passed to supersom. Moreover, custom weights can also be passed through the additional argument user.weights. In such case, user weights are applied on top of the internal weights.

When working with large matrices, the additional SOM argument keep.data may be set to FALSE. However, note that by doing so, the suggested follow-up function for raster products som_pam will not work since it requires both original data and winning units.

For the gap statistic, method = "scaledPCA" has resulted in errors for R sessions with BLAS/LAPACK supported by the Intel Math Kernel Library (MKL).

References

L. Kaufman and P. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 1990. doi:10.1002/9780470316801

T. Kohonen. Self-organized formation of topologically correct feature maps. Biological cybernetics, 43 (1):59–69, 1982. doi:10.1007/bf00337288

T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990. doi:10.1016/s0925-2312(98)00030-7

M. Maechler, P. Rousseeuw, A. Struyf, M. Hubert, and K. Hornik. cluster: Cluster Analysis Basics and Extensions, 2021. https://CRAN.R-project.org/package=cluster

R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423, 2001. doi:10.1111/1467-9868.00293

J. Vesanto and E. Alhoniemi. Clustering of the self-organizing map. IEEE Transactions on Neural Networks, 11(3):586–600, 2000. doi:10.1109/72.846731

R. Wehrens and J. Kruisselbrink. Flexible self-organizing maps in kohonen 3.0. Journal of Statistical Software, 87(1):1–18, 2018. doi:10.18637/jss.v087.i07

Examples

require(terra)
# Multi-layer SpatRaster with topographic variables
p <- system.file("exdat", package = "rassta")
tf <- list.files(path = p, pattern = "^height|^slope|^wetness",
                 full.names = TRUE
                )
t <- rast(tf)
# Scale topographic variables (mean = 0, StDev = 1)
ts <- scale(t)
# Self-organizing map and gap statistic for optimum k
set.seed(963)
tsom <- som_gap(var.rast = ts, xdim = 8, ydim = 8, rlen = 150,
               mode = "online", K.max = 6, B = 300, spaceH0 = "original",
               method = "globalSEmax"
              )
#> Clustering k = 1,2,..., K.max (= 6): .. done
#> Bootstrapping, b = 1,2,..., B (= 300)  [one "." per sample]:
#> .................................................. 50 
#> .................................................. 100 
#> .................................................. 150 
#> .................................................. 200 
#> .................................................. 250 
#> .................................................. 300 
# Optimum k
tsom$Kopt
#> [1] 5