Produces a low-dimensional representation of the input feature space for
subsequent estimation of the "optimal" number of clusters (k) in a
multivariate dataset. The dimension reduction is based on the
self-organizing map technique (SOM) of Kohonen (1982; 1990), and implemented
in R by the function supersom
of Wehrens and
Kruisselbrink (2018). To estimate the optimal k, the partitioning
around medoids (PAM) of Kaufman and Rousseeuw (1990), coupled with the gap
statistic of Tibshirani et al. (2001), is performed on the SOM's codebook
vectors. This is achieved by internally calling pam
and clusGap
(Maechler et al., 2021). See
Details for a brief theoretical background.
Usage
som_gap(
var.rast,
xdim = 12,
ydim = 12,
topo = "hexagonal",
neighbourhood.fct = "gaussian",
rlen = 600,
dist.fcts = c("sumofsquares", "manhattan"),
mode = "pbatch",
K.max,
stand = FALSE,
B = 500,
d.power = 2,
spaceH0 = "original",
method = "globalSEmax",
SE.factor = 1,
...
)
Arguments
- var.rast
SpatRaster, as in
rast
. This Multi-layer SpatRaster must contain n continuous variables from which the SOM will be created.- xdim
Integer. Horizontal dimension of the SOM's grid. Default: 12
- ydim
Integer. Vertical dimension of the SOM's grid. Default: 12
- topo
Character. Topology of the SOM's grid. Options = "rectangular", "hexagonal". Default: "hexagonal"
- neighbourhood.fct
Character. Neighborhood of the SOM's grid. Options = "bubble", "gaussian". Default: "gaussian"
- rlen
Integer. Number of times the complete dataset will be presented to the SOM's network. Default: 600
- dist.fcts
Character. Vector of length 2 containing the distance functions to use for SOM (First element, options = "sumofsquares", "euclidean", "manhattan") and for PAM (second element, options = "euclidean", "manhattan"). Default: c("sumofsquares", "manhattan")
- mode
Character. Type of learning algorithm. Options are “online", "batch", and "pbatch". Default: "pbatch"
- K.max
Integer. Maximum number of clusters to consider, must be at least two (2).
- stand
Boolean. For PAM function, does SOM's codebook vectors need to be standardized? Default: FALSE
- B
Integer. Number of bootstrap samples for the gap statistic. Default: 500
- d.power
Integer. Positive Power applied to euclidean distances for the gap statistic. Default: 2
- spaceH0
Character. Space of the reference distribution for the gap statistic. Options = "scaledPCA", "original" (See Details). Default: "original"
- method
Character. Optimal k selection criterion for the gap statistic. Options = "globalmax", "firstmax", "Tibs2001SEmax", "firstSEmax", "globalSEmax". See
clusGap
for more details. Default: "globalSEmax"- SE.factor
Numeric. Factor to feed into the standard error rule for the gap statistic. Only applicable for methods based on standard error (SE). See
clusGap
for more details. Default: 1- ...
Additional arguments as for
supersom
.
Value
SOM: An object of class kohonen (see
supersom
). The components of class kohonen returned by
this function are: (1) data = original input matrix, (2)
unit.classif = winning units for all observations, (3)
distances = distance between each observation and its corresponding
winning unit, (4) grid = object of class somgrid (see
somgrid
), (5) codes = matrix of codebook
vectors, (6) changes = matrix of mean average deviations from codebook
vectors, (7) dist.fcts = selected distance function, and other
arguments passed to supersom
(e.g., radius,
distance.weights, etc.). Note that components 1, 2, and 3 will only be
returned if keep.data = TRUE, which is the default.
SOMdist: Object of class dist. Matrix of pairwise distances calculated from the SOM's codebook vectors.
SOMgap: Object of class clusGap. The main component of
class clusGap returned by this function is Tab, which is a matrix of
the gap statistic results (see clusGap
). Additional
components are the arguments passed to the function (i.e., spaceH0,
B), the PAM function, n (number of observations) and
call (the clusGap call-type object).
Kopt: Optimal k, as selected by arguments method and (possibly) SE.factor.
Details
The clustering of SOM's codebook vectors has been proposed in several works, notably in that from Vesanto and Alhoniemi (2000). These authors proposed a two-stage clustering routine as an efficient method to reduce computational load, while obtaining satisfactory correspondence between the clustered codebook vectors and the clustered original feature space.
The main purpose of this function is to allow the use of clustering and k-selection algorithms that may result prohibitive for large datasets, such as matrices derived from raster layers commonly used during geocomputational routines. Thus, the SOM's codebook vectors can be subsequently used for the calculation of distance matrices, which given the large size of their input feature space, may otherwise be impossible to create due to insufficient memory allocation capacity. Similarly, robust clustering algorithms that require full pairwise distance matrices (e.g., hierarchical clustering, PAM) and/or eigenvalues (e.g., spectral clustering) may also be performed on SOM's codebook vectors.
Note that supersom
will internally equalize the
importance (i.e., weights) of variables such that differences in scale will
not affect distance calculations. This behavior can be prevented by setting
normalizeDataLayers = FALSE in additional arguments passed to
supersom
. Moreover, custom weights can also be passed
through the additional argument user.weights. In such case, user
weights are applied on top of the internal weights.
When working with large matrices, the additional SOM argument
keep.data may be set to FALSE. However, note that by doing so, the
suggested follow-up function for raster products som_pam
will
not work since it requires both original data and winning units.
For the gap statistic, method = "scaledPCA" has resulted in errors for R sessions with BLAS/LAPACK supported by the Intel Math Kernel Library (MKL).
References
L. Kaufman and P. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 1990. doi:10.1002/9780470316801
T. Kohonen. Self-organized formation of topologically correct feature maps. Biological cybernetics, 43 (1):59–69, 1982. doi:10.1007/bf00337288
T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990. doi:10.1016/s0925-2312(98)00030-7
M. Maechler, P. Rousseeuw, A. Struyf, M. Hubert, and K. Hornik. cluster: Cluster Analysis Basics and Extensions, 2021. https://CRAN.R-project.org/package=cluster
R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423, 2001. doi:10.1111/1467-9868.00293
J. Vesanto and E. Alhoniemi. Clustering of the self-organizing map. IEEE Transactions on Neural Networks, 11(3):586–600, 2000. doi:10.1109/72.846731
R. Wehrens and J. Kruisselbrink. Flexible self-organizing maps in kohonen 3.0. Journal of Statistical Software, 87(1):1–18, 2018. doi:10.18637/jss.v087.i07
Examples
require(terra)
# Multi-layer SpatRaster with topographic variables
p <- system.file("exdat", package = "rassta")
tf <- list.files(path = p, pattern = "^height|^slope|^wetness",
full.names = TRUE
)
t <- rast(tf)
# Scale topographic variables (mean = 0, StDev = 1)
ts <- scale(t)
# Self-organizing map and gap statistic for optimum k
set.seed(963)
tsom <- som_gap(var.rast = ts, xdim = 8, ydim = 8, rlen = 150,
mode = "online", K.max = 6, B = 300, spaceH0 = "original",
method = "globalSEmax"
)
#> Clustering k = 1,2,..., K.max (= 6): .. done
#> Bootstrapping, b = 1,2,..., B (= 300) [one "." per sample]:
#> .................................................. 50
#> .................................................. 100
#> .................................................. 150
#> .................................................. 200
#> .................................................. 250
#> .................................................. 300
# Optimum k
tsom$Kopt
#> [1] 5