Select Constrained Univariate Distribution Functions

Selection of distribution functions for continuous raster layers that were used to create a raster layer of classification units. The distribution functions currently supported are the probability density function (PDF), the empirical cumulative density function (ECDF), and the inverse of the empirical cumulative density function (iECDF). Please note that select_functions DOES NOT calculate the aforementioned distribution functions. The sole purpose of select_functions is to assist in the knowledge-driven selection of the most appropriate distribution function for each continuous variable used to create a given classification unit (see Details).

Usage

select_functions(
  cu.rast,
  var.rast,
  fun = mean,
  varscale = "uniminmax",
  mode = "auto",
  verbose = TRUE,
  ...
)

Arguments

cu.rast: SpatRaster, as in rast. Single-layer SpatRaster representing the classification units occurring across geographic space. The cell values (i.e., numeric IDs) for classification units must be integer values.
var.rast: SpatRaster. Multi-layer SpatRaster containing the n continuous raster layers of the variables used to create the classification units.
fun: Character. Descriptive statistical measurement (e.g., mean, max). See zonal. Default: mean
varscale: Character. Variable scaling method. See scale argument in ggparcoord. Default: "uniminmax"
mode: Character. String specifying the selection mode for univariate distribution functions. Possible values are "inter" for interactive selection, and "auto" for automatic selection (see Details). Default: "auto"
verbose: Boolean. Show warning messages in the console? Default: FALSE
...: Additional arguments as for ggparcoord.

Value

If mode = "inter":

distfun: A DT table (DataTables library) with the following attributes: (1) Class.Unit = numeric ID for classification units, (2) Variable = each of the n continuous raster layers of a classification unit, and (3) Dist.Func = Empty column whose cells can be filled with the following strings: "PDF, "ECDF", and "iECDF" (unquoted). This table can be saved on disk through the Shiny interface.

parcoord: A plotly-based parallel coordinate plot which can be saved on disk using the R package htmlwidgets.

If mode = "auto":

distfun: Same as distfun when mode = "inter", except for column "Dist.Func" whose cells were automatically filled.

parcoord: Same as parcoord when mode = "inter".

Details

The selection of distribution functions is univariate, that is, for each variable, and it is constrained, meaning that the selection has to be made for each classification unit. Overall, the distribution functions are used to characterize typical values of a given continuous variable within a given classification unit. When the PDF is selected, values closer to, or at the peak of the PDF will be considered as the most typical. Contrarily, values at the tails of the PDF will be considered as the less typical. When the ECDF or the iECDF are selected, values toward (+)infinity and (-)infinity will be considered as the most typical values, respectively.

In order to assist the selection process, when mode = "inter", this function displays an interactive parallel coordinates plot (see ggplotly) and a writable table (built in Shiny). For each variable, the parallel coordinates plot shows a trend of a descriptive statistical measurement (argument fun) across all of the classification units. Using this trend, one can then select the most appropriate distribution function for each variable based on the occurrence/absence of "peaks" and "pits" in the observed trend. For instance, a peak (highest point in the trend) would indicate that the given classification unit contains on average, the highest values of that variable. Conversely, a pit (lowest point in the trend) would indicate that the given classification unit contains on average, the lowest values of that variable. Thus, an ECDF and an iECDF can be selected for the peak and the pit, respectively. The PDF can be selected for classification units whose trend does not show either a peak or a pit. Please consider that peaks and pits are only reference points and thus, one should validate the selection of distribution functions based on domain knowledge.

When mode = "auto", the criteria for the selection of distribution functions will be based on peaks and pits in the parallel coordinates plot.

The output table (distfun) is intended to be used as input in the predict_functions function.

The selection of distribution functions is similar to the selection of membership functions in fuzzy logic. For example, if one wants to describe a phenomenon through distribution functions of continuous variables, then the functions can be considered to be membership curves. Accordingly, the PDF, ECDF, and iECDF will be equivalent to the Gaussian, S, and Z membership functions, respectively.

Examples

require(terra)
p <- system.file("exdat", package = "rassta")
# Multi-layer SpatRaster of topographic variables
## 3 topographic variables
tf <- list.files(path = p, pattern = "^height|^slope|^wetness",
                 full.names = TRUE
                )
tvars <- terra::rast(tf)
# Single-layer SpatRaster of topographic classification units
## 5 classification units
tcf <- list.files(path = p, pattern = "topography.tif", full.names = TRUE)
tcu <- terra::rast(tcf)
# Automatic selection of distribution functions
tdif <- select_functions(cu.rast = tcu, var.rast = tvars, fun = mean)
# Parallel coordinates plot
if(interactive()){tdif$parcoord}