Sample orthoimages for field-collected data

Sample images at points where we have field-collected data, creating a data table for modeling.

Usage

sample(
  site,
  pattern = "{*}",
  n = NULL,
  p = NULL,
  d = NULL,
  classes = NULL,
  balance = TRUE,
  balance_excl = c(7, 33),
  result = NULL,
  transects = NULL,
  drop_corr = NULL,
  reuse = FALSE,
  resources = NULL,
  local = FALSE,
  trap = TRUE,
  comment = NULL
)

Arguments

site: One or more site names, using 3 letter abbreviation. Use all to process all sites. In batch mode, each named site will be run in a separate job.
pattern: File names, portable names, regex matching either, or search names selecting files to sample. See Image naming in README for details. The default is {*}, which will include all variables.
n: Number of total samples to return.
p: Proportion of total samples to return. Use p = 1 to sample all.
d: Mean distance in cells between samples. No minimum spacing is guaranteed.
classes: Class or vector of classes in transects to sample. Default is all classes.
balance: If TRUE, balance number of samples for each class. Points will be randomly selected to match the sparsest class.
balance_excl: Vector of classes to exclude when determining sample size when balancing. Include classes with low samples we don't care much about.
result: Name of result file. If not specified, file will be constructed from site, number of X vars, and strategy.
transects: Name of transects file; default is transects.
drop_corr: Drop one of any pair of variables with correlation more than drop_corr.
reuse: Reuse the named file (ending in _all.txt) from previous run, rather than resampling. Saves a whole lot of time if you're changing n, p, d, balance, balance_excl, or drop_corr.
resources: Slurm launch resources. See launch. These take priority over the function's defaults.
local: If TRUE, run locally; otherwise, spawn a batch run on Unity
trap: If TRUE, trap errors in local mode; if FALSE, use normal R error handling. Use this for debugging. If you get unrecovered errors, the job won't be added to the jobs database. Has no effect if local = FALSE.
comment: Optional slurmcollie comment

Details

There are three mutually exclusive sampling strategies (n, p, and d). You must choose exactly one. n samples the total number of points provided. p samples the proportion of total points (after balancing, if balance is selected. d samples points with a mean (but not guaranteed) minimum distance.

Portable names are used for variable names in the resulting data files. Dashes from modifications are changed to underscore to avoid causing trouble.

Results are saved in four files, plus a metadata file:

_all.txt - A text version of the full dataset (selected by pattern but not subsetted by n, p, d, balance, or drop_corr). Readable by any software.
_all.RDS - An RDS version of the full dataset; far faster to read than a text file in R (1.1 s vs. 14.4 s in one example).
.txt - A text version of the final selected and subsetted dataset, as a text file.
.RDS - An RDS version of the final dataset.
_vars.txt - Lists the portable names used for variables in the sample alongside the file names on disk. This disambiguates when there are duplicate portable names in a flights directory.

Memory requirements: I've measured up to 28.5 GB.