Clustering and Unconstrained Ordination with Dirichlet Process Mixture Models
Christian Stratton Ph.D. Defense in Statistics (Dept. of Mathematical Sciences, MSU)
03/25/2022
Abstract: Assessment of similarity in species composition or abundance across sampled
locations is a common goal in ecological monitoring programs. Existing ordination
techniques provide a framework for clustering sample locations based on species composition
by projecting high-dimensional community data into a low-dimensional, latent ecological
gradient representing species composition. However, these techniques require specification
of the number of distinct ecological communities present in the latent space, which
can be difficult to determine prior to analysis. Additionally, many existing techniques
rely on algorithmic projection and clustering methods that do not appropriately account
for uncertainty in the ordination. We develop a hierarchical ordination model capable
of simultaneous clustering and ordination that allows for estimation of the number
of clusters present in the latent ecological gradient. This model draws latent coordinates
for each sample location from a Dirichlet process mixture model, affording researchers
with probabilistic statements about the number of clusters present in the latent ecological
gradient. Additionally, the model is extended to accommodate hierarchical sampling
designs, providing ordination results that are aligned with primary sampling units.
This model is applied to an empirical data set describing presence-absence records
of plant species in Craters of the Moon National Monument and Preserve (CRMO) in Idaho,
USA. Application of the model to the CRMO data provided evidence of four ecological
regions in the latent space, corresponding to various features of the ecological gradient
in CRMO, including elevation and proximity to volcanic features. Development of the
Dirichlet process ordination model provides ecologists and wildlife managers with
data-driven inferences about the number of distinct ecological communities present
across monitored locations. This information can be leveraged to develop more cost-effective
monitoring strategies, supporting reliable decision-making for wildlife and conservation
management.
In this project, we propose a robust estimator of a parameter or a summary quantity
of the model parameters in the context where outcome is subject to nonignorable missingness.
We completely avoid modeling the regression relation, while allowing the propensity
to be modeled by a semiparametric logistic relation where the dependence on covariates
is unspecified. We discover a surprising phenomenon in that the estimation of the
parameter in the propensity model as well as the functional estimation can be carried
out without assessing the missingness dependence on covariates. This allows us to
propose a general class of estimators for both model parameter estimation and estimation
of summary quantities of the model parameters, including the outcome mean. These estimators
are robust to misspecification of the dependence on covariates. The robustness of
the estimators are nonstandard and are established rigorously through theoretical
derivations, and are supported by simulations and a data application.