Clustering and Unconstrained Ordination with Dirichlet Process Mixture Models
Christian Stratton Ph.D. Defense in Statistics (Dept. of Mathematical Sciences, MSU)
03/25/2022
Abstract: Assessment of similarity in species composition or abundance across sampled
                              locations is a common goal in ecological monitoring programs. Existing ordination
                              techniques provide a framework for clustering sample locations based on species composition
                              by projecting high-dimensional community data into a low-dimensional, latent ecological
                              gradient representing species composition. However, these techniques require specification
                              of the number of distinct ecological communities present in the latent space, which
                              can be difficult to determine prior to analysis. Additionally, many existing techniques
                              rely on algorithmic projection and clustering methods that do not appropriately account
                              for uncertainty in the ordination. We develop a hierarchical ordination model capable
                              of simultaneous clustering and ordination that allows for estimation of the number
                              of clusters present in the latent ecological gradient. This model draws latent coordinates
                              for each sample location from a Dirichlet process mixture model, affording researchers
                              with probabilistic statements about the number of clusters present in the latent ecological
                              gradient. Additionally, the model is extended to accommodate hierarchical sampling
                              designs, providing ordination results that are aligned with primary sampling units.
                              This model is applied to an empirical data set describing presence-absence records
                              of plant species in Craters of the Moon National Monument and Preserve (CRMO) in Idaho,
                              USA. Application of the model to the CRMO data provided evidence of four ecological
                              regions in the latent space, corresponding to various features of the ecological gradient
                              in CRMO, including elevation and proximity to volcanic features. Development of the
                              Dirichlet process ordination model provides ecologists and wildlife managers with
                              data-driven inferences about the number of distinct ecological communities present
                              across monitored locations. This information can be leveraged to develop more cost-effective
                              monitoring strategies, supporting reliable decision-making for wildlife and conservation
                              management.
In this project, we propose a robust estimator of a parameter or a summary quantity
                              of the model parameters in the context where outcome is subject to nonignorable missingness.
                              We completely avoid modeling the regression relation, while allowing the propensity
                              to be modeled by a semiparametric logistic relation where the dependence on covariates
                              is unspecified. We discover a surprising phenomenon in that the estimation of the
                              parameter in the propensity model as well as the functional estimation can be carried
                              out without assessing the missingness dependence on covariates. This allows us to
                              propose a general class of estimators for both model parameter estimation and estimation
                              of summary quantities of the model parameters, including the outcome mean. These estimators
                              are robust to misspecification of the dependence on covariates. The robustness of
                              the estimators are nonstandard and are established rigorously through theoretical
                              derivations, and are supported by simulations and a data application.
