Sifting Through The Noise: Frameworks for Identifying Important Features using Latent Variables or Sparsity
Talk by Paul Wilson-Harmon (PhD Proposal in Statistics)
5/15/2020 1:00pm Online
Abstract:
In statistical learning, discrimination between signal and noise features is often an issue of great significance. In multivariate settings, results can be greatly improved by removing noise features prior to implementation of a given algorithm. This talk comprises three distinct research projects sharing this common theme, but across different statistical methodologies and settings.
In the first setting, I consider the Carnegie Classification of Institutions of Higher Education (CCIHE), which attempts to quantify doctoral research productivity at universities using Principal Component Analysis (PCA) to estimate latent indices of research performance and return three groups of schools. However, PCA does not provide an intuitive framework for latent-variable modeling, nor does it allow for data-driven selection of groups. Instead, I overview an alternate method using Structural Equation Modeling (SEM) to reframe the question in an explicit latent variable context. The SEM estimates a latent factor-of-factors score for each institution that can be input into a clustering algorithm. Then, using Model-Based Clustering, I calculate data-driven clusters that better allow for peer-to-peer comparisons between institutions.
In the second and third applications, I consider the effect of sparsity on unsupervised methods. Sparsity, in this case, results from a Lasso (L1) penalty on the features used in an algorithm to downweight individual features to 0, effectively removing noise and retaining signal-bearing ones. For clustering, I focus on a two-step sparse implementation of monothetic clustering, a divisive hierarchical clustering algorithm that makes grouping decisions based on single features at a time. Via utilization of sparsity to clean up the feature space prior to clustering, monothetic clustering can be done on a constrained set of features. Potential methods for tuning two important parameters, the number of clusters, k, and the degree of sparsity, s, will be discussed. Inclusion of additional external variables that inform the groupings but are not directly input into the clustering algorithm will be discussed as a potential extension.
Lastly, based on the framework for sparse monothetic clustering, I discuss extensions of sparsity to additional distance-based methods for dimension-reduction. I primarily will focus on frameworks for sparse implementations of t-distributed Stochastic Neighbor Embedding (t-SNE) and Classical Multidimensional Scaling (with non-Euclidean distances). I will outline a comparison of sparse and non-sparse methods, as well as discuss potential issues with tuning that may arise. This talk, which will serve as my dissertation proposal, will additionally outline my research plan and timeline.