Gradient Boosting for Group-Testing Data
Dr. Erica Porter (School of Mathematical and Statistical Sciences, Clemson University)
9/16/2024:
Abstract: When screening a population for a disease, it is often necessary or preferable to group or pool individual specimens and perform testing on the pools. This approach is faster and lower cost than testing each individual specimen. It is often of interest to model the true disease status as a function of one or more covariates available for the individuals (e.g. demographic information). We propose a gradient boosting method to build models for group testing data using individual-level covariates. We first develop gradient boosting for the case of masterpool testing, where testing is performed only on the pools with no follow-up testing, and then we describe gradient boosting that can be used with any testing protocol. Upon deriving the gradient for group-testing data, we demonstrate that our approach can be used to build models from a number of weak learners, including regression trees, kernel smoothing, and splines. For each of these weak learners, we develop a cross-validation approach to select appropriate values for applicable tuning parameters. Since the true individual disease statuses are not observed, typical cross-validation metrics such as mean squared error (MSE) cannot be applied. Instead, we calculate the log-likelihood value based on the observed group testing data and choose the tuning parameters that produce the largest log-likelihood value after gradient boosting across several folds. Our gradient boosting approach can be used to model both linear and nonlinear relationships between the disease status and covariates, and can easily be adapted to accommodate different types of weak learners. We demonstrate our method using a data set where group testing was used to screen for chlamydia in Iowa.