Introduction

When predicting an outcome is the scientific goal, one must decide on a metric by which to evaluate the quality of predictions. Often, the performance of a prediction algorithm must be estimated using the same data that are used to train the algorithm. To correct for overly optimistic assessments of performance, cross-validation is often used. However, standard \(K\)-fold cross-validated risk estimates may perform poorly in small samples when one considers nonlinear risks. This includes several popular risk criteria including the area under the ROC operating characteristics curve (AUC). The nlpred package implements several estimators that are tailored for improved cross-validated estimation of nonlinear risks. This vignette provides a brief overview of the motivation for these estimators, and demonstrations of how they are implemented in the nlpred package.

Motivation

Prediction is important in many areas of research. Modern technology has led to collection of vast amounts of data; for example in biomedical studies, we now routinely collect genetic sequences, gene expressions, proteomics, and metabolomics. Relative to the amount of information measured on each unit, the total number of units available may be quite modest. Many practical applications thus require methodology that enables researchers to simultaneously develop and evaluate performance of prediction algorithms in small samples. It is well known that estimating performance of an algorithm using the same data that were used to train the algorithm often leads to an overly optimistic estimate of performance. To correct for this, it is common to use \(K\)-fold cross-validation (CV). \(K\)-fold CV generalizes partitions the data into several distinct groups. The prediction algorithm is developed using all but one group, and the prediction metric is estimated in the remaining group. This is repeated until each partition has been used to estimate the risk once. The \(K\)-fold CV risk estimate is the average of these partition-specific estimates.

Theoretical frameworks have been developed for estimation of \(K\)-fold CV risks that apply to arbitrary learning algorithms (Hubbard, Kherad-Pajouh, and Laan 2016). Moreover, it is often possible to construct closed-form, computationally efficient confidence intervals and hypothesis tests based on \(K\)-fold CV estimates, e.g., (LeDell, Petersen, and Laan 2015). However, these estimators can suffer from poor behavior for certain risks. In particular, risks that are non-linear in the data generating distribution may suffer from poor performance. Whereas linear metrics can be estimated using estimators that themselves are linear (i.e., behave like means), non-linear metrics typically require asymptotically linear estimators. Such estimators behave as a mean plus a remainder term. While the remainder is generally negligible in large samples, it may contribute substantially to the behavior of the estimator in small samples.

In our recent work (references to be added when available), we have developed improved estimators of nonlinear risk functions. Our approach involves viewing the risk criteria as a statistical function of certain key parameters the data generating distribution. Standard CV estimators use the validation data to estimate these parameters, for example, using nonparametric maximum likelihood. However, as discussed above, this is problematic when validation samples are small. Thus, our proposal uses the training sample twice: once to train the prediction algorithm, and then again to estimate the relevant parameters of the data generating distribution that are needed to evaluate the risk criteria of interest. Because of the double-usage of the data, these estimates may exhibit bias. This motivates some form of bias correction. nlpred implements three asymptotically equivalent approaches: CV targeted minimum loss estimation (CVTMLE), one-step estimation, and estimating equations. We refer readers to our publication for further details on how the estimators are constructed.

Area under the receiver operating characteristics (ROC) curve

The area under the ROC curve (hence, AUC) is a popular risk criteria for evaluating prediction algorithms. Suppose we have a prediction algorithm \(\psi\) that maps a feature vector \(X\) into a predicted probability of a binary outcome \(Y\). The AUC of \(\psi\) can be defined as follows. Consider drawing \((X_1, Y_1)\) at random from the population of units with \(Y = 1\) and \((X_2, Y_2)\) independently from the population of units with \(Y = 0\). The AUC can be interpreted as \(P(\psi(X_1) > \psi(X_2) | Y_1 = 1, Y_2 = 0)\). That is, the probability that the predicted risk of \(X_1\) is higher than that of \(X_2\).

Estimates of CV AUC can be implemented in nlpred as follows.

## Loading required package: data.table
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
##                est         se       cil       ciu
## cvtmle   0.8152059 0.03117551 0.7541030 0.8763088
## onestep  0.8162802 0.02869939 0.7600304 0.8725300
## esteq    0.8358306 0.02869939 0.7795808 0.8920803
## standard 0.8170426 0.02958658 0.7590540 0.8750312

The main options to the cv_auc function are Y, the outcome, X, the features, and K, the number of cross-validation folds. The learner option specifies the learning algorithm of interest. See the Writing wrappers section below for more details. For now, it suffices to say that "glm_wrapper" corresponds to a main effects logistic regression. The nested_cv option is an important option in the estimation procedure. It specifies whether to use nested CV to estimate the nuisance parameters in each training sample. Obviously, requiring nested cross-validation adds considerable computational expense, so it is natural to inquire as to when this is necessary. In general, we recommend nested CV for more aggressive learning algorithms. Because logistic regression is a fairly stable algorithm, it may not be necessary in this case.

The printed output shows four estimates of CV AUC based on the three different bias corrections (cvtmle, onestep, esteq) as well as the standard CV AUC estimate (empirical) (LeDell, Petersen, and Laan 2015). Also shown is the influence function-based standard error estimate (se) and the limits of a, by default, 95% confidence interval. The level of the confidence interval is controlled by the ci_level of the print.cv_auc method.

##                est         se       cil       ciu
## cvtmle   0.8152059 0.03117551 0.7639268 0.8664851
## onestep  0.8162802 0.02869939 0.7690739 0.8634865
## esteq    0.8358306 0.02869939 0.7886243 0.8830369
## standard 0.8170426 0.02958658 0.7683770 0.8657082

We now consider a more aggressive algorithm: random forests (as implemented in the ranger package).

##                est         se       cil       ciu
## cvtmle   0.7882074 0.03178378 0.7259123 0.8505024
## onestep  0.7891323 0.03223887 0.7259453 0.8523193
## esteq    0.7832241 0.03223887 0.7200370 0.8464111
## standard 0.7909774 0.03257034 0.7271407 0.8548141

By default, cv_auc will use K-1 folds for the nested cross-validation. This choice is a sensible default since it allows for considerably less learner training relative to, e.g., two layers of \(K\)-fold CV, because certain learners can be re-used across different partitions of the data. However, if one wishes to control this, there is the nested_K option.

Available wrappers

The table below shows wrappers for learners that are available in nlpred.

Wrapper Description Reference
glm_wrapper logistic regression
stepglm_wrapper stepwise logistic regression
xgboost_wrapper eXtreme gradient boosting Chen and Guestrin (2016)
ranger_wrapper random forests Wright and Ziegler (2017)
randomforest_wrapper random forests Liaw and Wiener (2002)
superlearner_wrapper super learner Polley et al. (2019)
glmnet_wrapper elastic net regression Simon et al. (2011)

Writing wrappers

It is not difficult to write your own function that is compatible with cv_auc and other functions in the nlpred package. Let’s examine the glm_wrapper function.

## function (train, test) 
## {
##     if (!is.data.frame(train$X)) {
##         train$X <- data.frame(train$X)
##     }
##     if (!is.data.frame(test$X)) {
##         test$X <- data.frame(test$X)
##     }
##     glm_fit <- stats::glm(train$Y ~ ., data = train$X, family = stats::binomial())
##     train_pred <- stats::predict(glm_fit, newdata = train$X, 
##         type = "response")
##     test_pred <- stats::predict(glm_fit, newdata = test$X, type = "response")
##     return(list(test_pred = test_pred, train_pred = train_pred, 
##         model = NULL, train_y = train$Y, test_y = test$Y))
## }
## <bytecode: 0x7fe957308480>
## <environment: namespace:nlpred>

We see that the function expect input in the form of two lists: train and test. In these lists, we expect to find entries X and Y corresponding to the features and outcomes, respectively, in the a giving training sample and validation/testing sample. The salient points of the workflow of this function are: fit a main terms logistic regression and store it in the glm_fit object; obtain predictions on both the training data and testing data; structure the output to have a particular format. In particular, the output should be a list with named entries test_pred, predictions in the test sample, train_pred, predictions in the training sample, model, the fitted model (optional; only needed if you wish to examine it in the $prediction_list entry of the output of cv_auc), train_y, the training sample outcomes (copied from train$Y), and test_y, the testing sample outcomes (copied from test$Y).

Sensitivity constrained rate of negative prediction (SCRNP)

The sensitivity constrained rate of negative prediction (SCRNP) can be described as follows. Suppose again that we have a prediction function \(\psi\) that maps features \(X\) into a predicted probability of a binary outcome \(Y\). Suppose we choose a cutoff \(c_0\), such that \(P(\psi(X) > c_0 | Y = 1) \ge \rho\) for a user-defined \(\rho\). That is, we enforce that the sensitivity of a classifier based on \(\psi\) is at least \(\rho\). The SCRNP is then \(P(\psi(X) \le c_0)\); that is, the proportion of all data units that would be classified as a “control” (i.e., \(Y = 0\)).

To motivate SCRNP, consider developing a prediction algorithm for breast cancer incidence in women. We would like to identify a large proportion of women who will eventually develop breast cancer; that is, we would like to enforce that our procedure for classifying women at high-risk has high sensitivity. However, women with high predicted risk of cancer may be recommended to undergo more invasive screening procedures. So beyond our sensitivity constraint, we would like to maximize the proportion of women who are not required to undergo additional screening. The SCRNP describes this exactly this proportion. Zheng et al. (2018) discuss SCRNP in the context of HIV prevention.

Estimating SCRNP using traditional \(K\)-fold CV approaches is particularly challenging because we need to estimate a quantile of the distribution of \(\psi(X)\) amongst those with \(Y = 1\). If there are few observations with \(Y = 1\) in the validation fold, then this estimation will be highly unstable and will cause downstream instability in the estimate of CV SCRNP. This makes our approach particularly appealing for this problem.

The syntax to estimate CV SCRNP is through the cv_scrnp function as shown below.

##                est         se       cil       ciu
## cvtmle   0.2654355 0.04476104 0.1777055 0.3531656
## onestep  0.2489804 0.04500199 0.1607781 0.3371827
## esteq    0.2489804 0.04500199 0.1607781 0.3371827
## standard 0.2350000 0.04732137 0.1422518 0.3277482
##                est         se        cil       ciu
## cvtmle   0.1518164 0.04582227 0.06200643 0.2416264
## onestep  0.1625503 0.04570864 0.07296306 0.2521376
## esteq    0.1625503 0.04570864 0.07296306 0.2521376
## standard 0.1950000 0.03623288 0.12398486 0.2660151

Other methods implemented in nlpred

To compare the novel methods in nlpred to existing approaches, we have included several alternative approaches to estimating performance of these quantities.

Bootstrap corrections

The functions boot_auc and boot_scrnp can be used to estimate bootstrap based performance of prediction algorithms. There are several available approaches. In particular, each function implements a standard bootstrap correction (Harrell, Lee, and Mark 1996), as well as an 0.632 correction described in Chapter 7 of Friedman, Hastie, and Tibshirani (2001).

## $auc
## [1] 0.8214504
## 
## $n_valid_boot
## [1] 500
## $auc
## [1] 0.8189317
## 
## $n_valid_boot
## [1] 500
## $scrnp
## [1] 0.217375
## 
## $n_valid_boot
## [1] 200
## $scrnp
## [1] 0.07976385
## 
## $n_valid_boot
## [1] 200

Leave-pairs-out CV AUC estimator

Another proposal for estimating AUC is using leave-pairs-out CV (Airola et al. 2011). In this approach, a random observation with \(Y = 0\) and \(Y = 1\) are left out; the algorithm is trained on the remaining data and predictions are made on the two held out observations. The estimate is the proportion of these pairs for which the \(Y = 1\) observation had higher predicted risk than the \(Y = 0\) observation. Because it can be quite computationally expensive to retrain algorithms for every, we include an option max_pairs to specify the maximum number of pairs to leave out. If left equal to NULL, then every possible case/control pair is left out in turn.

## $auc
## [1] 0.828

Parallelization in nlpred

Unfortunately parallelization is not yet available in nlpred, but will be added as a feature soon.

Some rules of thumb based on simulation studies

From extensive simulation studies, here are a few relevant observations.

  • Any algorithm that is more complex than a standard logistic regression model would likely benefit from utilizing the nested cross-validation routine. Even for logistic regression, nested cross-validation does not tend to hurt performance too much.
  • Setting K = 10 or K = 20 tended to yield the best performance across a variety of settings.
  • Setting inner_K = 5 tended to yield the best performance across a variety of settings.
  • The CVTMLE of CV SCRNP vastly outperformed the others.
  • The estimating equations estimator of CV AUC tended to outperform the others, though the difference was not drastic.
  • Confidence intervals tended to be anti-conservative in small samples. Nominal coverage can be expected at around 500 observations, though this is obviously extremely context dependent.
  • Bootstrap methods should only be used for simple learning algorithms such as logistic regression.
  • The 0.632-correction improves performance of bootstrap estimators, but is generally less efficient than our newer approaches.

References

Airola, Antti, Tapio Pahikkala, Willem Waegeman, Bernard De Baets, and Tapio Salakoski. 2011. “An Experimental Comparison of Cross-Validation Techniques for Estimating the Area Under the ROC Curve.” Computational Statistics & Data Analysis 55 (4). Elsevier: 1828–44. https://doi.org/10.1016/j.csda.2010.11.018.

Chen, Tianqi, and Carlos Guestrin. 2016. “xgboost: A Scalable Tree Boosting System.” In Proceedings of the 22nd SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–94. ACM. https://doi.org/10.1145/2939672.2939785.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. 10. Springer Series in Statistics New York, NY, USA. https://doi.org/10.1007/978-0-387-84858-7.

Harrell, Frank E, Kerry L Lee, and Daniel B Mark. 1996. “Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors.” Statistics in Medicine 15 (4). Wiley Online Library: 361–87. https://doi.org/10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4.

Hubbard, Alan E, Sara Kherad-Pajouh, and Mark J van der Laan. 2016. “Statistical Inference for Data Adaptive Target Parameters.” The International Journal of Biostatistics 12 (1). De Gruyter: 3–19. https://doi.org/10.1515/ijb-2015-0013.

LeDell, Erin, Maya Petersen, and Mark J van der Laan. 2015. “Computationally Efficient Confidence Intervals for Cross-Validated Area Under the ROC Curve Estimates.” Electronic Journal of Statistics 9 (1). NIH Public Access: 1583. https://doi.org/10.1214/15-EJS1035.

Liaw, Andy, and Matthew Wiener. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22. https://CRAN.R-project.org/doc/Rnews/.

Polley, Eric, Erin LeDell, Chris Kennedy, and Mark van der Laan. 2019. SuperLearner: Super Learner Prediction. https://CRAN.R-project.org/package=SuperLearner.

Simon, Noah, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39 (5): 1–13. http://www.jstatsoft.org/v39/i05/.

Wright, Marvin N., and Andreas Ziegler. 2017. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (1): 1–17. https://doi.org/10.18637/jss.v077.i01.

Zheng, Wenjing, Laura Balzer, Mark J van der Laan, Maya Petersen, and SEARCH Collaboration. 2018. “Constrained Binary Classification Using Ensemble Learning: An Application to Cost-Efficient Targeted PrEP Strategies.” Statistics in Medicine 37 (2). Wiley Online Library: 261–79. https://doi.org/10.1002/sim.7296.