tcensReg: An Introduction

Justin Williams

2020-01-03

The goal of this package is to estimate parameters from a linear model when the data comes from a truncated normal distribution with censoring. Maximum likelihood values are returned. There are multiple method available for optimization with the default set as conjugate gradient. This package is also able to return maximum likelihood estimates for truncated only or censored only data similar to truncreg and censReg packages.

Installation

You can install tcensReg from CRAN for the stable release or install the GitHub verion for the active development version:

#stable CRAN version
install.packages("tcensReg")

# or ----

#active devel. GitHub version
install.packages("devtools")
devtools::install_github("williazo/tcensReg")

Example 1: Single Population

Some common examples where this type of problem may arise is when there is a natural truncation imposed by the structure of the data. For instance several applications have an implied zero truncation such as product lifetimes, age, or detection thresholds. To show how to implement the functions within the package, I will demonstrate a simple simulation example.

Assume that we have observations from an underlying truncated normal distribution. In our case we will assume a zero-truncated model by setting a=0. We generate this truncated normal data below and refer to it as y_star.

#loading the package
library(tcensReg)  
mu <- 0.5
sigma <- 0.5
a <- 0
#generate random values from the truncated normal distribution using tcensReg function
y_star <- rtnorm(n=1000, mu=mu, sd=sigma, a=a)
#note that the lowerbound will always be non-negative
round(range(y_star), 3)
## [1] 0.003 2.063

Next, we can imagine a scenario where we have an imprecise measurement of y_star leading to censoring. In our case we assume that values below a limit of detection, nu, are censored. This creates a random variable y.

In the example below we set our limit of detection as nu=0.25.

nu <- 0.25
y <- ifelse(y_star <= nu, nu, y_star)
#calculating the number of censored observations
sum(y == nu)/length(y) 
## [1] 0.182
#collecting the uncensored and censored data together
dt <- data.frame(y_star, y) 

We can observe the histogram and density plot for the uncensored data, which shows the zero-truncation.

We can then compare this to the censored observations below

We can then estimate the mean, mu, and standard deviation, sigma, using y with the tcensReg package as shown below.

tcensReg(y ~ 1, data=dt, a=0, v=0.25)
## $theta
##               Estimate
## (Intercept)  0.5067501
## log_sigma   -0.6774143
## 
## $convergence
## [1] 0
## 
## $initial_ll
## [1] -691.7561
## 
## $final_ll
## [1] -677.5373
## 
## $var_cov
##               (Intercept)     log_sigma
## (Intercept)  0.0008743094 -0.0008562579
## log_sigma   -0.0008562579  0.0015877418
## 
## $method
## [1] "CG"

Note that the this will return parameter estimates, variance-covariance matrix, the number of iterations until convergence, and the initial/final log-likelihood values.

Comparing the values to the truth we see that the estimates are unbiased.

#tcensReg model
output <- tcensReg(y ~ 1, data=dt, a=a, v=nu)
#extracting the point estimates
tcensReg_est <- output$theta 
#exponentiating the estimate of log_sigma to obtain sigma
tcensReg_est[2] <- exp(tcensReg_est[2]) 

#OLS model
lm_output <- lm(y ~ 1, data=dt) 
lm_est <- c(coef(lm_output), summary(lm_output)$sigma)
#censored only model, i.e., Tobit model
cens_output <- tcensReg(y ~ 1, data=dt, v=nu) 
## Warning: `a` is not specified indicating no truncation
cens_est <- cens_output$theta
cens_est[2] <- exp(cens_est[2])


results_df <- data.frame(rbind(c(mu, sigma),
                               t(tcensReg_est),
                               lm_est,
                               t(cens_est)))
names(results_df) <- c("mu", "sigma")
row.names(results_df) <- c("Truth", "tcensReg", "Normal MLE", "Tobit")
results_df$mu_bias <- abs(results_df$mu - mu)
results_df$sigma_bias <- abs(results_df$sigma - sigma)

knitr::kable(results_df, format="markdown", digits=4)
mu sigma mu_bias sigma_bias
Truth 0.5000 0.5000 0.0000 0.0000
tcensReg 0.5068 0.5079 0.0068 0.0079
Normal MLE 0.6746 0.3759 0.1746 0.1241
Tobit 0.6300 0.4424 0.1300 0.0576

Other methods result in significant bias for both mu and sigma.

Example 2: Two Population Model with Separate Variances

As an extension for the single population model above, we can imagine a two independent truncated normal random variables that have common censoring and truncation values but different standard deviations.

We can simulate the underlying truncated normal distributions Y1_star and Y2_star similar to \(Y\) above except now we allow them to have separate mean and variances.

For this example we let mu_1=0.5, mu_2=1, sigma_1=0.25, sigma_2=2, and a=0.

mu_1 <- 0.5
mu_2 <- 1
sigma_1 <- 0.25
sigma_2 <- 2
a <- 0

y_1_star <- rtnorm(1000, mu = mu_1, sd = sigma_1, a = a)
y_2_star <- rtnorm(1000, mu = mu_2, sd = sigma_2, a = a)
df <- data.frame(y_star = c(y_1_star, y_2_star), 
                 group = c(rep("Population 1", length(y_1_star)),
                           rep("Population 2", length(y_2_star))))

Plotting each of these uncensored population densities, we can see the difference in shape based on the underlying parameter selection.

Then censoring each observation at nu, we are left with Y1 and Y2. Again, we let nu=0.25.

nu <- 0.25
df$y <- ifelse(df$y_star<=nu, nu, df$y_star)

We then can fit our model with separate variances for each group using the command tcensReg_sepvar as shown below.

mod_result <- tcensReg_sepvar(y ~ group, a=a, v=nu, group_var="group", method="maxLik", data=df)
mod_result
## $theta
##       (Intercept) groupPopulation 2        log_sigma1        log_sigma2 
##         0.5002256         0.6074030        -1.3921394         0.6246513 
## 
## $convergence
## [1] 2
## 
## $initial_ll
## [1] -1891.114
## 
## $final_ll
## [1] -1816.653
## 
## $var_cov
##                     (Intercept) groupPopulation 2    log_sigma1
## (Intercept)        7.638326e-05     -7.638326e-05 -6.980312e-05
## groupPopulation 2 -7.638326e-05      2.371045e-02  6.980312e-05
## log_sigma1        -6.980312e-05      6.980312e-05  8.025460e-04
## log_sigma2        -3.797823e-22     -6.125757e-03  4.366464e-21
##                      log_sigma2
## (Intercept)       -3.797823e-22
## groupPopulation 2 -6.125757e-03
## log_sigma1         4.366464e-21
## log_sigma2         2.231303e-03
## 
## $method
## [1] "maxLik"
sepvar_est <- mod_result$theta
sepvar_est[3:4] <- exp(sepvar_est[3:4])

results_df <- data.frame(rbind(c(mu_1, mu_2, sigma_1, sigma_2),
                               t(sepvar_est)))
names(results_df) <- c("mu_1", "mu_2", "sigma_1", "sigma_2")
row.names(results_df) <- c("Truth", "tcensReg")
results_df$mu1_bias <- abs(results_df$mu_1 - mu_1)
results_df$mu2_bias <- abs(results_df$mu_2 - mu_2)
results_df$sigma1_bias <- abs(results_df$sigma_1 - sigma_1)
results_df$sigma2_bias <- abs(results_df$sigma_2 - sigma_2)

knitr::kable(results_df, format="markdown", digits=4)
mu_1 mu_2 sigma_1 sigma_2 mu1_bias mu2_bias sigma1_bias sigma2_bias
Truth 0.5000 1.0000 0.2500 2.0000 0e+00 0.0000 0.0000 0.0000
tcensReg 0.5002 0.6074 0.2485 1.8676 2e-04 0.3926 0.0015 0.1324

Performance Comparison: Censored-Only and Truncated-Only (Intercept Only Model)

Note also that the tcensReg can also estimate parameters in the censored-only or truncated-only cases. We show below that by using analytic values in the tcensReg implementation that our method is faster then the alternative estimation procedures while providing better variance estimates. With a small set of covariates and p<<n we can use the Newton Raphson method of optimization, which is computationally fast with few covariates.

library(microbenchmark)
#testing the censored-only regression
library(censReg)
cens <- microbenchmark(tcensReg_method = tcensReg(y ~ 1, data=dt, v=nu, method="Newton"),
               censReg_method = censReg(y ~ 1, left=nu, data=dt))
knitr::kable(summary(cens), format="markdown", digits=4)
expr min lq mean median uq max neval cld
tcensReg_method 4.9092 5.0706 5.7085 5.2077 5.9201 14.9288 100 a
censReg_method 13.7895 14.2633 19.3137 16.2156 21.7183 107.0266 100 b
#point estimates are equivalent
tcensReg_est <- as.numeric(tcensReg(y ~ 1, data=dt, v=nu, method="Newton")$theta)
censReg_est <- as.numeric(coef(censReg(y ~ 1, left=nu, data=dt)))
all.equal(tcensReg_est, censReg_est)
## [1] TRUE
#testing the truncated-only regression
library(truncreg)
trunc <- microbenchmark(
  tcensReg_method = tcensReg(y_star ~ 1, data=dt, a=a, method="Newton"),
  truncreg_method = truncreg(y_star ~ 1, point=a, data=dt))
knitr::kable(summary(trunc), format="markdown", digits=4)
expr min lq mean median uq max neval cld
tcensReg_method 8.5105 8.8282 9.5156 9.0185 9.3748 15.0493 100 a
truncreg_method 26.2514 27.4374 30.3957 29.9568 32.9080 42.2670 100 b
tcensReg_est <- as.numeric(tcensReg(y_star ~ 1, data=dt, a=a, method="Newton")$theta)
#note truncreg returns sigma not log_sigma so we need to exponentiate our value
tcensReg_est[2] <- exp(tcensReg_est[2])
truncreg_est <- as.numeric(coef(truncreg(y_star ~ 1, point=a, data=dt)))
all.equal(tcensReg_est, truncreg_est)
## [1] "Mean relative difference: 7.465878e-08"