library("devtools")
install_github("fasrc/CausalGPS", ref="master")
library("CausalGPS")
Input parameters:
Y
a vector of observed outcome.
w
a vector of observed continues exposure.
c
data frame or matrix of observed baseline covariates.
ci_appr
The causal inference approach. Options are “matching”, “weighting”, and “adjusting”.
pred_model
a prediction model (use “sl” for SuperLearner). gps_model
Model type which is used for estimating GPS value, including “parametric” (default) and “non-parametric”.
use_cov_transform
If TRUE, the function uses transformer to meet the covariate balance.
transformers
Is a list of transformers. Each transformer should be a unary function. Available transformers are “pow2” and “pow3”. bin_seq
Sequence of w (treatment) to generate pseudo population. If NULL
is passed the default value will be used, which is seq(min(w)+delta_n/2,max(w), by=delta_n)
. matching_fun
specified matching function. trim_quantiles
A numerical vector of two. Represents the trim quantile level. Both numbers should be in the range of [0,1] and in increasing order(default: c(0.01,0.99)). optimized_compile
If TRUE, uses counts to keep track of number of replicated pseudo population. params
Includes list of params that is used internally. Unrelated parameters will be ignored. scale
specified scale parameter to control the relative weight that is attributed to the distance measures of the exposure versus the GPS estimates.
delta_n
specified caliper parameter on the exposure
sl_lib
a set of machine learning methods used for estimating GPS
ci_appr
causal inference approach
covar_bl_method
specified covariate balance method
covar_bl_trs
specified covariate balance threshold
max_attempt
maximum number of attempt to satisfy covariate balance
set.seed(422)
<- 10000
n <- generate_syn_data(sample_size=n)
mydata <- sample(x=c("2001","2002","2003","2004","2005"),size = n, replace = TRUE)
year <- sample(x=c("North", "South", "East", "West"),size = n, replace = TRUE)
region $year <- as.factor(year)
mydata$region <- as.factor(region)
mydata$cf5 <- as.factor(mydata$cf5)
mydata
<- generate_pseudo_pop(mydata$Y,
pseudo_pop $treat,
mydatac("cf1","cf2","cf3","cf4","cf5","cf6","year","region")],
mydata[ci_appr = "matching",
pred_model = "sl",
gps_model = "non-parametric",
use_cov_transform = TRUE,
transformers = list("pow2", "pow3", "abs", "scale"),
trim_quantiles = c(0.01,0.99),
optimized_compile = TRUE,
sl_lib = c("m_xgboost"),
covar_bl_method = "absolute",
covar_bl_trs = 0.1,
max_attempt = 4,
matching_fun = "matching_l1",
delta_n = 1,
scale = 0.5,
nthread = 1)
plot(pseudo_pop)
matching_l1
is Manhattan distance matching approach. For prediciton model we use SuperLearner package. User need to pass sl
as pred_model
to use SuperLearner package. SuperLearner supports different machine learning methods and packages. params
is a list of hyperparameters that users can pass to the third party libraries in the SuperLearner package. All hyperparameters go into the params list. The prefixes are used to distinguished parameters for different libraries. The following table shows the external package names, their equivalent name that should be used in sl_lib
, the prefixes that should be used for their hyperparameters in the params
list, and available hyperparameters.
Package name | sl_lib name |
prefix | available hyperparameters |
---|---|---|---|
XGBoost | m_xgboost |
xgb_ |
nrounds, eta, max_depth, min_child_weight |
ranger | m_ranger |
rgr_ |
num.trees, write.forest, replace, verbose, family |
nthread
is the number of available threads (cores). XGBoost needs OpenMP installed on the system to parallize the processing.
<- estimate_gps(Y,
data_with_gps
w,
c,pred_model = "sl",
internal_use = FALSE,
params = list(xgb_max_depth = c(3,4,5),
xgb_rounds = c(10,20,30,40)),
nthread = 1,
sl_lib = c("m_xgboost")
)
If internal_use
is set to be TRUE, the program will return additional vectors to be used by the selected causal inference approach to generate a pseudo population. See ?estimate_gps
for more details.
<- estimate_npmetric_erf(Y,
erf
w,
bw_seq, w_vals)
<- generate_syn_data(sample_size=1000,
syn_data seed = 403,
outcome_sd = 10,
gps_spec = 1,
cova_spec = 1)
The CausalGPS package is logging internal activities into the CausalGPS.log
file. The file is located in the source file location and will be appended. Users can change the logging file name (and path) and logging threshold. The logging mechanism has different thresholds (see logger package). The two most important thresholds are INFO and DEBUG levels. The former, which is the default level, logs more general information about the process. The latter, if activated, logs more detailed information that can be used for debugging purposes.