# Change-in-estimate approach: Assessing confounding effects

Badges Confounding; Logistic regression; Cox proportional hazards model; Linear regression;

## Overview

The ‘chest’ package systematically calculates and compares effect estimates from various models with different combinations of variables. It calculates the changes in effect estimates when each variable is added to the model sequentially in a step-wise fashion. Effect estimates here can be regression coefficients, odds ratios and hazard ratios depending on modelling methods. At each step, only one variable that causes the largest change among the remaining variables is added to the model. The final results from many models are summarized in one graph and one data frame table. This approach can be used for assessing confounding effects in epidemiological studies and bio-medical research including clinical trials.

## Installation

You can install the released version of chest from CRAN with:

## Getting Started

library(chest)
names(diab_df)
#>  [1] "Endpoint"  "mid"       "Diabetes"  "Age"       "Sex"       "BMI"
#>  [7] "Married"   "Smoke"     "CVD"       "Cancer"    "Education" "Income"
#> [13] "t0"        "t1"

### Data: diabetes and mortality

A data frame ‘diab_df’ is used to examine the association between Diabetes and mortality Endpoint. The purpose of using this data set is to demonstrate the use of the functions in this package rather than answering any research questions.

### ‘chest_speedglm’: report odds ratios at all steps with logistic regression models

chest_speedglm(
crude = "Endpoint ~ Diabetes",
xlist = c("Age", "Sex", "Married", "Smoke", "Cancer", "CVD","Education", "Income"),
zero = 1, data = diab_df)

#>       variables       OR       lb       ub      Change        p    n
#> 1         Crude 2.312758 1.838975 2.908604          NA 7.57e-13 2372
#> 2         + Age 3.348051 2.500654 4.482605  44.7644279 4.83e-16 2372
#> 3      + Income 2.940581 2.153766 4.014836 -12.1703550 1.13e-11 2061
#> 4         + CVD 2.831739 2.068823 3.875994  -3.7013899 8.09e-11 2061
#> 5       + Smoke 2.930687 2.135170 4.022595   3.4942324 2.84e-11 2061
#> 6         + Sex 2.901284 2.113594 3.982528  -1.0032733 4.38e-11 2061
#> 7   + Education 2.881540 2.096819 3.959939  -0.6805040 6.81e-11 2051
#> 8      + Cancer 2.863889 2.083367 3.936828  -0.6125761 9.11e-11 2051
#> 9     + Married 2.878757 2.093150 3.959219   0.5191530 7.88e-11 2048

All Odds ratios are for the association between Diabetes and mortality Endpoint after each of other factors added to the model sequentially.

• Step 1: Started with a model: speedglm(endpoint ~ diabetes). The odds ratio for diabetes was presented in the row marked as Crude.

• Step 2: Each of 8 variables was separately added to the above model, and chest_speedglm compared odds ratios from those eight models to identify the one which created the largest change. The variable Age was selected and added to the model.

• Step 3: Repeat Step 2 with the remaining 7 variables. The variable income was selected and added.

• Step 4 to Step 9 repeated the same procedure until all variables were added. We can see after adding age and income variables add other variables had little impact on the odds ratio estimates, and odds ratio estimates remained positive on the right hand side of non-effect line. In this case, ‘chest’ shows one table with the results after fitting 37 total models: 1 crude model plus 36 (8 + 7 + 6 +5 + 4 + 3 + 2 + 1) models.

We can alter some details of the graph. For example, we used zero = 1 to mark the none effective line. Users can also save the result table to data frame for further presentation and analysis.

### When the list of variables is long, or the same list to be used repeatedly, generate a object of variable list:

vlist <- c("Age", "Sex", "Married", "Smoke", "Cancer", "CVD","Education", "Income")
chest_speedglm(
crude = "Endpoint ~ Diabetes",
xlist = vlist,   zero = 1, data = diab_df)

#>       variables       OR       lb       ub      Change        p    n
#> 1         Crude 2.312758 1.838975 2.908604          NA 7.57e-13 2372
#> 2         + Age 3.348051 2.500654 4.482605  44.7644279 4.83e-16 2372
#> 3      + Income 2.940581 2.153766 4.014836 -12.1703550 1.13e-11 2061
#> 4         + CVD 2.831739 2.068823 3.875994  -3.7013899 8.09e-11 2061
#> 5       + Smoke 2.930687 2.135170 4.022595   3.4942324 2.84e-11 2061
#> 6         + Sex 2.901284 2.113594 3.982528  -1.0032733 4.38e-11 2061
#> 7   + Education 2.881540 2.096819 3.959939  -0.6805040 6.81e-11 2051
#> 8      + Cancer 2.863889 2.083367 3.936828  -0.6125761 9.11e-11 2051
#> 9     + Married 2.878757 2.093150 3.959219   0.5191530 7.88e-11 2048

### Remove missing values, and change non-effect line

chest_speedglm(
crude = "Endpoint ~ Diabetes", xlist = vlist,
data = diab_df, zero = c(0.98, 1.02),  na_omit = TRUE)

#>       variables       OR       lb       ub       Change        p    n
#> 1         Crude 2.305786 1.809862 2.937600           NA 1.37e-11 2048
#> 2         + Age 3.297099 2.421755 4.488837  42.99238627 3.50e-14 2048
#> 3      + Income 2.902046 2.125534 3.962238 -11.98183743 2.00e-11 2048
#> 4         + CVD 2.797882 2.044220 3.829405  -3.58931015 1.32e-10 2048
#> 5       + Smoke 2.900600 2.113218 3.981361   3.67127605 4.39e-11 2048
#> 6         + Sex 2.872333 2.092484 3.942824  -0.97453037 6.65e-11 2048
#> 7     + Married 2.894543 2.107594 3.975330   0.77324421 5.19e-11 2048
#> 8      + Cancer 2.881019 2.097209 3.957769  -0.46723505 6.52e-11 2048
#> 9   + Education 2.878757 2.093150 3.959219  -0.07851916 7.88e-11 2048

## Add terms such as an interaction between Age and Sex, and age squared

library(tidyverse)
#> -- Attaching packages -------------------------------------------- tidyverse 1.3.0 --
#> v ggplot2 3.2.1     v purrr   0.3.3
#> v tibble  2.1.3     v dplyr   0.8.3
#> v tidyr   1.0.0     v stringr 1.4.0
#> v readr   1.3.1     v forcats 0.4.0
#> -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
diab_df <- diab_df %>%
mutate(Age_Sex = Age*Sex, Age2 = Age^2)
vlist_1<-c("Age", "Sex", "Age2", "Age_Sex", "Married", "Cancer", "CVD", "Education", "Income")
chest_speedglm(crude = "Endpoint ~ Diabetes", xlist = vlist_1, na_omit=TRUE, data = diab_df)

#>        variables       OR       lb       ub       Change        p    n
#> 1          Crude 2.305786 1.809862 2.937600           NA 1.37e-11 2048
#> 2          + Age 3.297099 2.421755 4.488837  42.99238627 3.50e-14 2048
#> 3       + Income 2.902046 2.125534 3.962238 -11.98183743 2.00e-11 2048
#> 4          + CVD 2.797882 2.044220 3.829405  -3.58931015 1.32e-10 2048
#> 5         + Age2 2.765179 2.028997 3.768472  -1.16884530 1.20e-10 2048
#> 6          + Sex 2.738303 2.008891 3.732558  -0.97195568 1.84e-10 2048
#> 7      + Age_Sex 2.770902 2.030833 3.780664   1.19049060 1.29e-10 2048
#> 8      + Married 2.794136 2.046620 3.814677   0.83848576 9.89e-11 2048
#> 9       + Cancer 2.780324 2.036062 3.796643  -0.49431423 1.25e-10 2048
#> 10   + Education 2.777871 2.032476 3.796634  -0.08822106 1.46e-10 2048

### chest_glm: Logistic regression using (generalized linear models, glm).

‘chest_glm’ is slower than ‘chest_speedglm’. We can use indicate = TRUE to monitor the progress. If it is too slow, you may want to try ‘chest_speedglm’.

vlist <- c("Age", "Sex", "Married", "Smoke", "Education")
chest_glm(crude = "Endpoint ~ Diabetes", xlist = vlist, data = diab_df, indicate = TRUE)

### chest_cox: Using Cox Proportional Hazards Models: ‘coxph’ of ‘survival’ package


chest_cox(crude = "Surv(t0, t1, Endpoint) ~ Diabetes", xlist = vlist,
na_omit = TRUE, data = diab_df, zero = 1)

#>       variables       HR       lb       ub     Change            p    n
#> 1         Crude 1.588134 1.434544 1.758167         NA 4.950249e-19 2048
#> 2         + CVD 1.526276 1.377192 1.691499 -3.8949795 7.454317e-16 2048
#> 3      + Income 1.480726 1.335380 1.641891 -2.9844079 9.581156e-14 2048
#> 4       + Smoke 1.514956 1.366037 1.680108  2.3116907 3.596810e-15 2048
#> 5         + Sex 1.498963 1.351879 1.662049 -1.0556582 1.571022e-14 2048
#> 6     + Married 1.512616 1.363974 1.677456  0.9108110 4.451213e-15 2048
#> 7         + Age 1.526426 1.376076 1.693202  0.9129952 1.305521e-15 2048
#> 8      + Cancer 1.517896 1.368399 1.683726 -0.5587865 3.050629e-15 2048
#> 9   + Education 1.514437 1.365204 1.679982 -0.2279234 4.453189e-15 2048

### chest_clogit: Conditional logistic regression: ‘clogit’ of ‘survival’ package

chest_clogit(crude = "Endpoint ~ Diabetes + strata(mid)",
xlist = vlist, data = diab_df, zero = 1)

#>       variables       OR       lb       ub    Change            p    n
#> 1         Crude 2.586950 1.719871 3.891170        NA 5.033866e-06 2372
#> 2      + Income 2.850010 1.752942 4.633671 10.168718 2.405822e-05 2061
#> 3     + Married 3.133480 1.875838 5.234301  9.946296 1.283423e-05 2058
#> 4   + Education 3.030468 1.810620 5.072149 -3.287484 2.452619e-05 2048
#> 5       + Smoke 3.128331 1.839469 5.320260  3.229314 2.559384e-05 2048
#> 6         + Age 3.212487 1.883223 5.480007  2.690121 1.844153e-05 2048
#> 7         + CVD 3.148114 1.824571 5.431754 -2.003848 3.776568e-05 2048
#> 8      + Cancer 3.100427 1.790709 5.368067 -1.514782 5.340664e-05 2048
#> 9         + Sex 3.100427 1.790709 5.368067  0.000000 5.340664e-05 2048

## Notes:

• Because ‘chest’ fits many models and compares effect estimates, some analyses may take long time to complete. In that case, consider ‘chest_speedglm’ for logistic regression and ‘chest_clogit’ with an argument of approximate method for conditional logistic regression.
• Possible alternative explanations: Although a large change the presence of possible confounding effects, we also need to keep in mind alternative explanations.
• Different sample sizes: When different models are fitted using different sample sizes due to missing values, this may also contribute to the change in effect estimates. The change can partly reflect the selection bias. Removing all the missing values can be helpful for distinguish the two.
• Intermediate measurements: Some variables in observational studies may be the intermediate factors of the causal pathway. This is more a design issue than a analysis issue. The package can be used to identify the change-in-effect estimates but cannot be used to distinguish confounding factors from intermediate factors.