After assessing balance and deciding on a matching specification, it comes time to estimate the effect of the treatment in the matched sample. How the effect is estimated and interpreted depends on the desired estimand and the type of model used (if any). In addition to estimating effects, estimating the uncertainty of the effects is critical in communicating them and assessing whether the observed effect is compatible with there being no effect in the population. This guide explains how to estimate effects after various forms of matching and with various outcome types. There may be situations that are not covered here for which additional methodological research may be required, but some of the recommended methods here can be used to guide such applications.

This guide is structured as follows: first, information on the concepts related to effect and standard error estimation is presented below. Then, instructions for how to estimate effects and standard errors are described. This section is split up first by the type of matching method performed (matching without replacement, matching with replacement, full matching, and subclassification) and then by the type of outcome (continuous, binary, and survival). Finally recommendations for reporting results and tips to avoid making common mistakes are presented.

Before an effect is estimated, the estimand must be specified and clarified. Although some aspects of the estimand depend not only on how the effect is estimated after matching but also on the matching method itself, other aspects must be considered at the tine of effect estimation and interpretation. Here, we consider three aspects of the estimand: the population the effect is meant to generalize to (the target population), the effect measure, and whether the effect is marginal or conditional.

**The target population.** Different matching methods
allow you to estimate effects that can generalize to different target
populations. The most common estimand in matching the average treatment
effect in the treated (ATT), which is the average effect of treatment
for those who receive treatment. This estimand is estimable for matching
methods that do not change the treated units (i.e., by weighting or
discarding units) and is requested in `matchit()`

by setting
`estimand = "ATT"`

(which is the default). The average
treatment effect in the population (ATE) is the average effect of
treatment for the population from which the sample is a random sample.
This estimand is estimable only for methods that allow the ATE and
either do not discard units from the sample or explicit target full
sample balance, which in `MatchIt`

is limited to full
matching, subclassification, and template matching when setting
`estimand = "ATE"`

. When treated units are discarded (e.g.,
through the use of common support restrictions, calipers, cardinality
matching, or [coarsened] exact matching), the estimand corresponds to
neither the population ATT nor the population ATE, but rather to an
average treatment effect in the remaining matched sample (ATM), which
may not correspond to any specific target population.

**Marginal and conditional effects.** A marginal effect
is a comparison between the expected potential outcome under treatment
and the expected potential outcome under control. This is the same
quantity estimated in randomized trials without blocking or covariate
adjustment and is particularly useful for quantifying the overall effect
of a policy or population-wide intervention. A conditional effect is the
comparison between the expected potential outcomes in the treatment
groups within strata. This is useful for identifying the effect of a
treatment for an individual patient or a subset of the population.

**Effect measures.** The outcome types we consider here
are continuous, with the effect measured by the mean difference; binary,
with the effect measured by the risk difference (RD), risk ratio (RR),
or odds ratio (OR); and time-to-event (i.e., survival), with the effect
measured by the hazard ratio (HR). The RR, OR, and HR are
*noncollapsible* effect measures, which means the marginal effect
on that scale is not a (possibly) weighted average of the conditional
effects within strata, even if the stratum-specific effects are of the
same magnitude. For these effect measures, it is critical to distinguish
between marginal and conditional effects because different statistical
methods target different types of effects. The mean difference and RD
are *collapsible* effect measures, so the same methods can be
used to estimate marginal and conditional effects.

In this guide, we will provide examples for estimating the ATT, ATE, and ATM for each of the three outcomes types. Our primary focus will be on marginal effects, which are appropriate for all effect measures, easily interpretable, and require few modeling assumptions. We will include a short section on estimating the conditional OR. The “Common Mistakes” section includes examples of commonly used methods that estimate conditional rather than marginal effects and should not be used when marginal effects are desired.

The type and form of the model used to estimate the treatment effect depend on the effect measure desired, whether a marginal or conditional effect is desired, whether covariates are to be included in the model, and whether effect modification by the covariates is present. For continuous outcomes, one can use a linear model regressing the outcome on the treatment; for binary outcomes, one can use a generalized linear model with a link function appropriate for the desired effect measure (e.g., a logistic link for the OR); for time-to-event outcomes, one can use a Cox proportional hazards model.

An additional decision to make is whether (and how) to include covariates in the outcome model. One may ask, why use matching at all if you are going to model the outcome with covariates anyway? Matching reduces the dependence of the effect estimate on correct specification of the outcome model; this is the central thesis of Ho et al. (2007). Including covariates in the outcome model after matching has several functions: it can increase precision in the effect estimate, reduce the bias due to residual imbalance, and make the effect estimate “doubly robust”, which means it is consistent if either the matching reduces sufficient imbalance in the covariates or if the outcome model is correct. For these reasons, we recommend covariate adjustment after matching when possible. There is some evidence that covariate adjustment is most helpful for covariates with standardized mean differences greater than .1 (Nguyen et al. 2017), so these covariates and covariates thought to be highly predictive of the outcome should be prioritized in treatment effect models if not all can be included due to sample size constraints.

For continuous outcomes, we can simply include the covariates in the regression of the outcome on the treatment in the matched sample (i.e., using the matching weights). Because the mean difference is collapsible, the effect estimate conditioning on the covariates is still a marginal effect estimate. This is not the case with binary and time-to-event outcomes; including covariates in the outcome model makes the treatment effect a conditional effect. To recover a marginal effect while including covariates in an outcome model, one has to perform a marginal effects procedure, also known as g-computation (Snowden, Rose, and Mortimer 2011) or regression estimation (Schafer and Kang 2008), which involves simulating the average potential outcomes under each treatment level and computing the effect as a contrast between those average potential outcomes. G-computation can also be used to estimate marginal effects when modeling effect modification by the covariates. We demonstrate examples of this below.

Although there are many possible ways to include covariates (i.e.,
not just main effects but interactions, smoothing terms like splines, or
other nonlinear transformations), it is important not to engage in
specification search (i.e., trying many outcomes models in search of the
“best” one). Doing so can invalidate results and yield a conclusion that
fails to replicate. For this reason, we recommend only including the
same terms included in the propensity score model unless there is a
strong *a priori* and justifiable reason to model the outcome
differently. Second, it is important not to interpret the coefficients
and tests of the other covariates in the outcome model. These are not
causal effects and their estimates may be severely confounded. Only the
treatment effect estimate can be interpreted as causal assuming the
relevant assumptions about unconfoundedness are met. Inappropriately
interpreting the coefficients of covariates in the outcome model is
known as the Table 2 fallacy (Westreich and Greenland
2013). To avoid this, in all examples that incorporate
covariates in the outcome model, we restrict the output of outcome
regression models to just the treatment coefficient.

Uncertainty estimation (i.e., of standard errors, confidence intervals, and p-values) may consider the variety of sources of uncertainty present in the analysis, including (but not limited to!) estimation of the propensity score (if used), matching (i.e., because treated units might be matched to different control units if others had been sampled), and estimation of the treatment effect (i.e., because of sampling error). In general, there are no analytic solutions to all these issues, so much of the research done on uncertainty estimation after matching has relied on simulation studies. The two primary methods that have been shown to perform well in matched samples are using cluster-robust standard errors and the bootstrap.

**Robust standard errors.** Also known as sandwich
standard errors (due to the form of the formula for computing them),
heteroscedasticity-consistent standard errors, or Huber-White standard
errors, robust standard errors are an adjustment to the usual maximum
likelihood or ordinary least squares standard errors that are robust to
violations of some of the assumptions required for usual standard errors
to be valid (MacKinnon and White 1985). Although
there has been some debate about their utility (King and
Roberts 2015), robust standard errors rarely degrade
inferences and often improve them. Generally, robust standard errors
**must** be used when any non-uniform weights are included
in the estimation (e.g., with full matching or inverse probability
weighting).

**Cluster-robust standard errors.** A version of robust
standard errors known as cluster-robust standard errors (Liang
and Zeger 1986) can be used to account for dependence between
observations within clusters (e.g., matched pairs). Abadie and Spiess (2019) demonstrate analytically that
cluster-robust standard errors are generally valid after matching,
whereas regular robust standard errors can over- or under-estimate the
true sampling variability of the effect estimator depending on the
specification of the outcome model (if any) and degree of effect
modification. A plethora of simulation studies have further confirmed
the validity of cluster-robust standard errors after matching (e.g., Austin 2009, 2013a; Austin and Small 2014; Gayat et al. 2012; Wan 2019). Given this evidence favoring
the use of cluster-robust standard errors, we recommend them in most
cases and use them judiciously in the this guide^{1}.

Here, we will use linear and generalized linear models as implemented
by `lm()`

and `glm()`

and use
`lmtest::coeftest()`

and `lmtest::coefci()`

along
with `sandwich::vcovHC()`

for robust standard errors and
`sandwich::vcovCL()`

for cluster-robust standard errors.
There are several “types” (e.g., HC0, HC1, etc.) of robust standard
errors that adjust the original standard errors in different ways;
although more research is required on the benefits of using different
types for estimating standard errors after matching, we use the default
types here.

Bootstrapping is a technique used to simulate the sampling distribution of an estimator by repeatedly drawing samples with replacement and estimating the effect in each bootstrap sample (Efron and Tibshirani 1993). From the bootstrap distribution, standard errors and confidence interval can be computed in several ways, including using the standard deviation of the bootstrap estimates as the standard error estimate or using the 2.5 and 97.5 percentiles as 95% confidence interval bounds. Bootstrapping tends to be most useful when no analytic estimator of a standard error is possible or has been derived yet. Although Abadie and Imbens (2008) found analytically that the bootstrap is inappropriate for matched samples, simulation evidence has found it to be adequate in many cases (Hill and Reiter 2006; Austin and Small 2014; Austin and Stuart 2017). It is the most accessible method for computing standard errors after g-computation for nonlinear models.

Typically, bootstrapping involves performing the entire estimation
process in each bootstrap sample, including propensity score estimation,
matching, and effect estimation. This tends to be the most
straightforward route, though intervals from this method may be
conservative in some cases (i.e., they are wider than necessary to
achieve nominal coverage) (Austin and Small
2014). Less conservative and more accurate intervals have
been found when using different forms of the bootstrap, including the
wild bootstrap develop by Bodory et al. (2020) and the
matched bootstrap described by Austin and Small
(2014) and
Abadie and Spiess (2019). The block bootstrap involves
sampling matched pairs/strata of units from the matched sample and
performing the analysis within each sample composed of the sampled
pairs. Abadie and Spiess (2019) derived
analytically that the block bootstrap is valid for estimating SEs and
CIs in the same circumstances cluster robust SEs are; indeed, the block
bootstrap SE is known to approximate the cluster-robust SE (Cameron and Miller 2015). We use
`boot()`

from the `boot`

package to implement
bootstrapping.

With bootstrapping, more bootstrap replications are always better but
can take time and increase the chances that at least one error will
occur within the bootstrap analysis (e.g., a bootstrap sample with zero
treated units or zero units with an event). In general, numbers of
replications upwards of 999 are recommended, with values one less than a
multiple of 100 preferred to avoid interpolation when using the
percentiles as confidence interval limits (MacKinnon
2006). There are several methods of computing bootstrap
confidence intervals, but the bias-corrected accelerated (BCa) bootstrap
confidence interval often performs best (Austin and Small 2014;
Carpenter and Bithell
2000) and is easy to implement, simply by setting
`type = "bca"`

in the call to `boot.ci()`

after
running `boot()`

^{2}.

Below, we describe effect estimation after several methods of matching. We consider four broad types of matching that require their own specific methods for estimation effects: 1) pair matching without replacement, 2) pair matching with replacement, 3) full matching, and 4) stratification. In some cases, methods for estimating effects are similar across methods, and we err on the side of redundancy here in our instructions. We also consider three different outcome types: 1) continuous, 2) binary, and 3) survival.

We’ll be using a simulated toy dataset `d`

with several
outcome types. Code to generate the dataset is at the end of this
document. The focus here is not on evaluating the methods but simply on
demonstrating them. In all cases, the correct propensity score model is
used. Below we display the first six rows of `d`

:

`head(d)`

```
## A X1 X2 X3 X4 X5 X6 X7 X8 X9 Y_C Y_B Y_S
## 1 0 0.1725 -1.4283 -0.4103 -2.36059 1 -1.1199 0.6398 -0.4840 -0.59385 0.07104 0 278.46
## 2 0 -1.0959 0.8463 0.2456 -0.12333 1 -2.2687 -1.4491 -0.5514 -0.31439 0.15619 0 330.63
## 3 0 0.1768 0.7905 -0.8436 0.82366 1 -0.2221 0.2971 -0.6966 -0.69516 -0.85180 1 369.94
## 4 0 -0.4595 0.1726 1.9542 -0.62661 1 -0.4019 -0.8294 -0.5384 0.20729 -2.35184 0 91.06
## 5 1 0.3563 -1.8121 0.8135 -0.67189 1 -0.8297 1.7297 -0.6439 -0.02648 0.68058 0 182.73
## 6 0 -2.4313 -1.7984 -1.2940 0.04609 1 -1.2419 -1.1252 -1.8659 -0.56513 -5.62260 0 2563.73
```

`A`

is the treatment variable, `X1`

through
`X9`

are covariates, `Y_C`

is a continuous
outcome, `Y_B`

is a binary outcome, and `Y_S`

is a
survival outcome.

We will need to the following packages to perform the desired analyses:

`lmtest`

provides the`coeftest()`

and`coefci()`

functions for estimating coefficients, standard errors, and confidence intervals incorporating robust standard errors`sandwich`

provides robust and cluster robust standard errors through the`vcovHC()`

and`vcovCL()`

functions, respectively`boot`

provides the`boot()`

and`boot.ci()`

functions for performing bootstrapping and estimating bootstrap effects, standard errors, and confidence intervals.`survival`

provides the functions`survdiff()`

and`coxph()`

to perform the log-rank test for differences in survival curves and estimate the coefficients in a Cox-proportional hazards model for the marginal hazard ratio, respectively, which we will use for survival outcomes.

Of course, we also need `MatchIt`

to perform the
matching.

```
library("MatchIt")
library("lmtest")
library("sandwich")
library("boot")
library("survival")
```

Other packages may be of use but are not used here. The
`margins`

package can be useful for computing marginal
effects and standard errors without bootstrapping. Some examples using
`margins`

are presented. The `survey`

package can
be used to estimate robust standard errors incorporating weights and
provides functions for survey-weighted generalized linear models and
Cox-proportional hazards models. It is often used with propensity score
weighting. The `Zelig`

package provides a broad interface to
estimating marginal and conditional effects using simulation and was
included in the original documentation for `MatchIt`

. The
`lme4`

(and `lmerTest`

) package performs mixed
effects modeling, which can be useful for accounting for pair membership
or other clustering features of the data.

Because different matching methods require different treatments, instructions for each method are organized in the following sections, as designated by the table below. It is important to ensure the right methods are used with the matching specification used in order for the estimated effects and standard errors to be valid.

Section | Matching Method |
---|---|

After Pair Matching Without Replacement | Nearest neighbor matching without replacement |

- | Optimal matching |

- | Genetic matching without replacement |

- | Coarsened exact matching with `k2k = TRUE` |

After Pair Matching With Replacement | Nearest neighbor matching with replacement |

- | Genetic matching with replacement |

After Full Matching | Full matching |

Cardinality and template matching | |

After Stratum Matching | Propensity score subclassification |

- | Exact matching |

- | Coarsened exact matching with `k2k = FALSE` (the
default) |

Pair matching without replacement yields the simplest way to estimate
treatment effects and standard errors. In general, whether a caliper,
common support restriction, exact matching specification, or \(k\):1 matching specification is used,
estimating the effect in the matched dataset (i.e., the output of
`match.data()`

) is straightforward and involves fitting a
model for the outcome that incorporates the matching weights^{3}.

First, we will perform nearest 1:1 neighbor propensity score matching without replacement.

```
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mNN + X7 + X8 + X9, data = d)
X6
mNN
```

```
## A matchit object
## - method: 1:1 nearest neighbor matching without replacement
## - distance: Propensity score
## - estimated with logistic regression
## - number of obs.: 2000 (original), 882 (matched)
## - target estimand: ATT
## - covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9
```

```
<- match.data(mNN)
md
head(md)
```

```
## A X1 X2 X3 X4 X5 X6 X7 X8 X9 Y_C Y_B Y_S distance weights subclass
## 1 0 0.17254 -1.4283 -0.4103 -2.3606 1 -1.1199 0.6398 -0.4840 -0.59385 0.07104 0 278.46 0.08461 1 167
## 3 0 0.17677 0.7905 -0.8436 0.8237 1 -0.2221 0.2971 -0.6966 -0.69516 -0.85180 1 369.94 0.22210 1 210
## 5 1 0.35631 -1.8121 0.8135 -0.6719 1 -0.8297 1.7297 -0.6439 -0.02648 0.68058 0 182.73 0.43291 1 322
## 9 0 0.78080 1.3137 0.6580 0.8540 1 0.9495 -0.5731 -0.2362 -0.14580 15.89771 1 67.53 0.15751 1 3
## 10 1 -0.56513 -0.1053 -0.1369 1.6233 1 -0.5304 -0.3342 0.4184 0.46308 1.07888 1 113.70 0.16697 1 1
## 11 0 0.07053 -0.1320 -1.8882 0.3077 0 -1.6558 0.8405 -0.5723 -1.01382 -6.55597 0 1632.69 0.45623 1 298
```

Typically one would assess balance and ensure that this matching
specification works, but we will skip that step here to focus on effect
estimation. See `vignette("MatchIt")`

and
`vignette("assessing-balance")`

for more information on this
necessary step. Because we did not use a caliper, the target estimand is
the ATT.

For continuous outcomes, estimating effects is fairly straightforward
using linear regression. We perform all analyses using the matched
dataset, `md`

, which contains only units retained in the
sample.

**Without covariate adjustment.** To estimate the effect
without covariate adjustment, we can run the following:

```
#Linear model without covariates
<- lm(Y_C ~ A, data = md, weights = weights) fit1
```

First, we will estimate the standard errors while accounting for pair membership using cluster-robust standard errors.

```
#Cluster-robust standard errors
coeftest(fit1, vcov. = vcovCL, cluster = ~subclass)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.779 0.303 5.88 5.8e-09 ***
## A 2.114 0.409 5.18 2.8e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

**Using the bootstrap.** We demonstrate bootstrapping
here using the block bootstrap as recommended by Abadie and Spiess (2019). The process is generalizable to
other matching methods for which cluster-robust standard errors are
valid and to other outcome types. We first need to write a function,
`est_fun`

, which takes in a vector of pair identifiers and an
ordering and outputs an effect estimate. This function should include
the effect estimation, though no standard error is required.

```
#Block bootstrap confidence interval
# library(boot)
<- levels(md$subclass)
pair_ids
<- function(pairs, i) {
est_fun
#Compute number of times each pair is present
<- table(pairs[i])
numreps
#For each pair p, copy corresponding md row indices numreps[p] times
<- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
ids function(p) rep(which(md$subclass == p),
numreps[p])))
#Subset md with block bootstrapped ids
<- md[ids,]
md_boot
#Effect estimation
<- lm(Y_C ~ A, data = md_boot, weights = weights)
fit_boot
#Return the coefficient on treatment
return(coef(fit_boot)["A"])
}
<- boot(pair_ids, est_fun, R = 499)
boot_est boot_est
```

```
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = pair_ids, statistic = est_fun, R = 499)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.114 0.003777 0.3862
```

`boot.ci(boot_est, type = "bca")`

```
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 499 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_est, type = "bca")
##
## Intervals :
## Level BCa
## 95% ( 1.317, 2.758 )
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable
```

The value `t1*`

in the `Bootstrap Statistics`

table contains the effect estimate and its bootstrap standard error, and
the interval in the confidence interval output is the bootstrap
confidence interval computed using the specified type (here,
`"bca"`

).

**With covariate adjustment.** Including covariates in
the outcome model is straightforward with a continuous outcome. We can
include main effects for each variable and use the coefficient on the
treatment as the treatment effect estimate. It can also be helpful to
include interactions between the covariates and treatment if effect
modification is suspected, though it is important to center the
covariates at their means in the focal (i.e., treated) group if the
coefficient on the treatment is to be interpreted as an average
treatment effect estimate (which we demonstrate with full matching). A
marginal effects procedure can also be used, which we demonstrate with
binary outcomes. Below we simply include main effects of each covariate,
which is the most typical way to include covariates.

```
#Linear model with covariates
<- lm(Y_C ~ A + X1 + X2 + X3 + X4 + X5 +
fit3 + X7 + X8 + X9, data = md,
X6 weights = weights)
coeftest(fit3, vcov. = vcovCL, cluster = ~subclass)["A",,drop=FALSE]
```

```
## Estimate Std. Error t value Pr(>|t|)
## A 2.029 0.3202 6.338 3.735e-10
```

As previously mentioned, it is important to avoid interpreting coefficients in the outcome model on predictors other than the treatment, as the coefficients do no correspond to causal effects and are likely confounded. In the code above, we restricted the output to just the treatment effect. Note that sometimes the intercept of the outcome model can be useful as an estimate of the average potential outcome under control, but the covariates in the model (if any) must be centered at their means in the target population to allow for this interpretation. Without centering the covariates, the intercept is uninterpretable.

For binary outcomes, effect estimation can be a bit more challenging, especially when including covariates. There are several measures of the effect one can consider, which include the odds ratio (OR), risk ratio/relative risk (RR), and risk difference (RD).

**Without covariate adjustment.** When omitting
covariates from the outcome model, effect and standard error estimation
is fairly straightforward. Below we demonstrate how to estimate the
marginal log OR after nearest neighbor matching without replacement.

```
#Generalized linear model without covariates
<- glm(Y_B ~ A, data = md, weights = weights,
fit4 family = binomial(link = "logit"))
```

By specifying `link = "logit"`

, we fit a logistic
regression model to estimate the marginal OR for treatment. To estimate
the marginal RR we can specify `link = "log"`

, and to
estimate the RD we can specify `link = "identity"`

or use
`lm()`

instead of `glm()`

. Below we estimate
cluster-robust standard errors, though it should be noted that these
standard errors can be biased for the OR and should be used with caution
(Austin 2007). Because we want the OR and
the effects are estimated on the log OR scale, we have to exponentiate
the coefficient on treatment to arrive at the OR (one would do this when
estimating the OR or RR, but not the RD).

```
#Cluster-robust standard errors
coeftest(fit4, vcov. = vcovCL, cluster = ~subclass)
```

```
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.939 0.106 -8.85 < 2e-16 ***
## A 0.834 0.143 5.82 5.7e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`exp(coef(fit4)) #OR`

```
## (Intercept) A
## 0.3912 2.3030
```

**With covariate adjustment and bootstrapping.** If we
want to include covariates in the model, we have to do some additional
work to estimate the effect and its standard error. This is because the
coefficient on treatment in a nonlinear model with covariates
corresponds to the conditional rather than marginal effect. To estimate
a marginal effect, we have to perform a marginal effects procedure,
which is equivalent to g-computation in the matched set. First we
estimate an outcome model including the covariates, and then we use the
predictions from the outcome model to compute the contrast of the
average potential outcomes under treatment and control. Bootstrapping is
the most straightforward method to estimate standard errors. We
demonstrate this below using the block bootstrap as recommended by Abadie and Spiess (2019).

```
#Block bootstrap confidence interval
# library(boot)
<- levels(md$subclass)
pair_ids
<- function(pairs, i) {
est_fun
#Compute number of times each pair is present
<- table(pairs[i])
numreps
#For each pair p, copy corresponding md row indices numreps[p] times
<- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
ids function(p) rep(which(md$subclass == p),
numreps[p])))
#Subset md with block bootstrapped ids
<- md[ids,]
md_boot
#Fitting outcome the model
<- glm(Y_B ~ A + X1 + X2 + X3 + X4 + X5 +
fit_boot + X7 + X8 + X9, data = md_boot,
X6 family = binomial(link = "logit"),
weights = weights)
#Estimate potential outcomes for each unit
#Under control
$A <- 0
md_boot<- weighted.mean(predict(fit_boot, md_boot, type = "response"),
P0 w = md_boot$weights)
<- P0 / (1 - P0)
Odds0
#Under treatment
$A <- 1
md_boot<- weighted.mean(predict(fit_boot, md_boot, type = "response"),
P1 w = md_boot$weights)
<- P1 / (1 - P1)
Odds1
#Return marginal odds ratio
return(Odds1 / Odds0)
}
<- boot(pair_ids, est_fun, R = 499)
boot_est boot_est
```

```
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = pair_ids, statistic = est_fun, R = 499)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.225 0.0291 0.2727
```

`boot.ci(boot_est, type = "bca")`

```
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 499 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_est, type = "bca")
##
## Intervals :
## Level BCa
## 95% ( 1.785, 2.827 )
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable
```

To use bootstrapping for the RR or RD, the part of the code that
computes the marginal OR can be replaced with code to compute the
marginal RR (`P1 / P0`

) or marginal RD
(`P1 - P0`

). Interactions between the treatment and
covariates can be included in the outcome model; no other part of the
procedure needs to be changed. Doing so can add robustness when effect
modification is possible.

There are several measures of effect size for survival outcomes. When using the Cox proportional hazards model, the quantity of interest is the HR between the treated and control groups. As with the OR, the HR is non-collapsible, which means the estimated HR will only be a valid estimate of the marginal HR when no other covariates are included in the model. Other effect measures, such as the difference in mean survival times or probability of survival after a given time, can be treated just like continuous and binary outcomes as previously described. Here we describe estimating the marginal HR.

**Without covariate adjustment.** Below we demonstrate
estimation of the marginal hazard ratio using `coxph()`

from
the `survival`

package. To request cluster-robust standard
errors as recommended by Austin (2013b), we need
to supply pair membership (stored in the `subclass`

column of
`md`

) to the `cluster`

argument and set
`robust = TRUE`

.

```
#Cox Regression for marginal HR
coxph(Surv(Y_S) ~ A, data = md, robust = TRUE,
weights = weights, cluster = subclass)
```

```
## Call:
## coxph(formula = Surv(Y_S) ~ A, data = md, weights = weights,
## robust = TRUE, cluster = subclass)
##
## coef exp(coef) se(coef) robust se z p
## A 0.43 1.53 0.07 0.07 6 1e-09
##
## Likelihood ratio test=39 on 1 df, p=5e-10
## n= 882, number of events= 882
```

The `coef`

column contains the log HR, and
`exp(coef)`

contains the HR. Remember to always use the
`robust se`

for the standard error of the log HR. The
displayed z-test p-value results from using the robust standard
error.

**Using the bootstrap.** Austin
and Small (2014) found bootstrap confidence
intervals to work well for marginal hazard ratios. Below we demonstrate
this using similar code as before to implement the block bootstrap:

```
#Block bootstrap confidence interval
# library(boot)
<- levels(md$subclass)
pair_ids
<- function(pairs, i) {
est_fun
#Compute number of times each pair is present
<- table(pairs[i])
numreps
#For each pair p, copy corresponding md row indices numreps[p] times
<- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
ids function(p) rep(which(md$subclass == p),
numreps[p])))
#Subset md with block bootstrapped ids
<- md[ids,]
md_boot
#Effect estimation
<- coxph(Surv(Y_S) ~ A, data = md_boot, weights = weights)
cox_fit_boot
#Compute the marginal HR by exponentiating the coefficient
#on treatment
<- exp(coef(cox_fit_boot)["A"])
HR
#Return the HR
return(HR)
}
<- boot(pair_ids, est_fun, R = 499)
boot_est
boot_estboot.ci(boot_est, type = "bca")
```

**With covariate adjustment.** As with binary outcomes,
if covariates are included in the model, the resulting hazard ratios
will be conditional rather than marginal. Austin,
Thomas, and Rubin (2020) describe several ways of including
covariates in a model to estimate the marginal HR, but because they do
not develop standard errors and little research has been done on this
method, we will not present it here.

Pair matching with replacement makes estimating effects and standard errors a bit less straightforward. Control units paired with multiple treated units belong to multiple pairs at the same time and appear multiple times in the matched dataset. Effect and standard error estimation need to account for control unit multiplicity (i.e., repeated use) and within-pair correlations (Hill and Reiter 2006; Austin and Cafri 2020).

`MatchIt`

provides two interfaces for extracting the
matched dataset after matching with replacement. The first uses
`match.data()`

and provides the same output as when used
after matching without replacement, except that the
`subclass`

column is absent because units are not assigned to
a single subclass but rather to several. Control units will have weights
that differ from 1 to reflect their use as matches for multiple treated
units. When using the `match.data()`

interface, including the
weights in the estimation of effects is crucial. Because pair membership
is omitted, accounting for it (if desired) must be done by conditioning
on covariates used to match the pairs or by bootstrapping the entire
process.

The second interface uses `get_matches()`

, which functions
similarly to `match.data()`

except that the dataset contains
one row per unit per pair, so control units matched to multiple treated
units will have multiple rows in the dataset, and a
`subclass`

column is included denoting pair membership. In
the `get_matches()`

output, each reused control unit will
have multiple rows, identical to each other except with different
`subclass`

membership. When performing k:1 matching, weights
must also be included in effect estimation when using
`get_matches()`

.

Here we will demonstrate the estimation of effects using both
`match.data()`

and `get_matches()`

output. The
`match.data()`

output is preferred when pair membership is
not directly included in the analysis, and the
`get_matches()`

output is preferred when pair membership is
to be included. Here will will use 3:1 matching with replacement and
with a caliper; note that because a caliper was used, the estimand
corresponds to neither the ATT nor the ATE but rather to an ATM.

```
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mNNr + X7 + X8 + X9, data = d,
X6 link = "linear.logit", caliper = .1,
ratio = 3, replace = TRUE)
mNNr
```

```
## A matchit object
## - method: 3:1 nearest neighbor matching with replacement
## - distance: Propensity score [caliper]
## - estimated with logistic regression and linearized
## - caliper: <distance> (0.13)
## - number of obs.: 2000 (original), 1040 (matched)
## - target estimand: ATT
## - covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9
```

```
#match.data output
<- match.data(mNNr)
md nrow(md)
```

`## [1] 1040`

`head(md)`

```
## A X1 X2 X3 X4 X5 X6 X7 X8 X9 Y_C Y_B Y_S distance weights
## 1 0 0.1725 -1.4283 -0.4103 -2.3606 1 -1.1199 0.6398 -0.4840 -0.59385 0.07104 0 278.46 -2.381 0.9571
## 3 0 0.1768 0.7905 -0.8436 0.8237 1 -0.2221 0.2971 -0.6966 -0.69516 -0.85180 1 369.94 -1.253 0.4785
## 5 1 0.3563 -1.8121 0.8135 -0.6719 1 -0.8297 1.7297 -0.6439 -0.02648 0.68058 0 182.73 -0.270 1.0000
## 7 0 1.8402 1.7601 -1.0746 -1.6428 1 1.4482 0.7131 0.6972 -0.94673 4.28651 1 97.49 -2.281 0.4785
## 9 0 0.7808 1.3137 0.6580 0.8540 1 0.9495 -0.5731 -0.2362 -0.14580 15.89771 1 67.53 -1.677 0.4785
## 10 1 -0.5651 -0.1053 -0.1369 1.6233 1 -0.5304 -0.3342 0.4184 0.46308 1.07888 1 113.70 -1.607 1.0000
```

```
#get_matches output
<- get_matches(mNNr)
gm nrow(gm)
```

`## [1] 1681`

`head(gm)`

```
## id subclass weights A X1 X2 X3 X4 X5 X6 X7 X8 X9 Y_C Y_B Y_S distance
## 1 5 1 1.0000 1 0.35631 -1.8121 0.81349 -0.6719 1 -0.8297 1.7297 -0.6439 -0.02648 0.6806 0 182.73 -0.2700
## 2 740 1 0.3333 0 0.02725 -0.4570 -0.32506 1.1203 1 0.3795 0.6468 -0.9992 0.01917 -0.9042 1 215.25 -0.2736
## 3 939 1 0.3333 0 0.92888 -0.2562 0.39077 0.3305 1 0.2654 1.2038 0.1720 1.21006 7.8923 0 83.85 -0.2736
## 4 860 1 0.3333 0 -0.88047 0.9117 -0.38871 1.0696 0 1.7953 0.2470 -1.6115 1.49297 5.3808 0 236.01 -0.2752
## 5 10 2 1.0000 1 -0.56513 -0.1053 -0.13685 1.6233 1 -0.5304 -0.3342 0.4184 0.46308 1.0789 1 113.70 -1.6073
## 6 399 2 0.3333 0 0.80396 -0.8760 0.03938 -0.5895 1 0.2819 -0.6540 -0.5423 1.16852 4.4816 1 117.28 -1.6065
```

The `get_matches()`

output provides some additional
information about the match. We can count how many times control units
are reused and how many units are in each match strata (not all will
have 3 control units due to the caliper).

```
#Number of time control units are rematched
table(table(gm$id[gm$A == 0]))
```

```
##
## 1 2 3 4 5 6 7 8 9 10 13 14
## 332 138 62 29 19 9 14 4 2 2 1 1
```

Here we can see that 332 control units were only used in one pair each, and one control unit was paired with 14 treated units (i.e., heavily reused).

```
#Number of control units in each match stratum
table(table(gm$subclass[gm$A == 0]))
```

```
##
## 1 2 3
## 9 9 409
```

Here we can see that 409 treated units have three matches, nine have two matches, and nine only have one match. The caliper did not end up restricting too many matches.

**Without covariate adjustment.** For continuous
outcomes, we can regress the outcome on the treatment and include the
weights in the estimation. We do this regardless of whether we are using
the `match.data()`

output or the `get_matches()`

output (if we were doing 1:1 matching, the weights would not be
necessary when using the `get_matches()`

output, but they
don’t change the results if included).

If we don’t mind ignoring pair membership, we can use the
`match.data()`

output to estimate the effect and standard
errors. Here we use `vcovHC()`

to estimate regular robust
standard errors that ignoring pairing. Hill and
Reiter (2006)
found these standard errors to be conservative for continuous
outcomes.

```
#match.data() output
<- lm(Y_C ~ A, data = md, weights = weights)
fit1md
coeftest(fit1md, vcov. = vcovHC)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.798 0.420 4.28 2.0e-05 ***
## A 2.065 0.514 4.02 6.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

If we want to incorporate pair membership into the standard error
estimation, we have to use the `get_matches()`

output. In
addition to supplying pair membership (`subclass`

) to the
standard error estimator, we also supply unit ID (`id`

) to
account for the fact that several rows may refer to the same control
unit.

```
#get_matches() output
<- lm(Y_C ~ A, data = gm, weights = weights)
fit1gm
coeftest(fit1gm, vcov. = vcovCL, cluster = ~subclass + id)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.798 0.416 4.32 1.7e-05 ***
## A 2.065 0.510 4.05 5.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Note that the effect estimates are identical; only the standard
errors and p-values differ between the approaches^{4}.

**Using the bootstrap.** We can also use bootstrapping
to estimate standard errors and confidence intervals. Although Abadie and Imbens (2008) demonstrated analytically that
bootstrap standard errors may be invalid for matching with replacement,
simulation work by Hill and Reiter (2006) and Bodory et al. (2020) has found that bootstrap standard
errors are adequate and generally slightly conservative. Given this
disagreement, it may be worth proceeding with caution when using
bootstrapping to estimate standard errors after matching with
replacement. Simulation experiments with matching with replacement have
only considered the full bootstrap and found it to yield confidence
intervals with approximately nominal coverage, so we recommend this
approach over the block bootstrap for now and demonstrate it below.

```
#Full bootstrap confidence interval
# library(boot)
<- function(data, i) {
est_fun #Matching function
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mNNr_boot + X7 + X8 + X9, data = data[i,],
X6 link = "linear.logit", caliper = .1,
ratio = 3, replace = TRUE)
<- match.data(mNNr_boot)
md_boot
#Effect estimation
<- lm(Y_C ~ A, data = md_boot, weights = weights)
fit_boot
#Return the coefficient on treatment
return(coef(fit_boot)["A"])
}
<- boot(d, est_fun, R = 499)
boot_est boot_est
```

```
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = d, statistic = est_fun, R = 499)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.065 -0.1287 0.4398
```

`boot.ci(boot_est, type = "perc")`

```
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 499 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_est, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 1.049, 2.796 )
## Calculations and Intervals on Original Scale
```

**With covariate adjustment.** We can include covariates
in the outcome model just as we could when matching without replacement.
Hill and Reiter (2006) found that standard errors that
fail to account for pair membership can have lower the nominal
confidence interval coverage, so we recommend using the
`get_matches()`

output to address both multiplicity and
clustering.

```
<- lm(Y_C ~ A + X1 + X2 + X3 + X4 + X5 +
fit2md + X7 + X8 + X9, data = gm,
X6 weights = weights)
coeftest(fit1gm, vcov. = vcovCL, cluster = ~subclass + id)["A",,drop = FALSE]
```

```
## Estimate Std. Error t value Pr(>|t|)
## A 2.065 0.51 4.048 5.392e-05
```

Remember that the coefficients and tests on the predictors other than the treatment should not be interpreted because they may be subject to confounding even if the treatment is not, so we omit them here.

The primary difference between dealing with binary and continuous
outcomes is the noncollapsibility of the effect measures for binary
outcomes, meaning that including covariates in the outcome model is less
straightforward because the coefficient on treatment does not correspond
to the marginal treatment effect. Similar to continuous outcomes, when
estimating the treatment effect, we can use either the output of
`match.data()`

or the output of `get_matches()`

,
only the latter of which allows us to account both for multiplicity in
the control units and for pair membership. Below we’ll demonstrate
estimating the marginal OR accounting for pair membership using the
`get_matches()`

output. Then we will demonstrate using the
bootstrap to estimate standard errors that include covariates in the
model. Note that there has been little research on the use of matching
with replacement with binary outcomes, so one should proceed with
caution and consider using a better-studied matching method.

**Without covariate adjustment.** We include the weights
in the call to `glm()`

and we include both
`subclass`

and `id`

as clustering variables in
computing the cluster-robust standard errors, just as we would with a
continuous outcome. We can use `family = quasibinomial`

instead of `family = binomial`

to avoid a warning due to the
use of weights; the estimates will be the same either way. To estimate
the marginal RR or RD, `"logit"`

would be replaced with
`"log"`

or `"identity"`

, respectively.

```
<- glm(Y_B ~ A, data = gm, weights = weights,
fit3gm family = quasibinomial(link = "logit"))
coeftest(fit3gm, vcov. = vcovCL, cluster = ~ subclass + id)
```

```
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.766 0.139 -5.50 3.9e-08 ***
## A 0.639 0.169 3.78 0.00016 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`exp(coef(fit3gm)) #OR`

```
## (Intercept) A
## 0.4648 1.8954
```

**With covariate adjustment and bootstrapping.** To
include covariates, we can use the bootstrap as we did before when
matching without replacement. It doesn’t matter whether the
`match.data()`

or `get_matches()`

output is used
because, as we saw before, both yield the same effect estimate in each
bootstrap replication. Here we use the full bootstrap rather than the
block bootstrap because the performance of the block bootstrap after
matching with replacement has not been evaluated.

```
#Bootstrap confidence intervals
# library(boot)
<- function(data, i) {
est_fun #Matching function
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mNNr_boot + X7 + X8 + X9, data = data[i,],
X6 link = "linear.logit", caliper = .1,
ratio = 3, replace = TRUE)
<- match.data(mNNr_boot)
md_boot
#Fitting the model
<- glm(Y_B ~ A + X1 + X2 + X3 + X4 + X5 +
fit_boot + X7 + X8 + X9, data = md_boot,
X6 family = quasibinomial(link = "logit"),
weights = weights)
#Estimate potential outcomes for each unit
$A <- 0
md_boot<- weighted.mean(predict(fit_boot, md_boot, type = "response"),
P0 w = md_boot$weights)
<- P0 / (1 - P0)
Odds0
$A <- 1
md_boot<- weighted.mean(predict(fit_boot, md_boot, type = "response"),
P1 w = md_boot$weights)
<- P1 / (1 - P1)
Odds1
#Return marginal odds ratio
return(Odds1 / Odds0)
}
<- boot(d, est_fun, R = 4999)
boot_est
boot_estboot.ci(boot_est, type = "bca")
```

As before, to use bootstrapping for the RR or RD, the part of the
code that computes the marginal OR can be replaced with code to compute
the marginal RR (`P1 / P0`

) or marginal RD
(`P1 - P0`

). The outcome model can additionally include
treatment-covariate interactions if desired.

Standard error estimation for the marginal HR after matching with replacement is not a well-studied area, with Austin and Cafri (2020) providing the sole examination into appropriate methods for doing so. With survival outcomes, other matching methods may be more appropriate until matching with replacement is better understood. Here we provide an example that implements the recommendations by Austin and Cafri (2020). Any other methods (e.g., bootstrap) should be used with caution until they have been formally evaluated.

**Without covariate adjustment.** According to the
results of Austin and Cafri’s (2020) simulation
studies, when prevalence of the treatment is low (<30%), a standard
error that does not involve pair membership is sufficient. When
treatment prevalence is higher, the standard error that ignores pair
membership may be too low, and the authors recommend a custom standard
error estimator that uses information about both multiplicity and
pairing.

For the continuous and binary outcomes, accounting for both
multiplicity and pair membership is fairly straightforward thanks to the
ability of the `sandwich`

package functions to include
multiple sources of clustering. Unfortunately, this must be done
manually for survival models. We perform this analysis below, adapting
code from the appendix of Austin and Cafri (2020) to the
`get_matches()`

output.

```
#Austin & Cafri's (2020) SE estimator
<- coxph(Surv(Y_S) ~ A, data = gm, robust = TRUE,
fs weights = weights, cluster = subclass)
<- fs$var
Vs <- nlevels(gm$subclass)
ks
<- coxph(Surv(Y_S) ~ A, data = gm, robust = TRUE,
fi weights = weights, cluster = id)
<- fi$var
Vi <- length(unique(gm$id))
ki
<- coxph(Surv(Y_S) ~ A, data = gm, robust = TRUE,
fc weights = weights)
<- fc$var
Vc <- nrow(gm)
kc
#Compute the variance
<- (ks/(ks-1))*Vs + (ki/(ki-1))*Vi - (kc/(kc-1))*Vc
V
#Sneak it back into the fit object
$var <- V
fc
fc
```

```
## Call:
## coxph(formula = Surv(Y_S) ~ A, data = gm, weights = weights,
## robust = TRUE)
##
## coef exp(coef) se(coef) robust se z p
## A 0.44 1.55 0.07 0.07 6 1e-09
##
## Likelihood ratio test=40 on 1 df, p=2e-10
## n= 1681, number of events= 1681
```

The `robust se`

column contains the computed standard
error, and the reported Z-test uses this standard error. The
`se(coef)`

column should be ignored.

Full matching presents fairly straightforward methods of effect and standard error estimation. The most common and recommended way to estimate effects after full matching is to use the computed matching weights to estimate weighted effects. These matching weights function essentially like inverse probability weights and can be treated as such in all analyses, including with respect to standard error estimation. For empirical comparisons between full matching and propensity score weighting, see Austin and Stuart (2015, 2017, 2015).

Standard error estimation involves using a cluster-robust standard error as recommended by Austin and Stuart (2017). Including covariates can improve precision and add robustness when valid. Note that the methods described here also work for other estimators that rely on matching weights, including cardinality and template matching and matching without replacement. The main difference is that full matching can be used to estimate the ATE, which we demonstrate below.

Below, we perform optimal full propensity score matching. Here we use full matching to estimate the ATE, but the procedure is nearly identical when estimating the ATT, and we point out any differences.

```
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mF + X7 + X8 + X9, data = d,
X6 method = "full", estimand = "ATE")
mF
```

```
## A matchit object
## - method: Optimal full matching
## - distance: Propensity score
## - estimated with logistic regression
## - number of obs.: 2000 (original), 2000 (matched)
## - target estimand: ATE
## - covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9
```

```
<- match.data(mF)
md
head(md)
```

```
## A X1 X2 X3 X4 X5 X6 X7 X8 X9 Y_C Y_B Y_S distance weights subclass
## 1 0 0.1725 -1.4283 -0.4103 -2.36059 1 -1.1199 0.6398 -0.4840 -0.59385 0.07104 0 278.46 0.08461 0.8166 56
## 2 0 -1.0959 0.8463 0.2456 -0.12333 1 -2.2687 -1.4491 -0.5514 -0.31439 0.15619 0 330.63 0.01855 0.7899 140
## 3 0 0.1768 0.7905 -0.8436 0.82366 1 -0.2221 0.2971 -0.6966 -0.69516 -0.85180 1 369.94 0.22210 1.0393 96
## 4 0 -0.4595 0.1726 1.9542 -0.62661 1 -0.4019 -0.8294 -0.5384 0.20729 -2.35184 0 91.06 0.04180 0.8445 135
## 5 1 0.3563 -1.8121 0.8135 -0.67189 1 -0.8297 1.7297 -0.6439 -0.02648 0.68058 0 182.73 0.43291 0.3307 202
## 6 0 -2.4313 -1.7984 -1.2940 0.04609 1 -1.2419 -1.1252 -1.8659 -0.56513 -5.62260 0 2563.73 0.04998 0.7990 167
```

A benefit of full matching is that no units are discarded, which has the potential to improve precision and prevent bias due to incomplete matching. However, the “effective” sample size implied by the matching weights is lower than the actual remaining sample size, so one should not always expect full matching to yield more precise estimates than other forms of matching.

**Without covariate adjustment.** Estimating effects and
standard errors for continuous outcomes after full matching involves
including the matching weights in the outcome model and using a
cluster-robust standard error. For cardinality or template matching, a
regular robust standard error can be requested using `vcovHC`

and omitting the `cluster`

argument in the code below.

```
<- lm(Y_C ~ A, data = md, weights = weights)
fit1
coeftest(fit1, vcov. = vcovCL, cluster = ~subclass)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.436 0.250 5.75 1e-08 ***
## A 1.892 0.519 3.65 0.00027 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

**Using the bootstrap.** Computing bootstrap standard
errors and confidence intervals after full matching is identical to
doing so after pair matching without replacement, so we do not
demonstrate it again here. Covariates can be included in the outcome
model used in the bootstrap replications. See below for a note on
centering the covariates when including treatment-covariate
interactions, as this must be done when using the bootstrap as well.

**With covariate adjustment.** As previously mentioned,
it is generally acceptable to include just the main effects, but
including interactions between the treatment and covariates can be
beneficial when effect modification by the covariates may be present.
Because we have not demonstrated this strategy so far, we demonstrate it
below.

In order to interpret the coefficient on treatment as a marginal
effect estimate, we need to center the covariates at their means in the
target population (i.e., the original sample for the ATE, the treated
units for the ATT, or the retained units for an ATM); we could also use
a marginal effects procedure as has been demonstrated with binary
outcomes for other matching methods. Below we use the strategy of
centering the covariates at their means. Note that when factor
predictors are present, they need to be split into dummy (0/1) variables
prior to centering. The `splitfactor()`

function in
`cobalt`

can make this straightforward. Although this
procedure is more involved compared to simply including main effects, it
can provide extra precision and robustness.

```
#Estimating a covariate-adjusted marginal effect
#with treatment-covariate interactions
#Create a new dataset for centered variables
<- md
md_cen
<- c("X1", "X2", "X3", "X4", "X5",
covs_to_center "X6", "X7", "X8", "X9")
<- scale(md_cen[covs_to_center],
md_cen[covs_to_center] scale = FALSE)
#Fit the model with every covariate interacting with treatment
<- lm(Y_C ~ A * (X1 + X2 + X3 + X4 + X5 +
fit2 + X7 + X8 + X9),
X6 data = md_cen, weights = weights)
#Only output the intercept and coefficient on treatment
coeftest(fit2, vcov. = vcovCL, cluster = ~subclass)[1:2,]
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.400 0.1645 8.511 3.349e-17
## A 1.979 0.4169 4.748 2.207e-06
```

Remember not to interpret the coefficients on the covariates or the treatment-covariate interactions, as they are likely confounded. To keep the output clean, above we restricted the output to just the intercept and coefficient on treatment. Another benefit of centering the covariates is that we can interpret the intercept as an estimate of the average potential outcome under control.

Note that the above strategy can be applied to all matching methods when analyzing continuous outcomes (but not binary or survival outcomes, which require bootstrapping to validly estimate standard errors with covariate adjustment). It is critical to center the covariates at their means in the target group, which may require some additional programming for estimands other than the ATE.

Using full matching with binary outcomes was described by Austin and Stuart (2017). In general, the procedures look similar to how they do with other matching methods.

**Without covariate adjustment.** We can use a weighted
generalized linear model regressing the outcome on the treatment with a
link function appropriate to the effect measure of interest. Below we
demonstrate estimating the marginal OR after full matching in a model
without covariates:

```
<- glm(Y_B ~ A, data = md, weights = weights,
fit3 family = quasibinomial(link = "logit"))
coeftest(fit3, vcov. = vcovCL, cluster = ~subclass)
```

```
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9146 0.0826 -11.07 <2e-16 ***
## A 0.5545 0.1722 3.22 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`exp(coef(fit3)) #OR`

```
## (Intercept) A
## 0.4007 1.7410
```

As with matching with replacement, we include weights in the call to
`glm()`

and set `family = quasibinomial()`

to
prevent a warning that occurs when using weights with binomial
regression models (though the results do not differ). Setting
`link = "logit"`

provides the marginal OR; for the marginal
RR, we would replace `"logit"`

with `"log"`

, and
for the marginal RD, we would replace `"logit"`

with
`"identity"`

and not exponentiate the coefficient.

**With covariate adjustment and bootstrapping.** To
include covariates in the model, a marginal effects procedure must be
used with bootstrapping to recover the marginal effect because the
coefficient on treatment in such a model corresponds to a conditional
effect. A bootstrap must be used to estimate the standard error and
confidence interval of the marginal effect. Either the full bootstrap or
block bootstrap can be used, though the performance of the latter has
not been formally evaluated (but because it is an approximation to
cluster-robust standard errors, which are valid, it is likely valid).
The code for the full bootstrap with full matching is identical to the
code for bootstrapping with binary outcomes for pair matching with
replacement (except that the call to `matchit()`

in the
`est_fun`

function must be adjusted to perform full
matching), and the code for the block bootstrap with full matching is
identical to the code for bootstrapping with binary outcomes for pair
matching without replacement, so we do not repeat it here.

Austin and Stuart (2015) describe the use of the full matching with survival outcomes.

**Without covariate adjustment.** To estimate the
marginal HR, we can regress the outcome on the treatment in a Cox
regression model weighted by the matching weights and including
subclasses as a cluster. Below we demonstrate how to estimate the
marginal HR and its standard error after full matching.

```
coxph(Surv(Y_S) ~ A, data = md, robust = TRUE,
weights = weights, cluster = subclass)
```

```
## Call:
## coxph(formula = Surv(Y_S) ~ A, data = md, weights = weights,
## robust = TRUE, cluster = subclass)
##
## coef exp(coef) se(coef) robust se z p
## A 0.42 1.52 0.05 0.09 5 2e-06
##
## Likelihood ratio test=54 on 1 df, p=2e-13
## n= 2000, number of events= 2000
```

To perform the log-rank test or compute survival curves after full
matching, functions designed for performing these tasks with inverse
probability weights can be used with the matching weights; the
`RISCA`

package offers functionality for this purpose.

Including covariates in the model is less straightforward because the resulting HR estimate is conditional rather than marginal.

Stratum matching includes exact matching, coarsened exact matching,
and propensity score subclassification. There are two natural ways to
estimate marginal effects after stratum matching: the first is to
estimate stratum-specific treatment effects and pool them, and the
second is to use the stratum weights to estimate a single marginal
effect. This latter approach is also known as marginal mean weighting
through stratification (MMWS), and is described in detail by Hong (2010)^{5}. When done properly, both methods should
yield similar or identical estimates of the treatment effect. MMWS is
generally preferable because it is far simpler to implement and avoids
issues of noncollapsibility with non-continuous outcomes. All of the
methods described above for use with full matching also work with MMWS
because the formation of the weights is the same; the only difference is
that it is not appropriate to use cluster-robust standard errors with
MMWS because of how few clusters are present.

Unless exact matching is used, estimating stratum-specific treatment effects can be fraught because balance may not be achieved within strata even if balance is achieved across strata. Stratum-specific effects should be interpreted with caution. Stratum-specific effects are conditional effects, but conditional on stratum membership, which may not always be a useful conditioning variable.

For each outcome type, we focus on estimating marginal effects using MMWS and using the strata directly. Below, we perform propensity score subclassification for the ATT using 8 subclasses.

```
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mS + X7 + X8 + X9, data = d,
X6 method = "subclass", estimand = "ATT",
subclass = 8)
mS
```

```
## A matchit object
## - method: Subclassification (8 subclasses)
## - distance: Propensity score
## - estimated with logistic regression
## - number of obs.: 2000 (original), 2000 (matched)
## - target estimand: ATT
## - covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9
```

```
<- match.data(mS)
md
head(md)
```

```
## A X1 X2 X3 X4 X5 X6 X7 X8 X9 Y_C Y_B Y_S distance weights subclass
## 1 0 0.1725 -1.4283 -0.4103 -2.36059 1 -1.1199 0.6398 -0.4840 -0.59385 0.07104 0 278.46 0.08461 0.2458 1
## 2 0 -1.0959 0.8463 0.2456 -0.12333 1 -2.2687 -1.4491 -0.5514 -0.31439 0.15619 0 330.63 0.01855 0.2458 1
## 3 0 0.1768 0.7905 -0.8436 0.82366 1 -0.2221 0.2971 -0.6966 -0.69516 -0.85180 1 369.94 0.22210 1.0742 3
## 4 0 -0.4595 0.1726 1.9542 -0.62661 1 -0.4019 -0.8294 -0.5384 0.20729 -2.35184 0 91.06 0.04180 0.2458 1
## 5 1 0.3563 -1.8121 0.8135 -0.67189 1 -0.8297 1.7297 -0.6439 -0.02648 0.68058 0 182.73 0.43291 1.0000 5
## 6 0 -2.4313 -1.7984 -1.2940 0.04609 1 -1.2419 -1.1252 -1.8659 -0.56513 -5.62260 0 2563.73 0.04998 0.2458 1
```

For continuous outcomes, we can use either MMWS or compute the weighted average of within-subclass effects. First we illustrate weighting.

**Without covariate adjustment.** With weighting, we can
supply the weights that are in the `match.data()`

output to a
call to `lm()`

to perform weighted least squares regression,
as we did with full matching. We need a robust standard error estimator
to account for the weights. Note that the subclasses don’t even need to
enter this analysis; they are fully incorporated through the MMWS
weights^{6}.
We use `vcovHC()`

to estimate the regular robust standard
error instead of the cluster-robust standard error used with other
methods.

```
<- lm(Y_C ~ A, data = md, weights = weights)
fit1
coeftest(fit1, vcov. = vcovHC)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.964 0.287 6.85 9.8e-12 ***
## A 1.930 0.407 4.74 2.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We can also fit a model within each subclass to estimate the within-stratum treatment effects and then compute a weighted average of them to be used as the marginal effect. The stratum weights in the weighted average must be equal to the proportion of treated units in each subclass if the ATT is targeted. If the ATE is targeted, the weights must be equal to the proportion of all units in each subclass. There are other ways to construct weight to minimize variance at the expense of losing the original target population (Rudolph et al. 2016).

Instead of fitting separate models for each subclass, we can fit a
single model that fully interacts the treatment with subclass membership
and then perform a linear hypothesis test. To do so, we use the form
`Y ~ S + S:A - 1`

in the call to `lm()`

. This
includes main effects for subclass and treatment interaction terms for
each subclass and omits an intercept. The fit of this model is
equivalent to that of a traditional full factorial model, so no
information is lost using this parameterization and using it makes it
easier to construct the stratum weights. This estimates the
subclass-specific intercept and subclass-specific treatment effect in
each subclass. We would use a robust standard error to account for
different residual variance across subclasses.

```
<- lm(Y_C ~ subclass + subclass:A - 1, data = md)
fit2
#Within-subclass effects
# coeftest(fit2, vcov. = vcovHC)
```

The within-subclass effects should only be trusted if balance is achieved in each subclass. In this example, balance has not been achieved within some subclasses, so we would not interpret these effects. Next we construct the weights to form the weighted average of the subclass effects. The weights take the form of a linear contrast matrix with zeroes for the subclass-specific intercepts and the subclass proportions for the corresponding subclass-specific treatment effect coefficients.

```
#Subclass weights for ATT
<- with(md, c(rep(0, nlevels(subclass)),
sub_w table(subclass[A==1])/sum(A==1)))
#Subclass weights for ATE (requires estimand = "ATE" in matchit())
# sub_w <- with(md, c(rep(0, nlevels(subclass)),
# table(subclass)/nrow(md)))
#Marginal effect
<- weighted.mean(coef(fit2), sub_w)) (est
```

`## [1] 1.93`

```
#SE of marginal effect
<- sqrt(drop(sub_w %*% vcovHC(fit2) %*% sub_w))) (se
```

`## [1] 0.4036`

```
#CI
c(ci_low = est - 1.96*se, ci_hi = est + 1.96*se)
```

```
## ci_low ci_hi
## 1.139 2.721
```

The following lines would have produced the same output but require
the `margins`

package:

```
#Using margins() from margins
summary(margins::margins(fit2, variables = "A",
data = md[md$A == 1,],
vcov = vcovHC(fit2)))
#For ATE, omit the second line.
```

**With covariate adjustment.** To include covariates in
the model when using MMWS, we can modify the code used for weighting
after full matching, the only difference being that regular robust
standard errors should be used with MMWS. As before, treatment-covariate
interactions are optional but can reduce bias and improve precision when
there effect modification by the covariates. When including these
interactions in the outcome model, it is important to center the
covariates at their means in the target population (i.e., the full
sample for the ATE and the treated units for the ATT) in order to
interpret the coefficient on treatment as a marginal effect.

To include covariates in the model when combining subclass-specific
effect estimates, it can be challenging to correctly parameterize the
model so that the linear contrast matrix method works as expected. The
simplest way would be to include covariates as desired, which include as
main effects, interactions with treatment, interactions with subclass,
or all three, and use the `margins()`

code above, which
should automatically provide the correct output.

Using stratification with binary outcomes is slightly more complicated than it is with continuous outcomes. This is because the OR and RR are not collapsible, so the marginal OR and RR cannot be computed as the weighted average of the stratum-specific effects. Instead, one must compute the average of the predicted stratum-specific risks under each treatment and then compute the marginal effect estimate from these marginal risks. Although stratum-specific conditional ORs are valid effect measures, they generally do not correspond to meaningful subpopulation effects unless the strata themselves are meaningful subpopulations. After exact matching or coarsened exact matching, strata may be meaningful because they correspond to specific combinations of covariates that may come close to designating specific patient attributes, but after propensity score subclassification, the strata correspond to propensity score bins, which are generally not meaningful. Although some researchers have interpreted stratum-specific effects after propensity score subclassification as representing effects at various risks or propensities for treatment, because the primary purpose of the propensity score is as a balancing score and not as an accurate estimate of propensity for treatment, such an interpretation should be regarded with caution.

Austin (2007) compared several methods of propensity score adjustment for estimating marginal ORs, including two methods based on propensity score stratification, one of which involved the stratified Mantel-Haenszel estimator, and the other of which involved averaging stratum-specific effects. Both of these estimate a common conditional OR, not a marginal OR (Forbes and Shortreed 2008; Stampf et al. 2010), and both yielded positively biased effect estimates for non-null treatment effects, a common pattern when using conditional effect estimates as estimates of marginal effects. Given the difficulties in estimating marginal ORs after stratification, the most straightforward way to do so is to use MMWS. We do, however, also demonstrate estimating the marginal odds ratio and its standard error using bootstrapping in a way that can incorporate covariate adjustment and allow for differential effect modification across strata.

**Without covariate adjustment.** As before, we can
supply the stratification weights to a weighted generalized linear model
with just the treatment as the sole predictor, and the coefficient on
treatment will correspond to a marginal treatment effect. We use
`vcovHC()`

to estimate the regular robust standard error.

```
<- glm(Y_B ~ A, data = md, weights = weights,
fit3 family = quasibinomial(link = "logit"))
coeftest(fit3, vcov. = vcovHC)
```

```
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.831 0.108 -7.71 1.3e-14 ***
## A 0.726 0.144 5.04 4.6e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`exp(coef(fit3))`

```
## (Intercept) A
## 0.4357 2.0675
```

As with other matching methods, we include weights in the call to
`glm()`

and set `family = quasibinomial()`

to
prevent a warning that occurs when using weights with binomial
regression models (though the results do not differ). Setting
`link = "logit"`

provides the marginal OR; for the marginal
RR, we would replace `"logit"`

with `"log"`

, and
for the marginal RD, we would replace `"logit"`

with
`"identity"`

and not exponentiate the coefficient.

**With covariate adjustment and bootstrapping.** We can
use bootstrapping to estimate the marginal OR and its standard error by
estimating the average of the stratum-specific risks under each
treatment level, computing the marginal risks under each treatment, and
computing marginal effects from the marginal risks. This also makes it
fairly straightforward to include covariates in the model. Below we
illustrate bootstrapping for the marginal OR with covariates included in
the outcome model.

```
#Bootstrap confidence intervals
library(boot)
<- function(data, i) {
est_fun #Subclassification function
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mS_boot + X7 + X8 + X9, data = data[i,],
X6 method = "subclass", estimand = "ATT",
subclass = 8)
<- match.data(mS_boot)
md_boot
#Fitting the model
<- glm(Y_B ~ A * (subclass + X1 + X2 + X3 + X4 + X5 +
fit_boot + X7 + X8 + X9), data = md_boot,
X6 family = quasibinomial(link = "logit"))
#Estimate potential outcomes for each unit
## Subset to just the treated for the ATT; remove this for the ATE
<- md_boot[md_boot$A == 1,]
md_boot ##
$A <- 0
md_boot<- mean(predict(fit_boot, md_boot, type = "response"))
P0 <- P0 / (1 - P0)
Odds0
$A <- 1
md_boot<- mean(predict(fit_boot, md_boot, type = "response"))
P1 <- P1 / (1 - P1)
Odds1
#Return marginal odds ratio
return(Odds1 / Odds0)
}
<- boot(d, est_fun, R = 4999)
boot_est
boot_estboot.ci(boot_est, type = "bca")
```

In this example, we included interactions between treatment and
subclass and between treatment and each covariate. Note that because we
are interested in the ATT, we restricted the sample used to compute the
predicted marginal risks (`P0`

) and (`P1`

) to just
those with `A = 1`

. If we were instead estimating the ATE, we
would supply `"ATE"`

that to the `estimand`

argument in the call to `matchit()`

and skip the step of
restricting the data used for prediction of the marginal risks.

As with other methods, to estimate the marginal RR or RD using the
above code, the returned object can instead be specified as
`P1 / P0`

or `P1 - P0`

, respectively.

Like ORs, HRs are not collapsible, so it is not straightforward to estimate marginal HRs using within-stratum HRs. Austin (2013a) examined the performance of several propensity score subclassification-based estimators of the marginal HR and found all to be positively biased for non-null effects, consistent with the use of conditional effect estimates as estimates of marginal effects; indeed, the subclassification methods examined all relied on pooling stratum-specific effects. Given these difficulties, the most straightforward method to estimate marginal HRs is to use MMWS weights. We demonstrate this below using essentially the same syntax as used with full matching, only omitting subclass membership as a clustering variable.

```
coxph(Surv(Y_S) ~ A, data = md, robust = TRUE,
weights = weights)
```

```
## Call:
## coxph(formula = Surv(Y_S) ~ A, data = md, weights = weights,
## robust = TRUE)
##
## coef exp(coef) se(coef) robust se z p
## A 0.46 1.58 0.05 0.07 7 2e-11
##
## Likelihood ratio test=64 on 1 df, p=1e-15
## n= 2000, number of events= 2000
```

The robust standard error must be used because of the MMWS weights.

As discussed previously, with binary and time-to-event outcomes, marginal and conditional effects generally differ. Though we have primarily focused on marginal effects, we offer some guidance here on estimating conditional effect as well. Estimating conditional effects typically involves modeling the outcome, and inferences are dependent on this model being correct, even in the presence is perfect covariate balance. Matching provides robustness to some forms of misspecification, however, by limiting the range of the covariate space. So, even though outcome modeling is required, matching prior to modeling the outcome can be beneficial in terms of reducing bias due to model misspecification.

On some effect measures, conditional effects typically vary depending on the covariate values at which the conditional effect is to be estimated; for example, the decrease in the risk of death (i.e., on the RD scale) corresponding to receipt of a treatment for a patient with a high baseline risk of death will by higher than that for a patient with a low baseline risk of death. In this sense, there is necessarily effect modification of the RD by baseline risk (which depends on a patient’s covariate values). Similarly, for an exposure that increases the risk by a factor (e.g., doubles the risk, for a RR of 2), a patient with a high baseline risk cannot experience the same change in risk as a patient with a low baseline risk could; for example, a RR of 2 is impossible for a patient with a baseline risk of .6 but is certainly plausible for a patient with a baseline risk of .02. In this sense, there is necessarily effect modification of the RR by baseline risk. This is not true for the OR; it is plausible that the effect on the OR scale could be consistent across levels of levels of the predictors because any OR is compatible with any possible baselines risk. For this reason, we only consider estimating the conditional OR and not other effect measures.

Although several methods exist for estimating conditional ORs after matching or subclassification, the one that generally works well is to run a logistic regression of the outcome on the treatment and covariates and use the coefficient on treatment as an estimate of the effect on the log OR scale. We demonstrate this below using 1:1 nearest neighbor propensity score matching for the ATT:

```
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mF + X7 + X8 + X9, data = d,
X6 method = "nearest")
<- match.data(mF)
md
<- glm(Y_B ~ A + X1 + X2 + X3 + X4 + X5 +
fitcond + X7 + X8 + X9, data = md,
X6 weights = weights, family = binomial)
coeftest(fitcond, vcov. = vcovCL, cluster = ~subclass)["A",,drop = FALSE]
```

```
## Estimate Std. Error z value Pr(>|z|)
## A 1.072 0.1653 6.482 9.055e-11
```

`exp(coef(fitcond)["A"])`

```
## A
## 2.92
```

As with other covariate-adjusted models, it is important not to interpret the coefficients on covariates other than treatment to avoid committing the table 2 fallacy. Other causes of the outcome can be included in the outcome model even if they were not the focus of matching as long as they are not (even possibly) caused by the treatment.

Alternative methods of estimating conditional effects include conditional logistic regression after matching and estimating stratum-specific effects after subclassification. We recommend using covariate-adjusted logistic regression models instead because the conditional effect is better defined (i.e., conditional on the specific covariates included in the model) and depends less on the estimand targeted (Forbes and Shortreed 2008).

Moderation analysis involves determining whether a treatment effect
differs across levels of another variable. The use of matching with
moderation analysis is described in Green and
Stuart (2014). The goal is to achieve balance
within each subgroup of the potential moderating variable, and there are
several ways of doing so. Broadly, one can either perform matching in
the full dataset, perhaps requiring exact matching on the moderator, or
one can perform completely separate analyses in each subgroup. We’ll
demonstrate both approaches below. The chosen approach should be that
which achieves the best balance, though we don’t demonstrate assessing
balance here to maintain focus on effect estimation. We’ll consider the
binary variable `X5`

to be the potential moderator of the
effect of `A`

on `Y_C`

.

The first approach involves pooling information across subgroups.
This could involve estimating propensity scores using a single model for
both groups but exact matching on the potential moderator. The
propensity score model could include moderator-by-covariate interactions
to allow the propensity score model to vary across subgroups on some
covariates. Below, we’ll estimate a propensity score using a single
propensity score model with a few moderator-by-covariate interactions
and exact matching on the moderator, `X5`

. We’ll perform
nearest neighbor matching on the propensity score.

```
<- matchit(A ~ X1 + X2 + X5*X3 + X4 +
mP *X6 + X7 + X5*X8 + X9, data = d,
X5exact = ~X5, method = "nearest")
mP
```

```
## A matchit object
## - method: 1:1 nearest neighbor matching without replacement
## - distance: Propensity score
## - estimated with logistic regression
## - number of obs.: 2000 (original), 882 (matched)
## - target estimand: ATT
## - covariates: X1, X2, X5, X3, X4, X6, X7, X8, X9
```

Although it is straightforward to assess balance overall using
`summary()`

, it is more challenging to assess balance within
subgroups. The easiest way to check subgroup balance would be to use
`cobalt::bal.tab()`

, which has a `cluster`

argument that can be used to assess balance within subgroups, e.g., by
`cobalt::bal.tab(mP, cluster = "X5")`

. See the vignette
“Appendix 2: Using cobalt with Clustered, Multiply Imputed, and Other
Segmented Data” on the `cobalt`

website for
details.

If we are satisfied with balance, we can then estimate the subgroup effects using an outcome model with an interaction between the treatment and the moderator.

```
<- match.data(mP)
mdP
<- lm(Y_C ~ A * X5, data = mdP, weights = weights)
fitP
coeftest(fitP, vcov. = vcovCL, cluster = ~subclass)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7500 0.4515 1.66 0.0970 .
## A 2.2073 0.6703 3.29 0.0010 **
## X5 1.7413 0.6070 2.87 0.0042 **
## A:X5 -0.0275 0.8791 -0.03 0.9751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

A second approach is to perform the matching analyses separately and
then combine the matched samples. We demonstrate this below. First, we
split the data by levels of `X5`

.

```
<- subset(d, X5 == 0)
d_X5_0 <- subset(d, X5 == 1) d_X5_1
```

Next we perform separate matching analyses in each new dataset,

```
<- matchit(A ~ X1 + X2 + X3 + X4 +
mS0 + X7 + X8 + X9, data = d_X5_0,
X6 method = "nearest")
<- matchit(A ~ X1 + X2 + X3 + X4 +
mS1 + X7 + X8 + X9, data = d_X5_1,
X6 method = "nearest")
```

It is straightforward to assess balance within each subgroup using
`summary()`

, and `cobalt`

functions can be used as
well. We omit this step here, but you should not!

To estimate the subgroup effects, we need to combine the matched
datasets, which we can do using `rbind.matchdata()`

(a
special `rbind()`

method for `matchdata`

objects).
Then we estimate the moderation effect just as we did previously.

```
<- match.data(mS0)
mdS0 <- match.data(mS1)
mdS1
<- rbind(mdS0, mdS1)
mdS
<- lm(Y_C ~ A * X5, data = mdS, weights = weights)
fitS
coeftest(fitS, vcov. = vcovCL, cluster = ~subclass)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.720 0.421 1.71 0.08766 .
## A 2.237 0.641 3.49 0.00051 ***
## X5 2.160 0.576 3.75 0.00019 ***
## A:X5 -0.446 0.842 -0.53 0.59670
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

There are benefits to using either approach, and Green and Stuart (2014) find that either can be successful at balancing the subgroups. The first approach may be most effective with small samples, where separate propensity score models would be fit with greater uncertainty and an increased possibility of perfect prediction or failure to converge (Wang et al. 2018). The second approach may be more effective with larger samples or with matching methods that target balance in the matched sample, such as genetic matching (Kreif et al. 2012). With genetic matching, separate subgroup analyses ensure balance is optimized within each subgroup rather than just overall.

It is important to be as thorough and complete as possible when describing the methods of estimating the treatment effect and the results of the analysis. This improves transparency and replicability of the analysis. Results should at least include the following:

- a description of the outcome model used (e.g., logistic regression, a linear model with treatment-covariate interactions and mean-centered covariates, a Cox proportional hazards model with the matching weights applied)
- the way the effect was estimated (e.g., as the coefficient on treatment in the outcome model, as the result of a marginal effects procedure)
- the way standard errors and confidence intervals were estimated (e.g., using robust standard errors, using cluster-robust standard errors with pair membership as the cluster, using the BCa bootstrap with 4999 bootstrap replications and the entire process of matching and effect estimation included in each replication)
- R packages and functions used in estimating the effect and its
standard error (e.g.,
`glm()`

in base R,`vcovCL()`

in`sandwich`

,`boot()`

and`boot.ci()`

in`boot`

) - The effect and its standard error and confidence interval

All this is in addition to information about the matching method, propensity score estimation procedure (if used), balance assessment, etc. mentioned in the other vignettes.

There are a few common mistakes that should be avoided. It is important not only to avoid these mistakes in one’s own research but also to be able to spot these mistakes in others’ analyses.

Several methods involve weights that are to be used in estimating the
treatment effect. With full matching and stratification matching (when
analyzed using MMWS), the weights do the entire work of balancing the
covariates across the treatment groups. Omitting weights essentially
ignores the entire purpose of matching. Some cases are less obvious.
When performing matching with replacement and estimating the treatment
effect using the `match.data()`

output, weights must be
included to ensure control units matched to multiple treated units are
weighted accordingly. Similarly, when performing k:1 matching where not
all treated units receive k matches, weights are required to account for
the differential weight of the matched control units. The only time
weights can be omitted after pair matching is when performing 1:1
matching without replacement. Including weights even in this scenario
will not affect the analysis and it can be good practice to always
include weights to prevent this error from occurring. There are some
scenarios where weights are not useful because the conditioning occurs
through some other means, such as when using the pooling strategy rather
than MMWS for estimating marginal effects after stratification.

Robust standard errors are required when using weights to estimate
the treatment effect. The model-based standard errors resulting from
weighted least squares or maximum likelihood are inaccurate when using
matching weights because they assume weights are frequency weights
rather than probability weights. Cluster-robust standard errors account
for both the matching weights and pair membership and should be used
when appropriate (i.e., with all matching methods other than
stratification matching). Sometimes, researchers use functions in the
`survey`

package to estimate robust standard errors,
especially with inverse probability weighting; this is a valid way to
compute robust standard errors and will give similar results to
`sandwich::vcovHC()`

.

The distinction between marginal and conditional effects is not always clear both in methodological and applied papers. Some statistical methods are valid only for estimating conditional effects and they should not be used to estimate marginal effects (without further modification). Sometimes conditional effects are desirable, and such methods may be useful for them, but when marginal effects are the target of inference, it is critical not to inappropriately interpret estimates resulting from statistical methods aimed at estimating conditional effects as marginal effects. Although this issue is particularly salient with binary and survival outcomes due to the general noncollapsibility of the OR, RR, and HR, this can also occur with linear models for continuous outcomes or the RD.

The following methods estimate **conditional effects**
for binary or survival outcomes (with noncollapsible effect measures)
and should **not** be used to estimate marginal
effects:

- Logistic regression or Cox proportional hazards model with covariates and/or the propensity score included, using the coefficient on treatment as the effect estimate
- Conditional logistic regression after matching
- Stratified Cox regression after matching
- Averaging stratum-specific effect estimates after stratification, including using Mantel-Haenszel OR pooling
- Including pair or stratum fixed or random effects in a logistic regression model, using the coefficient on treatment as the effect estimate

In addition, with continuous outcomes, conditional effects can be
mistakenly interpreted as marginal effect estimates when
treatment-covariate interactions are present in the outcome model. If
the covariates are not centered at their mean in the target population
(e.g., the treated group for the ATT, the full sample for the ATE, or
the remaining matched sample for an ATM), the coefficient on treatment
will not correspond to the marginal effect in the target population; it
will correspond to the effect of treatment when the covariate values are
equal to zero, which may not be meaningful or plausible. Marginal
effects procedures (e.g., using the `margins`

package or
manually with bootstrapping as demonstrated above) are always the safest
way to include covariates in the outcome model, especially in the
presence of treatment-covariate interactions. Appropriately centering
the covariates is a shortcut that is required when using the coefficient
on treatment as a marginal effect estimate for continuous outcomes
(demonstrated previously for full matching).

Abadie, Alberto, and Guido W. Imbens. 2008. “On the Failure of the
Bootstrap for Matching Estimators.” *Econometrica* 76 (6):
1537–57. https://www.jstor.org/stable/40056514.

Abadie, Alberto, and Jann Spiess. 2019. “Robust Post-Matching
Inference,” January, 34.

Austin, Peter C. 2007. “The Performance of Different Propensity
Score Methods for Estimating Marginal Odds Ratios.”
*Statistics in Medicine* 26 (16): 3078–94. https://doi.org/10.1002/sim.2781.

———. 2009. “Type i Error Rates, Coverage of Confidence Intervals,
and Variance Estimation in Propensity-Score Matched Analyses.”
*The International Journal of Biostatistics* 5 (1). https://doi.org/10.2202/1557-4679.1146.

———. 2013a. “The Performance of Different Propensity Score Methods
for Estimating Marginal Hazard Ratios.” *Statistics in
Medicine* 32 (16): 2837–49. https://doi.org/10.1002/sim.5705.

———. 2013b. “The Use of Propensity Score Methods with Survival or
Time-to-Event Outcomes: Reporting Measures of Effect Similar to Those
Used in Randomized Experiments.” *Statistics in Medicine*
33 (7): 1242–58. https://doi.org/10.1002/sim.5984.

Austin, Peter C., and Guy Cafri. 2020. “Variance Estimation When
Using Propensity-Score Matching with Replacement with
Survival or Time-to-Event Outcomes.”
*Statistics in Medicine* 39 (11): 1623–40. https://doi.org/10.1002/sim.8502.

Austin, Peter C., and Dylan S. Small. 2014. “The Use of
Bootstrapping When Using Propensity-Score Matching Without Replacement:
A Simulation Study.” *Statistics in Medicine* 33 (24):
4306–19. https://doi.org/10.1002/sim.6276.

Austin, Peter C., and Elizabeth A. Stuart. 2015. “The Performance
of Inverse Probability of Treatment Weighting and Full Matching on the
Propensity Score in the Presence of Model Misspecification When
Estimating the Effect of Treatment on Survival Outcomes.”
*Statistical Methods in Medical Research* 26 (4): 1654–70. https://doi.org/10.1177/0962280215584401.

———. 2015. “Optimal Full Matching for Survival Outcomes: A Method
That Merits More Widespread Use.” *Statistics in Medicine*
34 (30): 3949–67. https://doi.org/10.1002/sim.6602.

———. 2017. “Estimating the Effect of Treatment on Binary Outcomes
Using Full Matching on the Propensity Score.” *Statistical
Methods in Medical Research* 26 (6): 2505–25. https://doi.org/10.1177/0962280215601134.

Austin, Peter C., Neal Thomas, and Donald B. Rubin. 2020.
“Covariate-Adjusted Survival Analyses in Propensity-Score Matched
Samples: Imputing Potential Time-to-Event Outcomes.”
*Statistical Methods in Medical Research* 29 (3): 728–51. https://doi.org/10.1177/0962280218817926.

Bodory, Hugo, Lorenzo Camponovo, Martin Huber, and Michael Lechner.
2020. “The Finite Sample Performance of Inference Methods for
Propensity Score Matching and Weighting Estimators.” *Journal
of Business & Economic Statistics* 38 (1): 183–200. https://doi.org/10.1080/07350015.2018.1476247.

Cameron, A. Colin, and Douglas L. Miller. 2015. “A
Practitioner’s Guide to Cluster-Robust Inference.”
*Journal of Human Resources* 50 (2): 317–72. https://doi.org/10.3368/jhr.50.2.317.

Carpenter, James, and John Bithell. 2000. “Bootstrap Confidence
Intervals: When, Which, What? A Practical Guide for Medical
Statisticians.” *Statistics in Medicine* 19 (9): 1141–64.
https://doi.org/10.1002/(SICI)1097-0258(20000515)19:9<1141::AID-SIM479>3.0.CO;2-F.

Efron, Bradley, and Robert J. Tibshirani. 1993. *An Introduction to
the Bootstrap*. Springer US.

Forbes, Andrew, and Susan Shortreed. 2008. “Inverse Probability
Weighted Estimation of the Marginal Odds Ratio: Correspondence Regarding
‘The Performance of Different Propensity Score Methods for
Estimating Marginal Odds Ratios’ by P. Austin, Statictics
in Medicine, 2007; 26:30783094.” *Statistics in
Medicine* 27 (26): 5556–59. https://doi.org/10.1002/sim.3362.

Gayat, Etienne, Matthieu Resche-Rigon, Jean-Yves Mary, and Raphaël
Porcher. 2012. “Propensity Score Applied to Survival Data Analysis
Through Proportional Hazards Models: A Monte Carlo Study.”
*Pharmaceutical Statistics* 11 (3): 222–29. https://doi.org/10.1002/pst.537.

Green, Kerry M., and Elizabeth A. Stuart. 2014. “Examining
Moderation Analyses in Propensity Score Methods:
Application to Depression and Substance Use.”
*Journal of Consulting and Clinical Psychology*, Advances in
Data Analytic Methods, 82 (5): 773–83. https://doi.org/10.1037/a0036515.

Hill, Jennifer, and Jerome P. Reiter. 2006. “Interval Estimation
for Treatment Effects Using Propensity Score Matching.”
*Statistics in Medicine* 25 (13): 2230–56. https://doi.org/10.1002/sim.2277.

Ho, Daniel E., Kosuke Imai, Gary King, and Elizabeth A. Stuart. 2007.
“Matching as Nonparametric Preprocessing for Reducing Model
Dependence in Parametric Causal Inference.” *Political
Analysis* 15 (3): 199–236. https://doi.org/10.1093/pan/mpl013.

Hong, Guanglei. 2010. “Marginal Mean Weighting Through
Stratification: Adjustment for Selection Bias in Multilevel
Data.” *Journal of Educational and Behavioral Statistics*
35 (5): 499–531. https://doi.org/10.3102/1076998609359785.

King, Gary, and Margaret E. Roberts. 2015. “How Robust Standard
Errors Expose Methodological Problems They Do Not Fix, and What to Do
About It.” *Political Analysis* 23 (2): 159–79. https://doi.org/10.1093/pan/mpu015.

Kreif, Noemi, Richard Grieve, Rosalba Radice, Zia Sadique, Roland
Ramsahai, and Jasjeet S. Sekhon. 2012. “Methods for Estimating
Subgroup Effects in Cost-Effectiveness Analyses That Use Observational
Data.” *Medical Decision Making* 32 (6): 750–63. https://doi.org/10.1177/0272989X12448929.

Liang, Kung-Yee, and Scott L. Zeger. 1986. “Longitudinal Data
Analysis Using Generalized Linear Models.” *Biometrika* 73
(1): 13–22. https://doi.org/10.1093/biomet/73.1.13.

MacKinnon, James G. 2006. “Bootstrap Methods in
Econometrics*.” *Economic Record* 82 (s1): S2–18. https://doi.org/10.1111/j.1475-4932.2006.00328.x.

MacKinnon, James G., and Halbert White. 1985. “Some
Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved
Finite Sample Properties.” *Journal of Econometrics* 29
(3): 305–25. https://doi.org/10.1016/0304-4076(85)90158-7.

Nguyen, Tri-Long, Gary S. Collins, Jessica Spence, Jean-Pierre Daurès,
P. J. Devereaux, Paul Landais, and Yannick Le Manach. 2017.
“Double-Adjustment in Propensity Score Matching Analysis: Choosing
a Threshold for Considering Residual Imbalance.” *BMC Medical
Research Methodology* 17: 78. https://doi.org/10.1186/s12874-017-0338-0.

Rudolph, Kara E., K. Ellicott Colson, Elizabeth A. Stuart, and Jennifer
Ahern. 2016. “Optimally Combining Propensity Score
Subclasses.” *Statistics in Medicine* 35 (27): 4937–47. https://doi.org/10.1002/sim.7046.

Schafer, Joseph L., and Joseph Kang. 2008. “Average Causal Effects
from Nonrandomized Studies: A Practical Guide and Simulated
Example.” *Psychological Methods* 13 (4): 279–313. https://doi.org/10.1037/a0014268.

Snowden, Jonathan M., Sherri Rose, and Kathleen M. Mortimer. 2011.
“Implementation of G-Computation on a Simulated Data Set:
Demonstration of a Causal Inference Technique.” *American
Journal of Epidemiology* 173 (7): 731–38. https://doi.org/10.1093/aje/kwq472.

Stampf, Susanne, Erika Graf, Claudia Schmoor, and Martin Schumacher.
2010. “Estimators and Confidence Intervals for the Marginal Odds
Ratio Using Logistic Regression and Propensity Score
Stratification.” *Statistics in Medicine* 29 (7-8):
760–69. https://doi.org/10.1002/sim.3811.

Wan, Fei. 2019. “Matched or Unmatched Analyses with
Propensity-Scorematched Data?”
*Statistics in Medicine* 38 (2): 289–300. https://doi.org/10.1002/sim.7976.

Wang, Shirley V., Yinzhu Jin, Bruce Fireman, Susan Gruber, Mengdong He,
Richard Wyss, HoJin Shin, et al. 2018. “Relative
Performance of Propensity Score Matching
Strategies for Subgroup Analyses.”
*American Journal of Epidemiology* 187 (8): 1799–1807. https://doi.org/10.1093/aje/kwy049.

Westreich, D., and S. Greenland. 2013. “The Table 2 Fallacy:
Presenting and Interpreting Confounder and Modifier
Coefficients.” *American Journal of Epidemiology* 177 (4):
292–98. https://doi.org/10.1093/aje/kws412.

```
#Generating data similar to Austin (2009) for demonstrating treatment effect estimation
<- function(n) {
gen_X <- matrix(rnorm(9 * n), nrow = n, ncol = 9)
X 5] <- as.numeric(X[,5] < .5)
X[,
X
}
#~20% treated
<- function(X) {
gen_A <- - 1.2 + log(2)*X[,1] - log(1.5)*X[,2] + log(2)*X[,4] - log(2.4)*X[,5] + log(2)*X[,7] - log(1.5)*X[,8]
LP_A <- plogis(LP_A)
P_A rbinom(nrow(X), 1, P_A)
}
# Continuous outcome
<- function(A, X) {
gen_Y_C 2*A + 2*X[,1] + 2*X[,2] + 2*X[,3] + 1*X[,4] + 2*X[,5] + 1*X[,6] + rnorm(length(A), 0, 5)
}#Conditional:
# MD: 2
#Marginal:
# MD: 2
# Binary outcome
<- function(A, X) {
gen_Y_B <- -2 + log(2.4)*A + log(2)*X[,1] + log(2)*X[,2] + log(2)*X[,3] + log(1.5)*X[,4] + log(2.4)*X[,5] + log(1.5)*X[,6]
LP_B <- plogis(LP_B)
P_B rbinom(length(A), 1, P_B)
}#Conditional:
# OR: 2.4
# logOR: .875
#Marginal:
# RD: .144
# RR: 1.54
# logRR: .433
# OR: 1.92
# logOR .655
# Survival outcome
<- function(A, X) {
gen_Y_S <- -2 + log(2.4)*A + log(2)*X[,1] + log(2)*X[,2] + log(2)*X[,3] + log(1.5)*X[,4] + log(2.4)*X[,5] + log(1.5)*X[,6]
LP_S sqrt(-log(runif(length(A)))*2e4*exp(-LP_S))
}#Conditional:
# HR: 2.4
# logHR: .875
#Marginal:
# HR: 1.57
# logHR: .452
set.seed(19599)
<- 2000
n <- gen_X(n)
X <- gen_A(X)
A
<- gen_Y_C(A, X)
Y_C <- gen_Y_B(A, X)
Y_B <- gen_Y_S(A, X)
Y_S
<- data.frame(A, X, Y_C, Y_B, Y_S) d
```

Because they are only appropriate with a large number of clusters, cluster-robust standard errors are generally not used with subclassification methods. Regular robust standard errors are valid with these methods when using the subclassification weights to estimate marginal effects.↩︎

Sometimes, an error will occur with this method, which usually means more bootstrap replications are required. The number of replicates must be greater than the original sample size when using the full bootstrap and greater than the number of pairs/strata when using the block bootstrap.↩︎

The matching weights are not necessary when performing 1:1 matching, but we include them here for generality. When weights are not necessary, including them does not affect the estimates. Because it may not always be clear when weights are required, we recommend always including them.↩︎

It is possible to exactly reproduce the

`match.data()`

standard error using the`get_matches()`

data, but doing so may require some fiddling due to the defaults in`sandwich`

.↩︎It is also known as fine stratification weighting, described by Desai et al. [-@desai2017].↩︎

Including subclass as a main effect in the MMWS-weighted regression will not change the effect estimate but may slightly decrease its estimated standard error.↩︎