This vignette provides a short review of effect sizes for common hypothesis tests (in R
these are usually achieved with various *.test()
functions).
library(effectsize)
library(BayesFactor)
In most cases, the effect sizes can be automagically extracted from the htest
object via the effectsize()
function.
For t-tests, it is common to report an effect size representing a standardized difference between the two compared samples’ means. These measures range from \(-\infty\) to \(+\infty\), with negative values indicating the second group’s mean is larger (and vice versa).
For two independent samples, the difference between the means is standardized based on the pooled standard deviation of both samples (assumed to be equal in the population):
t.test(mpg ~ am, data = mtcars, var.equal = TRUE)
>
> Two Sample t-test
>
> data: mpg by am
> t = -4, df = 30, p-value = 3e-04
> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
> 95 percent confidence interval:
> -10.85 -3.64
> sample estimates:
> mean in group 0 mean in group 1
> 17.1 24.4
cohens_d(mpg ~ am, data = mtcars)
> Cohen's d | 95% CI
> --------------------------
> -1.48 | [-2.27, -0.67]
>
> - Estimated using pooled SD.
Hedges’ g provides a bias correction for small sample sizes (\(N < 20\)).
hedges_g(mpg ~ am, data = mtcars)
> Hedges' g | 95% CI
> --------------------------
> -1.44 | [-2.21, -0.65]
>
> - Estimated using pooled SD.
If variances cannot be assumed to be equal, it is possible to get estimates that are not based on the pooled standard deviation:
t.test(mpg ~ am, data = mtcars, var.equal = FALSE)
>
> Welch Two Sample t-test
>
> data: mpg by am
> t = -4, df = 18, p-value = 0.001
> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
> 95 percent confidence interval:
> -11.28 -3.21
> sample estimates:
> mean in group 0 mean in group 1
> 17.1 24.4
cohens_d(mpg ~ am, data = mtcars, pooled_sd = FALSE)
> Cohen's d | 95% CI
> --------------------------
> -1.41 | [-2.26, -0.53]
>
> - Estimated using un-pooled SD.
hedges_g(mpg ~ am, data = mtcars, pooled_sd = FALSE)
> Hedges' g | 95% CI
> --------------------------
> -1.35 | [-2.17, -0.51]
>
> - Estimated using un-pooled SD.
In cases where the differences between the variances are substantial, it is also common to standardize the difference based only on the standard deviation of one of the groups (usually the “control” group); this effect size is known as Glass’ \(\Delta\) (delta) (Note that the standard deviation is taken from the second sample).
glass_delta(mpg ~ am, data = mtcars)
> Glass' delta | 95% CI
> -----------------------------
> -1.17 | [-1.93, -0.39]
For a one-sided hypothesis, it is also possible to construct one-sided confidence intervals:
t.test(mpg ~ am, data = mtcars, var.equal = TRUE, alternative = "less")
>
> Two Sample t-test
>
> data: mpg by am
> t = -4, df = 30, p-value = 1e-04
> alternative hypothesis: true difference in means between group 0 and group 1 is less than 0
> 95 percent confidence interval:
> -Inf -4.25
> sample estimates:
> mean in group 0 mean in group 1
> 17.1 24.4
cohens_d(mpg ~ am, data = mtcars, pooled_sd = TRUE, alternative = "less")
> Cohen's d | 95% CI
> -------------------------
> -1.48 | [-Inf, -0.80]
>
> - Estimated using pooled SD.
> - One-sided CIs: lower bound fixed at (-Inf).
In the case of a one-sample test, the effect size represents the standardized distance of the mean of the sample from the null value. For paired-samples, the difference between the paired samples is used:
t.test(extra ~ group, data = sleep, paired = TRUE)
>
> Paired t-test
>
> data: extra by group
> t = -4, df = 9, p-value = 0.003
> alternative hypothesis: true difference in means is not equal to 0
> 95 percent confidence interval:
> -2.46 -0.70
> sample estimates:
> mean of the differences
> -1.58
cohens_d(extra ~ group, data = sleep, paired = TRUE)
> Cohen's d | 95% CI
> --------------------------
> -1.28 | [-2.23, -0.44]
hedges_g(extra ~ group, data = sleep, paired = TRUE)
> Hedges' g | 95% CI
> --------------------------
> -1.17 | [-2.04, -0.40]
<- ttestBF(mtcars$mpg[mtcars$am == 0], mtcars$mpg[mtcars$am == 1])) (BFt
> Bayes factor analysis
> --------------
> [1] Alt., r=0.707 : 86.6 ±0%
>
> Against denominator:
> Null, mu1-mu2 = 0
> ---
> Bayes factor type: BFindepSample, JZS
effectsize(BFt, test = NULL)
> Summary of Posterior Distribution
>
> Parameter | Median | 95% CI
> -----------------------------------
> Cohens_d | -1.29 | [-2.09, -0.51]
For more details, see ANOVA vignette.
<- oneway.test(mpg ~ gear, data = mtcars, var.equal = TRUE)
onew
eta_squared(onew)
> # Effect Size for ANOVA
>
> Eta2 | 95% CI
> -------------------
> 0.43 | [0.18, 1.00]
>
> - One-sided CIs: upper bound fixed at (1).
For contingency tables Cramér’s V, \(\phi\) (Phi, also known as Cohen’s w) and Pearson’s contingency coefficient indicate the strength of association with 0 indicating no association between the variables. While Cramér’s V and Pearson’s C are capped at 1 (perfect association), \(\phi\) can be larger than 1.
<- matrix(
(Music c(
150, 130, 35, 55,
100, 50, 10, 40,
165, 65, 2, 25
),byrow = TRUE, nrow = 3,
dimnames = list(
Study = c("Psych", "Econ", "Law"),
Music = c("Pop", "Rock", "Jazz", "Classic")
) ))
> Music
> Study Pop Rock Jazz Classic
> Psych 150 130 35 55
> Econ 100 50 10 40
> Law 165 65 2 25
chisq.test(Music)
>
> Pearson's Chi-squared test
>
> data: Music
> X-squared = 52, df = 6, p-value = 2e-09
cramers_v(Music)
> Cramer's V | 95% CI
> -------------------------
> 0.18 | [0.13, 1.00]
>
> - One-sided CIs: upper bound fixed at (1).
phi(Music)
> Phi | 95% CI
> -------------------
> 0.25 | [0.18, 1.41]
>
> - One-sided CIs: upper bound fixed at (1.4142135623731).
pearsons_c(Music)
> Pearson's C | 95% CI
> --------------------------
> 0.24 | [0.18, 1.00]
>
> - One-sided CIs: upper bound fixed at (1).
Pearson’s C and \(\phi\) are also applicable to tests of goodness-of-fit, where small values indicate no deviation from the hypothetical probabilities and large values indicate… large deviation from the hypothetical probabilities.
<- c(89, 37, 130, 28, 2) # observed group sizes
O <- c(.40, .20, .20, .15, .05) # expected group freq
E
chisq.test(O, p = E, rescale.p = TRUE)
>
> Chi-squared test for given probabilities
>
> data: O
> X-squared = 121, df = 4, p-value <2e-16
pearsons_c(O, p = E, rescale.p = TRUE)
> Pearson's C | 95% CI
> --------------------------
> 0.55 | [0.48, 1.00]
>
> - One-sided CIs: upper bound fixed at (1).
phi(O, p = E, rescale.p = TRUE)
> Phi | 95% CI
> -----------------------
> 0.65 | [0.54, Inf]
>
> - One-sided CIs: upper bound fixed at (Inf).
These can also be extracted from the equivalent Bayesian test:
<- contingencyTableBF(Music, sampleType = "jointMulti")) (BFX
> Bayes factor analysis
> --------------
> [1] Non-indep. (a=1) : 10053377 ±0%
>
> Against denominator:
> Null, independence, a = 1
> ---
> Bayes factor type: BFcontingencyTable, joint multinomial
effectsize(BFX, type = "cramers_v", test = NULL)
> Summary of Posterior Distribution
>
> Parameter | Median | 95% CI
> ---------------------------------
> Cramers_v | 0.18 | [0.14, 0.22]
effectsize(BFX, type = "phi", test = NULL)
> Summary of Posterior Distribution
>
> Parameter | Median | 95% CI
> ---------------------------------
> phi | 0.26 | [0.19, 0.31]
effectsize(BFX, type = "pearsons_c", test = NULL)
> Summary of Posterior Distribution
>
> Parameter | Median | 95% CI
> ----------------------------------
> pearsons_c | 0.25 | [0.19, 0.30]
For \(2\times 2\) tables, in addition to Cramér’s V, \(\phi\) and Pearson’s C, we can also compute the Odds-ratio (OR), where each column represents a different group. Values larger than 1 indicate that the odds are higher in the first group (and vice versa).
<- matrix(
(RCT c(
71, 30,
50, 100
),nrow = 2, byrow = TRUE,
dimnames = list(
Diagnosis = c("Sick", "Recovered"),
Group = c("Treatment", "Control")
) ))
> Group
> Diagnosis Treatment Control
> Sick 71 30
> Recovered 50 100
chisq.test(RCT) # or fisher.test(RCT)
>
> Pearson's Chi-squared test with Yates' continuity correction
>
> data: RCT
> X-squared = 32, df = 1, p-value = 2e-08
oddsratio(RCT)
> Odds ratio | 95% CI
> -------------------------
> 4.73 | [2.74, 8.17]
We can also compute the Risk-ratio (RR), which is the ratio between the proportions of the two groups - a measure which some claim is more intuitive.
riskratio(RCT)
> Risk ratio | 95% CI
> -------------------------
> 2.54 | [1.87, 3.45]
Additionally, Cohen’s h can also be computed, which uses the arcsin transformation. Negative values indicate smaller proportion in the first group (and vice versa).
cohens_h(RCT)
> Cohen's h | 95% CI
> ------------------------
> 0.74 | [0.50, 0.99]
For dependent (paired) contingency tables, Cohen’s g represents the symmetry of the table, ranging between 0 (perfect symmetry) and 0.5 (perfect asymmetry).
<- matrix(
(Performance c(
794, 86,
150, 570
),nrow = 2, byrow = TRUE,
dimnames = list(
"1st Survey" = c("Approve", "Disapprove"),
"2nd Survey" = c("Approve", "Disapprove")
) ))
> 2nd Survey
> 1st Survey Approve Disapprove
> Approve 794 86
> Disapprove 150 570
mcnemar.test(Performance)
>
> McNemar's Chi-squared test with continuity correction
>
> data: Performance
> McNemar's chi-squared = 17, df = 1, p-value = 4e-05
cohens_g(Performance)
> Cohen's g | 95% CI
> ------------------------
> 0.14 | [0.07, 0.19]
Rank based tests get rank based effect sizes!
For two independent samples, the rank-biserial correlation (\(r_{rb}\)) is a measure of relative superiority - i.e., larger values indicate a higher probability of a randomly selected observation from X being larger than randomly selected observation from Y. A value of \((-1)\) indicates that all observations in the second group are larger than the first, and a value of \((+1)\) indicates that all observations in the first group are larger than the second.
<- c(48, 48, 77, 86, 85, 85)
A <- c(14, 34, 34, 77)
B
wilcox.test(A, B) # aka Mann–Whitney U test
>
> Wilcoxon rank sum test with continuity correction
>
> data: A and B
> W = 22, p-value = 0.05
> alternative hypothesis: true location shift is not equal to 0
rank_biserial(A, B)
> r (rank biserial) | 95% CI
> --------------------------------
> 0.79 | [0.30, 0.95]
For one sample, \(r_{rb}\) measures the symmetry around \(\mu\) (mu; the null value), with 0 indicating perfect symmetry, \((-1)\) indicates that all observations fall below \(\mu\), and \((+1)\) indicates that all observations fall above \(\mu\). For paired samples the difference between the paired samples is used:
<- c(1.15, 0.88, 0.90, 0.74, 1.21, 1.36, 0.89)
x
wilcox.test(x, mu = 1) # aka Signed-Rank test
>
> Wilcoxon signed rank exact test
>
> data: x
> V = 16, p-value = 0.8
> alternative hypothesis: true location is not equal to 1
rank_biserial(x, mu = 1)
> r (rank biserial) | 95% CI
> ---------------------------------
> 0.14 | [-0.59, 0.75]
>
> - Deviation from a difference of 1.
<- c(1.83, 0.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30)
x <- c(0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29)
y
wilcox.test(x, y, paired = TRUE) # aka Signed-Rank test
>
> Wilcoxon signed rank exact test
>
> data: x and y
> V = 40, p-value = 0.04
> alternative hypothesis: true location shift is not equal to 0
rank_biserial(x, y, paired = TRUE)
> r (rank biserial) | 95% CI
> --------------------------------
> 0.78 | [0.30, 0.94]
The Rank-Epsilon-Squared (\(\varepsilon^2\)) is a measure of association for the rank based one-way ANOVA. Values range between 0 (no relative superiority between any of the groups) to 1 (complete separation - with no overlap in ranks between the groups).
<- list(
group_data g1 = c(2.9, 3.0, 2.5, 2.6, 3.2), # normal subjects
g2 = c(3.8, 2.7, 4.0, 2.4), # with obstructive airway disease
g3 = c(2.8, 3.4, 3.7, 2.2, 2.0) # with asbestosis
)
kruskal.test(group_data)
>
> Kruskal-Wallis rank sum test
>
> data: group_data
> Kruskal-Wallis chi-squared = 0.8, df = 2, p-value = 0.7
rank_epsilon_squared(group_data)
> Epsilon2 (rank) | 95% CI
> ------------------------------
> 0.06 | [0.01, 1.00]
>
> - One-sided CIs: upper bound fixed at (1).
For a rank based repeated measures one-way ANOVA, Kendall’s W is a measure of agreement on the effect of condition between various “blocks” (the subjects), or more often conceptualized as a measure of reliability of the rating / scores of observations (or “groups”) between “raters” (“blocks”).
# Subjects are COLUMNS
<- matrix(
(ReactionTimes c(398, 338, 520,
325, 388, 555,
393, 363, 561,
367, 433, 470,
286, 492, 536,
362, 475, 496,
253, 334, 610),
nrow = 7, byrow = TRUE,
dimnames = list(
paste0("Subject", 1:7),
c("Congruent", "Neutral", "Incongruent")
) ))
> Congruent Neutral Incongruent
> Subject1 398 338 520
> Subject2 325 388 555
> Subject3 393 363 561
> Subject4 367 433 470
> Subject5 286 492 536
> Subject6 362 475 496
> Subject7 253 334 610
friedman.test(ReactionTimes)
>
> Friedman rank sum test
>
> data: ReactionTimes
> Friedman chi-squared = 11, df = 2, p-value = 0.004
kendalls_w(ReactionTimes)
> Kendall's W | 95% CI
> --------------------------
> 0.80 | [0.76, 1.00]
>
> - One-sided CIs: upper bound fixed at (1).