This vignette is intended to provide a set of examples that nearly exhaustively demonstrate the functionalities provided by infer
. Commentary on these examples is limited—for more discussion of the intuition behind the package, see the “Getting to Know infer” vignette, accessible by calling vignette("infer")
.
Throughout this vignette, we’ll make use of the gss
dataset supplied by infer
, which contains a sample of data from the General Social Survey. See ?gss
for more information on the variables included and their source. Note that this data (and our examples on it) are for demonstration purposes only, and will not necessarily provide accurate estimates unless weighted properly. For these examples, let’s suppose that this dataset is a representative sample of a population we want to learn about: American adults. The data looks like this:
# load in the dataset
data(gss)
# take a look at its structure
::glimpse(gss) dplyr
## Rows: 500
## Columns: 11
## $ year <dbl> 2014, 1994, 1998, 1996, 1994, 1996, 1990, 2016, 2000, 1998, 20…
## $ age <dbl> 36, 34, 24, 42, 31, 32, 48, 36, 30, 33, 21, 30, 38, 49, 25, 56…
## $ sex <fct> male, female, male, male, male, female, female, female, female…
## $ college <fct> degree, no degree, degree, no degree, degree, no degree, no de…
## $ partyid <fct> ind, rep, ind, ind, rep, rep, dem, ind, rep, dem, dem, ind, de…
## $ hompop <dbl> 3, 4, 1, 4, 2, 4, 2, 1, 5, 2, 4, 3, 4, 4, 2, 2, 3, 2, 1, 2, 5,…
## $ hours <dbl> 50, 31, 40, 40, 40, 53, 32, 20, 40, 40, 23, 52, 38, 72, 48, 40…
## $ income <ord> $25000 or more, $20000 - 24999, $25000 or more, $25000 or more…
## $ class <fct> middle class, working class, working class, working class, mid…
## $ finrela <fct> below average, below average, below average, above average, ab…
## $ weight <dbl> 0.8960, 1.0825, 0.5501, 1.0864, 1.0825, 1.0864, 1.0627, 0.4785…
Calculating the observed statistic,
<- gss %>%
x_bar specify(response = hours) %>%
calculate(stat = "mean")
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
x_bar observe(response = hours, stat = "mean")
Then, generating the null distribution,
<- gss %>%
null_dist specify(response = hours) %>%
hypothesize(null = "point", mu = 40) %>%
generate(reps = 1000) %>%
calculate(stat = "mean")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = x_bar, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = x_bar, direction = "two-sided")
p_value |
---|
0.026 |
Calculating the observed statistic,
<- gss %>%
t_bar specify(response = hours) %>%
hypothesize(null = "point", mu = 40) %>%
calculate(stat = "t")
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
t_bar observe(response = hours, null = "point", mu = 40, stat = "t")
Then, generating the null distribution,
<- gss %>%
null_dist specify(response = hours) %>%
hypothesize(null = "point", mu = 40) %>%
generate(reps = 1000) %>%
calculate(stat = "t")
Alternatively, finding the null distribution using theoretical methods using the assume()
verb,
<- gss %>%
null_dist_theory specify(response = hours) %>%
assume("t")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = t_bar, direction = "two-sided")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = t_bar, direction = "two-sided")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = t_bar, direction = "two-sided")
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = t_bar, direction = "two-sided")
p_value |
---|
0.044 |
Alternatively, using the t_test
wrapper:
%>%
gss t_test(response = hours, mu = 40)
statistic | t_df | p_value | alternative | estimate | lower_ci | upper_ci |
---|---|---|---|---|---|---|
2.085 | 499 | 0.0376 | two.sided | 41.38 | 40.08 | 42.68 |
infer
does not support testing on one numerical variable via the z
distribution.
Calculating the observed statistic,
<- gss %>%
x_tilde specify(response = age) %>%
calculate(stat = "median")
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
x_tilde observe(response = age, stat = "median")
Then, generating the null distribution,
<- gss %>%
null_dist specify(response = age) %>%
hypothesize(null = "point", med = 40) %>%
generate(reps = 1000) %>%
calculate(stat = "median")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = x_tilde, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = x_tilde, direction = "two-sided")
p_value |
---|
0.008 |
Calculating the observed statistic,
<- gss %>%
p_hat specify(response = sex, success = "female") %>%
calculate(stat = "prop")
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
p_hat observe(response = sex, success = "female", stat = "prop")
Then, generating the null distribution,
<- gss %>%
null_dist specify(response = sex, success = "female") %>%
hypothesize(null = "point", p = .5) %>%
generate(reps = 1000) %>%
calculate(stat = "prop")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = p_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = p_hat, direction = "two-sided")
p_value |
---|
0.248 |
Note that logical variables will be coerced to factors:
<- gss %>%
null_dist ::mutate(is_female = (sex == "female")) %>%
dplyrspecify(response = is_female, success = "TRUE") %>%
hypothesize(null = "point", p = .5) %>%
generate(reps = 1000) %>%
calculate(stat = "prop")
Calculating the observed statistic,
<- gss %>%
p_hat specify(response = sex, success = "female") %>%
hypothesize(null = "point", p = .5) %>%
calculate(stat = "z")
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
p_hat observe(response = sex, success = "female", null = "point", p = .5, stat = "z")
Then, generating the null distribution,
<- gss %>%
null_dist specify(response = sex, success = "female") %>%
hypothesize(null = "point", p = .5) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "z")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = p_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = p_hat, direction = "two-sided")
p_value |
---|
0.244 |
The package also supplies a wrapper around prop.test
for tests of a single proportion on tidy data.
prop_test(gss,
~ NULL,
college p = .2)
statistic | chisq_df | p_value | alternative |
---|---|---|---|
635.6 | 1 | 0 | two.sided |
infer
does not support testing two means via the z
distribution.
The infer
package provides several statistics to work with data of this type. One of them is the statistic for difference in proportions.
Calculating the observed statistic,
<- gss %>%
d_hat specify(college ~ sex, success = "no degree") %>%
calculate(stat = "diff in props", order = c("female", "male"))
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
d_hat observe(college ~ sex, success = "no degree",
stat = "diff in props", order = c("female", "male"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(college ~ sex, success = "no degree") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000) %>%
calculate(stat = "diff in props", order = c("female", "male"))
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = d_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = d_hat, direction = "two-sided")
p_value |
---|
1 |
infer
also provides functionality to calculate ratios of proportions. The workflow looks similar to that for diff in props
.
Calculating the observed statistic,
<- gss %>%
r_hat specify(college ~ sex, success = "no degree") %>%
calculate(stat = "ratio of props", order = c("female", "male"))
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
r_hat observe(college ~ sex, success = "no degree",
stat = "ratio of props", order = c("female", "male"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(college ~ sex, success = "no degree") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000) %>%
calculate(stat = "ratio of props", order = c("female", "male"))
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = r_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = r_hat, direction = "two-sided")
p_value |
---|
1 |
In addition, the package provides functionality to calculate odds ratios. The workflow also looks similar to that for diff in props
.
Calculating the observed statistic,
<- gss %>%
or_hat specify(college ~ sex, success = "no degree") %>%
calculate(stat = "odds ratio", order = c("female", "male"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(college ~ sex, success = "no degree") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000) %>%
calculate(stat = "odds ratio", order = c("female", "male"))
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = or_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = or_hat, direction = "two-sided")
p_value |
---|
1 |
Finding the standardized observed statistic,
<- gss %>%
z_hat specify(college ~ sex, success = "no degree") %>%
hypothesize(null = "independence") %>%
calculate(stat = "z", order = c("female", "male"))
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
z_hat observe(college ~ sex, success = "no degree",
stat = "z", order = c("female", "male"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(college ~ sex, success = "no degree") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000) %>%
calculate(stat = "z", order = c("female", "male"))
Alternatively, finding the null distribution using theoretical methods using the assume()
verb,
<- gss %>%
null_dist_theory specify(college ~ sex, success = "no degree") %>%
assume("z")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = z_hat, direction = "two-sided")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = z_hat, direction = "two-sided")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = z_hat, direction = "two-sided")
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = z_hat, direction = "two-sided")
p_value |
---|
1 |
Note the similarities in this plot and the previous one.
The package also supplies a wrapper around prop.test
to allow for tests of equality of proportions on tidy data.
prop_test(gss,
~ sex,
college order = c("female", "male"))
statistic | chisq_df | p_value | alternative | lower_ci | upper_ci |
---|---|---|---|---|---|
0 | 1 | 0.9964 | two.sided | -0.1009 | 0.0917 |
Calculating the observed statistic,
Note the need to add in the hypothesized values here to compute the observed statistic.
<- gss %>%
Chisq_hat specify(response = finrela) %>%
hypothesize(null = "point",
p = c("far below average" = 1/6,
"below average" = 1/6,
"average" = 1/6,
"above average" = 1/6,
"far above average" = 1/6,
"DK" = 1/6)) %>%
calculate(stat = "Chisq")
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
Chisq_hat observe(response = finrela,
null = "point",
p = c("far below average" = 1/6,
"below average" = 1/6,
"average" = 1/6,
"above average" = 1/6,
"far above average" = 1/6,
"DK" = 1/6),
stat = "Chisq")
Then, generating the null distribution,
<- gss %>%
null_dist specify(response = finrela) %>%
hypothesize(null = "point",
p = c("far below average" = 1/6,
"below average" = 1/6,
"average" = 1/6,
"above average" = 1/6,
"far above average" = 1/6,
"DK" = 1/6)) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "Chisq")
Alternatively, finding the null distribution using theoretical methods using the assume()
verb,
<- gss %>%
null_dist_theory specify(response = finrela) %>%
assume("Chisq")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist_theory, method = "both") +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = Chisq_hat, direction = "greater")
p_value |
---|
0 |
Alternatively, using the chisq_test
wrapper:
chisq_test(gss,
response = finrela,
p = c("far below average" = 1/6,
"below average" = 1/6,
"average" = 1/6,
"above average" = 1/6,
"far above average" = 1/6,
"DK" = 1/6))
statistic | chisq_df | p_value |
---|---|---|
488 | 5 | 0 |
Calculating the observed statistic,
<- gss %>%
Chisq_hat specify(formula = finrela ~ sex) %>%
hypothesize(null = "independence") %>%
calculate(stat = "Chisq")
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
Chisq_hat observe(formula = finrela ~ sex, stat = "Chisq")
Then, generating the null distribution,
<- gss %>%
null_dist specify(finrela ~ sex) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "Chisq")
Alternatively, finding the null distribution using theoretical methods using the assume()
verb,
<- gss %>%
null_dist_theory specify(finrela ~ sex) %>%
assume(distribution = "Chisq")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = Chisq_hat, direction = "greater")
p_value |
---|
0.108 |
Alternatively, using the wrapper to carry out the test,
%>%
gss chisq_test(formula = finrela ~ sex)
statistic | chisq_df | p_value |
---|---|---|
9.105 | 5 | 0.1049 |
Calculating the observed statistic,
<- gss %>%
d_hat specify(age ~ college) %>%
calculate(stat = "diff in means", order = c("degree", "no degree"))
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
d_hat observe(age ~ college,
stat = "diff in means", order = c("degree", "no degree"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(age ~ college) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("degree", "no degree"))
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = d_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = d_hat, direction = "two-sided")
p_value |
---|
0.438 |
Finding the standardized observed statistic,
<- gss %>%
t_hat specify(age ~ college) %>%
hypothesize(null = "independence") %>%
calculate(stat = "t", order = c("degree", "no degree"))
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
t_hat observe(age ~ college,
stat = "t", order = c("degree", "no degree"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(age ~ college) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "t", order = c("degree", "no degree"))
Alternatively, finding the null distribution using theoretical methods using the assume()
verb,
<- gss %>%
null_dist_theory specify(age ~ college) %>%
assume("t")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = t_hat, direction = "two-sided")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = t_hat, direction = "two-sided")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = t_hat, direction = "two-sided")
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = t_hat, direction = "two-sided")
p_value |
---|
0.428 |
Note the similarities in this plot and the previous one.
Calculating the observed statistic,
<- gss %>%
d_hat specify(age ~ college) %>%
calculate(stat = "diff in medians", order = c("degree", "no degree"))
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
d_hat observe(age ~ college,
stat = "diff in medians", order = c("degree", "no degree"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(age ~ college) %>% # alt: response = age, explanatory = season
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in medians", order = c("degree", "no degree"))
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = d_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = d_hat, direction = "two-sided")
p_value |
---|
0.194 |
Calculating the observed statistic,
<- gss %>%
F_hat specify(age ~ partyid) %>%
calculate(stat = "F")
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
F_hat observe(age ~ partyid, stat = "F")
Then, generating the null distribution,
<- gss %>%
null_dist specify(age ~ partyid) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "F")
Alternatively, finding the null distribution using theoretical methods using the assume()
verb,
<- gss %>%
null_dist_theory specify(age ~ partyid) %>%
hypothesize(null = "independence") %>%
assume(distribution = "F")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = F_hat, direction = "greater")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = F_hat, direction = "greater")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = F_hat, direction = "greater")
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = F_hat, direction = "greater")
p_value |
---|
0.052 |
Calculating the observed statistic,
<- gss %>%
slope_hat specify(hours ~ age) %>%
calculate(stat = "slope")
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
slope_hat observe(hours ~ age, stat = "slope")
Then, generating the null distribution,
<- gss %>%
null_dist specify(hours ~ age) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "slope")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = slope_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = slope_hat, direction = "two-sided")
p_value |
---|
0.88 |
Calculating the observed statistic,
<- gss %>%
correlation_hat specify(hours ~ age) %>%
calculate(stat = "correlation")
Alternatively, using the observe()
wrapper to calculate the observed statistic,
<- gss %>%
correlation_hat observe(hours ~ age, stat = "correlation")
Then, generating the null distribution,
<- gss %>%
null_dist specify(hours ~ age) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "correlation")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = correlation_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = correlation_hat, direction = "two-sided")
p_value |
---|
0.876 |
Not currently implemented since \(t\) could refer to standardized slope or standardized correlation.
Calculating the observed fit,
<- gss %>%
obs_fit specify(hours ~ age + college) %>%
fit()
Generating a distribution of fits with the response variable permuted,
<- gss %>%
null_dist specify(hours ~ age + college) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
fit()
Generating a distribution of fits where each explanatory variable is permuted independently,
<- gss %>%
null_dist2 specify(hours ~ age + college) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute", variables = c(age, college)) %>%
fit()
Visualizing the observed fit alongside the null fits,
visualize(null_dist) +
shade_p_value(obs_stat = obs_fit, direction = "two-sided")