Introduction to fastpos

January, 2020

The R package fastpos provides a fast algorithm to calculate the required sample size for a Pearson correlation to stabilize within a sequential framework (Schönbrodt & Perugini, 2013, 2018). Basically, one wants to find the sample size at which one can be sure that 1-α percent of many studies will fall into a specified corridor of stability around an assumed population correlation and stay inside that corridor if more participants are added to the study. For instance, find out how many participants per study are required so that, out of 100k studies, 90% would fall into the region between .4 to .6 (a Pearson correlation) and not leave this region again when more participants are added (under the assumption that the population correlation is .5). This sample size is also referred to as the critical point of stability for the specific parameters.

This approach is related to accuracy in parameter estimation (AIPE, e.g. Maxwell, Kelley, & Rausch, 2008) and as such can be seen as an alternative to power analysis. Unlike AIPE, the concept of stability incorporates the idea of sequentially adding participants to a study. Although the approach is young, it has already attracted a lot of interest in the psychological research community, which is evident in over 600 citations of the original publication (Schönbrodt & Perugini, 2013). To date there exists no easy way to use sequential stability for individual sample size planning because there is no analytical solution to the problem and a simulation approach is computationally expensive. The package fastpos overcomes this limitation by speeding up the calculation of correlations. For typical parameters, the theoretical speedup should be at least around 250. An empirical benchmark for a typical scenario even shows a speedup of about 500, paving the way for a wider usage of the stability approach.

If you have found this page, I assume you either want to (1) calculate the critical point of stability for your own study or (2) explore the method in general. If this is the case, read on and you should find what you are looking for. Let us first load the package and set a seed for reproducibility:

library(fastpos)
set.seed(19950521)

In most cases you will just need the function find_critical_pos which will give you the critical point of stability for your specific parameters.

Let us reproduce one example from Schönbrodt and Perugini’s work (this should take only a couple of seconds on a modern CPU):

find_critical_pos(rho = .7, sample_size_min = 20, sample_size_max = 1000,
n_studies = 10000)
#>   rho_pop 80% 90% 95% sample_size_min sample_size_max lower_limit upper_limit
#> 1     0.7  65  96 127              20            1000         0.6         0.8
#>   n_studies n_not_breached precision precision_rel
#> 1     10000              0       0.1         FALSE

The result is very close to Schönbrodt and Perugini’s table (see https://github.com/nicebread/corEvol).

Note that find_critical_pos will throw a message if at least one study did not reach the corridor of stability with the maximum sample size. This happened in Schönbrodt and Perugini’s work, but quite seldom. Still, it should be be avoided for a proper estimate of the point of stability.

find_critical_pos(rho = .7, sample_size_min = 20, sample_size_max = 400,
n_studies = 10000)
#> Warning in find_critical_pos(rho = 0.7, sample_size_min = 20, sample_size_max = 400, : 3 simulation[s] did not reach the corridor of
#>             stability.
#> Increase sample_size_max and rerun the simulation.
#>   rho_pop 80% 90% 95% sample_size_min sample_size_max lower_limit upper_limit
#> 1     0.7  65  97 133              20             400         0.6         0.8
#>   n_studies n_not_breached precision precision_rel
#> 1     10000              3       0.1         FALSE

In this case, do what the message suggests and increase the maximum sample size. Note that larger sample sizes are more resource intensive because the correlations are calculated in the reverse way (from the maximum sample size downwards). Thus, you usually would not like to increase the maximum sample size, unless there are studies that did not reach the corridor of stability.

If you need different confidence levels, just state it:

find_critical_pos(rho = .7, sample_size_min = 20, sample_size_max = 1000,
n_studies = 10000, confidence_levels = c(.6, .85))
#>   rho_pop 60% 85% sample_size_min sample_size_max lower_limit upper_limit
#> 1     0.7  38  78              20            1000         0.6         0.8
#>   n_studies n_not_breached precision precision_rel
#> 1     10000              0       0.1         FALSE

This has no effect on resource consumption because the time consuming part is to simulate the distribution, not calculating quantiles of the distribution.

If you need a different precision level or even relative precision, specify it:

find_critical_pos(rho = c(.5, .7), sample_size_min = 20, sample_size_max = 2500,
n_studies = 10000, precision = .10, precision_rel = T)
#> Warning in find_critical_pos(rho = c(0.5, 0.7), sample_size_min = 20, sample_size_max = 2500, : 10 simulation[s] did not reach the corridor of
#>             stability.
#> Increase sample_size_max and rerun the simulation.
#>   rho_pop   80%   90%  95% sample_size_min sample_size_max lower_limit
#> 1     0.5 590.0 845.2 1122              20            2500        0.45
#> 2     0.7 135.2 199.0  264              20            2500        0.63
#>   upper_limit n_studies n_not_breached precision precision_rel
#> 1        0.55     10000             10       0.1          TRUE
#> 2        0.77     10000              0       0.1          TRUE

As you can see in the output, the limits were set relatively to the population correlation (+-25% of the population correlation).

If you want to dig deeper, you can have a look at the functions that find_critical_pos builds upon. simulate_pos is the workhorse of the package. It calls a C++ function to calculate correlations sequentially and it does this pretty quickly (but you know that already, right?). A rawish approach would be to create a population with create_pop and pass it to simulate_pos:

pop <- create_pop(0.5, 1000000)
pos <- simulate_pos(x_pop = pop[,1],
y_pop = pop[,2],
n_studies = 1000,
sample_size_min = 20,
sample_size_max = 1000,
replace = T,
lower_limit = 0.4,
upper_limit = 0.6)
hist(pos, xlim = c(0, 1000), xlab = c("Point of stability"),
main = "Histogram of points of stability for rho = .5+-.1")

quantile(pos, c(.8, .9, .95), na.rm = T)
#>    80%    90%    95%
#> 142.20 199.10 290.05

Note that no warning message appears if the corridor is not reached, but instead an NA value is returned. Pay careful attention if you work with this function, and adjust the maximum sample size as needed.

create_pop creates the population matrix by using mvrnorm. This is a much simpler way than Schönbrodt and Perugini’s approach, but the results do not seem to differ. If you are interested in how population parameters (e.g. skewness) affect the point of stability, you should instead refer to the population generating functions in Schönbrodt and Perugini’s work.

As you can see, there is not really much to the sequential definition of stability, except for calculating billions of correlations. This is done quite fast with the help of Rcpp.

Let us reproduce Schönbrodt and Perugini’s quite famous and oft-cited table of the critical points of stability for a precision of 0.1. We set the maximum sample size a bit higher, so we avoid studies where the corridor is never reached. We reduce the number of studies to 10k so that it runs fairly quickly.

find_critical_pos(rho = seq(.1, .7, .1), sample_size_max = 1000,
n_studies = 10000)
#> Warning in find_critical_pos(rho = seq(0.1, 0.7, 0.1), sample_size_max = 1000, : 22 simulation[s] did not reach the corridor of
#>             stability.
#> Increase sample_size_max and rerun the simulation.
#>   rho_pop   80%   90%    95% sample_size_min sample_size_max lower_limit
#> 1     0.1 254.0 366.1 489.00              20            1000         0.0
#> 2     0.2 235.2 340.0 443.05              20            1000         0.1
#> 3     0.3 211.0 303.1 396.00              20            1000         0.2
#> 4     0.4 179.0 260.0 350.00              20            1000         0.3
#> 5     0.5 144.2 211.0 276.05              20            1000         0.4
#> 6     0.6 102.0 149.0 198.05              20            1000         0.5
#> 7     0.7  66.0  97.0 128.05              20            1000         0.6
#>   upper_limit n_studies n_not_breached precision precision_rel
#> 1         0.2     10000              9       0.1         FALSE
#> 2         0.3     10000              7       0.1         FALSE
#> 3         0.4     10000              1       0.1         FALSE
#> 4         0.5     10000              5       0.1         FALSE
#> 5         0.6     10000              0       0.1         FALSE
#> 6         0.7     10000              0       0.1         FALSE
#> 7         0.8     10000              0       0.1         FALSE

You can obviously parallelize the process, which will be especially useful if you want to run many simulations. For instance, if you increase the number of studies to 100k (as in the original article), it will take less than a minute on a modern CPU with several cores. On my i7-2640 with 4 cores, it takes about 30 s. Overall, this speedup is substantial compared to the original implementation. A rough benchmark can be found here: https://github.com/johannes-titz/fastpos which results in a speedup of about 500 for a typical scenario.

If you are interested in this package, there is still some work to do and I am happy if you like to contribute. Specifically, I would like to use RcppParallel to speed up the simulation directly in C++. This is rather of academic interest, as the functions are fast enough to find the critical point of stability for an individual study in a few seconds for most use cases. Indeed, I hope the package will be used this way—quite similar to a power analysis for significance testing.

References

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537–563. https://doi.org/10.1146/annurev.psych.59.103006.093735

Schönbrodt, F. D., & Perugini, M. (2013). At what sample size do correlations stabilize? Journal of Research in Personality, 47, 609–612. https://doi.org/10.1016/j.jrp.2013.05.009

Schönbrodt, F. D., & Perugini, M. (2018). Corrigendum to “At What Sample Size Do Correlations Stabilize?” [J. Res. Pers. 47 (2013) 609–612]. Journal of Research in Personality, 74, 194. https://doi.org/10.1016/j.jrp.2018.02.010