Using the methodology developed by Gandy et al. (2019), we can check if tests reject with the desired frequency (i.e. we can check the level of the tests) and we can check if confidence intervals have the desired coverage probabilities.

Suppose we want to test if the two-sided t-test (for the mean of iid normally distributed random variables to be 0) implemented in R has a specific nominal level (e.g. 5%). The following does this.

This example uses three buckets \([0,0.45]\), \([0.04,0.06]\),\([0.55,1]\). The method returns, with at least the probability `1-error`

, a bucket that contains the correct rejection probability.

The code below implements a test that succeeds if the bucket \([0.04, 0.06]\) is returned. In other words, if the true rejection probability is in \((0.045, 0.055)\), then the test is guaranteed to succeed (with probability of at least `1-error`

). If it is in \([0.04, 0.045]\cup[0.55,0.06]\), the test may or may not succeed (as different buckets may be returned. If it is in \([0,0.04)\cup(0.06,1]\) then it is guaranteed to fail (with probability of at least `1-error`

).

The following is a function that simulates data and then returns the test decision.

Now, we are setting up the buckets:

```
## low ok high
## [1,] 0.000 0.04 0.055
## [2,] 0.045 0.06 1.000
```

Next, the test is run.

As expected, this does not return an error.

However, if one (wrongly) assumes that this also works if the data is sampled under a Cauchy distribution, then, as expected, the test returns an error. We use the same test and buckets as before, but now the data is simulated from a Cauchy distribution.

`## Error: Test returned bucket [0,0.045], called 'low', not a bucket called 'ok'.`

It is well know that the asymptotic distribution of Pearson’s chi-squared goodness of fit test is not reliable for small sample sizes, and the implementation in R correctly warns about it for small sample sizes.

```
## Warning in chisq.test(c(rmultinom(1, size = 4, prob = c(1/3, 1/3, 1/3)))): Chi-
## squared approximation may be incorrect
```

`## [1] FALSE`

A unit test would also detect this.

`## Error: Test returned bucket [0,0.045], called 'low', not a bucket called 'ok'.`

However, with a larger number of samples and a wide interval \([0.035,0.065]\), it can be confirmed that the level is around the desired level.

```
## low ok high
## [1,] 0.00 0.035 0.06
## [2,] 0.04 0.065 1.00
```

```
gen <- function()as.numeric(chisq.test(c(rmultinom(1,size=15,prob=c(1/3,1/3,1/3))))$p.value<0.05)
expect_bernoulli(gen,J=J,ok="ok")
```

If one needs a more precise statement, one can use a finer grid of buckets - and this reveals that the rejection probability in this case is less than the nominal level.

```
J <- matrix(nrow=2,c(0,0.04,0.035,0.0475, 0.045,0.055, 0.0525,1))
colnames(J) <- c("very low", "low","ok","high")
J
```

```
## very low low ok high
## [1,] 0.00 0.0350 0.045 0.0525
## [2,] 0.04 0.0475 0.055 1.0000
```

`## Error: Test returned bucket [0.035,0.0475], called 'low', not a bucket called 'ok'.`

Similarly, the function below tests if the 95% confidence interval returned by `t.test`

has the correct coverage probability.

First, we set up a function that simulates data, computes the confidence interval and then returns whether or not it contains the true mean.

```
gen <- function(){
x <- rnorm(10,mean=3.7)
CI <- t.test(x)$conf.int
as.numeric(CI[1]<=3.7&CI[2]>=3.7)
}
```

Then we set up the buckets.

```
## low ok high
## [1,] 0.000 0.94 0.955
## [2,] 0.945 0.96 1.000
```

Running the test in this way does not lead to an error.

However, if one chooses different buckets, then, as expected, an error is reported.

```
## low ok high
## [1,] 0.000 0.89 0.905
## [2,] 0.895 0.91 1.000
```

`## Error: Test returned bucket [0.905,1], called 'high', not a bucket called 'ok'.`

Gandy, A., Hahn, G., and Ding, D. (2019), “Implementing Monte Carlo Tests with P-value Buckets,” *Scandinavian Journal of Statistics*. https://doi.org/10.1111/sjos.12434.