The probability of success is a very useful concept to assess the chances of success of a trial while taking uncertainties into account. In this document we briefly introduce the needed formal details, describe an example data set and then demonstrate various ways of increasing complexity how the probability of success concept can be applied to increasingly complex data situations.

The co-data concept has been introduced in [1]. It differs from the use of historical data in that the approach makes use of contemporary data. A meta-analytic-predictive (MAP) analysis assumes that historical data is known at the time-point of specyfing the analysis and is as such a retrospective summary of available data. The MAP prior is then combined with the current trial data. A co-data approach extends this sequential procedure to a meta-analytic-combined (MAC) analysis. In the MAC approach all available data is analyzed in a single step - that is, historical and concurrent data is combined in a single inference step. Both approaches MAP and MAC yield exactly the same results as is demonstrated in the appendix at the bottom. An example for a co-data scenario in drug development is the simultaneous execution of twin phase III trails for registriation. In such a setting, a futility analysis at an interim analysis may take historical an all contemporary data into account in a co-data approach. This example has been discussed in [1] using the probability of success (PoS) as metric to assess futility at an interim analysis and is discussed here in detail.

The key property of the probability of success metric is the consideration of uncertainty in parameters conditional on available data. In contrast, the conditional power (CP) calculates the frequency a given experemintal design will be successful for a known value of the parameters. For example, in a 1-sample experiment with a one-sided success criterion the trial is successful if the collected data \(y_N\) of sample size \(N\) exceeds some critical value \(y_c\); recall that the critical value \(y_c\) is determined by the success criterion, prior and sample size when evaluated. Assuming that the sampling model of the data is \(p(y|\theta)\), then

\[ CP_N(\theta) = \int I(y_N > y_c) \, p(y_N|\theta) \, dy_N.\]

The integration over the data \(y_N\) comprises all possible outcomes of the trial. Note that before the start of the trial, \(CP_N(\theta_0)\) is the type I error rate under the conventional null hypothesis (\(\theta=\theta_0\)) and \(CP_N(\theta_a)\) the power of the trial under the alternative (\(\theta=\theta_a\)). At an interim analysis at sample size \(n_I\), the conditional power is then evaluated conditional on the observed data so far (the \(n_I\) measurements) while the remaining sample size (\(N-n_I\)) is random and distributed according to \(p(y|\theta)\) with \(\theta\) set to some known value,

\[ CP_{N-n_I}(\theta|y_{n_I}) = \int I(y_{n_I} + y_{N-n_I} = y_N > y_c|y_{n_I}) \, p(y_{N-n_I}|\theta) \, dy_{N-n_I}.\]

The known value can be set equal to the observed point estimate at the interim, \(CP_{N-n_I}(\hat{\theta}_{I}|y_{n_I})\), or to the assumed true alternative, \(CP_{N-n_I}(\theta_a|y_{n_I})\), used to plan the trial.

The probability of success in contrast assigns \(\theta\) a distribution and marginalizes the conditional power over this distribution. In absence of additional trial external information this distribution is the posterior for \(p(\theta|y_{n_I})\) obtained from the prior for \(p(\theta)\) and the data collected up to the interim,

\[ PoS_I = \int CP_{N-n_I}(\theta|y_{n_I}) \, p(\theta|y_{n_I}) \, d\theta.\]

However, our knowledge about \(\theta\) can be refined if other data-sources like completed (historical) or concurrent trials are available,

\[ PoS_{I,H,...} = \int CP_{N-n_I}(\theta|y_{n_I}) \, p(\theta|y_{I},y_{H},...) \, d\theta.\]

It is important to note that the conditional power is *always* evaluated with respect to the trial data (and prior) only. Thus, additional data-sources are not part of the analysis of the trial. In practice this means that the probability of success is usually calculated for a trial which uses non-informative priors, but at interim we may use additional data-sources to refine our knowledge on \(\theta\) which will not be part of the trial analysis.

In the following the hypothetical example as in [1] is discussed. The assumed endpoint is time-to-event, which is analyzed using the normal approximation of the log-rank statistic for comparing two groups. Under a 1:1 randomization the sampling distribution of the log-hazard ratio is a normal distribution with sampling standard deviation \(2\) and a standard error of \(\sqrt{4/N_{events}}\) for a sample size of \(N_{events}\). The historical data considered is a proof of concept and a phase II trial. The twin phase III studies are event driven. Each trial stops whenever a total of \(379\) events is reached and an interim is planned whenever at least \(150\) events have occured. The assumed true hazard ratio used for the design of the trial is \(0.8\).

Example data:

```
trials <- data.frame(study = c("PoC", "PhII", "PhIII_A", "PhIII_B"),
deaths = c( 8, 85, 162, 150),
HR = c( 0.7, 0.75, 0.83, 0.78),
stringsAsFactors = FALSE
)
## under the normal approximation of the log-HR, the sampling sd is 2
## such that the standard errors are sqrt(4/events)
trials <- trials %>%
mutate(logHR=log(HR), sem=sqrt(4/deaths))
kable(trials, digits=2)
```

study | deaths | HR | logHR | sem |
---|---|---|---|---|

PoC | 8 | 0.70 | -0.36 | 0.71 |

PhII | 85 | 0.75 | -0.29 | 0.22 |

PhIII_A | 162 | 0.83 | -0.19 | 0.16 |

PhIII_B | 150 | 0.78 | -0.25 | 0.16 |

The remaining outline of the vignette is to first evaluate the design properties of the trials, then calculate the probability of success at interim for the A trial only (using only trial A data). Next, the probability of success is calculated using in addition the historical data. Subsequently, the probability of success for trial A is calculated also using the historical data *and* the concurrent phase III data of trial B. Finally, the overall probability of success is calculated, which is defined by the joint success of both trials. In the appendix the equivalence of the MAP and MAC approach is demonstrated.

Key design choices:

- time-to-event endpoint
- phase III trials stop at target # of events 379
- null hypothesis of no difference in HR, \(\theta_0 = 1.0\)
- one-sided \(\alpha = 0.025\)
- alternative hypothesis assumes true HR of \(\theta_a=0.75\)
- interim when at least 150 events reached

Historical data:

- promising internal PoC
- promising phase II

Co-data:

- two phase III trials run in parallel \(\Rightarrow\) each phase III trial is
*concurrent*with the other

Define design choices

Here we use the unit information prior as non-informative prior and define it using the mean & effective sample size (ESS) specification:

```
## Univariate normal mixture
## Reference scale: 2
## Mixture Components:
## comp1
## w 1
## m 0
## s 2
```

Define conditional power for the overall trial:

```
## 1 sample decision function
## Conditions for acceptance:
## P(x <= 0) > 0.975
```

Under the alternative these design choices result in 80% power

`## [1] 0.7986379`

The impact of the unit-information prior is minimal which can be seen by comparing to the frequentist calculation:

```
##
## One-sample t test power calculation
##
## n = 379
## delta = 0.2876821
## sd = 2
## sig.level = 0.025
## power = 0.7976343
## alternative = one.sided
```

With RBesT we can explore the conditional power for a range of alternatives:

```
ggplot(data.frame(HR=c(0.5, 1.2)), aes(HR)) +
stat_function(fun=compose(design, log)) +
vline_at(c(alt_HR, 1.0), linetype=I(2)) +
scale_y_continuous(breaks=seq(0,1,by=0.1)) +
scale_x_continuous(breaks=c(alt_HR, seq(0,1.2,by=0.2))) +
ylab(NULL) + xlab(expression(theta[a])) +
ggtitle(paste("Power for N =", Nev, "events"))
```

The critical value determines at which observed logHR we *just* conclude that the success criterion is fulfilled.

`## [1] -0.2017185`

`## [1] 0.8173249`

We can check this:

`## Using default prior reference scale 2`

`## [1] 1`

Ok, when observing the critical value, we get a success.

Now, what if we observe a 1% worse result?

`## Using default prior reference scale 2`

`## [1] 0`

No success then \(\Rightarrow\) this is the critical boundary value.

Posterior of treatment effect at interim. The trial uses a non-informative prior for the treatment effect:

```
## Univariate normal mixture
## Reference scale: 2
## Mixture Components:
## comp1
## w 1.0000000
## m -0.1851865
## s 0.1566521
```

Now we are interested in the PoS at trial completion. The prior to use for the analysis of the second half is given by the data collected so far.

The returned function can now calculate the PoS assuming any distribution on the treatment effect. In case we do not use any historical information, then this is just the interim posterior:

`## [1] 0.4465623`

The above command integrates the *conditional power* over the uncertainty which we have about the treatment effect as defined above for \(PoS_I\).

The conditional power and the operating characteristics of a trial coincide whenever we do not condition on any observed data. The key difference of the conditional power as compared to the probability of success is that it assumes a known value for the parameter of interest. This can be seen as follows: First define the conditional power which is conditional on the observed data, \(CP_{N-n_I}(\theta|y_{n_I})\):

The conditional power assuming the alternative is true (a HR of 0.75):

`## [1] 0.708769`

In case there is no uncertainty of the treatment effect (here \(se=10^-4\)), then this result agrees with the probability of success calculation:

`## [1] 0.7087689`

For trial B the calculation is:

```
interim_B <- postmix(unit_inf, m=trials$logHR[4], se=trials$sem[4])
interim_pos_B <- pos1S(interim_B, Nev-trials$deaths[4], success_crit, sigma=2)
interim_pos_B(interim_B)
```

`## [1] 0.6411569`

However, we have historical information of which we can take advantage at the interim for a better informed decision.

Our data before the phase III trials includes the PoC and the phase II trial. We now derive from these a MAP prior; recall that the MAP prior is the prediction of the log-hazard ratio of a future trial:

```
base <- trials[1:2,]
set.seed(342345)
base_map_mc <- gMAP(cbind(logHR, sem) ~ 1 | study,
family=gaussian,
data=base,
weights=deaths,
tau.dist="HalfNormal", tau.prior=0.5,
beta.prior=cbind(0, 2))
forest_plot(base_map_mc, est="MAP")
```

```
## EM for Normal Mixture Model
## Log-Likelihood = -2977.119
##
## Univariate normal mixture
## Reference scale: 2
## Mixture Components:
## comp1 comp2 comp3
## w 0.54416926 0.39763059 0.05820016
## m -0.31213815 -0.22701276 -0.20302793
## s 0.28280435 0.63754706 1.37431804
```

At the interim we have even more knowledge available on the treatment effect through the interim data itself which we can include into the MAP prior:

The PoS for this posterior at interim (representing historical *and* interim data collected) is:

`## [1] 0.4903494`

Note that we have not redefined `interim_pos_A`

, such that this calculates the PoS for the phase III A trial taking into account that the final analysis will use a non-informative prior.

For trial B the calculation is:

```
interim_B_combined <- postmix(base_map, m=trials$logHR[4], se=trials$sem[4])
interim_pos_B(interim_B_combined)
```

`## [1] 0.6735573`

However, there is even more information which can be used here, since the phase III result of trial B is also available:

```
## Warning: There were 1 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning in gMAP(formula = cbind(logHR, sem) ~ 1 | study, family = gaussian, : In total 1 divergent transitions occured during the sampling phase.
## Please consider increasing adapt_delta closer to 1 with the following command prior to gMAP:
## options(RBesT.MC.control=list(adapt_delta=0.999))
```

Now the trial B specific posterior at interim is

mean | sd | 2.5% | 50% | 97.5% | |
---|---|---|---|---|---|

PoC | -0.249 | 0.237 | -0.777 | -0.245 | 0.231 |

PhII | -0.254 | 0.154 | -0.568 | -0.250 | 0.036 |

PhIII_A | -0.219 | 0.125 | -0.460 | -0.220 | 0.034 |

PhIII_B | -0.241 | 0.127 | -0.487 | -0.239 | 0.014 |

which we can extract as:

- obtain posterior (which we restrict to the first 4 columns)

```
interim_map_post <- as.matrix(interim_map_mc)[,1:4]
dim(interim_map_post) # posterior is given as matrix: iteration x parameter
```

`## [1] 4000 4`

```
## parameters
## iterations theta[1] theta[2] theta[3] theta[4]
## [1,] -1.23506686 -0.4938551 -0.65124983 -0.24640394
## [2,] -0.23050709 -0.2296346 -0.22231936 -0.21808907
## [3,] 0.05526545 -0.1658904 -0.04603719 -0.02871535
```

- turn MCMC posterior sample into parametric mixture

- and finally evaluate the PoS

`## [1] 0.5121816`

which aligns with the published result under the assumption of full exchangeability.

For trial B computations are:

```
interim_B_allcombined <- automixfit(interim_map_post[,"theta[4]"])
interim_pos_B(interim_B_allcombined)
```

`## [1] 0.6492509`

Differential discounting allows to weight different data-sources differently. For example, we may assume greater heterogeneity for the historical data in comparison to the twin phase III trials.

Assign data to historical (2) and concurrent data strata (1):

study | deaths | HR | logHR | sem | stratum |
---|---|---|---|---|---|

PoC | 8 | 0.70 | -0.36 | 0.71 | 2 |

PhII | 85 | 0.75 | -0.29 | 0.22 | 2 |

PhIII_A | 162 | 0.83 | -0.19 | 0.16 | 1 |

PhIII_B | 150 | 0.78 | -0.25 | 0.16 | 1 |

```
set.seed(435345)
interim_diff_map_mc <- gMAP(cbind(logHR, sem) ~ 1 | study,
tau.strata=stratum,
family=gaussian,
data=trials,
weights=deaths,
tau.dist="HalfNormal", tau.prior=c(0.5, 1),
beta.prior=cbind(0, 2))
interim_diff_map_post <- as.matrix(interim_diff_map_mc)[,1:4]
interim_A_diff_allcombined <- automixfit(interim_diff_map_post[,"theta[3]"])
interim_B_diff_allcombined <- automixfit(interim_diff_map_post[,"theta[4]"])
interim_pos_A(interim_A_diff_allcombined)
```

`## [1] 0.4846824`

`## [1] 0.6408315`

So far we have only calculated the individual PoS per trial, but more interesting is the *overall* PoS for both trials being successful.

Recall, the PoS is the conditional power integrated over an assumed true effect distribution. Hence, we had for trial A:

`## [1] 0.4465623`

As explained, the conditional power is the operating characerstic of a design when conditioning on the already observed data:

The PoS is then the integral of the conditional power over the parameter space \(\theta\) representing our knowledge. This integral can be evaluated in a Monte-Carlo (MC) approach as

\[ PoS_I = \int CP_{N-n_I}(\theta|y_{n_I}) \, p(\theta|y_{n_I}) \, d\theta \approx \frac{1}{S} \sum_{i=1}^S CP(\theta_i),\]

whenever we have a sample of \(p(\theta|y_{n_I})\) of size \(S\)… which we have:

`## [1] 0.4444263`

This is an MC approach to calculating the PoS.

When now considering the probability for both trials being successful we have to perform an MC integration over the joint density \(p(\theta_A,\theta_B|y_{n_{I_A}},y_{n_{I_B}})\)

\[ \begin{aligned} PoS &= \iint CP_{N-n_{I_A}}(\theta_A|y_{n_{I_A}}) \, CP_{N-n_{I_B}}(\theta_B|y_{n_{I_B}})\, p(\theta_B|y_{n_{I_A}},y_{n_{I_B}}) \, d\theta_A d\theta_B \\ & \approx \frac{1}{S} \sum_{i=1}^S CP_{N-n_{I_A}}(\theta_{A,i}|y_{n_{I_A}}) \, CP_{N-n_{I_B}}(\theta_{B,i}|y_{n_{I_B}}). \end{aligned} \]

Thus we need to also get the conditional power for trial B at interim…

…and integrate over the posterior samples (differential discounting case)

`## [1] 0.3359457`

which is slightly larger than assuming independence:

`## [1] 0.2863165`

This is due to dependence of the posteriors

```
## theta[3] theta[4]
## theta[3] 1.0000000 0.2903979
## theta[4] 0.2903979 1.0000000
```

For the full exchangeability case we have

`## [1] 0.3613849`

We have now calculated with increasing complexity the probability of success for various data constellations. As new trials are only conducted whenever previous trial results were positive, it is important to take note of the potential selection bias. Moreover, adding more historical data sources in this situation will likely increase the probability of success as illustrated by this summary of our preceding calculations.

Phase III trial A:

`## [1] 0.4465623`

`## [1] 0.4903494`

`## [1] 0.5121816`

Phase III trial B:

`## [1] 0.6411569`

`## [1] 0.6735573`

`## [1] 0.6492509`

In the preceeding sections we have used MAP and MAC equivalence already. The proof for the equivalence is presented reference in [2]. The formal deriavtion is shown at the end of this section

While MAP and MAC provide the exact same results, the difference is a sequential vs a joint analysis as (see also [2]):

- MAP: Summarize historical information as MAP and then update the MAP with the trial result (MCMC, then
`postmix`

). - MAC: Directly summarize historical information and trial result in a single step (only MCMC on all data).

The two results above using MAP and MAC did not line up. The reason here is that the MAP approach used the historical data and the phase III trial A interim data only. In contrast, the MAC approach used the historical data and interim phase III data of both trials. To show the equivalence we need to align this mismatch of used data.

Run `gMAP`

with base data and produce a large MCMC sample (10 chains) to get a very high precision.

```
base_map_mc_2 <- gMAP(cbind(logHR, sem) ~ 1 | study,
family=gaussian,
data=base,
weights=deaths,
tau.dist="HalfNormal", tau.prior=0.5,
beta.prior=cbind(0, 2),
chains=ifelse(is_CRAN, 2, 20))
```

Force an accurate fit with 5 components:

```
## EM for Normal Mixture Model
## Log-Likelihood = -15023.65
##
## Univariate normal mixture
## Reference scale: 2
## Mixture Components:
## comp1 comp2 comp3 comp4 comp5
## w 0.44291199 0.18579496 0.17155140 0.14114193 0.05859972
## m -0.27985892 0.07382954 -0.60624841 -0.51462844 0.01380718
## s 0.24534769 0.42229796 0.33952059 0.83218491 1.20191788
```

Now, combine the MAP prior (representing historical knowledge) with the interim data of trial A:

- Run the respective MAC analysis (thus we need historical data + phase III A trial, but excluding the phase III B data):

```
## Warning: There were 2 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning in gMAP(formula = cbind(logHR, sem) ~ 1 | study, family = gaussian, : In total 2 divergent transitions occured during the sampling phase.
## Please consider increasing adapt_delta closer to 1 with the following command prior to gMAP:
## options(RBesT.MC.control=list(adapt_delta=0.999))
```

- turn MCMC posterior sample into parametric mixture

```
## EM for Normal Mixture Model
## Log-Likelihood = 10928.27
##
## Univariate normal mixture
## Mixture Components:
## comp1 comp2 comp3 comp4 comp5
## w 0.2436740678 0.2219559985 0.2207907281 0.1979686955 0.1156105101
## m -0.3025028899 -0.1053612768 -0.2140820629 -0.3147708017 -0.0004070604
## s 0.1302665184 0.0591911126 0.0571517703 0.0803173179 0.0903307594
```

Now let’s overlay the two posterior’s

```
ggplot(data.frame(logHR=c(-0.8,0.25)), aes(logHR)) +
stat_function(fun=dmix, args=list(mix=interim_A_combined_2), aes(linetype="MAP")) +
stat_function(fun=dmix, args=list(mix=interim_A_allcombined_2), aes(linetype="MAC")) +
scale_linetype_discrete("Analysis") +
ggtitle("Posterior log hazard of phase III A trial at interim")
```

The PoS is essentially the same

`## [1] 0.4898895`

`## [1] 0.4869594`

The stated equivalence requires that the posterior of a trial specific parameter

\[p(\theta_\star|y_\star,y_H),\]

which is conditional on the trial specific data \(y_\star\) **and** the historical data \(y_H\) (MAC approach, joint use of \(y_H,y_\star\)), is equivalent to obtaining the MAP prior \(p(\theta_\star|y_H)\) based on the historical data and then analyzing the new trial with this prior.

\[ \begin{aligned} p(\theta_\star|y_\star,y_H) &\propto p(\theta_\star,\theta_H|y_\star,y_H) \\ &\propto p(y_\star,y_H|\theta_\star,\theta_H) \, p(\theta_\star,\theta_H) \\ &= p(y_\star|\theta_\star) \, p(y_H|\theta_H) \, p(\theta_\star,\theta_H) \\ &\propto p(y_\star|\theta_\star) \, p(\theta_\star,\theta_H|y_H) \\ &\propto p(y_\star|\theta_\star) \, p(\theta_\star|y_H) \end{aligned} \]

The equivalence holds under the use of the meta-analytic model.

[1] Neuenschwander, B., Roychoudhury, S., & Schmidli, H. (2016). On the Use of Co-Data in Clinical Trials. Statistics in Biopharmaceutical Research, 8(3), 345-354.

[2] 1. Schmidli H, Gsteiger S, Roychoudhury S, O’Hagan A, Spiegelhalter D, Neuenschwander B. Robust meta-analytic-predictive priors in clinical trials with historical control information. Biometrics. 2014;70(4):1023-1032.

```
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/UTF-8/C/C/C/C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.2.1 purrr_0.3.3 dplyr_0.8.3 bayesplot_1.7.0
## [5] knitr_1.25 RBesT_1.5-4 Rcpp_1.0.2
##
## loaded via a namespace (and not attached):
## [1] highr_0.8 plyr_1.8.4 pillar_1.4.2
## [4] compiler_3.5.2 prettyunits_1.0.2 tools_3.5.2
## [7] digest_0.6.21 pkgbuild_1.0.6 evaluate_0.14
## [10] tibble_2.1.3 gtable_0.3.0 checkmate_1.9.4
## [13] pkgconfig_2.0.3 rlang_0.4.0 cli_1.1.0
## [16] parallel_3.5.2 yaml_2.2.0 mvtnorm_1.0-11
## [19] xfun_0.10 loo_2.1.0 gridExtra_2.3
## [22] withr_2.1.2 stringr_1.4.0 stats4_3.5.2
## [25] grid_3.5.2 tidyselect_0.2.5 glue_1.3.1
## [28] inline_0.3.15 R6_2.4.0 processx_3.4.1
## [31] rmarkdown_1.16 rstan_2.19.2 Formula_1.2-3
## [34] reshape2_1.4.3 callr_3.3.2 magrittr_1.5
## [37] ggridges_0.5.1 codetools_0.2-15 matrixStats_0.55.0
## [40] StanHeaders_2.19.0 ps_1.3.0 backports_1.1.5
## [43] scales_1.0.0 htmltools_0.4.0 assertthat_0.2.1
## [46] colorspace_1.4-1 labeling_0.3 stringi_1.4.3
## [49] lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4
```