Most real-world ecological studies are characterized by imperfect detectability, i. e. the inability to detect a species or taxon despite its presence in a location. Imperfect detectability is a potential source of bias that must be avoided or at least estimated, particularly since it influences estimates of colonization and extinction. Unfortunately, it is not always possible to avoid or estimate the effects of imperfect detectability. We should be cautious in interpreting estimates derived from the methods that assume perfect detectability. However, when we have a replicated sampling design we can account for detectability while estimating colonization and extinction rates (MacKenzie *et al.* 2003).

MacKenzie (2003) presents a likelihood function to estimate site occupancy, colonization, and local extinction when a species is detected imperfectly. The method relies on replicate observations per sampling time. The implementation of this likelihood is not trivial because there might be several underlying colonization-extinction trajectories that are compatible with the same observed detection history. For example, a detection history such as \(\lbrace 101 \ 100 \rbrace\) means that in the first sampling time we have three replicates, \(101\), where we detected our hypothetical species twice, and a second sampling time, where we observed \(100\), this is, we detected the species only once. Since we detected it at least once at both time 0 and time 1, there is only one underlying colonization-extinction trajectory compatible with it, which, we take the convention of collapsing it into \(( 1 \ 1 )\). However, imagine we fail to detect the species at time 1, being then our detection history \(\lbrace 101 \ 000 \rbrace\). In this case, there are two underlying trajectories that are both compatible with this observation, since the species could have or could have not gone extinct at time 1. These are \(( 1 \ 1 )\) and \(( 1 \ 0 )\). Therefore, the probability of the observed detection history \(\lbrace 101 \ 000 \rbrace\) should sum over the two ways in which that detection history could have been observed, either through the trajectory \((1 \ 1 )\) or \((1 \ 0)\). For simplicity, let us analyse first what is the probability for the observed detection history \(\lbrace 101 \ 100 \rbrace\). The first sampling time always considers the probability of the species being present at the site, \(P_0\), as the fourth model parameter, and given that, the probability of making two out of three possible detections, \(d^2·(1-d)\). The probability of being also present at the time 1 given that the species was present at time 0 is given by \(T_{11}\), and given that, the probability of making only one out three possible detections is \(d · (1-d)^2\), where \(d\) is the detectability *per* replicate or probability of detecting a species when is present per observation. Taking all together, this leads us to the following probability for the full detection history: \[Pr(\lbrace 101 \ 100 \rbrace) = P_0 · d^2·(1-d)·T_{11} · d · (1-d)^2\] Now, let us examine the detection history \(\lbrace 101 \ 000 \rbrace\). As mentioned, we have two possibilities for the second sampling time: the species could be present and have not been detected or could have been truly absent. Notice then that the probability of the full detection history should sum over the two underlying colonization-extinction histories, \(\lbrace 1 \ 1 \rbrace\) and \(\lbrace 1 \ 0 \rbrace\). It would be: \[Pr(\lbrace 101 \ 000 \rbrace) = P_0 · d^2·(1-d) · T_{11} · (1-d)^3 + P_0 · d^2·(1-d) · T_{10} \] where \(T_{10}\) is the probability of colonization.

As a final example, consider the detection history \(\lbrace 001 \ 000 \ 101 \ 000 \ 111 \rbrace\). This detection history can be produced by four underlying colonization-extinction trajectories. These are: \((1 \ 1 \ 1\ 1\ 1)\), \((1 \ 0 \ 1\ 1\ 1)\), \((1 \ 1 \ 1\ 0\ 1)\), \((1 \ 0 \ 1\ 0\ 1)\). The probability of this detection history should sum over these four possible underlying colonization-extinction trajectories because all are compatible with it. Below we detailed the four conditional probabilities:

\[Pr( \lbrace 001 \ 000 \ 101 \ 000 \ 111 \rbrace | ( 1 \ 1 \ 1\ 1\ 1 ) ) = P_0·d·(1-d)^2 · T_{11}·(1-d)^3 · T_{11}·d^2·(1-d) · T_{11}·(1-d)^3 · T_{11}·d^3 \]

\[Pr( \lbrace 001 \ 000 \ 101 \ 000 \ 111 \rbrace | ( 1 \ 0 \ 1\ 1\ 1 ) ) = P_0·d·(1-d)^2 · T_{01} · T_{10}·d^2·(1-d) · T_{11}·(1-d)^3 · T_{11}·d^3 \]

\[Pr( \lbrace 001 \ 000 \ 101 \ 000 \ 111 \rbrace | ( 1 \ 1 \ 1\ 0\ 1 ) ) = P_0·d·(1-d)^2 · T_{11}·(1-d)^3 · T_{11}·d^2·(1-d) · T_{01} · T_{10}·d^3 \]

\[Pr( \lbrace 001 \ 000 \ 101 \ 000 \ 111 \rbrace | ( 1 \ 0 \ 1\ 0\ 1 ) ) = P_0·d·(1-d)^2 · T_{01} · T_{10}·d^2·(1-d) · T_{01} · T_{10}·d^3 \]

The algorithm implemented in island would sum over these four conditional probabilities to calculate the total probability for the initial detection history,\(Pr( \lbrace 001 \ 000 \ 101 \ 000 \ 111 \rbrace )\). Please note that, for real-life examples, when a species goes fully undetected for many sampling times, the full total sum becomes unfeasible because the number of compatible trajectories undergoes rapidly a combinatorial explosion. This may happen in practice if detectability per replicate is very low. In this case, only approximated likelihoods can be given. Alternatively, one could get around this problem by redesigning the full survey and taking more replicates per sampling time. As we have discussed in the main vignette of the package, transition probabilities \(T_{00}, \ T_{10}, \ T_{01}, \ T_{11}\) are functions of the rates \(c\) and \(e\) for a given time interval \(dt\) between observations. Therefore, we have all the elements required to estimate the likelihood of any detection history, even if time intervals between observations vary, which allows to find maximum likelihood estimates for the four model parameters, colonization and extinction rates, \(c\) and \(e\), along with the detectability, \(d\), and the probability of initial presence, \(P_0\).

In order to estimate detectability, we need to provide presence-absence data with replicated samples for the same sampling time, as in the example below extracted from data set `lakshadweepPLUS`

, where column X2000 and X2000.1 correspond to two replicate transects sampled in the same year. In addition, the data can have groups that can be treated as levels of a factor, as in column “Guild”.

Species | Atoll | Guild | X2000 | X2000.1 | X2001 | X2001.1 | X2001.2 | X2001.3 |
---|---|---|---|---|---|---|---|---|

Acanthurus_auranticavus | AGATHI | Algal_Feeder | 1 | 1 | 1 | 0 | 1 | 0.1 |

Acanthurus_leucosternon | AGATHI | Algal_Feeder | 1 | 1 | 1 | 0 | 0 | 0.1 |

Acanthurus_lineatus | AGATHI | Algal_Feeder | 1 | 1 | 1 | 0 | 0 | 0.1 |

Acanthurus_nigrofuscus | AGATHI | Algal_Feeder | 0 | 0 | 0 | 0 | 0 | 0.1 |

Acanthurus_thompsoni | AGATHI | Zooplanktivore | 0 | 0 | 0 | 0 | 0 | 0.1 |

Acanthurus_triostegus | AGATHI | Algal_Feeder | 1 | 1 | 0 | 1 | 1 | 0.1 |

Functions `sss_cedp`

, `mss_cedp`

allow the estimation of colonization and extinction rates with imperfect detectability with simple and multiple sampling schemes, respectively. The function `sss_cedp`

allows estimation for a single sampling scheme with repeated measures that has to be specified with arguments `Time`

, that contains the unique sampling times, and argument `Transects`

that specifies the number of of transects per sampling time. By contrast, `mss_cedp`

allows the estimation of rates with perfect or imperfect detectability for multiple sampling schemes, via the use of flags for missing values specified by argument `MV_FLAG`

, for the whole data set or groups of factors. A full sampling scheme should be specified with argument `Time`

, which is internally used to calculate the particular sampling schemes associated to each separate row with the help of the missing value flags on the columns that have not been sampled. In the next example, we use data sets `lakshadweep`

and `lakshadweepPLUS`

to demonstrate the use of the previous functions. These data sets are extensions of data set `alonso15`

, and include raw data (information of up to 4 transects per atoll at each sampling time, and additional samples for 2012 and 2013). Transects are considered as replicates. `lakshadweepPLUS`

differs in marking missing data with a flag, combining the data for the three atolls in a single `data.frame`

.

```
### Using sss_cedp
Data1 <- lakshadweep[[1]]
Name_of_Factors <- c("Species","Atoll","Guild")
Factors <- Filter(is.factor, Data1)
No_of_Factors <- length(Factors[1,])
n <- No_of_Factors + 1
D1 <- as.matrix(Data1[1:nrow(Data1),n:ncol(Data1)])
Time <- as.double(D1[1,])
P1 <- as.matrix(D1[2:nrow(D1),1:ncol(D1)])
Time_Vector <- as.numeric(names(table(Time)))
Transects <- as.numeric((table(Time)))
R1 <- sss_cedp(P1, Time_Vector, Transects,
Colonization=0.5, Extinction=0.5, Detectability=0.5,
Phi_Time_0=0.5,
Tol=1.0e-8, Verbose = F)
knitr::kable(unlist(R1))
```

x | |
---|---|

C | 0.3212542 |

E | 0.2007582 |

D | 0.4980380 |

P | 0.5082734 |

NLL | 2200.7687093 |

```
### Using mss_cedp
Data <- lakshadweepPLUS[[1]]
Guild_Tag = c("Alg","Cor","Mac","Mic","Omn","Pis","Zoo") # In alphabetical order.
Time <- as.vector(c(2000, 2000, 2001, 2001, 2001, 2001, 2002, 2002, 2002,
2002, 2003, 2003, 2003, 2003, 2010, 2010, 2011, 2011, 2011, 2011, 2012,
2012, 2012, 2012, 2013, 2013, 2013, 2013))
R2 <- mss_cedp(Data, Time, Factor=3, Tags=Guild_Tag, PerfectDetectability=FALSE, z=4)
#> Group 0 (Alg): NLL (Col = 0.544756, Ext = 0.167592, Dtc = 0.598818, P_0 = 0.662212) = 1399.03
#> Group 1 (Cor): NLL (Col = 0.366281, Ext = 0.216484, Dtc = 0.49887, P_0 = 0.529949) = 817.113
#> Group 2 (Mac): NLL (Col = 0.314384, Ext = 0.244689, Dtc = 0.4636, P_0 = 0.596814) = 1591.12
#> Group 3 (Mic): NLL (Col = 0.332609, Ext = 0.202035, Dtc = 0.501711, P_0 = 0.588815) = 849.614
#> Group 4 (Omn): NLL (Col = 0.184111, Ext = 0.167086, Dtc = 0.525903, P_0 = 0.426777) = 323.018
#> Group 5 (Pis): NLL (Col = 0.364651, Ext = 0.31265, Dtc = 0.371796, P_0 = 0.450578) = 712.661
#> Group 6 (Zoo): NLL (Col = 0.404582, Ext = 0.184582, Dtc = 0.558295, P_0 = 0.64373) = 619.948
```

Model selection aims to select the best model for a given phenomenon with a reasonable number of parameters describing it and avoiding over-fitting. Our procedure is intended to distinguish, for example, guilds or islands with different colonization and extinction dynamics. The function `upgma_model_selection`

incorporates an UPGMA algorithm based model selection procedure intended to find an optimal partition that minimizes AIC values. The algorithm needs a vector of tags in order to estimate the partition. This function allows the estimation of colonization and extinction rates with or without imperfect detectability.

The following example (using `lakshadweepPLUS`

) shows the best model describing the dynamics of coral reef fishes in the Lakshadweep Archipelago, based on their guilds.

```
Data <- lakshadweepPLUS[[1]]
Guild_Tag = c("Alg", "Cor", "Mac", "Mic", "Omn", "Pis", "Zoo")
Time <- as.vector(c(2000, 2000, 2001, 2001, 2001, 2001, 2002, 2002, 2002,
2002, 2003, 2003, 2003, 2003, 2010, 2010, 2011, 2011, 2011, 2011, 2012,
2012, 2012, 2012, 2013, 2013, 2013, 2013))
R3 <- upgma_model_selection(Data, Time, Factor = 3, Tags = Guild_Tag,
PerfectDetectability = FALSE, z = 4)
#> Number of Columns: 28
#> Group 0 (Alg): NLL (Col = 0.544756, Ext = 0.167592, Dtc = 0.598818, P_0 = 0.662212) = 1399.03
#> Group 1 (Cor): NLL (Col = 0.366281, Ext = 0.216484, Dtc = 0.49887, P_0 = 0.529949) = 817.113
#> Group 2 (Mac): NLL (Col = 0.314384, Ext = 0.244689, Dtc = 0.4636, P_0 = 0.596814) = 1591.12
#> Group 3 (Mic): NLL (Col = 0.332609, Ext = 0.202035, Dtc = 0.501711, P_0 = 0.588815) = 849.614
#> Group 4 (Omn): NLL (Col = 0.184111, Ext = 0.167086, Dtc = 0.525903, P_0 = 0.426777) = 323.018
#> Group 5 (Pis): NLL (Col = 0.364651, Ext = 0.31265, Dtc = 0.371796, P_0 = 0.450578) = 712.661
#> Group 6 (Zoo): NLL (Col = 0.404582, Ext = 0.184582, Dtc = 0.558295, P_0 = 0.64373) = 619.948
#> Partition 0-th: Number of estimated parameters: 2
#> { Alg Cor Mac Mic Omn Pis Zoo }
#> NLL = 6395.98 AIC = 12800 AIC (corrected) = 12800 AIC_d = 127.573 AIC_w = 1.80675e-28
#> Partition 1-th: Number of estimated parameters: 4
#> { Omn Pis Zoo Cor Mic Mac } { Alg }
#> NLL = 6348.51 AIC = 12713 AIC (corrected) = 12713 AIC_d = 40.641 AIC_w = 1.36121e-09
#> Partition 2-th: Number of estimated parameters: 6
#> { Pis Zoo Cor Mic Mac } { Omn } { Alg }
#> NLL = 6345.48 AIC = 12715 AIC (corrected) = 12715 AIC_d = 42.6136 AIC_w = 5.07662e-10
#> Partition 3-th: Number of estimated parameters: 8
#> { Zoo Cor Mic Mac } { Pis } { Omn } { Alg }
#> NLL = 6324.1 AIC = 12680.2 AIC (corrected) = 12680.3 AIC_d = 7.87612 AIC_w = 0.0177308
#> Partition 4-th: Number of estimated parameters: 10
#> { Cor Mic Mac } { Zoo } { Pis } { Omn } { Alg }
#> NLL = 6316.15 AIC = 12672.3 AIC (corrected) = 12672.4 AIC_d = 0 AIC_w = 0.909928
#> Partition 5-th: Number of estimated parameters: 12
#> { Mic Mac } { Cor } { Zoo } { Pis } { Omn } { Alg }
#> NLL = 6314.83 AIC = 12677.7 AIC (corrected) = 12677.8 AIC_d = 5.39963 AIC_w = 0.0611635
#> Partition 6-th: Number of estimated parameters: 14
#> { Mac } { Mic } { Cor } { Zoo } { Pis } { Omn } { Alg }
#> NLL = 6312.5 AIC = 12681 AIC (corrected) = 12681.2 AIC_d = 8.79881 AIC_w = 0.0111781
```

The function `upgma_model_selection`

also generates two output files in latex format (.tex) with: a) the parameters of the best model found under the model selection procedure and b) the summary of the procedure. Rmarkdown equivalent tables are included below.

Species Group | Extinction Rate | Colonization Rate |
---|---|---|

Cor Mic Mac | 0.22788 | 0.334553 |

Zoo | 0.184582 | 0.404582 |

Pis | 0.31265 | 0.364651 |

Omn | 0.167086 | 0.184111 |

Alg | 0.167592 | 0.544756 |

In table 3 (a), we find that corallivores, microinvertivores and macroinvertivores group together while the other guilds have their own estimates.

Model | NLL | AIC | AIC corrected | AIC difference | AIC weights |
---|---|---|---|---|---|

2-parameter model | 6395.98 | 12800 | 12800 | 127.573 | 1.80675e-28 |

4-parameter model | 6348.51 | 12713 | 12713 | 40.641 | 1.36121e-09 |

6-parameter model | 6345.48 | 12715 | 12715 | 42.6136 | 5.07662e-10 |

8-parameter model | 6324.1 | 12680.2 | 12680.3 | 7.87612 | 0.0177308 |

10-parameter model | 6316.15 | 12672.3 | 12672.4 | 0 | 0.909928 |

12-parameter model | 6314.83 | 12677.7 | 12677.8 | 5.39963 | 0.0611635 |

14-parameter model | 6312.5 | 12681 | 12681.2 | 8.79881 | 0.0111781 |

Table 4 (b) shows the Negative Log-Likelihood, Akaike Information Criterion and associated measures for the models considered in the UPGMA-based model selection procedure.