ordinalClust is an R package that allows users to perform classification, clustering and co-clustering of ordinal data. Furthermore, it allows to handle different numbers of levels and missing values. The ordinal data is considered to follow a BOS distribution [@biernacki16], which is specific for this kind of data. The Latent Block Model is used for performing co-clustering [@jacques17].

```
set.seed(0)
```

```
library(ordinalClust)
```

The package contains real datasets created from [@Anota17]. They concerns quality of life questionnaires for patient affected by breast cancer.

**dataqol**is a data.frame with 121 lines such that each line represents a patient and the columns are information about the patient:- Id: patient Id
- q1-q28: responses to 28 questions with number of categories equals to 4
- q29-q30: responses to 2 questions with number of categories equals to 7

**dataqol.classif**is a data.frame with 40 lines such that a line represents a patient, and the columns are information about the patient:- Id: patient Id
- q1-q28: responses to 28 questions with number of categories equals to 4
- q29-q30: responses to 2 questions with number of categories equals to 7
- death: if the patient deceased (2) or not (1).

To simulate a sample of ordinal data following the BOS distribution, the function **pejSim** is used.

This snippet creates a sample of ordinal data with 7 categories, that follows a BOS distribution parametrized by mu=5 and pi=0.5:

```
m=7
nr=10000
mu=5
pi=0.5
probaBOS=rep(0,m)
for (im in 1:m) probaBOS[im]=pejSim(im,m,mu,pi)
M <- sample(1:m,nr,prob = probaBOS, replace=TRUE)
```

To plot the resulting distribution, the **ggplot2** library can be used.

In this section, a clustering is executed with the **dataqol** dataset. The purpose of performing a clustering is to highlight a structure through the matrix rows.

```
set.seed(0)
library(ordinalClust)
data("dataqol")
# loading the ordinal data
M <- as.matrix(dataqol[,2:29])
m = 4
krow = 3
nbSEM=100
nbSEMburn=90
nbindmini=2
init = "randomBurnin"
percentRandomB = c(30)
object <- bosclust(x=M,kr=krow, m=m, nbSEM=nbSEM,
nbSEMburn=nbSEMburn, nbindmini=nbindmini,
percentRandomB=percentRandomB, init=init)
```

```
plot(object)
```

In this example, a co-clustering is performed with the **dataqol** dataset. In this case, the interest of co-clustering is to detect an internal struture throughout the rows and the columns of the data.

```
set.seed(0)
library(ordinalClust)
# loading the real dataset
data("dataqol")
# loading the ordinal data
M <- as.matrix(dataqol[,2:29])
# defining different number of categories:
m=4
# defining number of row and column clusters
krow = 3
kcol = 3
# configuration for the inference
nbSEM=100
nbSEMburn=90
nbindmini=2
init = "randomBurnin"
percentRandomB = c(30, 30)
# Co-clustering execution
object <- boscoclust(x = M,kr = krow, kc = kcol, m = m,
nbSEM = nbSEM, nbSEMburn = nbSEMburn,
nbindmini = nbindmini, init = init,
percentRandomB = percentRandomB)
```

This snippet shows how to visualize the resulting co-clustering, with the **plot** function:

```
plot(object)
```

In this section, the dataset **dataqol.classif** is used. It contains the responses to a questionnaire for 40 patients affected by breast cancer. Furhermore, a column called **death** indicates if the patient died from the disease (2) or not (1). The aim of this section is to predict the classes of a validation dataset from a training dataset.

The classification function **bosclassif** proposes two classification models. The first one, (chosen by the option kc=0), is a multivariate BOS model assuming that, conditionally on the class of the observations, the feature are independent.
The second model is a parsimonious version of the first model. Parcimony is introduced by grouping the features into clusters (as in co-clustering) and assuming that the features of a cluster have a common distribution. The number L of clusters of features is defined with the option kc=L. In practice L can be chosen by cross-validation, as in the following example:

```
set.seed(1)
library(ordinalClust)
# loading the real dataset
data("dataqol.classif")
# loading the ordinal data
M <- as.matrix(dataqol.classif[,2:29])
# creating the classes values
y <- as.vector(dataqol.classif$death)
# sampling datasets for training and to predict
nb.sample <- ceiling(nrow(M)*7/10)
sample.train <- sample(1:nrow(M), nb.sample, replace=FALSE)
M.train <- M[sample.train,]
M.validation <- M[-sample.train,]
nb.missing.validation <- length(which(M.validation==0))
y.train <- y[sample.train]
y.validation <- y[-sample.train]
# number of classes to predict
kr <- 2
# configuration for SEM algorithm
nbSEM=200
nbSEMburn=175
nbindmini=2
init="randomBurnin"
percentRandomB = c(50, 50)
# different kc to test with cross-validation
kcol <- c(0,1,2,3)
m <- 4
# matrix which contains the predictions for all different kc
preds <- matrix(0,nrow=length(kcol),ncol=nrow(M.validation))
for(kc in 1:length(kcol)){
res <- bosclassif(x=M.train, y=y.train,
kr=kr, kc=kcol[kc], m=m,
nbSEM=nbSEM, nbSEMburn=nbSEMburn,
nbindmini=nbindmini, init=init, percentRandomB=percentRandomB)
new.prediction <- predict(res, M.validation)
preds[kc,] <- new.prediction@zr_topredict
}
preds = as.data.frame(preds)
row.names <- c()
for(kc in kcol){
name= paste0("kc=",kc)
row.names <- c(row.names,name)
}
rownames(preds)=row.names
```

```
library(caret)
actual <- y.validation -1
specificities <- rep(0,length(kcol))
sensitivities <- rep(0,length(kcol))
for(i in 1:length(kcol)){
prediction <- unlist(as.vector(preds[i,])) -1
u <- union(prediction, actual)
conf_matrix<-table(factor(prediction, u),factor(actual, u))
sensitivities[i] <- recall(conf_matrix)
specificities[i] <- specificity(conf_matrix)
}
sensitivities
```

```
## [1] 1.0 0.5 1.0 1.0
```

```
specificities
```

```
## [1] 0.125 0.625 0.375 0.125
```

The package allows the user to deal with ordinal data that have different numbers of categories. In this section, we show how to introduce this kind of datasets in the co-clustering context.

In this example, co-clustering is performed with the dataset **dataqol**, by including the questions with 4 categories, and questions with 7 categories. The function **boscoclustMulti** is executed, and **it might take a few minutes**.

```
set.seed(0)
library(ordinalClust)
# loading the real dataset
data("dataqol")
# loading the ordinal data
M <- as.matrix(dataqol[,2:31])
# defining different number of categories:
m=c(4,7)
# defining number of row and column clusters
krow = 3
kcol = c(3,1)
# configuration for the inference
nbSEM=50
nbSEMburn=40
nbindmini=2
init='random'
d.list <- c(1,29)
# Co-clustering execution
object <- boscoclust(x=M,kr=krow,kc=kcol,m=m, idx_list=d.list,
nbSEM=nbSEM,nbSEMburn=nbSEMburn,
nbindmini=nbindmini, init=init)
```