In this vignette we demonstrate different ways in which longitudinal cluster models can be internally validated.
We explore the latrendData
dataset. This is a synthetic dataset for which the reference group of each trajectory is available, indicated by the Class
column. However, in this vignette we will assume that the true group specification and number of groups are unknown. Instead, we have a candidate model which we wish to validate internally.
data(latrendData)
head(latrendData)
#> Id Time Y Class
#> 1 1 0.0000000 -1.08049205 Class 1
#> 2 1 0.2222222 -0.68024151 Class 1
#> 3 1 0.4444444 -0.65148373 Class 1
#> 4 1 0.6666667 -0.39115398 Class 1
#> 5 1 0.8888889 -0.19407876 Class 1
#> 6 1 1.1111111 -0.02991783 Class 1
Specify the package options to adopt the respective column names of the loaded dataset, for convenience.
We consider a KML model with 3 clusters to be our candidate model that we will validate. We defined the method with a reduced number of repeated random starts (as indicated by the nbRedrawing
argument) in order to reduce the computation time needed in the repeated evaluations below. This is only done for demonstration purposes.
kml <- lcMethodKML("Y", nClusters = 3, nbRedrawing = 5)
kml
#> lcMethodKML as "longitudinal k-means (KML)"
#> nbRedrawing: 5
#> maxIt: 200
#> imputationMethod:"copyMean"
#> distanceName: "euclidean"
#> power: 2
#> distance: function() {}
#> centerMethod: meanNA
#> startingCond: "nearlyAll"
#> nbCriterion: 1000
#> scale: TRUE
#> response: "Y"
#> time: getOption("latrend.time")
#> id: getOption("latrend.id")
#> nClusters: 3
The purpose of model validation is essentially to identify that the model is robust and generalizes well to unseen data. A necessary condition for these aspects is that the model is reproducible on the original training data. This evaluation helps to ensure that the model estimation procedure is robust, i.e., does not yield spurious model solutions.
We can fit a method repeatedly using the latrendRep
data.
repModels <- latrendRep(kml, data = latrendData, .rep=10)
print(repModels, excludeShared = FALSE)
#> List of 10 lcModels with
#> .name .method data seed nbRedrawing maxIt imputationMethod
#> 1 1 kml latrendData 1140350788 5 200 copyMean
#> 2 2 kml latrendData 312928385 5 200 copyMean
#> 3 3 kml latrendData 866248189 5 200 copyMean
#> 4 4 kml latrendData 1909893419 5 200 copyMean
#> 5 5 kml latrendData 554504146 5 200 copyMean
#> 6 6 kml latrendData 884616499 5 200 copyMean
#> 7 7 kml latrendData 803234389 5 200 copyMean
#> 8 8 kml latrendData 1158971242 5 200 copyMean
#> 9 9 kml latrendData 934673902 5 200 copyMean
#> 10 10 kml latrendData 1632225031 5 200 copyMean
#> distanceName power distance centerMethod
#> 1 euclidean 2 function () {} function (x) { mean(x, na.rm = TRUE)}
#> 2 euclidean 2 function () {} function (x) { mean(x, na.rm = TRUE)}
#> 3 euclidean 2 function () {} function (x) { mean(x, na.rm = TRUE)}
#> 4 euclidean 2 function () {} function (x) { mean(x, na.rm = TRUE)}
#> 5 euclidean 2 function () {} function (x) { mean(x, na.rm = TRUE)}
#> 6 euclidean 2 function () {} function (x) { mean(x, na.rm = TRUE)}
#> 7 euclidean 2 function () {} function (x) { mean(x, na.rm = TRUE)}
#> 8 euclidean 2 function () {} function (x) { mean(x, na.rm = TRUE)}
#> 9 euclidean 2 function () {} function (x) { mean(x, na.rm = TRUE)}
#> 10 euclidean 2 function () {} function (x) { mean(x, na.rm = TRUE)}
#> startingCond nbCriterion scale response time id nClusters
#> 1 nearlyAll 1000 TRUE Y Time Id 3
#> 2 nearlyAll 1000 TRUE Y Time Id 3
#> 3 nearlyAll 1000 TRUE Y Time Id 3
#> 4 nearlyAll 1000 TRUE Y Time Id 3
#> 5 nearlyAll 1000 TRUE Y Time Id 3
#> 6 nearlyAll 1000 TRUE Y Time Id 3
#> 7 nearlyAll 1000 TRUE Y Time Id 3
#> 8 nearlyAll 1000 TRUE Y Time Id 3
#> 9 nearlyAll 1000 TRUE Y Time Id 3
#> 10 nearlyAll 1000 TRUE Y Time Id 3
A convenient way to assess the stability across repeated runs is to compare the models on one or more internal model metrics. Similar solutions should yield similar metric scores.
repSelfMetrics <- metric(repModels, name = c("BIC", "WMAE", "APPA"))
head(repSelfMetrics)
#> BIC WMAE APPA
#> [1,] 655.5420 0.06418772 0.9865047
#> [2,] 655.5420 0.06418772 0.9865047
#> [3,] 655.5420 0.06418772 0.9865047
#> [4,] 655.0419 0.06416027 0.9865672
#> [5,] 655.5420 0.06418772 0.9865047
#> [6,] 655.0419 0.06416027 0.9865672
summary(repSelfMetrics[, "WMAE"])
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.06416 0.06416 0.06416 0.06417 0.06419 0.06419
As can be seen from the numbers, the models are (practically) identical in terms of model fit, measurement error, and cluster separation.
Alternatively, we can select the model with the best fit, and compare it against the other fitted models.
bestRepModel <- min(repModels, "BIC")
externalMetric(repModels, bestRepModel, name = "adjustedRand")
#> [1] 1 1 1 1 1 1 1 1 1 1
As indicated by the adjusted Rand index, the methods are highly similar (a score of 1 indicates a perfect agreement). Note however that there are some discrepancies among the repeated runs on one or two trajectories.
Similarly, we can compute the pairwise adjusted Rand indices, resulting in a similarity matrix
bootModels <- latrendBoot(kml, data = latrendData, samples = 10)
bootModels
#> List of 10 lcModels with
#> .name .method data
#> 1 1 kml bootSample(latrendData, "Id", 64231719L)
#> 2 2 kml bootSample(latrendData, "Id", 893996438L)
#> 3 3 kml bootSample(latrendData, "Id", 1434113967L)
#> 4 4 kml bootSample(latrendData, "Id", 958577579L)
#> 5 5 kml bootSample(latrendData, "Id", 2079738042L)
#> 6 6 kml bootSample(latrendData, "Id", 2012583691L)
#> 7 7 kml bootSample(latrendData, "Id", 520205446L)
#> 8 8 kml bootSample(latrendData, "Id", 648143680L)
#> 9 9 kml bootSample(latrendData, "Id", 2127214623L)
#> 10 10 kml bootSample(latrendData, "Id", 882537923L)
bootMetrics <- metric(bootModels, name = c("BIC", "WMAE", "APPA"))
bootMetrics
#> BIC WMAE APPA
#> [1,] 386.5510 0.06249879 0.9918266
#> [2,] 430.5063 0.06337630 0.9904221
#> [3,] 389.1990 0.06258588 0.9847272
#> [4,] 544.3614 0.06614531 0.9891894
#> [5,] 480.6107 0.06621823 0.9952108
#> [6,] 461.5246 0.06367826 0.9888881
#> [7,] 410.5406 0.06151990 0.9874232
#> [8,] 372.2148 0.06216186 0.9888215
#> [9,] 466.8833 0.06500594 0.9884566
#> [10,] 410.7789 0.06232585 0.9897330
Lastly, we can fit models using \(k\)-fold cross-validation to validate the models on previously unseen data from the test folds.
trainModels <- latrendCV(kml, data = latrendData, folds = 10, seed = 1)
trainModels
#> List of 10 lcModels with
#> .name .method data
#> 1 1 kml trainFold(latrendData, fold = 1, "Id", 10, 1)
#> 2 2 kml trainFold(latrendData, fold = 2, "Id", 10, 1)
#> 3 3 kml trainFold(latrendData, fold = 3, "Id", 10, 1)
#> 4 4 kml trainFold(latrendData, fold = 4, "Id", 10, 1)
#> 5 5 kml trainFold(latrendData, fold = 5, "Id", 10, 1)
#> 6 6 kml trainFold(latrendData, fold = 6, "Id", 10, 1)
#> 7 7 kml trainFold(latrendData, fold = 7, "Id", 10, 1)
#> 8 8 kml trainFold(latrendData, fold = 8, "Id", 10, 1)
#> 9 9 kml trainFold(latrendData, fold = 9, "Id", 10, 1)
#> 10 10 kml trainFold(latrendData, fold = 10, "Id", 10, 1)
Alternatively, we can generate the training data folds ourselves, and fit models using latrendBatch
.
dataFolds <- createTrainDataFolds(latrendData, folds = 10)
foldModels <- latrendBatch(kml, data = dataFolds)
foldModels
#> List of 10 lcModels with
#> .name .method data
#> 1 1 kml dataFolds[[1]]
#> 2 2 kml dataFolds[[2]]
#> 3 3 kml dataFolds[[3]]
#> 4 4 kml dataFolds[[4]]
#> 5 5 kml dataFolds[[5]]
#> 6 6 kml dataFolds[[6]]
#> 7 7 kml dataFolds[[7]]
#> 8 8 kml dataFolds[[8]]
#> 9 9 kml dataFolds[[9]]
#> 10 10 kml dataFolds[[10]]
The list of test data folds is obtained using