The **philentropy** package has several mechanisms to calculate distances between probability density functions. The main one is to use the the `distance()`

function, which enables to compute 46 different distances/similarities between probability density functions (see `?philentropy::distance`

and a companion vignette for details). Alternatively, it is possible to call each distance/dissimilarity function directly. For example, the `euclidean()`

function will compute the euclidean distance, while `jaccard`

- the Jaccard distance. The complete list of available distance measures are available with the `philentropy::getDistMethods()`

function.

Both of the above approaches have their pros and cons. The `distance()`

function is more flexible as it allows users to use any distance measure and can return either a `matrix`

or a `dist`

object. It also has several defensive programming checks implemented, and thus, it is more appropriate for regular users. Single distance functions, such as `euclidean()`

or `jaccard()`

, can be, on the other hand, slightly faster as they directly call the underlining C++ code.

Now, we introduce three new low-level functions that are intermediaries between `distance()`

and single distance functions. They are fairly flexible, allowing to use of any implemented distance measure, but also usually faster than calling the `distance()`

functions (especially, if it is needed to use many times). These functions are:

`dist_one_one()`

- expects two vectors (probability density functions), returns a single value`dist_one_many()`

- expects one vector (a probability density function) and one matrix (a set of probability density functions), returns a vector of values`dist_many_many()`

- expects two matrices (two sets of probability density functions), returns a matrix of values

Let’s start testing them by attaching the **philentropy** package.

`library(philentropy)`

`dist_one_one()`

`dist_one_one()`

is a lower level equivalent to `distance()`

. However, instead of accepting a numeric `data.frame`

or `matrix`

, it expects two vectors representing probability density functions. In this example, we create two vectors, `P`

and `Q`

.

```
<- 1:10 / sum(1:10)
P <- 20:29 / sum(20:29) Q
```

To calculate the euclidean distance between them we can use several approaches - (a) build-in R `dist()`

function, (b) `philentropy::distance()`

, (c) `philentropy::euclidean()`

, or the new `dist_one_one()`

.

```
# install.packages("microbenchmark")
::microbenchmark(
microbenchmarkdist(rbind(P, Q), method = "euclidean"),
distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE),
euclidean(P, Q, FALSE),
dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
)
```

```
## Unit: microseconds
## expr
## dist(rbind(P, Q), method = "euclidean")
## distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE)
## euclidean(P, Q, FALSE)
## dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 20.581 21.7560 25.74881 23.1155 23.8055 284.644 100
## 28.083 28.7445 51.33078 29.8440 31.6055 2132.509 100
## 2.509 2.7240 3.39208 3.0130 3.5095 8.403 100
## 3.708 4.0495 6.02215 4.7150 5.0955 117.806 100
```

All of them return the same, single value. However, as you can see in the benchmark above, some are more flexible, and others are faster.

`dist_one_many()`

The role of `dist_one_many()`

is to calculate distances between one probability density function (in a form of a `vector`

) and a set of probability density functions (as rows in a `matrix`

).

Firstly, let’s create our example data.

```
set.seed(2020-08-20)
<- 1:10 / sum(1:10)
P <- t(replicate(100, sample(1:10, size = 10) / 55)) M
```

`P`

is our input vector and `M`

is our input matrix.

Distances between the `P`

vector and probability density functions in `M`

can be calculated using several approaches. For example, we could write a `for`

loop (adding a new code) or just use the existing `distance()`

function and extract only one row (or column) from the results. The `dist_one_many()`

allows for this calculation directly as it goes through each row in `M`

and calculates a given distance measure between `P`

and values in this row.

```
# install.packages("microbenchmark")
::microbenchmark(
microbenchmarkas.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1],
distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1],
dist_one_many(P, M, method = "euclidean", testNA = FALSE)
)
```

```
## Unit: microseconds
## expr
## as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1]
## distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1]
## dist_one_many(P, M, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 284.485 370.5515 456.4489 448.7075 533.3575 776.544 100
## 23620.943 25740.1705 29194.4739 27281.6785 32796.1225 45305.363 100
## 22.739 28.7505 34.3501 34.4265 37.6905 124.308 100
```

The `dist_one_many()`

returns a vector of values. It is, in this case, much faster than `distance()`

, and visibly faster than `dist()`

while allowing for more possible distance measures to be used.

`dist_many_many()`

`dist_many_many()`

calculates distances between two sets of probability density functions (as rows in two `matrix`

objects).

Let’s create two new `matrix`

example data.

```
set.seed(2020-08-20)
<- t(replicate(10, sample(1:10, size = 10) / 55))
M1 <- t(replicate(10, sample(1:10, size = 10) / 55)) M2
```

`M1`

is our first input matrix and `M2`

is our second input matrix. I am not aware of any function build-in R that allows calculating distances between rows of two matrices, and thus, to solve this problem, we can create our own - `many_dists()`

…

```
= function(m1, m2){
many_dists = matrix(nrow = nrow(m1), ncol = nrow(m2))
r for (i in seq_len(nrow(m1))){
for (j in seq_len(nrow(m2))){
= rbind(m1[i, ], m2[j, ])
x = distance(x, method = "euclidean", mute.message = TRUE)
r[i, j]
}
}
r }
```

… and compare it to `dist_many_many()`

.

```
# install.packages("microbenchmark")
::microbenchmark(
microbenchmarkmany_dists(M1, M2),
dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)
)
```

```
## Unit: microseconds
## expr min
## many_dists(M1, M2) 2310.314
## dist_many_many(M1, M2, method = "euclidean", testNA = FALSE) 37.081
## lq mean median uq max neval
## 2569.2030 2878.19842 2668.8310 2813.3190 12068.858 100
## 42.0015 97.64644 45.9455 49.4415 5060.589 100
```

Both `many_dists()`

and `dist_many_many()`

return a matrix. The above benchmark concludes that `dist_many_many()`

is about 30 times faster than our custom `many_dists()`

approach.