**NOTE**: The `pmml`

package referenced in this vignette is assumed to be version `1.5.7`

. Starting with `pmml 2.0.0`

, functions from `pmmlTransformations`

have been merged into `pmml`

. The examples have (commented-out) calls to functions from `pmml`

; if using `pmmlTransformations`

, use `pmml 1.5.7`

or older.

For an updated version of this vignette, see the latest `pmml`

package.

This vignette provides examples of how to use the `FunctionXform`

transformation to create new data features for PMML models.

Given a `WrapData`

object and a transformation expression, `FunctionXform`

calculates data for a new feature and creates a new `WrapData`

object. When PMML is produced with `pmml::pmml()`

, the transformation is inserted into the `LocalTransformations`

node as a `DerivedField`

.

`FunctionXform`

makes it possible to use multiple data fields and functions to produce a new feature.

While `FunctionXform`

is part of the `pmmlTransformations`

package, the code to produce pmml from R is in the `pmml`

package. The following examples assume that both these packages are installed and loaded. The `kable`

function is part of `knitr`

, and is used to make tables more readable.

Using the `iris`

dataset as an example, let’s construct a new feature by transforming one variable. Load the dataset and show the first few lines:

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa |

4.9 | 3.0 | 1.4 | 0.2 | setosa |

4.7 | 3.2 | 1.3 | 0.2 | setosa |

Create the `irisBox`

object with `WrapData`

:

`irisBox`

contains the data and transform information that will be used to produce PMML later. The original data is in `irisBox$data`

. Any new features created with a transformation are added as columns to this data frame.

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa |

4.9 | 3.0 | 1.4 | 0.2 | setosa |

4.7 | 3.2 | 1.3 | 0.2 | setosa |

Transform and field information is in `irisBox$fieldData`

. The fieldData data frame contains information on every field in the dataset, as well as every transform used. The `functionXform`

column contains expressions used in the `FunctionXform`

transform.

type | dataType | origFieldName | sampleMin | sampleMax | xformedMin | xformedMax | centers | scales | fieldsMap | transform | default | missingValue | functionXform | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Sepal.Length | original | numeric | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |

Sepal.Width | original | numeric | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |

Petal.Length | original | numeric | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |

Petal.Width | original | numeric | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |

Species | original | factor | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |

Now add a new feature, `Sepal.Length.Sqrt`

, using `FunctionXform`

:

```
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length",
newFieldName="Sepal.Length.Sqrt",
formulaText="sqrt(Sepal.Length)")
```

The new feature is calculated and added as a column to the `irisBox$data`

data frame:

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Sepal.Length.Sqrt |
---|---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa | 2.258318 |

4.9 | 3.0 | 1.4 | 0.2 | setosa | 2.213594 |

4.7 | 3.2 | 1.3 | 0.2 | setosa | 2.167948 |

`irisBox$fieldData`

now contains a new row with the transformation expression:

type | dataType | origFieldName | functionXform | |
---|---|---|---|---|

Sepal.Length.Sqrt | derived | numeric | Sepal.Length | sqrt(Sepal.Length) |

Construct a linear model for `Petal.Width`

using this new feature:

```
fit <- lm(Petal.Width ~ Sepal.Length.Sqrt, data=irisBox$data)
# Convert to PMML:
# fit_pmml <- pmml(fit, transform=irisBox)
```

Since the model predicts `Petal.Width`

using a variable based on `Sepal.Length`

, the PMML will contain these two fields in the `DataDictionary`

and `MiningSchema`

:

The `LocalTransformations`

node contains `Sepal.Length.Sqrt`

as a derived field:

`FunctionXform`

can also operate on categorical data. In this example, let’s create a boolean feature that equals 1 only when `Species`

is `setosa`

:

```
irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Species",
newFieldName="Species.Setosa",
formulaText="if (Species == 'setosa') {1} else {0}")
kable(head(irisBox$data,3))
```

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Species.Setosa |
---|---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa | 1 |

4.9 | 3.0 | 1.4 | 0.2 | setosa | 1 |

4.7 | 3.2 | 1.3 | 0.2 | setosa | 1 |

Create a linear model and check the `LocalTransformations`

node:

It is possible to create new features by combining several fields. Let’s create a new field from the ratio of sepal and petal lengths:

```
irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length",
newFieldName="Length.Ratio",
formulaText="Sepal.Length / Petal.Length")
```

As before, the new field is added as a column to the `irisBox$data`

data frame:

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Length.Ratio |
---|---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa | 3.642857 |

4.9 | 3.0 | 1.4 | 0.2 | setosa | 3.500000 |

4.7 | 3.2 | 1.3 | 0.2 | setosa | 3.615385 |

Fit a linear model using this new feature:

```
fit <- lm(Petal.Width ~ Length.Ratio, data=irisBox$data)
# Convert to pmml:
# fit_pmml <- pmml(fit, transform=irisBox)
```

The pmml will contain `Sepal.Length`

and `Petal.Length`

in the `DataDictionary`

and `MiningSchema`

, since these were used in `FormulaXform`

:

The `Local.Transformations`

node contains `Length.Ratio`

as a derived field:

It is possible to pass a feature derived with `FunctionXform`

to another `FunctionXform`

call. To do this, the second call to `FunctionXform`

must use the original data field names (instead of the derived field) in the `origFieldName`

argument.

```
irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length",
newFieldName="Length.Ratio",
formulaText="Sepal.Length / Petal.Length")
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length,Sepal.Width",
newFieldName="Length.R.Times.S.Width",
formulaText="Length.Ratio * Sepal.Width")
kable(irisBox$fieldData[6:7,c(1:3,14)])
```

type | dataType | origFieldName | functionXform | |
---|---|---|---|---|

Length.Ratio | derived | numeric | Sepal.Length,Petal.Length | Sepal.Length / Petal.Length |

Length.R.Times.S.Width | derived | numeric | Sepal.Length,Petal.Length,Sepal.Width | Length.Ratio * Sepal.Width |

```
fit <- lm(Petal.Width ~ Length.R.Times.S.Width, data=irisBox$data)
# Convert to pmml:
# fit_pmml <- pmml(fit, transform=irisBox)
```

The pmml will contain `Sepal.Length`

, `Petal.Length`

, and `Sepal.Width`

in the `DataDictionary`

and `MiningSchema`

, since these were used in `FormulaXform`

:

The `Local.Transformations`

node contains `Length.Ratio`

and `Length.R.Times.S.Width`

as derived fields:

`FunctionXform`

The following R functions and operators are directly supported by `FunctionXform`

. Their PMML equivalents are listed on the second line:

- / * ^ < <= > >= && & | || == != ! ceiling prod log — — — — —- ——— ———— ———— ————— —- —- — — —— ——— —- ——– ——– —-

- / * ^ < <= > >= && & | || == != ! ceiling prod log

- / * pow lessThan lessOrEqual greaterThan greaterOrEqual and and or or equal notEqual not ceil product ln

For these functions, no extra code is required for translation.

The R function `prod`

can be used as long as only numeric arguments are specified. That is, `prod`

can take an `na.rm`

argument, but specifying this in `FunctionXform`

directly will not produce PMML equivalent to the R expression.

Similarly, the R function `log`

can be used directly as long as the second argument (the base) is not specified.

`FunctionXform`

There are built-in functions defined in PMML that cannot be directly translated to PMML using `FunctionXform`

as described above.

In this case, an error will be thrown when R tries to calculate a new feature using the function passed to `FunctionXform`

, but does not see that function in the environment.

It is still possible to make `FunctionXform`

work, but the PMML function must be defined in the R environment first.

Let’s use `isIn`

, a PMML function, as an example. The function returns a boolean indicating whether the first argument is contained in a list of values. Detailed specification for this function is available on this DMG page.

One way to implement this in R is by using `%in%`

, with the list of values being represented by `...`

:

```
isIn <- function(x, ...) {
dots <- c(...)
if (x %in% dots) {
return(TRUE)
} else {
return(FALSE)
}
}
isIn(1,2,1,4)
#> [1] TRUE
```

This function can now be passed to `FunctionXform`

. The following code creates a feature that indicates whether `Species`

is either `setosa`

or `versicolor`

:

```
irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Species",
newFieldName="Species.Setosa.or.Versicolor",
formulaText="isIn(Species,'setosa','versicolor')")
```

The `data`

data frame now contains the new feature:

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Species.Setosa.or.Versicolor |
---|---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa | TRUE |

4.9 | 3.0 | 1.4 | 0.2 | setosa | TRUE |

4.7 | 3.2 | 1.3 | 0.2 | setosa | TRUE |

Create a linear model and view the corresponding PMML for the function:

`FunctionXform`

- another exampleAs another example, let’s use R’s `mean`

function to create a new feature. PMML has a built-in `avg`

, so we will define an R function with this name.

Now use this function to take an average of several other features and combine with another field:

```
irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length,Sepal.Width",
newFieldName="Length.Average.Ratio",
formulaText="avg(Sepal.Length,Petal.Length)/Sepal.Width")
```

The `data`

data frame now contains the new feature:

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Length.Average.Ratio |
---|---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa | 0.9285714 |

4.9 | 3.0 | 1.4 | 0.2 | setosa | 1.0500000 |

4.7 | 3.2 | 1.3 | 0.2 | setosa | 0.9375000 |

Create a simple linear model and view the corresponding PMML for the function:

```
fit <- lm(Petal.Width ~ Length.Average.Ratio, data=irisBox$data)
# fit_pmml <- pmml(fit, transform=irisBox)
# fit_pmml[[3]][[3]]
```

In the PMML, `avg`

will be recognized as a valid function.

The function `functionToPMML`

(part of the `pmml`

package) makes it possible to convert an R expression into PMML directly, without creating a model or calculating values.

As long as the expression passed to the function is a valid R expression (e.g., no unbalanced parentheses), it can contain arbitrary function names not defined in R. Variables in the expression passed to `FunctionXform`

are always assumed to be field names, and not substituted. That is, even if `x`

has a value in the R environment, the resulting expression will still use `x`

.

There are several limitations to parsing expressions in `FunctionXform`

.

Each transformation operates on one data row at a time. For example, it is not possible to compute the mean of an entire feature column in `FunctionXform`

.

An expression such as `foo(x)`

is treated as a function `foo`

with argument `x`

. Consequently, passing in an R vector `c(1,2,3)`

will produce PMML where `c`

is a function and `1,2,3`

are the arguments:

We can also see what happens when passing an `na.rm`

argument to `prod`

, as mentioned in an above example:

```
# functionToPMML("prod(1,2,na.rm=FALSE)") #produces incorrect PMML
# functionToPMML("prod(1,2)") #produces correct PMML
```

Additionally, passing in a vector to `prod`

produces incorrect PMML:

The following are additional examples of pmml produced from R expressions.

Extra parentheses:

If-else expressions: