Cox regression is not very suitable in the analysis of huge data sets with a lot of events (e.g., deaths). For instance, consider analyzing the mortality of the Swedish population aged 60–110 during the years 1968-2019, where we can count to more than four million deaths.

The obvious way to handle that situation is by tabulation and applying a
*piecewise constant hazard* function, because it is a well-known fact that any
continuous function can arbitrary well be approximated by a step function,
simply by taking small enough steps.

The data sets `swepop`

and `swedeaths`

in `eha`

contain age and sex
specific yearly information on population size and number of deaths,
respectively. They both cover the full *Swedish population* the years 1968–2019.

The first few rows of each:

`head(swepop)`

```
## age sex year pop id
## 1.1968 0 men 1968 59280 1
## 2.1968 0 women 1968 56134 2
## 3.1968 1 men 1968 62298 3
## 4.1968 1 women 1968 58722 4
## 5.1968 2 men 1968 62602 5
## 6.1968 2 women 1968 59126 6
```

`head(swedeaths)`

```
## age sex year deaths id
## 1.1968 0 men 1968 783 1
## 2.1968 0 women 1968 566 2
## 3.1968 1 men 1968 103 3
## 4.1968 1 women 1968 66 4
## 5.1968 2 men 1968 48 5
## 6.1968 2 women 1968 31 6
```

The funny rownames and the column `id`

are created by the function `reshape`

,
which was used to transform the original tables, given in *wide format*, to
*long format*. In the original data, downloaded from
Statistics Sweden, the population size refers to the last
day, December 31, of the given year, but here it refers to an average of that
value and the corresponding one the previous year. In that way we get an estimate
of the number of *person years*, which allows us to consider number of
*occurrences* and *exposure time* in each cell of the data. This information will
allow us to fit *proportional hazards* survival models. So we start by joining
the two data sets and remove irrelevant stuff:

```
dat <- swepop[, c("age", "sex", "year", "pop")]
dat$deaths <- swedeaths$deaths
rownames(dat) <- 1:NROW(dat) # Simplify rownames.
head(dat)
```

```
## age sex year pop deaths
## 1 0 men 1968 59280 783
## 2 0 women 1968 56134 566
## 3 1 men 1968 62298 103
## 4 1 women 1968 58722 66
## 5 2 men 1968 62602 48
## 6 2 women 1968 59126 31
```

`tail(dat)`

```
## age sex year pop deaths
## 10499 98 men 2019 596 314
## 10500 98 women 2019 2121 846
## 10501 99 men 2019 320 230
## 10502 99 women 2019 1308 638
## 10503 100 men 2019 368 248
## 10504 100 women 2019 1768 1005
```

We note that the age column ends with `age == 100`

, which in fact means
`age >= 100`

. There are in total 4729403 observed deaths during the years
1968–2019, or 90950 deaths per
year on average. There are 101 age groups, two sexes, and 52 years, in all 10504
cells (rows in the data frame).

Assuming a piecewise constant hazards model on the 101 age groups, we can fit a
proportional hazards model by *Poisson regression*, utilizing the fact that two
likelihood functions in fact are identical. In **R**, we use `glm`

.

```
fit.glm <- glm(deaths ~ offset(log(pop)) + I(year - 2000) + sex +
factor(age), data = dat, family = poisson)
summary(fit.glm)$coefficients[2:3, ]
```

```
## Estimate Std. Error z value Pr(>|z|)
## I(year - 2000) -0.01565 0.0000315 -496.9 0
## sexmen 0.45523 0.0009324 488.2 0
```

The 101 coefficients corresponding to the intercept and the age factor can be
used to estimate the *hazard function*: The intercept, -5.7268,
is the log of the hazard in the age interval 0-1, and the rest are differences to
that value, so we can reconstruct the baseline hazard by

```
lhaz <- coefficients(fit.glm)[-(2:3)]
n <- length(lhaz)
lhaz[-1] <- lhaz[-1] + lhaz[1]
haz <- exp(lhaz)
```

and plot the result, see Figure 1.1.

```
oldpar <- par(las = 1, lwd = 1.5, mfrow = c(1, 2))
plot(0:(n-1), haz, type = "s", main = "log(hazards)",
xlab = "Age", ylab = "", log = "y")
plot(0:(n-1), haz, type = "s", main = "hazards",
xlab = "Age", ylab = "Deaths / Year")
```

While it straightforward to use glm and Poisson regression to fit the model, it
takes some efforts to get it right. That is the reason for the creation of
the function `tpchreg`

(“Tabular Piecewise Constant Hazards REGression”), and
with it, the “Poisson analysis” is performed by

```
fit <- tpchreg(oe(deaths, pop) ~ I(year - 2000) + sex,
time = age, last = 101, data = dat)
```

Note:

The function

`oe`

(“occurrence/exposure”) takes two arguments, the first is the number of events (deaths in our example), and the second is exposure time, or person years.The argument

`time`

is the defining time intervals variable. It can be either character, like c(“0-1”, “1-2”, …, “100-101”) or numeric (as here). If numeric, the value refers to the start of the corresponding interval, and the next start is the end of the previous interval. This leaves the last interval’s endpoint undefined, and if not given by the`last`

argument (see below), it is chosen so that the length of the last interval is one.The argument

`last`

closes the last interval, if is not already closed, see above. The exact value of last is only important for plotting and for the calculation of the*restricted mean survival time*, (RMST) see the summary result below.

`summary(fit)`

```
## Covariate Mean Coef Rel.Risk S.E. LR p
## I(year - 2000) -5.511 -0.016 0.984 0.000 0.000
## sex 0.000
## women 0.503 0 1 (reference)
## men 0.497 0.455 1.577 0.001
##
## Events 4729403
## Total time at risk 457210264
## Max. log. likelihood -18984719
## LR test statistic 477840.10
## Degrees of freedom 2
## Overall p-value 0
##
## Restricted mean survival: 81.84 in (0, 101]
```

The restricted mean survival time is defined as the integral of the survivor
function over the given time interval. Note that if the lower limit of the interval
is larger than zero, it gives the *conditional* restricted mean survival time, given
survival to the lower endpoint.

Graphs of the hazards and the log(hazards) functions are shown in Figure 1.2.

```
oldpar <- par(mfrow = c(1, 2), las = 1, lwd = 1.5)
plot(fit, fn = "haz", log = "y", main = "log(hazards)",
xlab = "Age")
plot(fit, fn = "haz", log = "", main = "hazards",
xlab = "Age", ylab = "Deaths / Year")
```

Same results as with `glm`

and Poisson regression, but a lot simpler.

Sometimes you have a large data file in classical, individual form, suitable for
*Cox regression* with `coxreg`

, but the mere size makes it impractical, or even
impossible. Then help is close by tabulating and assuming a piecewise constant
hazard function, returning to the method in the previous section, that is, using
`tpchreg`

.

The helper function is `toTpch`

, and we illustrate its use on the `oldmort`

data
frame:

`head(oldmort[, c("enter", "exit", "event", "sex", "civ", "birthdate")])`

```
## enter exit event sex civ birthdate
## 1 94.51 95.81 TRUE female widow 1765
## 2 94.27 95.76 TRUE female unmarried 1766
## 3 91.09 91.95 TRUE female widow 1769
## 4 89.01 89.59 TRUE female widow 1771
## 5 90.00 90.21 TRUE female widow 1770
## 6 88.43 89.76 TRUE female widow 1772
```

```
oldmort$birthyear <- floor(oldmort$birthdate) - 1800
om <- toTpch(Surv(enter, exit, event) ~ sex + civ + birthyear,
cuts = seq(60, 100, by = 2), data = oldmort)
head(om)
```

```
## sex civ birthyear age event exposure
## 1 male unmarried -2 60-62 0 0.578
## 2 female unmarried -2 60-62 0 4.109
## 3 male married -2 60-62 0 17.366
## 4 female married -2 60-62 0 14.129
## 5 male widow -2 60-62 0 0.148
## 6 female widow -2 60-62 0 5.805
```

Note two things:

The creation of a new variable,

`birthyear`

. The original`birthdate`

is given with precision days and contains 3570 unique values, which will contribute to creating a very large table, so the transformation gives*birth year*with 52 unique values. Further, the new variable is given a*reference value*of 1800 by subtraction, necessary so that the baseline does not coincide with the birth of Christ. Will foremost affect plotting of the estimated survivor function. However, regression parameter estimates are unaffected, as long as*no interaction effect*including`birthyear`

is present.The length of the time pieces is set to two years, in order to avoid empty intervals in the upper age range. This choice has only a marginal effect on the final results of the analyses. You can try it out yourself. Note that it is not necessary to use equidistant cut points, it is chosen here only for convenience.

Now we can run `tpchreg`

as before

```
fit3 <- tpchreg(oe(event, exposure) ~ sex + civ +
birthyear, time = age, data = om)
summary(fit3)
```

```
## Covariate Mean Coef Rel.Risk S.E. LR p
## sex 0.000
## male 0.406 0 1 (reference)
## female 0.594 -0.245 0.783 0.047
## civ 0.000
## unmarried 0.080 0 1 (reference)
## married 0.530 -0.397 0.672 0.081
## widow 0.390 -0.258 0.773 0.079
## birthyear 2.114 -0.006 0.994 0.004 0.150
##
## Events 1971
## Total time at risk 37824
## Max. log. likelihood -7265.5
## LR test statistic 43.45
## Degrees of freedom 4
## Overall p-value 8.34423e-09
##
## Restricted mean survival: 12.65 in (60, 100]
```

And the hazards graphs are shown in Figure 2.1.

```
oldpar <- par(mfrow = c(1, 2), las = 1, lwd = 1.5)
plot(fit3, fn = "haz", log = "y", main = "log(hazards)",
xlab = "Age", ylab = "log(Deaths / Year)", col = "blue")
plot(fit3, fn = "haz", log = "", main = "hazards",
xlab = "Age", ylab = "Deaths / Year", col = "blue")
```

The plots of the *survivor* and *cumulative hazards* functions are “smoother”,
see Figure 2.2.

```
oldpar <- par(mfrow = c(1, 2), las = 1, lwd = 1.5)
plot(fit3, fn = "cum", log = "y", main = "Cum. hazards",
xlab = "Age", col = "blue")
plot(fit3, fn = "sur", log = "", main = "Survivor function",
xlab = "Age", col = "blue")
```

`par(oldpar)`

A comparison with Cox regression on the original data.

```
fit4 <- coxreg(Surv(enter, exit, event) ~ sex + civ + I(birthdate - 1800),
data = oldmort)
summary(fit4)
```

```
## Covariate Mean Coef Rel.Risk S.E. LR p
## sex 0.000
## male 0.406 0 1 (reference)
## female 0.594 -0.244 0.783 0.047
## civ 0.000
## unmarried 0.080 0 1 (reference)
## married 0.530 -0.397 0.673 0.081
## widow 0.390 -0.259 0.772 0.079
## I(birthdate - 18 2.602 -0.005 0.995 0.004 0.212
##
## Events 1971
## Total time at risk 37824
## Max. log. likelihood -13557
## LR test statistic 42.79
## Degrees of freedom 4
## Overall p-value 1.14378e-08
```

And the graphs, see Figure 2.3.

```
oldpar <- par(mfrow = c(1, 2), lwd = 1.5, las = 1)
plot(fit4, main = "Cumulative hazards", xlab = "Age",
col = "blue")
plot(fit4, main = "Survivor function", xlab = "Age",
fn = "surv", col = "blue")
```