#load libraries
#load strm
library(strm)
library(rmarkdown)
#load package dependencies for the vignette
library(tidyr)
library(dplyr)
library(rgdal)
library(spdep)
library(rgeos)
library(sf)
library(splm)
library(knitr)
library(Ecdat)
strm
is an R
package that fits spatio-temporal regression model based on Chi & Zhu Spatial Regression Models for the Social Sciences (2019). The approach here fits a spatial error model while incorporating a temporally lagged response variable and temporally lagged explanatory variables.
This package builds on the errorsarlm()
function from the spatialreg
package.
This package is still under development. Please report bugs or constructive tips to issues here.
Fit a spatial error model but include a temporally lagged response variable, temporally lagged explanatory variable, and temporally and spatially lagged explanatory variables.
Spatial error model at time \(t\):
\[Y_t = X_t \beta + u_t, u_t=\rho Wu_t + \varepsilon\]
Adding in a temporally lagged response variable and temporally lagged explanatory variable:
\[Y_t = X_t\beta_t + \beta_2Y_{t-1} + X_{t-1}\beta_3 + u_t, u_t=\rho Wu_t + \varepsilon\]
This becomes:
\[Y_t = Y_{t-1} \beta_2 + \rho W Y_t - \rho\beta_2 WY_{t-1} + X_t \beta_1 + X_{t-1}\beta_3 - W X_t \rho \beta_1 -WX_{t-1} \rho \beta_3 + \varepsilon\]
(This example has been adapted from the splm
package vignette.)
The first data set we will use is the Produc data set from the Ecdat
package. This data set is a panel of United States production data from 1970-1986. There are 816 observations by county in the United States. The variables we will use are:
year
: yearstate
: state in the United Statesgsp
:pcap
: private capital stock.pc
: public capital.emp
: labor input measured by employment in non-agricultural payrolls.unemp
: state unemployment rate.We also load usaww from the splm
package, a spatial weights matrix of the 48 continental United States based on second-order neighbors.
#load data example
data("Produc")
data("usaww")
#explore the data structures
str(Produc)
## 'data.frame': 816 obs. of 10 variables:
## $ state: Factor w/ 48 levels "ALABAMA","ARIZONA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...
## $ pcap : num 15033 15502 15972 16406 16763 ...
## $ hwy : num 7326 7526 7765 7908 8026 ...
## $ water: num 1656 1721 1765 1742 1735 ...
## $ util : num 6051 6255 6442 6756 7002 ...
## $ pc : num 35794 37300 38670 40084 42057 ...
## $ gsp : int 28418 29375 31303 33430 33749 33604 35764 37463 39964 40979 ...
## $ emp : num 1010 1022 1072 1136 1170 ...
## $ unemp: num 4.7 5.2 4.7 3.9 5.5 7.7 6.8 7.4 6.3 7.1 ...
str(usaww)
## num [1:48, 1:48] 0 0 0 0 0 0 0 0.5 0.2 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:48] "ALABAMA" "ARIZONA" "ARKANSAS" "CALIFORNIA" ...
## ..$ : chr [1:48] "ALABAMA" "ARIZONA" "ARKANSAS" "CALIFORNIA" ...
We next convert the spatial weights matrix to a list of spatial weights using the mat2listw()
function from the spdep
package and check the class()
of the object:
#create list of spatial weights
usalw <- mat2listw(usaww)
class(usalw)
## [1] "listw" "nb"
Next, we perform the spatio-temporal regression model. We first create a formula using as.formula()
. For this model, our response is log(gsp)
and our explanatory variables are log(pcap), log(pc), log(emp), unemp
.
For this analysis, we will limit the data to only 1970 and 1971 - two time periods.
We use the strm()
function from the strm
package. The first argument is the model formula. Given that the data is in long format, strm
will create the lagged values for you as well as the lagged response in the right-hand side of the model. The next argument is id
, which we set to “state” as each observation is taken at the state level. We then specify the name of the data set (data=Produc
), the list of spatial weights (listw = usalw
), time=2
, we tell strm
that the data is in long format by setting wide=FALSE
, and lastly we pass an argument to filter observations where year
is equal to either 1970 or 1971 (2 time periods).
formula1 <-as.formula(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp)
out <- strm(formula1, id="state", data=Produc, listw = usalw, time=2,wide=FALSE,
filter_options="year==1970 | year==1971")
## The spatio-temporal regression model fitted:
## loggsp ~ logpcap + logpc + logemp + unemp + logpcap.Tlag1 + logpc.Tlag1 + logemp.Tlag1 + unemp.Tlag1 + loggsp.Tlag1
summary(out)
##
## Call:spatialreg::errorsarlm(formula = modframe, listw = listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03855189 -0.00739710 -0.00061512 0.00585547 0.05119821
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.9946e-01 7.6150e-02 2.6193 0.0088117
## logpcap -1.8587e-01 2.3251e-01 -0.7994 0.4240508
## logpc 1.4288e-05 2.5371e-01 0.0001 0.9999551
## logemp 8.2984e-01 1.9872e-01 4.1758 2.969e-05
## unemp 3.3262e-03 6.0468e-03 0.5501 0.5822638
## logpcap.Tlag1 1.9192e-01 2.2597e-01 0.8493 0.3956987
## logpc.Tlag1 1.7700e-02 2.5045e-01 0.0707 0.9436582
## logemp.Tlag1 -7.5794e-01 1.9659e-01 -3.8555 0.0001155
## unemp.Tlag1 -4.5510e-03 6.8513e-03 -0.6643 0.5065302
## loggsp.Tlag1 9.1257e-01 2.5405e-02 35.9204 < 2.2e-16
##
## Lambda: -0.0093887, LR test value: 0.00085418, p-value: 0.97668
## Asymptotic standard error: 0.20491
## z-value: -0.045819, p-value: 0.96345
## Wald statistic: 0.0020994, p-value: 0.96345
##
## Log likelihood: 128.249 for error model
## ML residual variance (sigma squared): 0.00027975, (sigma: 0.016726)
## Number of observations: 48
## Number of parameters estimated: 12
## AIC: -232.5, (AIC for lm: -234.5)
From the model summary output, we see that strm
has included lagged variables (*.Tlag1) for each of the explanatory variables initially specified (log(pcap), log(pc), log(emp), unemp
), as well as a lagged variable for the response (log(gsp)
).
As the number of time periods in the data increase, so do the number of lags. If we now use the same model, but instead include an additional year (1972) and set time=3
.
formula2 <-as.formula(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp)
out <- strm(formula2, id="state", data=Produc, listw= usalw, time=3,
wide=FALSE,filter_options="year==1970 | year==1971 | year ==1972")
## The spatio-temporal regression model fitted:
## loggsp ~ logpcap + logpc + logemp + unemp + logpcap.Tlag1 + logpc.Tlag1 + logemp.Tlag1 + unemp.Tlag1 + logpcap.Tlag2 + logpc.Tlag2 + logemp.Tlag2 + unemp.Tlag2 + loggsp.Tlag1 + loggsp.Tlag2
After executing the strm()
model, we apply the summary()
function to the results object, out
.
summary(out)
##
## Call:spatialreg::errorsarlm(formula = modframe, listw = listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02179096 -0.00525540 0.00041786 0.00461342 0.01653898
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2516030 0.0462216 5.4434 5.227e-08
## logpcap 0.0980899 0.1760534 0.5572 0.57742
## logpc 0.2284076 0.3163419 0.7220 0.47028
## logemp 0.2484005 0.1105107 2.2478 0.02459
## unemp -0.0056502 0.0035709 -1.5823 0.11359
## logpcap.Tlag1 -0.2784667 0.3731087 -0.7463 0.45546
## logpc.Tlag1 -0.4567652 0.3651608 -1.2509 0.21099
## logemp.Tlag1 0.5172162 0.2297628 2.2511 0.02438
## unemp.Tlag1 0.0099326 0.0042575 2.3330 0.01965
## logpcap.Tlag2 0.1929118 0.2093529 0.9215 0.35681
## logpc.Tlag2 0.2527881 0.1388406 1.8207 0.06865
## logemp.Tlag2 -0.6862555 0.1397530 -4.9105 9.085e-07
## unemp.Tlag2 -0.0037672 0.0033893 -1.1115 0.26636
## loggsp.Tlag1 0.7593464 0.0720738 10.5357 < 2.2e-16
## loggsp.Tlag2 0.1317338 0.0685188 1.9226 0.05453
##
## Lambda: -0.77659, LR test value: 4.9975, p-value: 0.025385
## Asymptotic standard error: 0.18694
## z-value: -4.1542, p-value: 3.2636e-05
## Wald statistic: 17.258, p-value: 3.2636e-05
##
## Log likelihood: 162.5098 for error model
## ML residual variance (sigma squared): 5.8561e-05, (sigma: 0.0076525)
## Number of observations: 48
## Number of parameters estimated: 17
## AIC: -291.02, (AIC for lm: -288.02)
From the model summary output, we see that strm
has now included two lagged variables (*.Tlag1 and *.Tlag2) for each of the explanatory variables initially specified, as well as two lags for the response variable.
We use the example from Spatial Regression Models for the Social Sciences Chi & Zhu (2019). The example uses population growth data from 2000 to 2010. Data are at the minor civil division (MCD) level in Wisconsin. There are two years of data: 2000 and 2010. The variables we will use are:
LNP1000
: population growth from 2000 to 2010.LNP0090
: population growth from 1990 to 2000.POLD00
: percentage of the old population (age sixty-five and older) in 2000.POLD90
: percentage of the old population (age sixty-five and older) in 1990.We load the sptdmg3
.RData file that comes with the strm
package using data(sptdmg3)
and explore its class()
and names()
:
data(sptdmg3)
class(sptdmg3)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
names(sptdmg3)
## [1] "STFID" "X_COORD" "Y_COORD" "LNP1000" "LNP0090" "POLD00" "POLD90"
First, fit a standard linear regression model with population growth from 2000 to 2010 as the response variable and population growth from 1990 to 2000, the old population in 2000, and the old population in 1990 as the explanatory variables.
formula2 <- as.formula(LNP1000 ~ LNP0090 + POLD00 + POLD90)
m2 <- lm(formula2, data = sptdmg3)
summary(m2)
##
## Call:
## lm(formula = formula2, data = sptdmg3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69307 -0.06942 -0.01170 0.05636 1.08243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.109237 0.009174 11.907 < 2e-16 ***
## LNP0090 0.061267 0.020858 2.937 0.003352 **
## POLD00 -0.335543 0.089100 -3.766 0.000171 ***
## POLD90 -0.214722 0.082246 -2.611 0.009109 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1348 on 1833 degrees of freedom
## Multiple R-squared: 0.05626, Adjusted R-squared: 0.05471
## F-statistic: 36.42 on 3 and 1833 DF, p-value: < 2.2e-16
We first convert the SpatialPolygonsDataFrame
to a simple features or sf
object. We use the st_as_sf()
function from the sf
package.
#convert to simple features
sptdmg3_sf <- sf::st_as_sf(sptdmg3)
class(sptdmg3_sf)
## [1] "sf" "data.frame"
Next, we create a spatial weights matrix based on 4-nearest neighbors. We extract the coordinates from the MCD centroids using st_centroid()
to extract the centroid of each MCD and st_coordinates()
to extract the coordinates. We then use knearneigh()
from the spdep
package and specify k=4
to create a matrix of the 4 nearest neighbors; knn2nb()
creates a neighborhood structure object and nb2listw()
creates a spatial weights list, where we set the style to be row-standardized weights and the zero policy to be TRUE
. Finally, we plot the MCD and their neighbors.
#knn4
coords <-st_coordinates(st_centroid(sptdmg3_sf)) #extract centroid of each minor civil division
col.knn <- knearneigh(coords, k=4)
nbknn <- knn2nb(col.knn, row.names = sptdmg3$STFID)
listwk <- nb2listw(nbknn, style="W")
plot(sptdmg3, main="Wisconsin: 4 Nearest-Neighbors"); plot(nbknn, coords,col="forestgreen", add=TRUE)
We are now ready to perform the spatio-temporal regression model. In this example, the data is already in WIDE format, meaning that every variable-year is a column in the dataset. Because of this, we directly include the lagged variables for all of the covariates as well as the lagged variable for the response, just as we had done in the simple linear model above.
#since this data is in wide format, include temporal lag for both explanatory and response variable manually
res <- strm(formula2, id="STFID", data=sptdmg3_sf, listw = listwk, time=2,wide=TRUE)
## The spatio-temporal regression model fitted:
## LNP1000 ~ LNP0090 + POLD00 + POLD90
summary(res)
##
## Call:spatialreg::errorsarlm(formula = modframe, listw = listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.667816 -0.068463 -0.010631 0.055872 1.053517
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.0856035 0.0099058 8.6418 < 2e-16
## LNP0090 0.0312120 0.0212166 1.4711 0.14126
## POLD00 -0.1992946 0.0863089 -2.3091 0.02094
## POLD90 -0.1665145 0.0789090 -2.1102 0.03484
##
## Lambda: 0.28547, LR test value: 78.083, p-value: < 2.22e-16
## Asymptotic standard error: 0.030315
## z-value: 9.4169, p-value: < 2.22e-16
## Wald statistic: 88.678, p-value: < 2.22e-16
##
## Log likelihood: 1115.203 for error model
## ML residual variance (sigma squared): 0.017069, (sigma: 0.13065)
## Number of observations: 1837
## Number of parameters estimated: 6
## AIC: -2218.4, (AIC for lm: -2142.3)