The `WeMix`

package aims to fit a linear mixed model where units are nested inside groups,which may themselves be nested in groups,a model specified by Rabe-Hesketh and Skrondal (2006) and Rabe-Hesketh, Skrondal, and Pickles (2002) for the GLLAMM software. The advantage of fitting the model in nested levels is that survey sampling weights can be incorporated; the pseudo-likelihood,\footnote{For inverse probability of selection survey data, the likelihood is an estimated population likelihood and not an observed likelihood; the typical phrase for this is \textit{pseudo-likelihood}, which applies to every likelihood in this document.} at the top level, is an integral that sums across the top-level groups, weighting each group or unit (in a two-level model) according to sampling weights.

At the highest level, the likelihood (\(\mathcal{L}\)) is a function of the parameters for the random-effect covariance matrix (\(\bm{\theta}\)), the fixed-effect estimates (\(\bm{\beta}\)), and the outcomes \(\bm{y}\). This equation relates the overall likelihood [\(\mathcal{L}(\bm{\theta}, \bm{\beta}; \bm{y})\)] to the integral of the likelihoods of the integrals of the top-level units [\(\mathcal{L}^{(L)}_{j}(\bm{\theta}, \bm{\beta} ; \bm{y}| \bm{u}^{(L)})\), indexed by \(j\)]; the top-level unit likelihoods have a superscript \((L)\) to indicate they are for the $L$th (top) level. Here we omit reference to the fixed-effect design matrix \(\bm{X}\) and the random-effect design matrix \(\bm{Z}\), but the likelihood is conditional on those terms as well.
\begin{align}
\mathcal{L}(\bm{\theta}, \bm{\beta}; \bm{y}) &= \int g^{{(L)}(\bm{u}{(L)};\bm{\theta}{(L)})\prod_{j}{} \bigg[\mathcal{L}^{{(L)}_{j}(\bm{\theta},} \bm{\beta} ; \bm{y}| \bm{u}^{{(L)})\bigg]{w{(L)}_{j}}}} d\bm{u}^{{(L)}}
\end{align}
where the \(\bm{u}\) terms are the random effects, which are marginalized by integrating over them;\footnote{The \(\bm{u}\) terms are integrated out and so they are not, strictly speaking, estimated. Because of this, we do not call them empirical Bayes estimates. However, the WeMix' package does provide a value for them and, because the variance of the distribution that acts as a prior is estimated, these values could reasonably be regarded as an empirical Bayes estimate of the random effects.} the \(g\) function plays a role similar to a prior, but has its variance (or covariance matrix) fit using covariance parameter vector \(\bm{\theta}\), while \(j\) represents the indexes of all the top-level units, which have their likelihood raised to the power of their weight \(w^{(L)}_{j}\). Because \(\bm{u}\) may be a vector—for example, if there is a random slope and intercept—the covariance matrix between the \(\bm{u}\) terms (\(\bm{\Sigma}^{(L)}\)) may be diagonal (no covariance) or dense. In any case, the covariance matrix is parameterized by a vector of values \(\bm{\theta}\); at the $L$th level, the relevant elements are denoted by \(\bm{\theta}^{(L)}\).

The conditional likelihood at each level from \(L\) to two—those above the first, or lowest level—is then recursively defined, for the $j$th unit, at level \(l\) (\(\mathcal{L}^{(l)}_j\); Rabe-Hesketh et al., 2002, eq. 3) as:
\begin{align}
\mathcal{L}^{{(l)}_j(\bm{\theta},} \bm{\beta}; \bm{y}| \bm{U}^{{(l+1)})} = \int g^{{(l)}(\bm{u}{(l)};\bm{\theta}{(l)})\prod_{j'}{} \bigg[\mathcal{L}^{{(l-1)}_{j'}(\bm{\theta},} \bm{\beta} ; \bm{y}| \bm{U}^{{(l)})\bigg]{\bm{w}{(l-1)}_{j'}}}} d\bm{u}^{{(l)}} \quad l=2, \ldots, L-1
\end{align}
where the subscript \(j'\) that the product is indexed over indicates that the likelihood \(\mathcal{L}^{(l-1)}_{j'}(\cdot)\) is for the units of level \(l-1\) nested in unit \(j\), and \(\bm{U}^{(l)}\) is all of the random effects at or above level \(l\), so that \(\bm{U}^{(l)}\) includes \(\left\{\bm{u}^{(l)}, \bm{u}^{(l+1)}, \dots, \bm{u}^{(L)}\right\}\). This equation reflects the nested nature of the data; a group (e.g., school), annotated as \(j\), has an individual likelihood that is the product of the likelihoods of the units or groups (e.g., students or classrooms, annotated as \(j'\)) nested inside of it.

In the case of a Gaussian residual, the likelihood (\(\mathcal{L}^{(1)}\)) at the individual unit level is given by the equation
\begin{align}
\mathcal{L}*i ^{{(1)}(\bm{\theta},} \bm{\beta};\bm{y},\bm{U}^{{(2)})} &= \frac{1}{\sigma} \phi \left( \frac{\hat{e}_i^{{(1)}} }{\sigma} \right)
\end{align}
where the subscript \(i\) is used to indicate that this is the likelihood of the \(i^{th}\) individual, the superscript \((1)\) on \(\hat{\bm{e}}_i^{(1)}\) is used to indicate that it is an unpenalized residual—where the penalized residual will be introduced in the next section—\(\phi(\cdot)\) is the standard normal density function and \(\sigma\) is the residual variance (a scalar), and the residuals vector \(\hat{\bm{e}}^{(1)}\) represents the residuals \(\hat{\bm{e}}_i^{(1)}\) for each individual and is given, in vector form, by
\begin{align}
\hat {\bm{e}}^{{(1)}} &= \bm{y} - \bm{X}\hat{\bm{\beta}} - \sum*{l=2}

This document describes how `WeMix`

uses symbolic integration that relies on a mix of both Bates and Pinheiro (1998) and the `lme4`

package (Bates, Maechler, Bolker, and Walker, 2015), obviating the need for numerical integration in a weighted, nested model.

The basic model in `lme4`

is of the form (Bates et al., 2015, eqs. 2 and 3):
\begin{align}
\left( \bm{y}|\bm{U}=\bm{u}\right) &\sim N(\bm{X\beta} + \bm{Zu}, \sigma^{2} W^{{-1}} ) \label{eq:lme4A}\
\bm{U} &\sim N(0, \bm{\Sigma}(\bm{\theta})) \label{eq:lme4B}
\end{align}
where \(N(\cdot, \cdot)\) is the multivariate normal distribution, and \(\bm{\Sigma}(\bm{\theta})\) is positive semidefinite—allowing, for example, a variance parameter to be exactly zero—that is parameterized by a vector of parameters \(\bm{\theta}\).

The likelihood is maximized by integrating (symbolically) over \(\bm{U}\) (Bates et al., 2015, eqs. 20, 21, 22, and 24):
\begin{align}
f*{\bm{y}}(\bm{y};\bm{\beta},\bm{\theta}) &= \int f*{\bm{y}|\bm{U=u}}\left(\bm{y}; \bm{\beta}, \bm{\theta}, \bm{u} \right) \cdot f_{\bm{U}}(\bm{u}) d\bm{u} \label{eq:lme4C}
\end{align}

where the \(f_{\bm{U}}\) term is analogous to \(g(\bm{u}^{(l)})\) in the Rabe-Hesketh et al. (2006) formulation but is intentionally represented in a non-nested structure to allow for crossed terms.

In this document we show how `WeMix`

uses a derivation similar to that in Bates et al. (2015) and Bates and Pinheiro (1998) to fit a sample-weighted mixed model, avoiding the integration necessary in GLLAMM. Comparing `WeMix`

to `lmer`

, the latter is more general in the sense of allowing crossed terms, while `WeMix`

allows for sampling weights; unweighted, the models should agree.

Generally, the Rabe-Hesketh et al. (2006) model can be rewritten in a form very similar to `lme4`

as
\begin{align}
\left( \bm{y}|\bm{U}=\bm{u}\right) &\sim T*1 ^{{\omega{(1)}}} \label{eq:WeMixA}\
\bm{U} &\sim T_2^{{\omega{(2)}}} * T_3^{{\omega{(3)}}} * \cdots * T_L^{{\omega{(L)}}} \label{eq:WeMixB}
\end{align}
where \(T^k\) represents the convolution of \(k\) instances of \(T\) and \(*\) represents the convolution of two distributions, with a likelihood that is their product; \(w^{(l)}\) are the weights assigned to units (\(l=1\)) or groups (\(l>1\)) that are ideally\footnote{This weight is only ideally this because of how weights are adjusted for total nonresponse (see Gelman 2007).} the inverse (unconditional) probability of selecting the unit or group; and the $T$s have distribution
\begin{align}
T_1 &\sim N(\bm{X\beta} + \bm{Zu}, \sigma^{2} I ) \label{eq:WeMixA2} \
T_l &\sim N(0, \bm{\Sigma*{ll}}) \quad l=2, \cdots, L \label{eq:WeMixB2}
\end{align}
where \(\bm{\Sigma}\) is block diagonal, disallowing nonzero prior correlations across levels, with the $l$th block being written \(\bm{\Sigma}_{ll}\).

The next section follows Bates et al. (2015) and shows the `lme4`

and `WeMix`

model without weights—the only case where they are identical. The subsequent section then shows the adaptations to the likelihood for the sample weighted case. Notably, `lme4`

has unit-level weights, but they are precision weights and not level-1 sample weights, so even when only the level-1 weights are nontrivial, the models disagree. The final section describes the application of the profile likelihood (again, following `lme4`

) used to estimate linear models in `WeMix`

.

\subsection{Unweighted Case}

Following the logic of Bates et al. (2015), the unweighted (unit weight) case simplifies eqs. \ref{eq:WeMixA} and \ref{eq:WeMixB} to
\begin{align}
\left( \bm{y}|\bm{U}=\bm{u}\right) &\sim T*1 \label{eq:WeMixAnoW}\
\bm{U} &\sim T_2 * T_3 * \cdots * T_L \label{eq:WeMixBnoW}
\end{align}
where eqs. \ref{eq:WeMixA2} and \ref{eq:WeMixB2} are simplified to
\begin{align}
T_1 &\sim N(\bm{X\beta} + \bm{Zu}, \sigma ^{2} I ) \label{eq:WeMixA2noW} \
T_l &\sim N(0, \bm{\Sigma*{ll}}) \quad l=2, \cdots, L \label{eq:WeMixB2noW}
\end{align}
the random-effect vector \(\bm{U}\) is rewritten as the product of a square root-covariance matrix \(\bm{\Lambda}\) and an \(iid\) normal vector \(\bm{\upsilon}\):
\begin{align}
\bm{U}&=\bm{\Lambda} \bm{\upsilon} \label{eq:Uf} \
\bm{\upsilon} &\sim N(0, \sigma

Then, the residual (penalized) sum of squares is (Bates et al., 2015, eqs. 14 and 15)
\begin{align}
r^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \lvert\lvert \bm{y}-\bm{X\beta} - \bm{Z\Lambda(\theta)\upsilon}\rvert\rvert^{2} + ||\bm{\upsilon}||^{2}
\end{align}
where \(||\bm{v}||^2\) is the sum of squares for the vector \(\bm{v}\); as an equation, \(||\bm{v}||^2=\sum v_i^2\).

Notice that the penalty function (\(||\bm{\upsilon}||^2\)) and the residual both are assumed to be independent identically distributed normal distributions with variance \(\sigma^2\); this allows for both the regression and the random effects to be estimated in one regression where the pseudo-data outcome for the random effects is a vector of zeros (Bates et al., 2015, eq. 16). This rewrites \(r\) as a sum of squares, adding a vector of zeros below \(\bm{y}\)—the *pseudo-data*:
\begin{align}
r^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{y} \ \bm{0} \end{matrix} \right] -\left[ \begin{matrix} \bm{Z\Lambda}(\bm{\theta}) & \bm{X} \ \bm{I} & \bm{0} \end{matrix} \right] \left[ \begin{matrix} \bm{\upsilon} \ \bm{\beta} \end{matrix} \right] \right|\right|^{2} \label{eq:WeMixR0}
\end{align}
Unlike Bates et al. (2015, eq. 16), we proceed by taking a QR decomposition (Trefethen and Bau, 1997) of
\begin{align}
\bm{A} &\equiv \left[ \begin{matrix} \bm{Z\Lambda}(\bm{\theta}) & \bm{X} \ \bm{I} & \bm{0} \end{matrix} \right] \label{eq:uwA}
\end{align}
Plugging eq. \ref{eq:uwA} into eq. \ref{eq:WeMixR0} and finding the least squares solution (denoted with a hat: \(\bm{\hat{u}}\) and \(\bm{\hat{\beta}}\))
\begin{align}
\bm{A} \left[ \begin{matrix} \hat{\bm{\upsilon}} \ \hat{\bm{\beta}} \end{matrix} \right]&=\left[ \begin{matrix} \bm{y} \ \bm{0} \end{matrix} \right]
\end{align}
using the QR decomposition on \(\bm{A}\), which rewrites \(\bm{A}=\bm{QR}\) for an orthogonal matrix \(\bm{Q}\) (So, \(\bm{Q}^T\bm{Q}=\bm{I}\)) and an upper triangular matrix \(\bm{R}\),
\begin{align}
\bm{QR} \left[ \begin{matrix} \hat{\bm{\upsilon}} \ \hat{\bm{\beta}} \end{matrix} \right]&=\left[ \begin{matrix} \bm{y} \ \bm{0} \end{matrix} \right]
\end{align}
where \(\bm{R}\) can be written in block form as
\begin{align}
\bm{R} &= \left[ \begin{matrix} \bm{R}*{11} & \bm{R}*{12} \ \bm{0} & \bm{R}*{22} \end{matrix} \right] \label{eq:Rblock}
\end{align}
in which $\bm{R}*{11}$ is also upper triangular, square, and conformable with \(\bm{\upsilon}\), and \(\bm{R}_{22}\) is similarly upper triangular, square, and conformable with \(\bm{\beta}\), while \(\bm{R}_{12}\) is rectangular and (potentially) dense.

Rewriting \(r^2\) as a deviation from the least squares solution, eq. \ref{eq:WeMixR0} becomes
\begin{align}
r^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{y} \ \bm{0} \end{matrix} \right] - \bm{A} \left[ \begin{matrix} \bm{\upsilon} \ \bm{\beta} \end{matrix} \right] \right|\right|^{2} \
&= \left| \left| \left[ \begin{matrix} \bm{y} \ \bm{0} \end{matrix} \right] - \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} + \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} + \hat{\bm{\beta}} \end{matrix} \right] \right|\right|^{2} \
&= \left| \left| \left[ \begin{matrix} \bm{y} \ \bm{0} \end{matrix} \right] - \bm{A} \left[ \begin{matrix}\hat{\bm{\upsilon}} \ \hat{\bm{\beta}} \end{matrix} \right] - \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right|\right|^{2}
\end{align}
Using the identity that for any vector \(\bm{v}\) the sum of squares is also just the inner product, so \(||\bm{v}||^2 = \bm{v}^T\bm{v}\),

\begin{align}
r^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon})&= \left[ \left[ \begin{matrix} \bm{y} \ \bm{0} \end{matrix} \right] - \bm{A} \left[ \begin{matrix}\hat{\bm{\upsilon}} \ \hat{\bm{\beta}} \end{matrix} \right] - \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right]^{T} \left[ \left[ \begin{matrix} \bm{y} \ \bm{0} \end{matrix} \right] - \bm{A} \left[ \begin{matrix}\hat{\bm{\upsilon}} \ \hat{\bm{\beta}} \end{matrix} \right] - \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right] \label{eq:finalT}
\end{align}
Defining \(\hat{\bm{e}}\) to be the penalized least squares residual,
\begin{align}
\hat{\bm{e}} \equiv \left[ \begin{matrix} \bm{y} \ \bm{0} \end{matrix} \right] - \bm{A} \left[ \begin{matrix}\hat{\bm{\upsilon}} \ \hat{\bm{\beta}} \end{matrix} \right]
\end{align}
then eq. \ref{eq:finalT} can be rewritten
\begin{align}
r^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \left[ \hat{\bm{e}} - \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right]^{T} \left[ \hat{\bm{e}} - \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right] \
&= \hat{\bm{e}}^{T} \hat{\bm{e}} - 2 \hat{\bm{e}}^{T} \left[ \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right] + \left[ \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right]^{T} \left[ \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right] \label{eq:quad}
\end{align}
Since \(\hat{\bm{e}}\), the uniquely minimized residuals to the least squares problem is in the null of \(\bm{A}\), while \(\bm{Ax}\) is in the span of \(\bm{A}\) for any vector \(\bm{x}\), then \(\hat{\bm{e}}^T \bm{Ax}=0\) for any \(\bm{x}\). Thus, \(\hat{\bm{e}}^T \left[ \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right]=\bm{0}\) and eq. \ref{eq:quad} becomes
\begin{align}
r^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \hat{\bm{e}}^{T} \hat{\bm{e}} + \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right]^{T} \bm{A}^{T} \bm{A} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \
&= \hat{\bm{e}}^{T} \hat{\bm{e}} + \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right]^{T} \left[ \bm{QR} \right]^{T} \left[\bm{QR}\right] \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \
&= \hat{\bm{e}}^{T} \hat{\bm{e}} + \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right]^{T} \bm{R}^{T} \bm{Q}^{T} \bm{Q} \bm{R} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right]
\end{align}
Then, because \(\bm{Q}\) is orthonormal, \(\bm{Q}^T=\bm{Q}^{-1}\) and
\begin{align}
r^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \hat{\bm{e}}^{T} \hat{\bm{e}} + \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right]^{T} \bm{R}^{T} \bm{R} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \
&= \hat{\bm{e}}^{T} \hat{\bm{e}} + \left|\left| \bm{R} \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right| \right|^{2} \label{eq:r2F}
\end{align}
Notice that \(\hat{\bm{e}}^T \hat{\bm{e}}\) is the value of \(r^2\) evaluated at the least squares solution (denoted by adding hats to \(\bm{\beta}\) and \(\bm{\upsilon}\)), so that
\begin{align}
r^{2(\bm{\theta},} \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) &= \hat{\bm{e}}^{T} \hat{\bm{e}} \label{eq:r2tau}
\end{align}
Plugging eqs. \ref{eq:r2tau} and \ref{eq:Rblock} into eq. \ref{eq:r2F} (Bates et al., 2015, eq. 19),
\begin{align}
r^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= r^{2(\bm{\theta},} \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) + \left|\left| \left[ \begin{matrix} \bm{R}*{11} & \bm{R}*{12} \ \bm{0} & \bm{R}*{22} \end{matrix}\right] \left[ \begin{matrix} \bm{\upsilon} - \hat{\bm{\upsilon}} \ \bm{\beta} - \hat{\bm{\beta}} \end{matrix} \right] \right| \right| ^{2} \
&= r^{2(\bm{\theta},} \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) + \left|\left| \bm{R}*{11} (\bm{\upsilon} - \hat{\bm{\upsilon}}) + \bm{R}

From the joint distribution of \(\bm{y}\) and \(\bm{\upsilon}\) (Bates et al., 2015, eqs. 20, 21, and 22),
\begin{align}
(\bm{y},\bm{\upsilon}) &\sim N(\bm{y} - \bm{X\beta} - \bm{Z\upsilon},\sigma^{2} \bm{I}*{n_x}) * N(\bm{\upsilon}, \sigma ^{2} \bm{I}*{n

`Substitution for Multiple Variables'' in`

Integration by Substitution'' (Wikipedia, n.d.).} we add the inverse determinant of $\bm{R}This makes a deviance function, defined relative to the log-likelihood (\(\ell\)), so that \(D(\cdot) = -2\ell(\cdot)\),
\begin{align}
D \left( \bm{\beta}, \bm{\theta}, \sigma^{2;} \bm{y} \right)&= 2\,{\rm ln} {|{\rm det}(\bm{R}*{11})|} + n_x {\rm ln} \left(2\pi\sigma ^{2\right)} + \frac{r^{2(\bm{\theta},} \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) + \left|\left| \bm{R}*{22} (\bm{\beta} - \hat{\bm{\beta}}) \right| \right|

Using the profile likelihood, the deviance is minimized when \(\bm{\beta}=\hat{\bm{\beta}}\) because \(\bm{\beta}\) only appears inside of a sum of squares that can be minimized (set to zero) using \(\bm{\beta}=\hat{\bm{\beta}}\) (Bates et al., 2015, eqs. 30 and 31). The profile deviance then becomes
\begin{align}
D \left( \bm{\theta}, \sigma^{2;} \bm{y} \right)&= 2\,{\rm ln} {|{\rm det}(\bm{R}*{11})|} + n_x {\rm ln} \left(2\pi\sigma ^{2\right)} + \frac{r^{2(\bm{\theta},} \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) }{\sigma^{2}}*{11})|} + n_x \left({\rm ln} \left(2\pi \widehat{\sigma

\end{align} Similarly, the value of \(\sigma^2\) can be found by taking the derivative of the profile deviance with respect to \(\sigma^2\) and setting it equal to zero. This yields \begin{align} \widehat{\sigma^{2}&=} \frac{r^{2(\bm{\theta},} \bm{\hat{\beta}}, \bm{\hat{\upsilon}})}{n_x}

\end{align} giving a profile deviance that is a function only of the parameter \(\bm{\theta}\): \begin{align} D \left( \bm{\theta}; \bm{y} \right)&= 2\,{\rm ln} {|{\rm det}(\bm{R}

\section{Weighted Case}
The purpose of `WeMix`

is to estimate the likelihood of the weighted model (eqs. \ref{eq:WeMixA} and \ref{eq:WeMixB}). In this section we derive the weighted estimator that is analogous to the estimator used by `lme4`

and is similar to Henderson (1982) but we include a deviance (and likelihood), which Henderson omits.

The first difference is in the penalized sum of squares, which now weights the residuals by the unit-level weights (\(\bm{\Omega}\)) and the random-effect penalties by the group-level weights (\(\bm{\Psi}\)):
\begin{align}
r^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \left| \left| \bm{\Omega}^{{\frac{1}{2}}\left(} \bm{y}-\bm{X\beta} - \sum*{l=1} ^{L} \bm{Z}_l\Lambda*{ll}(\theta)\upsilon

Then, the weighted pseudo-data notation combines the two terms in eq. \ref{eq:r2sep}, adding a vector of *pseudo-data* to the end of the \(\bm{y}\) vector:
\begin{align}
r^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{\Omega}^{{\frac{1}{2}}} \bm{y} \ \bm{0} \end{matrix} \right] - \left[ \begin{matrix} \bm{\Omega}^{{\frac{1}{2}}} \bm{Z\Lambda}(\bm{\theta}) & \bm{\Omega}^{{\frac{1}{2}}\bm{X}} \ \bm{\Psi}^{{\frac{1}{2}}} & \bm{0} \end{matrix} \right] \left[ \begin{matrix} \bm{\upsilon} \ \bm{\beta} \end{matrix} \right] \right|\right|^{2} \label{eq:WeMixWr2} \
&= \left|\left| \bm{\Omega}^{{\frac{1}{2}}} \left( \bm{y} - \bm{Z\Lambda}(\bm{\theta}) \bm{\upsilon} - \bm{X} \right) \right| \right|^{2} + \left| \left| \bm{\Psi}^{{\frac{1}{2}}} \bm{\upsilon} \right| \right|^{2}
\end{align}
where \(\bm{Z}\) is now a block matrix that incorporates all of the \(\bm{Z}\) matrices for the various levels:
\begin{align}
\bm{Z} = \left[ \begin{matrix} \bm{Z}*1 & \bm{Z}_2 & \cdots & \bm{Z}_L \end{matrix} \right]
\end{align}
\(\bm{\Lambda}(\bm{\theta})\) is a block diagonal matrix with elements $\bm{\Lambda}*{ll}(\bm{\theta})$, \(\bm{\Psi}\) is a block diagonal matrix with elements \(\bm{\Psi}_{ll}\), and
\begin{align}
\bm{\upsilon} = \left[ \begin{matrix} \bm{\upsilon}_1 \ \bm{\upsilon}_2 \ \vdots \ \bm{\upsilon}_L \end{matrix} \right]
\end{align}

The likelihood of \(\bm{y}\), conditional on \(\bm{\upsilon}\) is then
\begin{align}
f*{\bm{y}|\bm{\upsilon}=\bm{\upsilon}}(\bm{y}, \bm{\upsilon}) &= \prod*{i=1}^{{n_x}} \left[ {\frac{1}{\left(2\pi\sigma^{2\right){\frac{1}{2}}}} \exp\left[-\frac{\left|\left|\bm{y}*i-\bm{X}_i\bm{\beta}-\bm{Z}_i\bm{\Lambda}(\bm{\theta})\bm{\upsilon} \right|\right| ^{2}{2\sigma2}\right)} } \right]^{{\bm{\Omega}}*{ii}} \
&= \prod

The joint distribution of \(\bm{\upsilon}\) and \(\bm{y}\) is then the product of eqs. \ref{eq:WeMixWfyu} and \ref{eq:WeMixWfu}:
\begin{align}
f*{\bm{y}, \bm{\upsilon}} \left(\bm{y}, \bm{\upsilon}, \bm{\beta}, \bm{\theta}, \sigma ^{2} \right)&= f*{\bm{y}|\bm{\upsilon}=\bm{\upsilon}}(\bm{y}, \bm{\upsilon})\cdot f

While these formulas allow for estimation of a likelihood function that allows for estimation of \(\hat{\bm{\beta}}\) and \(\hat{\sigma^2}\) via profiling, they do not depend on \(\alpha\) because \(\bm{\theta}\) appears as a parameter of \(\alpha\); optimizing the log-likelihood with respect to \(\bm{\theta}\) requires all of the terms in the log-likelihood to be calculated, including \(\alpha\).

\subsection{Calculation of \(\alpha\)} Bates and Pinheiro (1998) offer an unweighted method of calculating \(\alpha\) that we extend here to admit weighting. The essential insight of Bates and Pinheiro is that the variables must be remapped, per group, using an orthogonal transform that separates out the \(q_l\) random effects associated with group \(g\) at level \(l\). In what follows we describe a three-level case, but the methods readily generalize to the \(L\)-level case.

The integral is slightly re-expressed using \(\bm{u}\) instead of \(\upsilon\), but instead of using \(\bm{\Lambda}\) it uses the \(\bm{\Delta}\) matrix, which is an individual block of \(\bm{\Lambda}\), so \(\bm{\Delta}\) is defined, implicitly, by \begin{align} \bm{\Lambda}(\bm{\theta}) = \bm{\Delta}(\bm{\theta}) \otimes \bm{I} \end{align} where \(\otimes\) is the Kronecker product.

Using this notation, the likelihood is then given by
\begin{align}
\mathcal{L}&=\int \prod*g \left[ \int f*{y|U} (\bm{y}, \bm{\theta}, \bm{u}*{2g}, \cdots, \bm{\upsilon}*{Lg} ) f*U (\bm{\theta}, \bm{u}*{2g}) d\bm{u}*{2g} \right] f*{U} (\bm{\theta}, \bm{u}*{3g}) d\bm{u}*{3g} \label{eq:intalpha} \
&\propto \int \prod*g \left[ \int \exp \left[ \left | \left |\Omega*{gg}^{{\frac{1}{2}}} (\bm{y}*g - \bm{X}_g \bm{\beta} - \bm{Z}_g \bm{u}) \right | \right| ^{2} + \left| \left|\bm{u}^{T}*{2g} \bm{\Delta}

While the residual sum of squares (\(r^2\)) has been, up until now, treated for the entire data set, the notion is to think of the residual sum of squares for just group \(g\) (at level \(l\)). Bates and Pinheiro (1998) also use the notion of \(r^2_g\), or the \(r^2\) contribution for group \(g\), which is defined as
\begin{align}
r*g ^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{\Omega}^{{\frac{1}{2}}}*{gg}\bm{y}

Expanding \(\bm{Z}_{g}\) gives
\begin{align}
r*g ^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{\Omega}^{{\frac{1}{2}}}*{gg}\bm{y}

Starting at level 2, a change of variables is chosen via the (full) QR decomposition,
\begin{align}
\left[ \begin{matrix} \bm{\Omega}^{{\frac{1}{2}}_{gg}} \bm{Z}*{2 g} \ \bm{\Delta}_2 \end{matrix} \right] &= \bm{Q}*{2g} \left[ \begin{matrix} \bm{R}*{12g} \ \bm{0} \end{matrix} \right] \label{eq:QR0}
\end{align}
where the subscript on $\bm{R}*{12g}$ indicates that it is the top submatrix (\(1\)), at level 2, and for group \(g\). Notice that the blocks are different shapes on the left- and right-hand sides of eq. \ref{eq:QR0}; on the left-hand side, the top block (\(\bm{\Omega}^{\frac{1}{2}}_{gg} \bm{Z}_{2 g}\)) has as many rows as there are observations in group \(g\) and the bottom block (\(\bm{\Delta}_2\)) has as many rows as there are random effects at level 2, while the right-hand side is flipped. The top block (\(\bm{R}_{12g}\)) has as many rows as there are random effects at level 2, while the bottom block (\(\bm{0}\)) has as many rows as there are observations in group \(g\). The reason for this is that the change makes the top block on the right-hand side a square, upper triangular matrix, which will allow the change of variables to proceed.

Because \(\bm{Q}_{2g}\) is orthogonal by construction, \(\bm{Q}_{2g}^T \bm{Q}_{2g}= I\), one can freely premultiply the inside of the sum of squares in eq. \ref{eq:hugerg} by \(\bm{Q}_{2g}^T\) without changing the sum of squares:\footnote{Recalling \(||\bm{x}||^2 = \bm{x}^T \bm{x}\), so that, for orthogonal matrix \(\bm{Q}\), it is easy to see \(||\bm{Qx}||^2 = \bm{x}^T\bm{Q}^T \bm{Qx} = \bm{x}^T\bm{x}\).}
\begin{align}
r*g ^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \left| \left| \bm{Q}*{2g}

Multiplying the \(\bm{Q}_{2g}^T\) through and plugging the eq. \ref{eq:Qtdef} equations into eq. \ref{eq:Qz},
\begin{align}
r*g ^{2(\bm{\theta},} \bm{\beta}, \bm{u}) &= \left| \left| \left[ \begin{matrix} \bm{R}*{1yg} \ \bm{R}

We continue on with these remapped variables, starting the unweighted case (only at level 2), and now using \(g'\) to indicate that this is a different group (at level 3), with \(n_{g'}\) subgroups in it. Each group (labeled \(i\)) contributes an outcomes matrix \(\bm{R}_{2yi}\), a matrix per level 2 group \(\bm{R}_{23i}\) and a matrix per fixed effects regressor \(\bm{X}_{2Xi}\), for \(i=1, \cdots, n_{g'}\). Combining these, the residual sum of squares at level 3 for the group \(g'\) is
\begin{align}
r*{g'} ^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{R}*{2y1} \ \vdots \ \bm{R}

When there are level 2 conditional weights—non-unit probabilities of selection for the groups at level 2, conditional on the selection of the level 3 unit—each matrix in eq. \ref{eq:RunW} could be replicated \(\bm{\Psi}_{ii}\) times. Equivalently, each matrix can be weighted by the conditional probability of selection, so that eq. \ref{eq:RunW} becomes
\begin{align}
r*{g'} ^{2(\bm{\theta},} \bm{\beta}, \bm{\upsilon}) &= \left| \left| \left[ \begin{matrix} \bm{\Psi}*{11}

This change leads to the same QR decomposition as the replicated case:
\begin{align}
\left[ \begin{matrix} \bm{\Psi}*{11} ^{{\frac{1}{2}}} \bm{R}*{231} \ \vdots \ \bm{\Psi}

The weighted value of \(\alpha_3\) for the third level is then
\begin{align}
\alpha*{3} = \sum*{g'=1}^{{n_3}} \bm{\Psi}*{g'g'} \left| \bm{R}*{13g'} \right|^{{-1}}
\end{align}

\subsection{Implementation Details} The actual implementation of the calculation is slightly different from what is above. First, when the QR decomposition is taken (eqs. \ref{eq:QR0} and \ref{eq:QRg3}), it is possible to stop the decomposition as is described in Bates and Pinheiro (1998). It is also possible to continue the QR decomposition for the other levels of \(\bm{Z}\). Using the \(\bm{QR}\) on the entire matrix still results in an orthogonal component in the submatrices, and so meets the goals of the decomposition while obviating the need to form the \(\bm{Q}\) matrix explicitly.

Also note that, in the above, the value of \(r^2\) was never used, so the components relating to \(\bm{y}\) and \(\bm{X}\) need not be formed.

\section{Estimation}
Continuing to follow `lme4`

, the estimation uses the profile likelihood. Since \(\bm{\beta}\) appears only in the final term in quadratic form, it is immediately evident that the maximum likelihood estimator (MLE) for \(\bm{\beta}\) is \(\hat{\bm{\beta}}\), making eq. \ref{eq:tmpL2} profile to
\begin{align}
D \left( \bm{\theta}, \sigma^{2;} \bm{y} \right)&= 2\,\rm{ln} \left(\alpha (\bm{\theta};\bm{\Psi},\bm{\Omega}) \right) + \left(\sum*i \bm{\Omega}*{ii} \right) \rm{ln} \left(2\pi\sigma^{2\right)} + \frac{r^{2(\bm{\theta},} \bm{\hat{\beta}}, \bm{\hat{\upsilon}})}{\sigma^{2}} \label{eq:WeMixPB}
\end{align}

Then, the value of \(\sigma^2\) can also be profiled out by taking the derivative of the deviance with respect to \(\sigma^2\) and setting it equal to zero (Bates et al., 2015, eq. 32):
\begin{align}
0 &= \frac{\sum*i \bm{\Omega}*{ii}}{\widehat{\sigma^{2}}} - \frac{r^{2(\bm{\theta},} \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) }{\widehat{\sigma^{4}}}
\end{align}
rearranging
\begin{align}
\widehat{\sigma^{2}} &= \frac{r^{2(\bm{\theta},} \bm{\hat{\beta}}, \bm{\hat{\upsilon}}) }{\sum*i \bm{\Omega}*{ii}} \label{eq:WeMixMaxs2}
\end{align}
Eq. \ref{eq:WeMixMaxs2} can then be plugged into eq. \ref{eq:WeMixPB} to give
\begin{align}
D \left( \bm{\theta}; \bm{y} \right)&= 2\, \rm{ln} \left(\alpha (\bm{\theta};\bm{\Psi},\bm{\Omega}) \right) + \left(\sum*i \bm{\Omega}*{ii}\right) \left[ \rm{ln} \left(2\pi\widehat{\sigma^{2}\right)} + 1 \right] \label{eq:WeMixP}
\end{align}
This function is then minimized numerically with respect to \(\bm{\theta}\), using the profile estimates for \(\bm{\beta}\) and \(\bm{\upsilon}\) (eq. \ref{eq:WeMixWr2}) and \(\widehat{\sigma^2}\) (eq. \ref{eq:WeMixP}).

The estimated values are then the \(\bm{\theta}\) that maximize eq. \ref{eq:WeMixP}, the \(\sigma^2\) value from eq. \ref{eq:WeMixMaxs2}, and the \(\bm{\beta}\) values from solving the system of equations in eq. \ref{eq:WeMixWr2}.

The inverse Hessian of \(\bm{\beta}\) is given by (Bates et al., 2015, eq. 54):
\begin{align}
\widehat{\rm{Var}}(\bm{\beta}) &= \widehat{\sigma^{2}} \bm{R}*{22} ^{{-1}} \left(\bm{R}^{T}*{22}\right)

A robust (sandwich) variance estimator is given by (Binder, 1983) is appropriate:
\begin{align}
\left(\widehat{\sigma^{2}} \bm{R}^{T_{22}\right){-1}} \bm{J} \left(\widehat{\sigma^{2}} \bm{R}^{T_{22}\right){-1}} \label{eq:sandwich}
\end{align}
where \(\bm{J}\) is the sum of outer products of the Jacobian matrix
\begin{align}
\bm{J} = \frac{n*L}{n_L-1} \sum*{g=1}^{{n_L}} \frac{\partial(\ell*g)}{\partial \bm{\beta}}
\end{align}
where \(n_L\) is the number of level-\(L\) (top-level) groups, \(g\) indexes level-\(L\) groups, and \(\ell_g\) is the log-likelihood for group \(g\) and all groups and units nested inside of \(g\). The log-likelihood of the full model is
\begin{align}
\ell(\bm{\beta}, \bm{\theta}, \sigma ^{2;} \bm{y}) &= \rm{ln} \left[ \alpha(\bm{\theta};\bm{\Omega},\bm{\Psi}) \right] - \frac{\sum*{i} \bm{\Omega}

While it would be convenient if eq. \ref{eq:Vlnl} could be directly broken up into a portion attributable to each group, and some encouragement appears when the first three terms can be, the final term has dependencies across multiple groups. A distinct likelihood is needed that depends only on the data in that group. This is achieved by noting that data for a particular group is also valid data for a mixed model of the same type as the global mixed model, and so eq. \ref{eq:Vlnl} can be used on a single group's data to get the group log-likelihood; thus a group log-likelihood can be written using the notion of the fitted value of \(\bm{\beta}\) in the group (\(\hat{\bm{\beta}}_g\))
\begin{align}
\ell*g(\bm{\beta}, \bm{\theta}, \sigma ^{2;} \bm{y}_g) &= \rm{ln} \left[ \alpha_g(\bm{\theta};\bm{\Omega},\bm{\Psi}) \right] - \frac{\sum*{i\in g} \bm{\Omega}

Using these formulas the Jacobian matrix can now be calculated numerically and the robust variance estimator formed with eq. \ref{eq:sandwich}.

So far this section has regarded only \(\bm{\beta}\) but similar methods apply to the estimation of the variance of the random effect variance estimates (\(\bm{\theta}\) and \(\sigma\)). These variance terms have their variance estimated assuming that they are uncorrelated with the \(\bm{\beta}\) terms. At each level the variance is calculated, including a term for \(\sigma\), as
\begin{align}
{\rm Var} \left(\bm{\theta},\sigma \right) = \left( -\bm{H} \right)^{{-1}} \bm{J}*{\bm{\theta},\sigma} \left( -\bm{H} \right) ^{{-1}} \label{vartheta}
\end{align}
where \(\bm{H}\) is the Hessian of the likelihood (eq. \ref{eq:Vlnl}) with respect to \(\bm{\theta}\) and \(\sigma\) while $\bm{J}*{\bm{\theta},\sigma}$ is the portion of the Jacobian that regards \(\bm{\theta}\) and \(\sigma\). The estimated value for the variance of \(\sigma\) from the lowest level group (level 2) is used to form the standard error of the residual variance.

However, the variance estimates are not simply the values of \(\bm{\theta}\) and \(\sigma\) but transformations of that (eq. \ref{eq:root}). To estimate the variances of the variance estimates, the Delta method is used so that
\begin{align}
{\rm Var}\left(\bm{\Sigma}, \sigma^{2} \right) = \left[ \nabla ( \bm{\Lambda}^{T} \bm{\Lambda} ) \right]^{T} {\rm Var}\left(\bm{\theta}, \sigma^{2} \right) \left[ \nabla ( \bm{\Lambda}^{T} \bm{\Lambda}) \right]
\end{align}
where the gradient (\(\nabla (\cdot)\)) is taken with respect to the elements of \(\bm{\Sigma}\) and \(\sigma^2\), and \(\rm{Var}\left(\bm{\theta}, \sigma^2 \right)\) is from eq. \ref{vartheta}.

We can use the a Wald test to test both fixed effects parameters (\(\beta\)) and variance of the random parameters (\(\Lambda\)).

The Wald test compares estimated parameters with null hypothesis values. In the default case the null hypothesis is that value of the parameters is 0.

In this default case, if the test fails to reject the null hypothesis, removing the variables from the model will not substantially harm the fit of that model.

One advantage of the Wald test is that it can be used to test multiple hypotheses about multiple parameters simultaneously.

To test \(q\) hypotheses on \(p\) estimated parameters, let \(\hat{P}\) be the vector of estimated coefficients, \(R\) be a \(q\) x \(p\) hypothesis matrix (this matrix has 1 row per coefficient being tested with a value of 1 in the column corresponding to that coefficient), \(\hat{V}\) be the estimated covariance matrix for \(\hat{P}\), and \(r\) be the vector of hypothesized values for \(\hat{\beta}\).

Then the Wald test statistic for multiple parameters is equal to:

\[ W =(R\hat{\beta} - r)‘(R\hat{V}R’)^{-1}(R\hat{\beta} - r) \]

The resulting test statistic can be tested against a chi-square distribution. For this test, the degrees of freedom is the number of parameters that are tested.

\[ W \sim \chi^2(p) \]

Bates, D., Machler, M., Bolker, B. M., & Walker, S. C. (2015), Fitting linear mixed-effects models using lme4. *Journal of Statistical Software*, *67*(*1*), 1–48.

Bates, D., & Pinheiro, J. C. (1998). *Computational methods for multilevel modelling.* Bell Labs Report.

Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. *International Statistical Review*, *51*(*3*), 279–292.

Gelman, A. (2007). Struggles with survey weighting and regression modeling. *Statistical Science*, *22*(*2*), 153–164.

Henderson, C. R. (1982). Analysis of covariance in the mixed model: higher-level, nonhomogeneous, and random regressions. *Biometrics*, *38*(*3*), 623–640.

Integration by Substitution. (n.d.). In Wikipedia. Retrieved February 13, 2019, from \url{https://en.wikipedia.org/wiki/Integration_by_substitution}

Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data. *Journal of the Royal Statistical Society*. Series A (Statistics in Society), *169*(*4*), 805–827.

Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2002). Reliable estimation of generalized linear mixed models using adaptive quadrature. *Stata Journal*, *2*, 1–21.

Trefethen, L. N., & Bau, D. (1997). *Numerical linear algebra*. Philadelphia, PA: SIAM.