ggmix
is a package
that fits penalized multivariable linear mixed models with a single
random effect to account for population structure in genetic association
studies. The penalty allows one to fit high-dimensional models, i.e.,
where the number of variables is much larger than the number of
observations.
This package currently supports the lasso penalty. The group lasso
penalty will soon be implemented.
We consider the following linear mixed model with a single random effect: where the random effect b and the error variance e are assigned the distributions Here, ΦNT × NT is a known positive semi-definite and symmetric kinship matrix, INT × NT is the identity matrix and parameters σ2 and η ∈ [0, 1] determine how the variance is divided between b and e. The joint density of Y is multivariate normal:
Define the complete parameter vector Θ = (β, η, σ2). The negative log-likelihood for~ is given by where V = ηΦ + (1 − η)I and det (V) is the determinant of V. Let Φ = UDUT be the eigen (spectral) decomposition of the kinship matrix Φ, where UNT × NT is an orthonormal matrix of eigenvectors (i.e. UUT = I) and DNT × NT is a diagonal matrix of eigenvalues Λi. Using some algebraic tricks, this can be simplified to
where Ỹ = UTY, X̃ = UTX, Ỹi denotes the ith element of Ỹ, X̃ij is the i, jth entry of X̃ and 1 is a column vector of NT ones.
We define the p + 3 length vector of parameters Θ := (Θ0, Θ1, …, Θp + 1, Θp + 2, Θp + 3) = (β, η, σ2) where β ∈ ℝp + 1, η ∈ [0, 1], σ2 > 0. In what follows, p + 2 and p + 3 are the indices in Θ for η and σ2, respectively. Define the objective function: where f(Θ) := −ℓ(Θ) is defined above, Pj(⋅) is a penalty term on the fixed regression coefficients β1, …, βp + 1 (we do not penalize the intercept), controlled by the nonnegative regularization parameter λ, and vj is the penalty factor for jth covariate. These penalty factors serve as a way of allowing parameters to be penalized differently. Note that we do not penalize η or σ2. The penalty term is a necessary constraint because in our applications, the sample size is much smaller than the number of predictors. An estimate of the regression parameters $\widehat{\boldsymbol{\Theta}}_{\lambda}$ is obtained by
The package can be installed from GitHub via
We give a quick overview of the main functions and go into details in other vignettes. We will use the simulated data which ships with the package and can be loaded via:
library(ggmix)
data("admixed")
names(admixed)
#> [1] "ytrain" "ytune" "ytest" "xtrain"
#> [5] "xtune" "xtest" "xtrain_lasso" "xtune_lasso"
#> [9] "xtest_lasso" "Xkinship" "kin_train" "kin_tune_train"
#> [13] "kin_test_train" "mu_train" "causal" "beta"
#> [17] "not_causal" "kinship" "coancestry" "PC"
#> [21] "subpops"
For details on how this data was simulated, see
help(admixed)
.
There are three basic inputs that ggmix
needs:
1) Y: a continuous response
variable
2) X: a matrix of covariates
of dimension N × p
where N is the sample size and
p is the number of
covariates
3) Φ: a kinship
matrix
We can visualize the kinship matrix in the admixed
data
using the popkin
package:
# need to install the package if you don't have it
# pacman::p_load_gh('StoreyLab/popkin')
popkin::plot_popkin(admixed$kin_train)
We will use the most basic call to the main function of this package
ggmix
. This function by default will fit a L1 penalized linear mixed
model (LMM) for 100 distinct values of the tuning parameter λ. It will choose its own
sequence:
fit <- ggmix(x = admixed$xtrain,
y = admixed$ytrain,
kinship = admixed$kin_train)
names(fit)
#> [1] "result" "ggmix_object" "n_design" "p_design" "lambda"
#> [6] "coef" "b0" "beta" "df" "eta"
#> [11] "sigma2" "nlambda" "cov_names" "call"
class(fit)
#> [1] "lassofullrank" "ggmix_fit"
We can see the solution path for each variable by calling the
plot
method for objects of class
ggmix_fit
:
We can also get the coefficients for given value(s) of lambda using
the coef
method for objects of class
ggmix_fit
:
coef(fit, s = c(0.1,0.02))
#> 51 x 2 Matrix of class "dgeMatrix"
#> 1 2
#> (Intercept) -0.03652365 0.246752560
#> X23 0.00000000 0.097969940
#> X36 0.00000000 -0.012883477
#> X38 0.00000000 0.005414466
#> X40 0.00000000 0.004049076
#> X53 0.00000000 0.000000000
#> X74 0.00000000 -0.020625732
#> X83 0.00000000 0.000000000
#> X114 0.00000000 -0.296767166
#> X139 0.00000000 0.099222114
#> X143 0.00000000 -0.014927711
#> X168 0.00000000 -0.331411922
#> X176 0.00000000 0.000000000
#> X244 0.00000000 0.000000000
#> X246 0.00000000 -0.005748825
#> X249 0.00000000 0.000000000
#> X266 0.00000000 0.046705809
#> X271 0.00000000 -0.185900625
#> X273 0.00000000 0.000000000
#> X282 0.00000000 -0.007852365
#> X286 0.00000000 0.000000000
#> X300 0.00000000 -0.096840340
#> X302 -0.17341415 -0.116940734
#> X312 0.00000000 0.000000000
#> X315 0.00000000 -0.182654846
#> X330 0.00000000 0.000000000
#> X336 0.00000000 0.138752496
#> X344 0.00000000 -0.009399151
#> X348 0.00000000 0.000000000
#> X352 0.00000000 -0.069070682
#> X375 0.00000000 0.000000000
#> X383 0.00000000 0.000000000
#> X403 0.00000000 0.021615875
#> X404 0.00000000 0.000000000
#> X420 0.00000000 0.000000000
#> X422 0.00000000 0.000000000
#> X431 0.00000000 0.000000000
#> X435 0.00000000 0.000000000
#> X441 0.00000000 0.000000000
#> X447 0.00000000 0.000000000
#> X468 0.00000000 0.000000000
#> X485 0.00000000 0.000000000
#> X503 0.00000000 -0.575003035
#> X512 0.00000000 -0.042270178
#> X515 0.00000000 0.000000000
#> X516 0.00000000 0.248060154
#> X407 0.00000000 -0.158297657
#> X507 0.00000000 0.000000000
#> X524 1.34259118 1.863637356
#> X538 -0.71457966 -1.081362608
#> X243 0.00000000 -0.229774448
We can also get predictions ($X\widehat{\boldsymbol{\beta}}$) using the
predict
method for objects of class
ggmix_fit
:
We use the Generalized Information Criterion (GIC) to select the optimal value for λ. The GIC takes the form
$$GIC_{\lambda} = -2 \ell(\widehat{\boldsymbol{\beta}}, \widehat{\sigma}^2, \widehat{\eta}) + a_n \cdot \widehat{df}_{\lambda}$$
where ℓ(⋅) is the log-likelihood evaluated at the converged values of the parameters, $\widehat{df}_{\lambda}$ is the number of non-zero elements in $\widehat{\boldsymbol{\beta}}_{\lambda}$ plus two (representing the variance parameters η and σ2), and an is a non-negative penalty parameter. The BIC has an = log (n), and AIC has an = 2. The user can specify the value of an that they want. The default is an = log(log(n)) * log(p) which corresponds to a high-dimensional BIC (HDBIC):
# pass the fitted object from ggmix to the gic function:
hdbic <- gic(fit)
class(hdbic)
#> [1] "ggmix_gic" "lassofullrank" "ggmix_fit"
# we can also fit the BIC by specifying the an argument
bicfit <- gic(fit, an = log(length(admixed$ytrain)))
We can plot the HDBIC values against log (λ) using the plot
method for objects of class ggmix_gic
:
The optimal value for lambda according to the HDBIC is (i.e. the λ that leads to the minium HDBIC):
We can also plot the BIC:
We can use the object outputted by the gic
function to
extract the coefficients using the coef
method for objects
of class ggmix_gic
:
coef(hdbic)
#> 51 x 1 sparse Matrix of class "dgCMatrix"
#> 1
#> (Intercept) -0.03598164
#> X23 .
#> X36 .
#> X38 .
#> X40 .
#> X53 .
#> X74 .
#> X83 .
#> X114 .
#> X139 .
#> X143 .
#> X168 .
#> X176 .
#> X244 .
#> X246 .
#> X249 .
#> X266 .
#> X271 .
#> X273 .
#> X282 .
#> X286 .
#> X300 .
#> X302 -0.17617815
#> X312 .
#> X315 .
#> X330 .
#> X336 .
#> X344 .
#> X348 .
#> X352 .
#> X375 .
#> X383 .
#> X403 .
#> X404 .
#> X420 .
#> X422 .
#> X431 .
#> X435 .
#> X441 .
#> X447 .
#> X468 .
#> X485 .
#> X503 .
#> X512 .
#> X515 .
#> X516 .
#> X407 .
#> X507 .
#> X524 1.34917874
#> X538 -0.72073279
#> X243 .
We can also extract just the nonzero coefficients:
coef(hdbic, type = "nonzero")
#> 1
#> (Intercept) -0.03598164
#> X302 -0.17617815
#> X524 1.34917874
#> X538 -0.72073279
#> eta 0.99000000
#> sigma2 1.60477653
Finally we can also make predictions from the hdbic
object, which by default will use the model corresponding to the optimal
tuning parameter:
We can also plot some standard diagnostic plots such as the observed
vs. predicted response, QQ-plots of the residuals and random effects and
the Tukey-Anscombe plot. These can be plotted using the
plot
method on a ggmix_gic
object as shown
below.