Title: | Environment Based Clustering for Interpretable Predictive Models in High Dimensional Data |
---|---|
Description: | Companion package to the paper: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures. Bhatnagar, Yang, Khundrakpam, Evans, Blanchette, Bouchard, Greenwood (2017) <DOI:10.1101/102475>. This package includes an algorithm for clustering high dimensional data that can be affected by an environmental factor. |
Authors: | Sahir Rai Bhatnagar [aut, cre] (http://sahirbhatnagar.com/) |
Maintainer: | Sahir Rai Bhatnagar <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-01-27 04:43:58 UTC |
Source: | https://github.com/sahirbhatnagar/eclust |
Plots cluster summaries such as the 1st principal component or
average by exposure status. This is a plot method for object of class
eclust returned by the r_cluster_data
function. Two heatmaps,
side-by-side are returned, where the first heatmap corresponds to the
unexposed subjects and the second heatmap corresponds to the exposed
subjects.
## S3 method for class 'eclust' plot(x, type = c("ECLUST", "CLUST"), summary = c("pc", "avg"), sample = c("training", "test"), unexposed_title = "E=0", exposed_title = "E=1", ...)
## S3 method for class 'eclust' plot(x, type = c("ECLUST", "CLUST"), summary = c("pc", "avg"), sample = c("training", "test"), unexposed_title = "E=0", exposed_title = "E=1", ...)
x |
object of class |
type |
show results from the "ECLUST" (which considers the environment)
or "CLUST" (which ignores the environment) methods. Default is "ECLUST".
See |
summary |
show the 1st principal component or the average of each cluster. Default is "pc". |
sample |
which sample to show, the "training" or the "test" set. Default
is "training". This is determined by the |
unexposed_title |
The title for the unexposed subjects heatmap. Default is "E=0". |
exposed_title |
The title for the exposed subjects heatmap. Default is "E=1". |
... |
other arguments passed to the
|
Rows are the cluster summaries and columns are the subjects. This
function determines the minimum and maximum value for the whole dataset and
then creates a color scale using those values with the
colorRamp2
. This is so that both heatmaps are on
the same color scale, i.e., each color represents the same value in both
heatmaps. This is done for being able to visually compare the results.
a plot of two Heatmaps, side-by-side, of the cluster summaries by exposure status
## Not run: data("tcgaov") tcgaov[1:5,1:6, with = FALSE] Y <- log(tcgaov[["OS"]]) E <- tcgaov[["E"]] genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE]) trainIndex <- drop(caret::createDataPartition(Y, p = 1, list = FALSE, times = 1)) testIndex <- setdiff(seq_len(length(Y)),trainIndex) cluster_res <- r_cluster_data(data = genes, response = Y, exposure = E, train_index = trainIndex, test_index = testIndex, cluster_distance = "tom", eclust_distance = "difftom", measure_distance = "euclidean", clustMethod = "hclust", cutMethod = "dynamic", method = "average", nPC = 1, minimum_cluster_size = 60) class(cluster_res) plot(cluster_res, show_column_names = FALSE) ## End(Not run)
## Not run: data("tcgaov") tcgaov[1:5,1:6, with = FALSE] Y <- log(tcgaov[["OS"]]) E <- tcgaov[["E"]] genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE]) trainIndex <- drop(caret::createDataPartition(Y, p = 1, list = FALSE, times = 1)) testIndex <- setdiff(seq_len(length(Y)),trainIndex) cluster_res <- r_cluster_data(data = genes, response = Y, exposure = E, train_index = trainIndex, test_index = testIndex, cluster_distance = "tom", eclust_distance = "difftom", measure_distance = "euclidean", clustMethod = "hclust", cutMethod = "dynamic", method = "average", nPC = 1, minimum_cluster_size = 60) class(cluster_res) plot(cluster_res, show_column_names = FALSE) ## End(Not run)
Plots a heatmap of a similarity matrix such as a correlation
matrix or a TOM matrix. This function is a plotting method for an object of
class similarity. These objects are returned by the
s_generate_data
and s_generate_data_mars
functions
## S3 method for class 'similarity' plot(x, color = viridis::viridis(100), truemodule, active, ...)
## S3 method for class 'similarity' plot(x, color = viridis::viridis(100), truemodule, active, ...)
x |
an object of class similarity. This is a p x p symmetric matix such as a correlation matrix or a TOM matrix, where p is the number of genes |
color |
colors for the heatmap. By default it uses the |
truemodule |
a numeric vector of length p where p is the number of genes, giving the module membership. By default, 0 = Grey, 1 = Turquoise, 2 = Blue, 3 = Red, 4 = Green, and 5 = Yellow. This information is used for annotating the heatmap |
active |
a binary vector of length p (where p is the number of genes) where 0 means that gene is not related to the response, and 1 means that the gene is associated to the response. |
... |
other arguments passed to the pheatmap function |
a heatmap of a similarity matrix
this function is only meant to be used with output from the
s_generate_data
and s_generate_data_mars
functions, since it assumes a fixed number of modules.
## Not run: corrX <- cor(simdata[,c(-1,-2)]) class(corrX) <- append(class(corrX), "similarity") plot(corrX, truemodule = c(rep(1:5, each=150), rep(0, 250))) ## End(Not run)
## Not run: corrX <- cor(simdata[,c(-1,-2)]) class(corrX) <- append(class(corrX), "similarity") plot(corrX, truemodule = c(rep(1:5, each=150), rep(0, 250))) ## End(Not run)
This is one of the functions for real data analysis, which will cluster the data based on the environment, as well as ignoring the environment
r_cluster_data(data, response, exposure, train_index, test_index, cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1", "diffcorr", "difftom", "fisherScore"), eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"), measure_distance = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"), minimum_cluster_size = 50, ...)
r_cluster_data(data, response, exposure, train_index, test_index, cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1", "diffcorr", "difftom", "fisherScore"), eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"), measure_distance = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"), minimum_cluster_size = 50, ...)
data |
n x p matrix of data. rows are samples, columns are genes or cpg sites. Should not contain the environment variable |
response |
numeric vector of length n |
exposure |
binary (0,1) numeric vector of length n for the exposure status of the n samples |
train_index |
numeric vector indcating the indices of |
test_index |
numeric vector indcating the indices of |
cluster_distance |
character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following
|
eclust_distance |
character representing which matrix from the training
set that you want to use to cluster the genes based on the environment. See
|
measure_distance |
one of "euclidean","maximum","manhattan",
"canberra", "binary","minkowski" to be passed to |
minimum_cluster_size |
The minimum cluster size. Only applicable if
|
... |
arguments passed to the |
This function clusters the data. The results of this function should
then be passed to the r_prepare_data
function which output
the appropriate X and Y matrices in the right format for regression
packages such as mgcv
, caret
and glmnet
a list of length 8:
clustering results
based on the environment and not the environment. see
u_cluster_similarity
for
details
clustering results ignoring the environment. See
u_cluster_similarity
for details
vector of the exposure variable for the training set
the similarity matrix based on the
argument specified in
cluster_distance
the similarity
matrix based on the argument specified in
eclust_distance
a data.frame and data.table of the clustering membership for clustering results based on the environment and not the environment. As a result, each gene will show up twice in this table
a data.frame and data.table of the clustering membership for clustering results based on all subjects i.e. ignoring the environment. Each gene will only show up once in this table
a data.frame and data.table of the clustering membership for clustering results accounting for the environment. Each gene will only show up once in this table
data("tcgaov") tcgaov[1:5,1:6, with = FALSE] Y <- log(tcgaov[["OS"]]) E <- tcgaov[["E"]] genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE]) trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1)) testIndex <- setdiff(seq_len(length(Y)),trainIndex) ## Not run: cluster_res <- r_cluster_data(data = genes, response = Y, exposure = E, train_index = trainIndex, test_index = testIndex, cluster_distance = "tom", eclust_distance = "difftom", measure_distance = "euclidean", clustMethod = "hclust", cutMethod = "dynamic", method = "average", nPC = 1, minimum_cluster_size = 60) # the number of clusters determined by the similarity matrices specified # in the cluster_distance and eclust_distance arguments. This will always be larger # than cluster_res$clustersAll$nclusters which is based on the similarity matrix # specified in the cluster_distance argument cluster_res$clustersAddon$nclusters # the number of clusters determined by the similarity matrices specified # in the cluster_distance argument only cluster_res$clustersAll$nclusters ## End(Not run)
data("tcgaov") tcgaov[1:5,1:6, with = FALSE] Y <- log(tcgaov[["OS"]]) E <- tcgaov[["E"]] genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE]) trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1)) testIndex <- setdiff(seq_len(length(Y)),trainIndex) ## Not run: cluster_res <- r_cluster_data(data = genes, response = Y, exposure = E, train_index = trainIndex, test_index = testIndex, cluster_distance = "tom", eclust_distance = "difftom", measure_distance = "euclidean", clustMethod = "hclust", cutMethod = "dynamic", method = "average", nPC = 1, minimum_cluster_size = 60) # the number of clusters determined by the similarity matrices specified # in the cluster_distance and eclust_distance arguments. This will always be larger # than cluster_res$clustersAll$nclusters which is based on the similarity matrix # specified in the cluster_distance argument cluster_res$clustersAddon$nclusters # the number of clusters determined by the similarity matrices specified # in the cluster_distance argument only cluster_res$clustersAll$nclusters ## End(Not run)
This function will output the appropriate X and Y matrices in
the right format for regression packages such as mgcv
, caret
and glmnet
r_prepare_data(data, response = "Y", exposure = "E", probe_names)
r_prepare_data(data, response = "Y", exposure = "E", probe_names)
data |
the data frame which contains the response, exposure, and genes or cpgs or covariates. the columns should be labelled. |
response |
the column name of the response in the |
exposure |
the column name of the exposure in the |
probe_names |
the column names of the genes, or cpg sites or covariates |
a list of length 5:
the X matrix
the response vector
the exposure vector
the names of the main effects including the exposure
the names of the interaction effects
data("tcgaov") tcgaov[1:5,1:6, with = FALSE] Y <- log(tcgaov[["OS"]]) E <- tcgaov[["E"]] genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE]) trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1)) testIndex <- setdiff(seq_len(length(Y)),trainIndex) ## Not run: cluster_res <- r_cluster_data(data = genes, response = Y, exposure = E, train_index = trainIndex, test_index = testIndex, cluster_distance = "tom", eclust_distance = "difftom", measure_distance = "euclidean", clustMethod = "hclust", cutMethod = "dynamic", method = "average", nPC = 1, minimum_cluster_size = 50) pc_eclust_interaction <- r_prepare_data(data = cbind(cluster_res$clustersAddon$PC, survival = Y[trainIndex], subtype = E[trainIndex]), response = "survival", exposure = "subtype") names(pc_eclust_interaction) dim(pc_eclust_interaction$X) pc_eclust_interaction$main_effect_names pc_eclust_interaction$interaction_names ## End(Not run)
data("tcgaov") tcgaov[1:5,1:6, with = FALSE] Y <- log(tcgaov[["OS"]]) E <- tcgaov[["E"]] genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE]) trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1)) testIndex <- setdiff(seq_len(length(Y)),trainIndex) ## Not run: cluster_res <- r_cluster_data(data = genes, response = Y, exposure = E, train_index = trainIndex, test_index = testIndex, cluster_distance = "tom", eclust_distance = "difftom", measure_distance = "euclidean", clustMethod = "hclust", cutMethod = "dynamic", method = "average", nPC = 1, minimum_cluster_size = 50) pc_eclust_interaction <- r_prepare_data(data = cbind(cluster_res$clustersAddon$PC, survival = Y[trainIndex], subtype = E[trainIndex]), response = "survival", exposure = "subtype") names(pc_eclust_interaction) dim(pc_eclust_interaction$X) pc_eclust_interaction$main_effect_names pc_eclust_interaction$interaction_names ## End(Not run)
create a function that takes as input, the number of genes, the true beta vector, the gene expression matrix created from the generate_blocks function and returns a list of data matrix, as well as correlation matrices, TOM matrices, cluster information, training and test data
s_generate_data(p, X, beta, binary_outcome = FALSE, cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1", "diffcorr", "difftom", "corScor", "tomScor", "fisherScore"), n, n0, include_interaction = F, signal_to_noise_ratio = 1, eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"), cluster_method = c("hclust", "protoclust"), cut_method = c("dynamic", "gap", "fixed"), distance_method = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"), n_clusters, agglomeration_method = c("complete", "average", "ward.D2", "single", "ward.D", "mcquitty", "median", "centroid"), nPC = 1, K.max = 10, B = 10)
s_generate_data(p, X, beta, binary_outcome = FALSE, cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1", "diffcorr", "difftom", "corScor", "tomScor", "fisherScore"), n, n0, include_interaction = F, signal_to_noise_ratio = 1, eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"), cluster_method = c("hclust", "protoclust"), cut_method = c("dynamic", "gap", "fixed"), distance_method = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"), n_clusters, agglomeration_method = c("complete", "average", "ward.D2", "single", "ward.D", "mcquitty", "median", "centroid"), nPC = 1, K.max = 10, B = 10)
p |
number of genes in design matrix |
X |
gene expression matrix of size n x p using the
|
beta |
true beta coefficient vector |
binary_outcome |
Logical. Should a binary outcome be generated. Default
is |
cluster_distance |
character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following
|
n |
total number of subjects |
n0 |
total number of subjects with E=0 |
include_interaction |
Should an interaction with the environment be generated as part of the response. Default is FALSE. |
signal_to_noise_ratio |
signal to noise ratio, default is 1 |
eclust_distance |
character representing which matrix from the training
set that you want to use to cluster the genes based on the environment. See
|
cluster_method |
Cluster the data using hierarchical clustering or
prototype clustering. Defaults |
cut_method |
what method to use to cut the dendrogram. |
distance_method |
one of "euclidean","maximum","manhattan", "canberra",
"binary","minkowski" to be passed to |
n_clusters |
Number of clusters specified by the user. Only applicable
when |
agglomeration_method |
the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). |
nPC |
number of principal components to extract from each cluster. Default is 1. Only 1 or 2 is allowed. |
K.max |
the maximum number of clusters to consider, must be at least
two. Only used if |
B |
integer, number of Monte Carlo (“bootstrap”) samples. Only used if
|
To generate a binary outcome we first generate a continuous outcome
Y which is , defined
and used
this to generate a two-class outcome z with
and
.
list of (in the following order)
a 1 column matrix containing the true beta coefficient vector
an object of class similarity which is the similarity
matrix specified by the cluster_distance
argument
an object of class similarity which is the
similarity matrix specified by the eclust_distance
argument
data.table of simulated data from the s_response
function
The simulated response
the n0 x p design matrix for the unexposed subjects
the n1 x p design matrix for the exposed subjects
the training design matrix for all subjects
the test set design matrix for all subjects
the training set response
the test set response
the training response and training design matrix in a single data.frame object
the test response and training design matrix in a single data.frame object
a character vector of the active genes i.e. the ones that are associated with the response
the number of clusters identified by using
the similarity matrix specified by the cluster_distance
argument
the number of clusters identified by using the
similarity matrix specified by the eclust_distance
argument
the sum of n_clusters_All
and
n_clusters_Eclust
the cluster membership of each
gene based on the cluster_distance
matrix
the
cluster membership of each gene based on both the cluster_distance
matrix and the eclust_distance
matrix. Note that each gene will
appear twice here
the cluster membership of each gene
based on the eclust_distance
matrix
cluster membership of each gene with a penalty factor used for the group lasso
cluster membership of each gene with a penalty factor used for the group lasso
the TOM matrix based on all training subjects
the absolute difference of the exposed and unexposed
TOM matrices:
the TOM matrix based on training exposed subjects only
the TOM matrix based on training unexposed subjects only
the Pearson correlation matrix based on all training subjects
the absolute difference of the exposed and unexposed
Pearson correlation matrices:
the Pearson correlation matrix based on training exposed subjects only
the Pearson correlation matrix based on training unexposed subjects only
The fisher
scoring matrix. see u_fisherZ
for details
The
correlation scoring matrix:
The MSE for the null model
The 10 training folds used for the stability measures
The 10 X training folds (the same as in DT_train_folds)
The 10 Y training folds (the same as in DT_train_folds)
this function calls the s_response
to generate phenotype as a
function of the gene expression data. This function also returns other
information derived from the simulated data including the test and training
sets, the correlation and TOM matrices and the clusters.
the PCs and averages need to be calculated in the fitting functions, because these will change based on the CV fold
library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) betaMainInteractions <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1) beta <- c(betaMainEffect, betaE, betaMainInteractions) ## Not run: result <- s_generate_data(p = p, X = X, beta = beta, include_interaction = TRUE, cluster_distance = cluster_distance, n = n, n0 = n0, eclust_distance = Ecluster_distance, signal_to_noise_ratio = SNR, distance_method = distanceMethod, cluster_method = clustMethod, cut_method = cutMethod, agglomeration_method = agglomerationMethod, nPC = 1) names(result) ## End(Not run)
library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) betaMainInteractions <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1) beta <- c(betaMainEffect, betaE, betaMainInteractions) ## Not run: result <- s_generate_data(p = p, X = X, beta = beta, include_interaction = TRUE, cluster_distance = cluster_distance, n = n, n0 = n0, eclust_distance = Ecluster_distance, signal_to_noise_ratio = SNR, distance_method = distanceMethod, cluster_method = clustMethod, cut_method = cutMethod, agglomeration_method = agglomerationMethod, nPC = 1) names(result) ## End(Not run)
create a function that takes as input, the number of genes, the true beta vector, the gene expression matrix created from the generate_blocks function and returns a list of data matrix, as well as correlation matrices, TOM matrices, cluster information, training and test data
s_generate_data_mars(p, X, beta, binary_outcome = FALSE, truemodule, nActive, cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1", "diffcorr", "difftom", "corScor", "tomScor", "fisherScore"), n, n0, include_interaction = F, signal_to_noise_ratio = 1, eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"), cluster_method = c("hclust", "protoclust"), cut_method = c("dynamic", "gap", "fixed"), distance_method = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"), n_clusters, agglomeration_method = c("complete", "average", "ward.D2", "single", "ward.D", "mcquitty", "median", "centroid"), nPC = 1, K.max = 10, B = 10)
s_generate_data_mars(p, X, beta, binary_outcome = FALSE, truemodule, nActive, cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1", "diffcorr", "difftom", "corScor", "tomScor", "fisherScore"), n, n0, include_interaction = F, signal_to_noise_ratio = 1, eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"), cluster_method = c("hclust", "protoclust"), cut_method = c("dynamic", "gap", "fixed"), distance_method = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"), n_clusters, agglomeration_method = c("complete", "average", "ward.D2", "single", "ward.D", "mcquitty", "median", "centroid"), nPC = 1, K.max = 10, B = 10)
p |
number of genes in design matrix |
X |
gene expression matrix of size n x p using the
|
beta |
true beta coefficient vector |
binary_outcome |
Logical. Should a binary outcome be generated. Default
is |
truemodule |
numeric vector of the true module membership used in the
|
nActive |
number of active genes in the response used in the
|
cluster_distance |
character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following
|
n |
total number of subjects |
n0 |
total number of subjects with E=0 |
include_interaction |
Should an interaction with the environment be generated as part of the response. Default is FALSE. |
signal_to_noise_ratio |
signal to noise ratio, default is 1 |
eclust_distance |
character representing which matrix from the training
set that you want to use to cluster the genes based on the environment. See
|
cluster_method |
Cluster the data using hierarchical clustering or
prototype clustering. Defaults |
cut_method |
what method to use to cut the dendrogram. |
distance_method |
one of "euclidean","maximum","manhattan", "canberra",
"binary","minkowski" to be passed to |
n_clusters |
Number of clusters specified by the user. Only applicable
when |
agglomeration_method |
the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). |
nPC |
number of principal components. Can be 1 or 2. |
K.max |
the maximum number of clusters to consider, must be at least
two. Only used if |
B |
integer, number of Monte Carlo (“bootstrap”) samples. Only used if
|
list of (in the following order)
a 1 column matrix containing the true beta coefficient vector
an object of class similarity which is the similarity
matrix specified by the cluster_distance
argument
an object of class similarity which is the
similarity matrix specified by the eclust_distance
argument
data.table of simulated data from the s_response
function
The simulated response
the n0 x p design matrix for the unexposed subjects
the n1 x p design matrix for the exposed subjects
the training design matrix for all subjects
the test set design matrix for all subjects
the training set response
the test set response
the training response and training design matrix in a single data.frame object
the test response and training design matrix in a single data.frame object
a character vector of the active genes i.e. the ones that are associated with the response
the number of clusters identified by using
the similarity matrix specified by the cluster_distance
argument
the number of clusters identified by using the
similarity matrix specified by the eclust_distance
argument
the sum of n_clusters_All
and
n_clusters_Eclust
the cluster membership of each
gene based on the cluster_distance
matrix
the
cluster membership of each gene based on both the cluster_distance
matrix and the eclust_distance
matrix. Note that each gene will
appear twice here
the cluster membership of each gene
based on the eclust_distance
matrix
cluster membership of each gene with a penalty factor used for the group lasso
cluster membership of each gene with a penalty factor used for the group lasso
the TOM matrix based on all training subjects
the absolute difference of the exposed and unexposed
TOM matrices:
the TOM matrix based on training exposed subjects only
the TOM matrix based on training unexposed subjects only
the Pearson correlation matrix based on all training subjects
the absolute difference of the exposed and unexposed
Pearson correlation matrices:
the Pearson correlation matrix based on training exposed subjects only
the Pearson correlation matrix based on training unexposed subjects only
The fisher
scoring matrix. see u_fisherZ
for details
The
correlation scoring matrix:
The MSE for the null model
The 10 training folds used for the stability measures
The 10 X training folds (the same as in DT_train_folds)
The 10 Y training folds (the same as in DT_train_folds)
library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) beta <- c(betaMainEffect, betaE) result <- s_generate_data_mars(p = p, X = X, beta = beta, binary_outcome = FALSE, truemodule = truemodule1, nActive = nActive, include_interaction = FALSE, cluster_distance = cluster_distance, n = n, n0 = n0, eclust_distance = Ecluster_distance, signal_to_noise_ratio = SNR, distance_method = distanceMethod, cluster_method = clustMethod, cut_method = cutMethod, agglomeration_method = agglomerationMethod, nPC = 1) names(result)
library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) beta <- c(betaMainEffect, betaE) result <- s_generate_data_mars(p = p, X = X, beta = beta, binary_outcome = FALSE, truemodule = truemodule1, nActive = nActive, include_interaction = FALSE, cluster_distance = cluster_distance, n = n, n0 = n0, eclust_distance = Ecluster_distance, signal_to_noise_ratio = SNR, distance_method = distanceMethod, cluster_method = clustMethod, cut_method = cutMethod, agglomeration_method = agglomerationMethod, nPC = 1) names(result)
This function creates summaries of the given clusters (e.g. 1st PC or average), and then runs Friedman's MARS on those summaries. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016.
s_mars_clust(x_train, x_test, y_train, y_test, s0, summary = c("pc", "avg"), model = c("MARS"), exp_family = c("gaussian", "binomial"), gene_groups, true_beta = NULL, topgenes = NULL, stability = F, filter = F, include_E = T, include_interaction = F, clust_type = c("CLUST", "ECLUST"), nPC = 1)
s_mars_clust(x_train, x_test, y_train, y_test, s0, summary = c("pc", "avg"), model = c("MARS"), exp_family = c("gaussian", "binomial"), gene_groups, true_beta = NULL, topgenes = NULL, stability = F, filter = F, include_E = T, include_interaction = F, clust_type = c("CLUST", "ECLUST"), nPC = 1)
x_train |
|
x_test |
|
y_train |
numeric vector of length |
y_test |
numeric vector of length |
s0 |
chracter vector of the active feature names, i.e., the features in
|
summary |
the summary of each cluster. Can be the principal component or
average. Default is |
model |
Type of non-linear model to be fit. Currently only Friedman's MARS is supported. |
exp_family |
Response type. See details for |
gene_groups |
data.frame that contains the group membership for each
feature. The first column is called 'gene' and the second column should be
called 'cluster'. The 'gene' column identifies the features and must be the
same identifiers in the |
true_beta |
numeric vector of true beta coefficients |
topgenes |
List of features to keep if |
stability |
Should stability measures be calculated. Default is
|
filter |
Should analysis be run on a subset of features. Default is
|
include_E |
Should the environment variable be included in the
regression analysis. Default is |
include_interaction |
Should interaction effects between the features in
|
clust_type |
Method used to cluster the features. This is used for
naming the output only and has no consequence for the results.
|
nPC |
Number of principal components if |
This function first does 10 fold cross-validation to tune the degree
(either 1 or 2) using the train
function with
method="earth"
and nprune is fixed at 1000. Then the
earth
function is used, with nk = 1000
and
pmethod = "backward"
to fit the MARS model using the optimal degree
from the 10-fold CV.
This function has two different outputs depending on whether
stability = TRUE
or stability = FALSE
If stability = TRUE
then this function returns a p x 2
data.frame or data.table of regression coefficients without the intercept.
The output of this is used for subsequent calculations of stability.
If stability = FALSE
then returns a vector with the following
elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+)
for definitions of each measure of performance):
mse or AUC |
Test set
mean squared error if |
RMSE |
Square root of the mse. Only
applicable if |
Shat |
Number of non-zero estimated regression coefficients. The non-zero estimated regression coefficients are referred to as being selected by the model |
TPR |
true positive rate |
FPR |
false positive rate |
Correct Sparsity |
Correct true positives + correct true negative coefficients divided by the total number of features |
CorrectZeroMain |
Proportion of correct true negative main effects |
CorrectZeroInter |
Proportion of correct true negative interactions |
IncorrectZeroMain |
Proportion of incorrect true negative main effects |
IncorrectZeroInter |
Proportion of incorrect true negative interaction effects |
## Not run: library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) beta <- c(betaMainEffect, betaE) result <- s_generate_data_mars(p = p, X = X, beta = beta, binary_outcome = FALSE, truemodule = truemodule1, nActive = nActive, include_interaction = FALSE, cluster_distance = cluster_distance, n = n, n0 = n0, eclust_distance = Ecluster_distance, signal_to_noise_ratio = SNR, distance_method = distanceMethod, cluster_method = clustMethod, cut_method = cutMethod, agglomeration_method = agglomerationMethod, nPC = 1) mars_res <- s_mars_clust(x_train = result[["X_train"]], x_test = result[["X_test"]], y_train = result[["Y_train"]], y_test = result[["Y_test"]], s0 = result[["S0"]], summary = "pc", exp_family = "gaussian", gene_groups = result[["clustersAddon"]], clust_type = "ECLUST") unlist(mars_res) ## End(Not run)
## Not run: library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) beta <- c(betaMainEffect, betaE) result <- s_generate_data_mars(p = p, X = X, beta = beta, binary_outcome = FALSE, truemodule = truemodule1, nActive = nActive, include_interaction = FALSE, cluster_distance = cluster_distance, n = n, n0 = n0, eclust_distance = Ecluster_distance, signal_to_noise_ratio = SNR, distance_method = distanceMethod, cluster_method = clustMethod, cut_method = cutMethod, agglomeration_method = agglomerationMethod, nPC = 1) mars_res <- s_mars_clust(x_train = result[["X_train"]], x_test = result[["X_test"]], y_train = result[["Y_train"]], y_test = result[["Y_test"]], s0 = result[["S0"]], summary = "pc", exp_family = "gaussian", gene_groups = result[["clustersAddon"]], clust_type = "ECLUST") unlist(mars_res) ## End(Not run)
This function can run Friedman's MARS models on the untransformed design matrix. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016. Uses caret functions to tune the degree and the nprune parameters
s_mars_separate(x_train, x_test, y_train, y_test, s0, model = c("MARS"), exp_family = c("gaussian", "binomial"), topgenes = NULL, stability = F, filter = F, include_E = T, include_interaction = F, ...)
s_mars_separate(x_train, x_test, y_train, y_test, s0, model = c("MARS"), exp_family = c("gaussian", "binomial"), topgenes = NULL, stability = F, filter = F, include_E = T, include_interaction = F, ...)
x_train |
|
x_test |
|
y_train |
numeric vector of length |
y_test |
numeric vector of length |
s0 |
chracter vector of the active feature names, i.e., the features in
|
model |
Type of non-linear model to be fit. Currently only Friedman's MARS is supported. |
exp_family |
Response type. See details for |
topgenes |
List of features to keep if |
stability |
Should stability measures be calculated. Default is
|
filter |
Should analysis be run on a subset of features. Default is
|
include_E |
Should the environment variable be included in the
regression analysis. Default is |
include_interaction |
Should interaction effects between the features in
|
... |
other parameters passed to |
This function first does 10 fold cross-validation to tune the degree
(either 1 or 2) using the train
function with
method="earth"
and nprune is fixed at 1000. Then the
earth
function is used, with nk = 1000
and
pmethod = "backward"
to fit the MARS model using the optimal degree
from the 10-fold CV.
This function has two different outputs depending on whether
stability = TRUE
or stability = FALSE
If stability = TRUE
then this function returns a p x 2
data.frame or data.table of regression coefficients without the intercept.
The output of this is used for subsequent calculations of stability.
If stability = FALSE
then returns a vector with the following
elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+)
for definitions of each measure of performance):
mse or AUC |
Test set
mean squared error if |
RMSE |
Square root of the mse. Only
applicable if |
Shat |
Number of non-zero estimated regression coefficients. The non-zero estimated regression coefficients are referred to as being selected by the model |
TPR |
true positive rate |
FPR |
false positive rate |
Correct Sparsity |
Correct true positives + correct true negative coefficients divided by the total number of features |
CorrectZeroMain |
Proportion of correct true negative main effects |
CorrectZeroInter |
Proportion of correct true negative interactions |
IncorrectZeroMain |
Proportion of incorrect true negative main effects |
IncorrectZeroInter |
Proportion of incorrect true negative interaction effects |
## Not run: library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) beta <- c(betaMainEffect, betaE) result <- s_generate_data_mars(p = p, X = X, beta = beta, binary_outcome = FALSE, truemodule = truemodule1, nActive = nActive, include_interaction = FALSE, cluster_distance = cluster_distance, n = n, n0 = n0, eclust_distance = Ecluster_distance, signal_to_noise_ratio = SNR, distance_method = distanceMethod, cluster_method = clustMethod, cut_method = cutMethod, agglomeration_method = agglomerationMethod, nPC = 1) mars_res <- s_mars_separate(x_train = result[["X_train"]], x_test = result[["X_test"]], y_train = result[["Y_train"]], y_test = result[["Y_test"]], s0 = result[["S0"]], exp_family = "gaussian") unlist(mars_res) ## End(Not run)
## Not run: library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) beta <- c(betaMainEffect, betaE) result <- s_generate_data_mars(p = p, X = X, beta = beta, binary_outcome = FALSE, truemodule = truemodule1, nActive = nActive, include_interaction = FALSE, cluster_distance = cluster_distance, n = n, n0 = n0, eclust_distance = Ecluster_distance, signal_to_noise_ratio = SNR, distance_method = distanceMethod, cluster_method = clustMethod, cut_method = cutMethod, agglomeration_method = agglomerationMethod, nPC = 1) mars_res <- s_mars_separate(x_train = result[["X_train"]], x_test = result[["X_test"]], y_train = result[["Y_train"]], y_test = result[["Y_test"]], s0 = result[["S0"]], exp_family = "gaussian") unlist(mars_res) ## End(Not run)
This is a wrapper of the simulateDatExpr
function which simulates data in a modular structure (i.e. in blocks). This
function simulates data in 5 blocks referred to as Turquoise, Blue, Red,
Green and Yellow, separately for exposed (E=1) and unexposed (E=0)
observations.
s_modules(n, p, rho, exposed, ...)
s_modules(n, p, rho, exposed, ...)
n |
number of observations |
p |
total number of predictors to simulate |
rho |
numeric value representing the expected correlation between green module and red module |
exposed |
binary numeric vector of length |
... |
arguments passed to the |
n x p
matrix of simulated data
library(magrittr) p <- 1000 n <- 200 d0 <- s_modules(n = 100, p = 1000, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = 100, p = 1000, rho = 0.90, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) dim(X)
library(magrittr) p <- 1000 n <- 200 d0 <- s_modules(n = 100, p = 1000, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = 100, p = 1000, rho = 0.90, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) dim(X)
This function creates summaries of the given clusters (e.g. 1st PC or average), and then fits a penalized regression model on those summaries. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016. Can run lasso, elasticnet, SCAD or MCP models
s_pen_clust(x_train, x_test, y_train, y_test, s0, gene_groups, summary = c("pc", "avg"), model = c("lasso", "elasticnet", "scad", "mcp"), exp_family = c("gaussian", "binomial"), filter = F, topgenes = NULL, stability = F, include_E = T, include_interaction = F, clust_type = c("CLUST", "ECLUST"), number_pc = 1)
s_pen_clust(x_train, x_test, y_train, y_test, s0, gene_groups, summary = c("pc", "avg"), model = c("lasso", "elasticnet", "scad", "mcp"), exp_family = c("gaussian", "binomial"), filter = F, topgenes = NULL, stability = F, include_E = T, include_interaction = F, clust_type = c("CLUST", "ECLUST"), number_pc = 1)
x_train |
|
x_test |
|
y_train |
numeric vector of length |
y_test |
numeric vector of length |
s0 |
chracter vector of the active feature names, i.e., the features in
|
gene_groups |
data.frame that contains the group membership for each
feature. The first column is called 'gene' and the second column should be
called 'cluster'. The 'gene' column identifies the features and must be the
same identifiers in the |
summary |
the summary of each cluster. Can be the principal component or
average. Default is |
model |
Regression model to be fit on cluster summaries. Default is
|
exp_family |
Response type. See details for |
filter |
Should analysis be run on a subset of features. Default is
|
topgenes |
List of features to keep if |
stability |
Should stability measures be calculated. Default is
|
include_E |
Should the environment variable be included in the
regression analysis. Default is |
include_interaction |
Should interaction effects between the features in
|
clust_type |
Method used to cluster the features. This is used for
naming the output only and has no consequence for the results.
|
number_pc |
Number of principal components if |
The stability of feature importance is defined as the variability of feature weights under perturbations of the training set, i.e., small modifications in the training set should not lead to considerable changes in the set of important covariates (Toloşi, L., & Lengauer, T. (2011)). A feature selection algorithm produces a weight, a ranking, and a subset of features. In the CLUST and ECLUST methods, we defined a predictor to be non-zero if its corresponding cluster representative weight was non-zero. Using 10-fold cross validation (CV), we evaluated the similarity between two features and their rankings using Pearson and Spearman correlation, respectively. For each CV fold we re-ran the models and took the average Pearson/Spearman correlation of the 10 choose 2 combinations of estimated coefficients vectors. To measure the similarity between two subsets of features we took the average of the Jaccard distance in each fold. A Jaccard distance of 1 indicates perfect agreement between two sets while no agreement will result in a distance of 0.
This function has two different outputs depending on whether
stability = TRUE
or stability = FALSE
If stability = TRUE
then this function returns a p x 2
data.frame or data.table of regression coefficients without the intercept.
The output of this is used for subsequent calculations of stability.
If stability = FALSE
then returns a vector with the following
elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+)
for definitions of each measure of performance):
mse or AUC |
Test set
mean squared error if |
RMSE |
Square root of the mse. Only
applicable if |
Shat |
Number of non-zero estimated regression coefficients. The non-zero estimated regression coefficients are referred to as being selected by the model |
TPR |
true positive rate |
FPR |
false positive rate |
Correct Sparsity |
Correct true positives + correct true negative coefficients divided by the total number of features |
CorrectZeroMain |
Proportion of correct true negative main effects |
CorrectZeroInter |
Proportion of correct true negative interactions |
IncorrectZeroMain |
Proportion of incorrect true negative main effects |
IncorrectZeroInter |
Proportion of incorrect true negative interaction effects |
nclusters |
number of estimated clusters by the
|
number_pc=2
will not work if there is only one feature in an
estimated cluster
Toloşi, L., & Lengauer, T. (2011). Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27(14), 1986-1994.
Bhatnagar, SR., Yang, Y., Blanchette, M., Bouchard, L., Khundrakpam, B., Evans, A., Greenwood, CMT. (2016+). An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures Preprint
Langfelder, P., Zhang, B., & Horvath, S. (2008). Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics, 24(5), 719-720.
Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent, http://www.stanford.edu/~hastie/Papers/glmnet.pdf
Breheny, P. and Huang, J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Statist., 5: 232-253.
library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) betaMainInteractions <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1) beta <- c(betaMainEffect, betaE, betaMainInteractions) result <- s_generate_data(p = p, X = X, beta = beta, include_interaction = TRUE, cluster_distance = cluster_distance, n = n, n0 = n0, eclust_distance = Ecluster_distance, signal_to_noise_ratio = SNR, distance_method = distanceMethod, cluster_method = clustMethod, cut_method = cutMethod, agglomeration_method = agglomerationMethod, nPC = 1) pen_res <- s_pen_clust(x_train = result[["X_train"]], x_test = result[["X_test"]], y_train = result[["Y_train"]], y_test = result[["Y_test"]], s0 = result[["S0"]], gene_groups = result[["clustersAddon"]], summary = "pc", model = "lasso", exp_family = "gaussian", clust_type = "ECLUST", include_interaction = TRUE) unlist(pen_res)
library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) betaMainInteractions <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1) beta <- c(betaMainEffect, betaE, betaMainInteractions) result <- s_generate_data(p = p, X = X, beta = beta, include_interaction = TRUE, cluster_distance = cluster_distance, n = n, n0 = n0, eclust_distance = Ecluster_distance, signal_to_noise_ratio = SNR, distance_method = distanceMethod, cluster_method = clustMethod, cut_method = cutMethod, agglomeration_method = agglomerationMethod, nPC = 1) pen_res <- s_pen_clust(x_train = result[["X_train"]], x_test = result[["X_test"]], y_train = result[["Y_train"]], y_test = result[["Y_test"]], s0 = result[["S0"]], gene_groups = result[["clustersAddon"]], summary = "pc", model = "lasso", exp_family = "gaussian", clust_type = "ECLUST", include_interaction = TRUE) unlist(pen_res)
This function can run penalized regression models on the untransformed design matrix. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016. Can run lasso, elasticnet, SCAD or MCP models
s_pen_separate(x_train, x_test, y_train, y_test, s0, exp_family = c("gaussian", "binomial"), model = c("lasso", "elasticnet", "scad", "mcp"), topgenes = NULL, stability = F, filter = F, include_E = T, include_interaction = F)
s_pen_separate(x_train, x_test, y_train, y_test, s0, exp_family = c("gaussian", "binomial"), model = c("lasso", "elasticnet", "scad", "mcp"), topgenes = NULL, stability = F, filter = F, include_E = T, include_interaction = F)
x_train |
|
x_test |
|
y_train |
numeric vector of length |
y_test |
numeric vector of length |
s0 |
chracter vector of the active feature names, i.e., the features in
|
exp_family |
Response type. See details for |
model |
Regression model to be fit on cluster summaries. Default is
|
topgenes |
List of features to keep if |
stability |
Should stability measures be calculated. Default is
|
filter |
Should analysis be run on a subset of features. Default is
|
include_E |
Should the environment variable be included in the
regression analysis. Default is |
include_interaction |
Should interaction effects between the features in
|
The stability of feature importance is defined as the variability of feature weights under perturbations of the training set, i.e., small modifications in the training set should not lead to considerable changes in the set of important covariates (Toloşi, L., & Lengauer, T. (2011)). A feature selection algorithm produces a weight, a ranking, and a subset of features. In the CLUST and ECLUST methods, we defined a predictor to be non-zero if its corresponding cluster representative weight was non-zero. Using 10-fold cross validation (CV), we evaluated the similarity between two features and their rankings using Pearson and Spearman correlation, respectively. For each CV fold we re-ran the models and took the average Pearson/Spearman correlation of the 10 choose 2 combinations of estimated coefficients vectors. To measure the similarity between two subsets of features we took the average of the Jaccard distance in each fold. A Jaccard distance of 1 indicates perfect agreement between two sets while no agreement will result in a distance of 0.
This function has two different outputs depending on whether
stability = TRUE
or stability = FALSE
If stability = TRUE
then this function returns a p x 2
data.frame or data.table of regression coefficients without the intercept.
The output of this is used for subsequent calculations of stability.
If stability = FALSE
then returns a vector with the following
elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+)
for definitions of each measure of performance):
mse or AUC |
Test set
mean squared error if |
RMSE |
Square root of the mse. Only
applicable if |
Shat |
Number of non-zero estimated regression coefficients. The non-zero estimated regression coefficients are referred to as being selected by the model |
TPR |
true positive rate |
FPR |
false positive rate |
Correct Sparsity |
Correct true positives + correct true negative coefficients divided by the total number of features |
CorrectZeroMain |
Proportion of correct true negative main effects |
CorrectZeroInter |
Proportion of correct true negative interactions |
IncorrectZeroMain |
Proportion of incorrect true negative main effects |
IncorrectZeroInter |
Proportion of incorrect true negative interaction effects |
Toloşi, L., & Lengauer, T. (2011). Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27(14), 1986-1994.
Bhatnagar, SR., Yang, Y., Blanchette, M., Bouchard, L., Khundrakpam, B., Evans, A., Greenwood, CMT. (2016+). An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures Preprint
Langfelder, P., Zhang, B., & Horvath, S. (2008). Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics, 24(5), 719-720.
Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent, http://www.stanford.edu/~hastie/Papers/glmnet.pdf
Breheny, P. and Huang, J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Statist., 5: 232-253.
## Not run: library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) betaMainInteractions <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1) beta <- c(betaMainEffect, betaE, betaMainInteractions) result <- s_generate_data(p = p, X = X, beta = beta, include_interaction = TRUE, cluster_distance = cluster_distance, n = n, n0 = n0, eclust_distance = Ecluster_distance, signal_to_noise_ratio = SNR, distance_method = distanceMethod, cluster_method = clustMethod, cut_method = cutMethod, agglomeration_method = agglomerationMethod, nPC = 1) pen_res <- s_pen_separate(x_train = result[["X_train"]], x_test = result[["X_test"]], y_train = result[["Y_train"]], y_test = result[["Y_test"]], s0 = result[["S0"]], model = "lasso", exp_family = "gaussian", include_interaction = TRUE) unlist(pen_res) ## End(Not run)
## Not run: library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) betaMainInteractions <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1) beta <- c(betaMainEffect, betaE, betaMainInteractions) result <- s_generate_data(p = p, X = X, beta = beta, include_interaction = TRUE, cluster_distance = cluster_distance, n = n, n0 = n0, eclust_distance = Ecluster_distance, signal_to_noise_ratio = SNR, distance_method = distanceMethod, cluster_method = clustMethod, cut_method = cutMethod, agglomeration_method = agglomerationMethod, nPC = 1) pen_res <- s_pen_separate(x_train = result[["X_train"]], x_test = result[["X_test"]], y_train = result[["Y_train"]], y_test = result[["Y_test"]], s0 = result[["S0"]], model = "lasso", exp_family = "gaussian", include_interaction = TRUE) unlist(pen_res) ## End(Not run)
Given the true beta vector, covariates and environment variable this function generates the linear response with specified signal to noise ratio.
s_response(n, n0, p, genes, binary_outcome = FALSE, E, signal_to_noise_ratio = 1, include_interaction = FALSE, beta = NULL)
s_response(n, n0, p, genes, binary_outcome = FALSE, E, signal_to_noise_ratio = 1, include_interaction = FALSE, beta = NULL)
n |
Total number of subjects |
n0 |
Total number of unexposed subjects |
p |
Total number of genes (or covariates) |
genes |
nxp matrix of the genes or covariates |
binary_outcome |
Logical. Should a binary outcome be generated. Default
is |
E |
binary 0,1, vector of the exposure/environment variable |
signal_to_noise_ratio |
a numeric variable for the signal to noise ratio |
include_interaction |
Logical. Should the response include the
interaction between E and the genes (for the non-zero |
beta |
true beta coefficient vector. Assumes this vector is in the same
order as the |
a data.frame/data.table containing the response and the design
matrix. Also an object of class expression
library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) beta <- c(betaMainEffect, betaE) result <- s_response(n = n, n0 = n0, p = p, genes = X, binary_outcome = FALSE, E = c(rep(0,n0), rep(1, n1)), signal_to_noise_ratio = 1, include_interaction = FALSE, beta = beta) result[1:5,1:5]
library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) beta <- c(betaMainEffect, betaE) result <- s_response(n = n, n0 = n0, p = p, genes = X, binary_outcome = FALSE, E = c(rep(0,n0), rep(1, n1)), signal_to_noise_ratio = 1, include_interaction = FALSE, beta = beta) result[1:5,1:5]
Given the covariates and environment variable this function generates the nonlinear response with specified signal to noise ratio.
s_response_mars(n, n0, p, genes, beta, binary_outcome = FALSE, E, signal_to_noise_ratio = 1, truemodule, nActive)
s_response_mars(n, n0, p, genes, beta, binary_outcome = FALSE, E, signal_to_noise_ratio = 1, truemodule, nActive)
n |
total number of subjects |
n0 |
total number of subjects with E=0 |
p |
number of genes in design matrix |
genes |
nxp matrix of the genes or covariates |
beta |
true beta coefficient vector |
binary_outcome |
Logical. Should a binary outcome be generated. Default
is |
E |
binary 0,1, vector of the exposure/environment variable |
signal_to_noise_ratio |
signal to noise ratio, default is 1 |
truemodule |
numeric vector of the true module membership used in the
|
nActive |
number of active genes in the response used in the
|
a data.frame/data.table containing the response and the design
matrix. Also an object of class expression
See Bhatnagar et al (2017+) for details on how the response is simulated.
library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) beta <- c(betaMainEffect, betaE) result <- s_response_mars(n = n, n0 = n0, p = p, genes = X, binary_outcome = TRUE, E = c(rep(0,n0), rep(1, n1)), signal_to_noise_ratio = 1, truemodule = truemodule1, nActive = nActive, beta = beta) result[1:5,1:5]
library(magrittr) # simulation parameters rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom"; Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2; alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust"; cutMethod = "dynamic"; agglomerationMethod = "average" #in this simulation its blocks 3 and 4 that are important #leaveOut: optional specification of modules that should be left out #of the simulation, that is their genes will be simulated as unrelated #("grey"). This can be useful when simulating several sets, in some which a module #is present while in others it is absent. d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.01, maxCor = 1, corPower = 1, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE, leaveOut = 1:4) d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE, modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25), minCor = 0.4, maxCor = 1, corPower = 0.3, propNegativeCor = 0.3, backgroundNoise = 0.5, signed = FALSE) truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1$datExpr) %>% magrittr::set_colnames(paste0("Gene", 1:p)) %>% magrittr::set_rownames(paste0("Subject",1:n)) betaMainEffect <- vector("double", length = p) # the first nActive/2 in the 3rd block are active betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif( nActive/2, betaMean - 0.1, betaMean + 0.1) # the first nActive/2 in the 4th block are active betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif( nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1) beta <- c(betaMainEffect, betaE) result <- s_response_mars(n = n, n0 = n0, p = p, genes = X, binary_outcome = TRUE, E = c(rep(0,n0), rep(1, n1)), signal_to_noise_ratio = 1, truemodule = truemodule1, nActive = nActive, beta = beta) result[1:5,1:5]
A dataset containing simulated data for example use of the eclust
package functions. This data was generated using the s_modules
and s_generate_data
simdata
simdata
A matrix with 100 rows and 502 variables:
continuous response vector
binary environment variable for ECLUST method. E = 0 for unexposed (n=50) and E = 1 for exposed (n=50)
gene expression data for 1000 genes. column names are the gene names
Code used to generate this data can be found on the GitHub page for this package. See URL below.
https://raw.githubusercontent.com/sahirbhatnagar/eclust/master/data-raw/simulated-data-processing.R
Bhatnagar, SR., Yang, Y., Blanchette, M., Bouchard, L., Khundrakpam, B., Evans, A., Greenwood, CMT. (2016+). An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures Preprint
simdata[1:5, 1:10] table(simdata[,"E"])
simdata[1:5, 1:10] table(simdata[,"E"])
A dataset containing a subset of the TCGA mRNA Ovarian serous cystadenocarcinoma data generated using Affymetrix HTHGU133a arrays. Differences in gene expression profiles have led to the identification of robust molecular subtypes of ovarian cancer; these are of biological and clinical importance because they have been shown to correlate with overall survival (Tothill et al., 2008). Improving prediction of survival time based on gene expression signatures can lead to targeted therapeutic interventions (Helland et al., 2011). The proposed ECLUST algorithm was applied to gene expression data from 511 ovarian cancer patients profiled by the Affymetrix Human Genome U133A 2.0 Array. The data were obtained from the TCGA Research Network: http://cancergenome.nih.gov/ and downloaded via the TCGA2STAT R library (Wanet al., 2015). Using the 881 signature genes from Helland et al. (2011) we grouped subjects into two groups based on the results in this paper, to create a “positive control” environmental variable expected to have a strong effect. Specifically, we defined an environment variable in our framework as: E = 0 for subtypes C1 and C2 (n = 253), and E = 1 for subtypes C4 and C5 (n = 258).
tcgaov
tcgaov
A data.table and data.frame with 511 rows and 886 variables:
unique patient identifier (character
)
cancer subtype (1,2,3 or 4) as per Helland et al. 2011
(integer
)
binary environment variable for ECLUST method. E
= 0 for subtypes 1 and 2 (n = 253), and E = 1 for subtypes 4 and 5 (n =
258) (numeric
)
vital status, 0 = alive, 1 = dead
(numeric
)
overall survival time (numeric
)
gene expression data for 881 genes. column names are
the gene names (numeric
)
http://www.liuzlab.org/TCGA2STAT/#import-gene-expression
http://gdac.broadinstitute.org/
http://journals.plos.org/plosone/article/asset?unique&id=info:doi/10.1371/journal.pone.0018064.s015
Richard W Tothill, Anna V Tinker, Joshy George, Robert Brown, Stephen B Fox, Stephen Lade, Daryl S Johnson, Melanie K Trivett, Dariush Etemadmoghadam, Bianca Locandro, et al. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clinical Cancer Research, 14(16):5198–5208, 2008.
Aslaug Helland, Michael S Anglesio, Joshy George, Prue A Cowin, Cameron N Johnstone, Colin M House, Karen E Sheppard, Dariush Etemadmoghadam, Nataliya Melnyk, Anil K Rustgi, et al. Deregulation of mycn, lin28b and let7 in a molecular subtype of aggressive high-grade serous ovarian cancers. PloS one, 6(4):e18064, 2011.
# using data.table syntax from the data.table package tcgaov[1:5, 1:10, with = FALSE] tcgaov[,table(subtype, E, useNA = "always")]
# using data.table syntax from the data.table package tcgaov[1:5, 1:10, with = FALSE] tcgaov[,table(subtype, E, useNA = "always")]
Return cluster membership of each predictor. This function is
called internally by the s_generate_data
and
s_generate_data_mars
functions. Is also used by the
r_clust
function for real data analysis.
u_cluster_similarity(x, expr, exprTest, distanceMethod, clustMethod = c("hclust", "protoclust"), cutMethod = c("dynamic", "gap", "fixed"), nClusters, method = c("complete", "average", "ward.D2", "single", "ward.D", "mcquitty", "median", "centroid"), K.max = 10, B = 50, nPC, minimum_cluster_size = 50)
u_cluster_similarity(x, expr, exprTest, distanceMethod, clustMethod = c("hclust", "protoclust"), cutMethod = c("dynamic", "gap", "fixed"), nClusters, method = c("complete", "average", "ward.D2", "single", "ward.D", "mcquitty", "median", "centroid"), K.max = 10, B = 50, nPC, minimum_cluster_size = 50)
x |
similarity matrix. must have non-NULL dimnames i.e., the rows and columns should be labelled, e.g. "Gene1, Gene2, ..." |
expr |
gene expression data (training set). rows are people, columns are genes |
exprTest |
gene expression test set. If using real data, and you dont
have enough samples for a test set then just supply the same data supplied
to the |
distanceMethod |
one of "euclidean","maximum","manhattan", "canberra",
"binary","minkowski" to be passed to |
clustMethod |
Cluster the data using hierarchical clustering or
prototype clustering. Defaults |
cutMethod |
what method to use to cut the dendrogram. |
nClusters |
number of clusters. Only used if |
method |
the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). |
K.max |
the maximum number of clusters to consider, must be at least
two. Only used if |
B |
integer, number of Monte Carlo (“bootstrap”) samples. Only used if
|
nPC |
number of principal components. Can be 1 or 2. |
minimum_cluster_size |
The minimum cluster size. Only applicable if
|
a list of length 2:
a p x 3 data.frame or data.table which give the cluster membership of each gene, where p is the number of genes. The first column is the gene name, the second column is the cluster number (numeric) and the third column is the cluster membership as a character vector of color names (these will match up exactly with the cluster number)
a list of length 9:
a list of the eigengenes i.e. the 1st (and 2nd if nPC=2) principal component of each module
a data.frame of the average expression for each module for the training set
a data.frame of the average expression for each module for the test set
percentage of variance explained by each 1st (and 2nd if nPC=2) principal component of each module
cluster membership of each gene
a data.frame of the 1st (and 2nd if nPC=2) PC for each module for the training set
a data.frame of the 1st (and 2nd if nPC=2) PC for each module for the test set
the prcomp
object
a numeric value for the total number of clusters
data("simdata") X = simdata[,c(-1,-2)] train_index <- sample(1:nrow(simdata),100) cluster_results <- u_cluster_similarity(x = cor(X), expr = X[train_index,], exprTest = X[-train_index,], distanceMethod = "euclidean", clustMethod = "hclust", cutMethod = "dynamic", method = "average", nPC = 2, minimum_cluster_size = 75) cluster_results$clusters[, table(module)] names(cluster_results$pcInfo) cluster_results$pcInfo$nclusters
data("simdata") X = simdata[,c(-1,-2)] train_index <- sample(1:nrow(simdata),100) cluster_results <- u_cluster_similarity(x = cor(X), expr = X[train_index,], exprTest = X[-train_index,], distanceMethod = "euclidean", clustMethod = "hclust", cutMethod = "dynamic", method = "average", nPC = 2, minimum_cluster_size = 75) cluster_results$clusters[, table(module)] names(cluster_results$pcInfo) cluster_results$pcInfo$nclusters
function to extract the selected terms from an earth object
u_extract_selected_earth(obj)
u_extract_selected_earth(obj)
obj |
object of class |
called internally by the s_mars_separate
and
s_mars_clust
functions
character vector of selected terms from the MARS model
This is a modified version of
moduleEigengenes
. It can extract (1st and 2nd
principal component) of modules in a given single dataset. It can also
return the average, the variance explained This function is more flexible
and the nPC argument is used. currently only nPC = 1 and nPC = 2 are
supported
u_extract_summary(x_train, colors, x_test, y_train, y_test, impute = TRUE, nPC, excludeGrey = FALSE, grey = if (is.numeric(colors)) 0 else "grey", subHubs = TRUE, trapErrors = FALSE, returnValidOnly = trapErrors, softPower = 6, scale = TRUE, verbose = 0, indent = 0)
u_extract_summary(x_train, colors, x_test, y_train, y_test, impute = TRUE, nPC, excludeGrey = FALSE, grey = if (is.numeric(colors)) 0 else "grey", subHubs = TRUE, trapErrors = FALSE, returnValidOnly = trapErrors, softPower = 6, scale = TRUE, verbose = 0, indent = 0)
x_train |
Training data for a single set in the form of a data frame where rows are samples and columns are genes (probes, cpgs, covariates). |
colors |
A vector of the same length as the number of probes in expr, giving module color for all probes (genes). Color "grey" is reserved for unassigned genes. |
x_test |
Test set in the form of a data frame where rows are samples and columns are genes (probes, cpgs, covariates). |
y_train |
Training response numeric vector |
y_test |
Test response numeric vector |
impute |
If TRUE, expression data will be checked for the presence of NA entries and if the latter are present, numerical data will be imputed, using function impute.knn and probes from the same module as the missing datum. The function impute.knn uses a fixed random seed giving repeatable results. |
nPC |
Number of principal components and variance explained entries to be calculated. Note that only 1 or 2 is possible. |
excludeGrey |
Should the improper module consisting of 'grey' genes be excluded from the eigengenes? |
grey |
Value of colors designating the improper module. Note that if colors is a factor of numbers, the default value will be incorrect. |
subHubs |
Controls whether hub genes should be substituted for missing eigengenes. If TRUE, each missing eigengene (i.e., eigengene whose calculation failed and the error was trapped) will be replaced by a weighted average of the most connected hub genes in the corresponding module. If this calculation fails, or if subHubs==FALSE, the value of trapErrors will determine whether the offending module will be removed or whether the function will issue an error and stop. |
trapErrors |
Controls handling of errors from that may arise when there are too many NA entries in expression data. If TRUE, errors from calling these functions will be trapped without abnormal exit. If FALSE, errors will cause the function to stop. Note, however, that subHubs takes precedence in the sense that if subHubs==TRUE and trapErrors==FALSE, an error will be issued only if both the principal component and the hubgene calculations have failed. |
returnValidOnly |
logical; controls whether the returned data frame of module eigengenes contains columns corresponding only to modules whose eigengenes or hub genes could be calculated correctly (TRUE), or whether the data frame should have columns for each of the input color labels (FALSE). |
softPower |
The power used in soft-thresholding the adjacency matrix. Only used when the hubgene approximation is necessary because the principal component calculation failed. It must be non-negative. The default value should only be changed if there is a clear indication that it leads to incorrect results. |
scale |
logical; can be used to turn off scaling of the expression data before calculating the singular value decomposition. The scaling should only be turned off if the data has been scaled previously, in which case the function can run a bit faster. Note however that the function first imputes, then scales the expression data in each module. If the expression contain missing data, scaling outside of the function and letting the function impute missing data may lead to slightly different results than if the data is scaled within the function. |
verbose |
Controls verbosity of printed progress messages. 0 means silent, up to (about) 5 the verbosity gradually increases. |
indent |
A single non-negative integer controlling indentation of printed messages. 0 means no indentation, each unit above that adds two spaces. |
This function is called internally by the
u_cluster_similarity
function
A list with the following components:
Module eigengenes in a dataframe, with each column corresponding to one eigengene
the average expression per module in the training set
the average expression per module in the training set
The variance explained by the first PC in each module
A copy of the input colors with entries corresponding to invalid modules set to grey if given, otherwise 0 if colors is numeric and "grey" otherwise.
The 1st or 1st and 2nd PC from each module in the training set
The 1st or 1st and 2nd PC from each module in the test set
The prcomp
object returned by
prcomp
the number of modules (clusters)
Zhang, B. and Horvath, S. (2005), "A General Framework for Weighted Gene Co-Expression Network Analysis", Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17
## Not run: #see u_cluster_similarity for examples ## End(Not run)
## Not run: #see u_cluster_similarity for examples ## End(Not run)
Calculate Fisher's Z transformation for correlations. This can
be used as an alternative measure of similarity. Used in the
s_generate_data
function
u_fisherZ(n0, cor0, n1, cor1) fisherTransform(n_1, r1, n_2, r2)
u_fisherZ(n0, cor0, n1, cor1) fisherTransform(n_1, r1, n_2, r2)
n0 |
number of unexposed subjects |
cor0 |
correlation matrix of unexposed covariate values. Should be dimension pxp |
n1 |
number of exposed subjects |
cor1 |
correlation matrix of exposed covariate values. Should be dimension pxp |
n_1 |
number of unexposed subjects |
r1 |
correlation for unexposed |
n_2 |
number of exposed subjects |
r2 |
correlation for exposed |
a pxp matrix of Fisher's Z transformation of correlations
fisherTransform
is called internally by u_fisherZ
function
https://en.wikipedia.org/wiki/Fisher_transformation
data("simdata") X = simdata[,c(-1,-2)] fisherScore <- u_fisherZ(n0 = 100, cor0 = cor(X[1:50,]), n1 = 100, cor1 = cor(X[51:100,])) dim(fisherScore) fisherScore[1:5,1:5]
data("simdata") X = simdata[,c(-1,-2)] fisherScore <- u_fisherZ(n0 = 100, cor0 = cor(X[1:50,]), n1 = 100, cor1 = cor(X[51:100,])) dim(fisherScore) fisherScore[1:5,1:5]