Package 'eclust' reference manual

Title:	Environment Based Clustering for Interpretable Predictive Models in High Dimensional Data
Description:	Companion package to the paper: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures. Bhatnagar, Yang, Khundrakpam, Evans, Blanchette, Bouchard, Greenwood (2017) <DOI:10.1101/102475>. This package includes an algorithm for clustering high dimensional data that can be affected by an environmental factor.
Authors:	Sahir Rai Bhatnagar [aut, cre] (http://sahirbhatnagar.com/)
Maintainer:	Sahir Rai Bhatnagar <[email protected]>
License:	MIT + file LICENSE
Version:	0.1.0
Built:	2025-03-28 04:54:08 UTC
Source:	https://github.com/sahirbhatnagar/eclust

Plot Heatmap of Cluster Summaries by Exposure Status

Description

Plots cluster summaries such as the 1st principal component or average by exposure status. This is a plot method for object of class eclust returned by the r_cluster_data function. Two heatmaps, side-by-side are returned, where the first heatmap corresponds to the unexposed subjects and the second heatmap corresponds to the exposed subjects.

Usage

## S3 method for class 'eclust'
plot(x, type = c("ECLUST", "CLUST"), summary = c("pc",
  "avg"), sample = c("training", "test"), unexposed_title = "E=0",
  exposed_title = "E=1", ...)
## S3 method for class 'eclust'
plot(x, type = c("ECLUST", "CLUST"), summary = c("pc",
  "avg"), sample = c("training", "test"), unexposed_title = "E=0",
  exposed_title = "E=1", ...)

Arguments

`x`	object of class `eclust`, which is returned by the `r_cluster_data` function
`type`	show results from the "ECLUST" (which considers the environment) or "CLUST" (which ignores the environment) methods. Default is "ECLUST". See `r_cluster_data` for details. This function uses the `clustersAddon` object for "ECLUST" and the `clustersAll` for "CLUST"
`summary`	show the 1st principal component or the average of each cluster. Default is "pc".
`sample`	which sample to show, the "training" or the "test" set. Default is "training". This is determined by the `train_index` and `test_index` arguments in the `r_cluster_data` function. If you want to show all subjects, then provide the numeric vector 1:n to either argument, where n is the entire sample size.
`unexposed_title`	The title for the unexposed subjects heatmap. Default is "E=0".
`exposed_title`	The title for the exposed subjects heatmap. Default is "E=1".
`...`	other arguments passed to the `Heatmap` function

Details

Rows are the cluster summaries and columns are the subjects. This function determines the minimum and maximum value for the whole dataset and then creates a color scale using those values with the colorRamp2. This is so that both heatmaps are on the same color scale, i.e., each color represents the same value in both heatmaps. This is done for being able to visually compare the results.

Value

a plot of two Heatmaps, side-by-side, of the cluster summaries by exposure status

Examples

## Not run: 
data("tcgaov")
tcgaov[1:5,1:6, with = FALSE]
Y <- log(tcgaov[["OS"]])
E <- tcgaov[["E"]]
genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex <- drop(caret::createDataPartition(Y, p = 1, list = FALSE, times = 1))
testIndex <- setdiff(seq_len(length(Y)),trainIndex)

cluster_res <- r_cluster_data(data = genes,
                              response = Y,
                              exposure = E,
                              train_index = trainIndex,
                              test_index = testIndex,
                              cluster_distance = "tom",
                              eclust_distance = "difftom",
                              measure_distance = "euclidean",
                              clustMethod = "hclust",
                              cutMethod = "dynamic",
                              method = "average",
                              nPC = 1,
                              minimum_cluster_size = 60)

class(cluster_res)

plot(cluster_res, show_column_names = FALSE)

## End(Not run)
## Not run: 
data("tcgaov")
tcgaov[1:5,1:6, with = FALSE]
Y <- log(tcgaov[["OS"]])
E <- tcgaov[["E"]]
genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex <- drop(caret::createDataPartition(Y, p = 1, list = FALSE, times = 1))
testIndex <- setdiff(seq_len(length(Y)),trainIndex)

cluster_res <- r_cluster_data(data = genes,
                              response = Y,
                              exposure = E,
                              train_index = trainIndex,
                              test_index = testIndex,
                              cluster_distance = "tom",
                              eclust_distance = "difftom",
                              measure_distance = "euclidean",
                              clustMethod = "hclust",
                              cutMethod = "dynamic",
                              method = "average",
                              nPC = 1,
                              minimum_cluster_size = 60)

class(cluster_res)

plot(cluster_res, show_column_names = FALSE)

## End(Not run)

Function to generate heatmap

Description

Plots a heatmap of a similarity matrix such as a correlation matrix or a TOM matrix. This function is a plotting method for an object of class similarity. These objects are returned by the s_generate_data and s_generate_data_mars functions

Usage

## S3 method for class 'similarity'
plot(x, color = viridis::viridis(100), truemodule,
  active, ...)
## S3 method for class 'similarity'
plot(x, color = viridis::viridis(100), truemodule,
  active, ...)

Arguments

`x`	an object of class similarity. This is a p x p symmetric matix such as a correlation matrix or a TOM matrix, where p is the number of genes
`color`	colors for the heatmap. By default it uses the `viridis` color scheme. The `viridis` package needs to be installed.
`truemodule`	a numeric vector of length p where p is the number of genes, giving the module membership. By default, 0 = Grey, 1 = Turquoise, 2 = Blue, 3 = Red, 4 = Green, and 5 = Yellow. This information is used for annotating the heatmap
`active`	a binary vector of length p (where p is the number of genes) where 0 means that gene is not related to the response, and 1 means that the gene is associated to the response.
`...`	other arguments passed to the pheatmap function

Value

a heatmap of a similarity matrix

Note

this function is only meant to be used with output from the s_generate_data and s_generate_data_mars functions, since it assumes a fixed number of modules.

Examples

## Not run: 
corrX <- cor(simdata[,c(-1,-2)])
class(corrX) <- append(class(corrX), "similarity")
plot(corrX, truemodule = c(rep(1:5, each=150), rep(0, 250)))

## End(Not run)
## Not run: 
corrX <- cor(simdata[,c(-1,-2)])
class(corrX) <- append(class(corrX), "similarity")
plot(corrX, truemodule = c(rep(1:5, each=150), rep(0, 250)))

## End(Not run)

Cluster data using environmental exposure

Description

This is one of the functions for real data analysis, which will cluster the data based on the environment, as well as ignoring the environment

Usage

r_cluster_data(data, response, exposure, train_index, test_index,
  cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
  "diffcorr", "difftom", "fisherScore"), eclust_distance = c("fisherScore",
  "corScor", "diffcorr", "difftom"), measure_distance = c("euclidean",
  "maximum", "manhattan", "canberra", "binary", "minkowski"),
  minimum_cluster_size = 50, ...)
r_cluster_data(data, response, exposure, train_index, test_index,
  cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
  "diffcorr", "difftom", "fisherScore"), eclust_distance = c("fisherScore",
  "corScor", "diffcorr", "difftom"), measure_distance = c("euclidean",
  "maximum", "manhattan", "canberra", "binary", "minkowski"),
  minimum_cluster_size = 50, ...)

Arguments

`data`	n x p matrix of data. rows are samples, columns are genes or cpg sites. Should not contain the environment variable
`response`	numeric vector of length n
`exposure`	binary (0,1) numeric vector of length n for the exposure status of the n samples
`train_index`	numeric vector indcating the indices of `response` and the rows of `data` that are in the training set
`test_index`	numeric vector indcating the indices of `response` and the rows of `data` that are in the test set
`cluster_distance`	character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following corr, corr0, corr1, tom, tom0, tom1, diffcorr, difftom, corScor, tomScor, fisherScore
`eclust_distance`	character representing which matrix from the training set that you want to use to cluster the genes based on the environment. See `cluster_distance` for avaialble options. Should be different from `cluster_distance`. For example, if `cluster_distance=corr` and `EclustDistance=fisherScore`. That is, one should be based on correlations ignoring the environment, and the other should be based on correlations accounting for the environment. This function will always return this add on
`measure_distance`	one of "euclidean","maximum","manhattan", "canberra", "binary","minkowski" to be passed to `dist` function for calculating the distance for the clusters based on the corr,corr1,corr0, tom, tom0, tom1 matrices
`minimum_cluster_size`	The minimum cluster size. Only applicable if `cutMethod='dynamic'`. This argument is passed to the `cutreeDynamic` function through the `u_cluster_similarity` function. Default is 50.
`...`	arguments passed to the `u_cluster_similarity` function

Details

This function clusters the data. The results of this function should then be passed to the r_prepare_data function which output the appropriate X and Y matrices in the right format for regression packages such as mgcv, caret and glmnet

Value

a list of length 8:

clustersAddon: clustering results based on the environment and not the environment. see u_cluster_similarity for details
clustersAll: clustering results ignoring the environment. See u_cluster_similarity for details
etrain: vector of the exposure variable for the training set
cluster_distance_similarity: the similarity matrix based on the argument specified in cluster_distance
eclust_distance_similarity: the similarity matrix based on the argument specified in eclust_distance
clustersAddonMembership: a data.frame and data.table of the clustering membership for clustering results based on the environment and not the environment. As a result, each gene will show up twice in this table
clustersAllMembership: a data.frame and data.table of the clustering membership for clustering results based on all subjects i.e. ignoring the environment. Each gene will only show up once in this table
clustersEclustMembership: a data.frame and data.table of the clustering membership for clustering results accounting for the environment. Each gene will only show up once in this table

Examples

data("tcgaov")
tcgaov[1:5,1:6, with = FALSE]
Y <- log(tcgaov[["OS"]])
E <- tcgaov[["E"]]
genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1))
testIndex <- setdiff(seq_len(length(Y)),trainIndex)

## Not run: 
cluster_res <- r_cluster_data(data = genes,
                              response = Y,
                              exposure = E,
                              train_index = trainIndex,
                              test_index = testIndex,
                              cluster_distance = "tom",
                              eclust_distance = "difftom",
                              measure_distance = "euclidean",
                              clustMethod = "hclust",
                              cutMethod = "dynamic",
                              method = "average",
                              nPC = 1,
                              minimum_cluster_size = 60)

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance and eclust_distance arguments. This will always be larger
# than cluster_res$clustersAll$nclusters which is based on the similarity matrix
# specified in the cluster_distance argument
cluster_res$clustersAddon$nclusters

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance argument only
cluster_res$clustersAll$nclusters

## End(Not run)
data("tcgaov")
tcgaov[1:5,1:6, with = FALSE]
Y <- log(tcgaov[["OS"]])
E <- tcgaov[["E"]]
genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1))
testIndex <- setdiff(seq_len(length(Y)),trainIndex)

## Not run: 
cluster_res <- r_cluster_data(data = genes,
                              response = Y,
                              exposure = E,
                              train_index = trainIndex,
                              test_index = testIndex,
                              cluster_distance = "tom",
                              eclust_distance = "difftom",
                              measure_distance = "euclidean",
                              clustMethod = "hclust",
                              cutMethod = "dynamic",
                              method = "average",
                              nPC = 1,
                              minimum_cluster_size = 60)

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance and eclust_distance arguments. This will always be larger
# than cluster_res$clustersAll$nclusters which is based on the similarity matrix
# specified in the cluster_distance argument
cluster_res$clustersAddon$nclusters

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance argument only
cluster_res$clustersAll$nclusters

## End(Not run)

Prepare data for regression routines

Description

This function will output the appropriate X and Y matrices in the right format for regression packages such as mgcv, caret and glmnet

Usage

r_prepare_data(data, response = "Y", exposure = "E", probe_names)
r_prepare_data(data, response = "Y", exposure = "E", probe_names)

Arguments

`data`	the data frame which contains the response, exposure, and genes or cpgs or covariates. the columns should be labelled.
`response`	the column name of the response in the `data` argument
`exposure`	the column name of the exposure in the `data` argument
`probe_names`	the column names of the genes, or cpg sites or covariates

Value

a list of length 5:

X: the X matrix
Y: the response vector
E: the exposure vector
main_effect_names: the names of the main effects including the exposure
interaction_names: the names of the interaction effects

Examples

data("tcgaov")
tcgaov[1:5,1:6, with = FALSE]
Y <- log(tcgaov[["OS"]])
E <- tcgaov[["E"]]
genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1))
testIndex <- setdiff(seq_len(length(Y)),trainIndex)

## Not run: 
cluster_res <- r_cluster_data(data = genes,
                              response = Y,
                              exposure = E,
                              train_index = trainIndex,
                              test_index = testIndex,
                              cluster_distance = "tom",
                              eclust_distance = "difftom",
                              measure_distance = "euclidean",
                              clustMethod = "hclust",
                              cutMethod = "dynamic",
                              method = "average",
                              nPC = 1,
                              minimum_cluster_size = 50)

pc_eclust_interaction <- r_prepare_data(data = cbind(cluster_res$clustersAddon$PC,
                                                     survival = Y[trainIndex],
                                                     subtype = E[trainIndex]),
                                        response = "survival", exposure = "subtype")
names(pc_eclust_interaction)
dim(pc_eclust_interaction$X)
pc_eclust_interaction$main_effect_names
pc_eclust_interaction$interaction_names

## End(Not run)
data("tcgaov")
tcgaov[1:5,1:6, with = FALSE]
Y <- log(tcgaov[["OS"]])
E <- tcgaov[["E"]]
genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1))
testIndex <- setdiff(seq_len(length(Y)),trainIndex)

## Not run: 
cluster_res <- r_cluster_data(data = genes,
                              response = Y,
                              exposure = E,
                              train_index = trainIndex,
                              test_index = testIndex,
                              cluster_distance = "tom",
                              eclust_distance = "difftom",
                              measure_distance = "euclidean",
                              clustMethod = "hclust",
                              cutMethod = "dynamic",
                              method = "average",
                              nPC = 1,
                              minimum_cluster_size = 50)

pc_eclust_interaction <- r_prepare_data(data = cbind(cluster_res$clustersAddon$PC,
                                                     survival = Y[trainIndex],
                                                     subtype = E[trainIndex]),
                                        response = "survival", exposure = "subtype")
names(pc_eclust_interaction)
dim(pc_eclust_interaction$X)
pc_eclust_interaction$main_effect_names
pc_eclust_interaction$interaction_names

## End(Not run)

Generate linear response data and test and training sets for simulation study

Description

create a function that takes as input, the number of genes, the true beta vector, the gene expression matrix created from the generate_blocks function and returns a list of data matrix, as well as correlation matrices, TOM matrices, cluster information, training and test data

Usage

s_generate_data(p, X, beta, binary_outcome = FALSE,
  cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
  "diffcorr", "difftom", "corScor", "tomScor", "fisherScore"), n, n0,
  include_interaction = F, signal_to_noise_ratio = 1,
  eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"),
  cluster_method = c("hclust", "protoclust"), cut_method = c("dynamic",
  "gap", "fixed"), distance_method = c("euclidean", "maximum", "manhattan",
  "canberra", "binary", "minkowski"), n_clusters,
  agglomeration_method = c("complete", "average", "ward.D2", "single",
  "ward.D", "mcquitty", "median", "centroid"), nPC = 1, K.max = 10,
  B = 10)
s_generate_data(p, X, beta, binary_outcome = FALSE,
  cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
  "diffcorr", "difftom", "corScor", "tomScor", "fisherScore"), n, n0,
  include_interaction = F, signal_to_noise_ratio = 1,
  eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"),
  cluster_method = c("hclust", "protoclust"), cut_method = c("dynamic",
  "gap", "fixed"), distance_method = c("euclidean", "maximum", "manhattan",
  "canberra", "binary", "minkowski"), n_clusters,
  agglomeration_method = c("complete", "average", "ward.D2", "single",
  "ward.D", "mcquitty", "median", "centroid"), nPC = 1, K.max = 10,
  B = 10)

Arguments

`p`	number of genes in design matrix
`X`	gene expression matrix of size n x p using the `generate_blocks` function
`beta`	true beta coefficient vector
`binary_outcome`	Logical. Should a binary outcome be generated. Default is `FALSE`. See details on how a binary outcome is generated
`cluster_distance`	character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following corr, corr0, corr1, tom, tom0, tom1, diffcorr, difftom, corScor, tomScor, fisherScore
`n`	total number of subjects
`n0`	total number of subjects with E=0
`include_interaction`	Should an interaction with the environment be generated as part of the response. Default is FALSE.
`signal_to_noise_ratio`	signal to noise ratio, default is 1
`eclust_distance`	character representing which matrix from the training set that you want to use to cluster the genes based on the environment. See `cluster_distance` for avaialble options. Should be different from `cluster_distance`. For example, if `cluster_distance=corr` and `EclustDistance=fisherScore`. That is, one should be based on correlations ignoring the environment, and the other should be based on correlations accounting for the environment. This function will always return this add on
`cluster_method`	Cluster the data using hierarchical clustering or prototype clustering. Defaults `cluster_method="hclust"`. Other option is `protoclust`, however this package must be installed before proceeding with this option
`cut_method`	what method to use to cut the dendrogram. `'dynamic'` refers to `dynamicTreeCut` library. `'gap'` is Tibshirani's gap statistic `clusGap` using the `'Tibs2001SEmax'` rule. `'fixed'` is a fixed number specified by the `n_clusters` argument
`distance_method`	one of "euclidean","maximum","manhattan", "canberra", "binary","minkowski" to be passed to `dist` function.
`n_clusters`	Number of clusters specified by the user. Only applicable when `cut_method="fixed"`
`agglomeration_method`	the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
`nPC`	number of principal components to extract from each cluster. Default is 1. Only 1 or 2 is allowed.
`K.max`	the maximum number of clusters to consider, must be at least two. Only used if `cutMethod='gap'`
`B`	integer, number of Monte Carlo (“bootstrap”) samples. Only used if `cutMethod='gap'`

Details

To generate a binary outcome we first generate a continuous outcome Y which is $X^T \beta$ , defined $p = 1/(1 + exp(-Y ))$ and used this to generate a two-class outcome z with $Pr(z = 1) = p$ and $Pr(z = 0) = 1 - p$ .

Value

list of (in the following order)

beta_truth: a 1 column matrix containing the true beta coefficient vector
similarity: an object of class similarity which is the similarity matrix specified by the cluster_distance argument
similarityEclust: an object of class similarity which is the similarity matrix specified by the eclust_distance argument
DT: data.table of simulated data from the s_response function
Y: The simulated response
X0: the n0 x p design matrix for the unexposed subjects
X1: the n1 x p design matrix for the exposed subjects
X_train: the training design matrix for all subjects
X_test: the test set design matrix for all subjects
Y_train: the training set response
Y_test: the test set response
DT_train: the training response and training design matrix in a single data.frame object
DT_test: the test response and training design matrix in a single data.frame object
S0: a character vector of the active genes i.e. the ones that are associated with the response
n_clusters_All: the number of clusters identified by using the similarity matrix specified by the cluster_distance argument
n_clusters_Eclust: the number of clusters identified by using the similarity matrix specified by the eclust_distance argument
n_clusters_Addon: the sum of n_clusters_All and n_clusters_Eclust
clustersAll: the cluster membership of each gene based on the cluster_distance matrix
clustersAddon: the cluster membership of each gene based on both the cluster_distance matrix and the eclust_distance matrix. Note that each gene will appear twice here
clustersEclust: the cluster membership of each gene based on the eclust_distance matrix
gene_groups_inter: cluster membership of each gene with a penalty factor used for the group lasso
gene_groups_inter_Addon: cluster membership of each gene with a penalty factor used for the group lasso
tom_train_all: the TOM matrix based on all training subjects
tom_train_diff: the absolute difference of the exposed and unexposed TOM matrices: $|TOM_{E=1} - TOM_{E=0}|$
tom_train_e1: the TOM matrix based on training exposed subjects only
tom_train_e0: the TOM matrix based on training unexposed subjects only
corr_train_all: the Pearson correlation matrix based on all training subjects
corr_train_diff: the absolute difference of the exposed and unexposed Pearson correlation matrices: $|Cor_{E=1} - Cor_{E=0}|$
corr_train_e1: the Pearson correlation matrix based on training exposed subjects only
corr_train_e0: the Pearson correlation matrix based on training unexposed subjects only
fisherScore: The fisher scoring matrix. see u_fisherZ for details
corScor: The correlation scoring matrix: $|Cor_{E=1} + Cor_{E=0} - 2|$
mse_null: The MSE for the null model
DT_train_folds: The 10 training folds used for the stability measures
X_train_folds: The 10 X training folds (the same as in DT_train_folds)
Y_train_folds: The 10 Y training folds (the same as in DT_train_folds)

Note

this function calls the s_response to generate phenotype as a function of the gene expression data. This function also returns other information derived from the simulated data including the test and training sets, the correlation and TOM matrices and the clusters.

the PCs and averages need to be calculated in the fitting functions, because these will change based on the CV fold

Examples

library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)
betaMainInteractions <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1)
beta <- c(betaMainEffect, betaE, betaMainInteractions)
## Not run: 
result <- s_generate_data(p = p, X = X,
                          beta = beta,
                          include_interaction = TRUE,
                          cluster_distance = cluster_distance,
                          n = n, n0 = n0,
                          eclust_distance = Ecluster_distance,
                          signal_to_noise_ratio = SNR,
                          distance_method = distanceMethod,
                          cluster_method = clustMethod,
                          cut_method = cutMethod,
                          agglomeration_method = agglomerationMethod,
                          nPC = 1)
names(result)

## End(Not run)
library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)
betaMainInteractions <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1)
beta <- c(betaMainEffect, betaE, betaMainInteractions)
## Not run: 
result <- s_generate_data(p = p, X = X,
                          beta = beta,
                          include_interaction = TRUE,
                          cluster_distance = cluster_distance,
                          n = n, n0 = n0,
                          eclust_distance = Ecluster_distance,
                          signal_to_noise_ratio = SNR,
                          distance_method = distanceMethod,
                          cluster_method = clustMethod,
                          cut_method = cutMethod,
                          agglomeration_method = agglomerationMethod,
                          nPC = 1)
names(result)

## End(Not run)

Generate non linear response and test and training sets for non-linear simulation study

Description

Usage

s_generate_data_mars(p, X, beta, binary_outcome = FALSE, truemodule, nActive,
  cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
  "diffcorr", "difftom", "corScor", "tomScor", "fisherScore"), n, n0,
  include_interaction = F, signal_to_noise_ratio = 1,
  eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"),
  cluster_method = c("hclust", "protoclust"), cut_method = c("dynamic",
  "gap", "fixed"), distance_method = c("euclidean", "maximum", "manhattan",
  "canberra", "binary", "minkowski"), n_clusters,
  agglomeration_method = c("complete", "average", "ward.D2", "single",
  "ward.D", "mcquitty", "median", "centroid"), nPC = 1, K.max = 10,
  B = 10)
s_generate_data_mars(p, X, beta, binary_outcome = FALSE, truemodule, nActive,
  cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
  "diffcorr", "difftom", "corScor", "tomScor", "fisherScore"), n, n0,
  include_interaction = F, signal_to_noise_ratio = 1,
  eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"),
  cluster_method = c("hclust", "protoclust"), cut_method = c("dynamic",
  "gap", "fixed"), distance_method = c("euclidean", "maximum", "manhattan",
  "canberra", "binary", "minkowski"), n_clusters,
  agglomeration_method = c("complete", "average", "ward.D2", "single",
  "ward.D", "mcquitty", "median", "centroid"), nPC = 1, K.max = 10,
  B = 10)

Arguments

`p`	number of genes in design matrix
`X`	gene expression matrix of size n x p using the `generate_blocks` function
`beta`	true beta coefficient vector
`binary_outcome`	Logical. Should a binary outcome be generated. Default is `FALSE`. See details on how a binary outcome is generated
`truemodule`	numeric vector of the true module membership used in the `s_response_mars` function. Modules 3 and 4 are active in the response. See `s_response_mars` function for details.
`nActive`	number of active genes in the response used in the `s_response_mars`
`cluster_distance`	character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following corr, corr0, corr1, tom, tom0, tom1, diffcorr, difftom, corScor, tomScor, fisherScore
`n`	total number of subjects
`n0`	total number of subjects with E=0
`include_interaction`	Should an interaction with the environment be generated as part of the response. Default is FALSE.
`signal_to_noise_ratio`	signal to noise ratio, default is 1
`eclust_distance`	character representing which matrix from the training set that you want to use to cluster the genes based on the environment. See `cluster_distance` for avaialble options. Should be different from `cluster_distance`. For example, if `cluster_distance=corr` and `EclustDistance=fisherScore`. That is, one should be based on correlations ignoring the environment, and the other should be based on correlations accounting for the environment. This function will always return this add on
`cluster_method`	Cluster the data using hierarchical clustering or prototype clustering. Defaults `cluster_method="hclust"`. Other option is `protoclust`, however this package must be installed before proceeding with this option
`cut_method`	what method to use to cut the dendrogram. `'dynamic'` refers to `dynamicTreeCut` library. `'gap'` is Tibshirani's gap statistic `clusGap` using the `'Tibs2001SEmax'` rule. `'fixed'` is a fixed number specified by the `n_clusters` argument
`distance_method`	one of "euclidean","maximum","manhattan", "canberra", "binary","minkowski" to be passed to `dist` function.
`n_clusters`	Number of clusters specified by the user. Only applicable when `cut_method="fixed"`
`agglomeration_method`	the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
`nPC`	number of principal components. Can be 1 or 2.
`K.max`	the maximum number of clusters to consider, must be at least two. Only used if `cutMethod='gap'`
`B`	integer, number of Monte Carlo (“bootstrap”) samples. Only used if `cutMethod='gap'`

Value

list of (in the following order)

beta_truth: a 1 column matrix containing the true beta coefficient vector
similarity: an object of class similarity which is the similarity matrix specified by the cluster_distance argument
similarityEclust: an object of class similarity which is the similarity matrix specified by the eclust_distance argument
DT: data.table of simulated data from the s_response function
Y: The simulated response
X0: the n0 x p design matrix for the unexposed subjects
X1: the n1 x p design matrix for the exposed subjects
X_train: the training design matrix for all subjects
X_test: the test set design matrix for all subjects
Y_train: the training set response
Y_test: the test set response
DT_train: the training response and training design matrix in a single data.frame object
DT_test: the test response and training design matrix in a single data.frame object
S0: a character vector of the active genes i.e. the ones that are associated with the response
n_clusters_All: the number of clusters identified by using the similarity matrix specified by the cluster_distance argument
n_clusters_Eclust: the number of clusters identified by using the similarity matrix specified by the eclust_distance argument
n_clusters_Addon: the sum of n_clusters_All and n_clusters_Eclust
clustersAll: the cluster membership of each gene based on the cluster_distance matrix
clustersAddon: the cluster membership of each gene based on both the cluster_distance matrix and the eclust_distance matrix. Note that each gene will appear twice here
clustersEclust: the cluster membership of each gene based on the eclust_distance matrix
gene_groups_inter: cluster membership of each gene with a penalty factor used for the group lasso
gene_groups_inter_Addon: cluster membership of each gene with a penalty factor used for the group lasso
tom_train_all: the TOM matrix based on all training subjects
tom_train_diff: the absolute difference of the exposed and unexposed TOM matrices: $|TOM_{E=1} - TOM_{E=0}|$
tom_train_e1: the TOM matrix based on training exposed subjects only
tom_train_e0: the TOM matrix based on training unexposed subjects only
corr_train_all: the Pearson correlation matrix based on all training subjects
corr_train_diff: the absolute difference of the exposed and unexposed Pearson correlation matrices: $|Cor_{E=1} - Cor_{E=0}|$
corr_train_e1: the Pearson correlation matrix based on training exposed subjects only
corr_train_e0: the Pearson correlation matrix based on training unexposed subjects only
fisherScore: The fisher scoring matrix. see u_fisherZ for details
corScor: The correlation scoring matrix: $|Cor_{E=1} + Cor_{E=0} - 2|$
mse_null: The MSE for the null model
DT_train_folds: The 10 training folds used for the stability measures
X_train_folds: The 10 X training folds (the same as in DT_train_folds)
Y_train_folds: The 10 Y training folds (the same as in DT_train_folds)

Examples

library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_generate_data_mars(p = p, X = X,
                               beta = beta,
                               binary_outcome = FALSE,
                               truemodule = truemodule1,
                               nActive = nActive,
                               include_interaction = FALSE,
                               cluster_distance = cluster_distance,
                               n = n, n0 = n0,
                               eclust_distance = Ecluster_distance,
                               signal_to_noise_ratio = SNR,
                               distance_method = distanceMethod,
                               cluster_method = clustMethod,
                               cut_method = cutMethod,
                               agglomeration_method = agglomerationMethod,
                               nPC = 1)
names(result)
library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_generate_data_mars(p = p, X = X,
                               beta = beta,
                               binary_outcome = FALSE,
                               truemodule = truemodule1,
                               nActive = nActive,
                               include_interaction = FALSE,
                               cluster_distance = cluster_distance,
                               n = n, n0 = n0,
                               eclust_distance = Ecluster_distance,
                               signal_to_noise_ratio = SNR,
                               distance_method = distanceMethod,
                               cluster_method = clustMethod,
                               cut_method = cutMethod,
                               agglomeration_method = agglomerationMethod,
                               nPC = 1)
names(result)

Fit MARS Models on Simulated Cluster Summaries

Description

This function creates summaries of the given clusters (e.g. 1st PC or average), and then runs Friedman's MARS on those summaries. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016.

Usage

s_mars_clust(x_train, x_test, y_train, y_test, s0, summary = c("pc", "avg"),
  model = c("MARS"), exp_family = c("gaussian", "binomial"), gene_groups,
  true_beta = NULL, topgenes = NULL, stability = F, filter = F,
  include_E = T, include_interaction = F, clust_type = c("CLUST",
  "ECLUST"), nPC = 1)
s_mars_clust(x_train, x_test, y_train, y_test, s0, summary = c("pc", "avg"),
  model = c("MARS"), exp_family = c("gaussian", "binomial"), gene_groups,
  true_beta = NULL, topgenes = NULL, stability = F, filter = F,
  include_E = T, include_interaction = F, clust_type = c("CLUST",
  "ECLUST"), nPC = 1)

Arguments

`x_train`	`ntrain x p` matrix of simulated training set where `ntrain` is the number of training observations and `p` is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names
`x_test`	`ntest x p` matrix of simulated training set where `ntest` is the number of training observations and `p` is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names
`y_train`	numeric vector of length `ntrain` representing the responses for the training subjects. If continuous then you must set `exp_family = "gaussion"`. For `exp_family="binomial"` should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class)
`y_test`	numeric vector of length `ntest` representing the responses for the test subjects. If continuous then you must set `exp_family = "gaussion"`. For `exp_family="binomial"` should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class).
`s0`	chracter vector of the active feature names, i.e., the features in `x_train` that are truly associated with the response.
`summary`	the summary of each cluster. Can be the principal component or average. Default is `summary = "pc"` which takes the first `number_pc` principal components. Currently a maximum of 2 principal components can be chosen.
`model`	Type of non-linear model to be fit. Currently only Friedman's MARS is supported.
`exp_family`	Response type. See details for `y_train` argument above.
`gene_groups`	data.frame that contains the group membership for each feature. The first column is called 'gene' and the second column should be called 'cluster'. The 'gene' column identifies the features and must be the same identifiers in the `x_train,x_test` matrices. The 'cluster' column is a numeric integer indicating the cluster group membership. A cluster group membership of 0 implies the feature did not cluster into any group.
`true_beta`	numeric vector of true beta coefficients
`topgenes`	List of features to keep if `filter=TRUE`. Default is `topgenes = NULL` which means all features are kept for the analysis
`stability`	Should stability measures be calculated. Default is `stability=FALSE`. See details
`filter`	Should analysis be run on a subset of features. Default is `filter = FALSE`
`include_E`	Should the environment variable be included in the regression analysis. Default is `include_E = TRUE`
`include_interaction`	Should interaction effects between the features in `x_train` and the environment variable be fit. Default is `include_interaction=TRUE`
`clust_type`	Method used to cluster the features. This is used for naming the output only and has no consequence for the results. `clust_type = "CLUST"` is the default which means that the environment varible was not used in the clustering step. `clust_type = "ECLUST"` means that the environment variable was used in the clustering aspect.
`nPC`	Number of principal components if `summary = "pc"`. Default is `nPC = 1`. Can be either 1 or 2.

Details

This function first does 10 fold cross-validation to tune the degree (either 1 or 2) using the train function with method="earth" and nprune is fixed at 1000. Then the earth function is used, with nk = 1000 and pmethod = "backward" to fit the MARS model using the optimal degree from the 10-fold CV.

Value

This function has two different outputs depending on whether stability = TRUE or stability = FALSE

If stability = TRUE then this function returns a p x 2 data.frame or data.table of regression coefficients without the intercept. The output of this is used for subsequent calculations of stability.

If stability = FALSE then returns a vector with the following elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+) for definitions of each measure of performance):

`mse or AUC`	Test set mean squared error if `exp_family = "gaussion"` or test set Area under the curve if `exp_family = "binomial"` calculated using the `roc` function
`RMSE`	Square root of the mse. Only applicable if `exp_family = "gaussion"`
`Shat`	Number of non-zero estimated regression coefficients. The non-zero estimated regression coefficients are referred to as being selected by the model
`TPR`	true positive rate
`FPR`	false positive rate
`Correct Sparsity`	Correct true positives + correct true negative coefficients divided by the total number of features
`CorrectZeroMain`	Proportion of correct true negative main effects
`CorrectZeroInter`	Proportion of correct true negative interactions
`IncorrectZeroMain`	Proportion of incorrect true negative main effects
`IncorrectZeroInter`	Proportion of incorrect true negative interaction effects

Examples

## Not run: 
library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_generate_data_mars(p = p, X = X,
                               beta = beta,
                               binary_outcome = FALSE,
                               truemodule = truemodule1,
                               nActive = nActive,
                               include_interaction = FALSE,
                               cluster_distance = cluster_distance,
                               n = n, n0 = n0,
                               eclust_distance = Ecluster_distance,
                               signal_to_noise_ratio = SNR,
                               distance_method = distanceMethod,
                               cluster_method = clustMethod,
                               cut_method = cutMethod,
                               agglomeration_method = agglomerationMethod,
                               nPC = 1)


mars_res <- s_mars_clust(x_train = result[["X_train"]],
                         x_test = result[["X_test"]],
                         y_train = result[["Y_train"]],
                         y_test = result[["Y_test"]],
                         s0 = result[["S0"]],
                         summary = "pc",
                         exp_family = "gaussian",
                         gene_groups = result[["clustersAddon"]],
                         clust_type = "ECLUST")
unlist(mars_res)

## End(Not run)
## Not run: 
library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_generate_data_mars(p = p, X = X,
                               beta = beta,
                               binary_outcome = FALSE,
                               truemodule = truemodule1,
                               nActive = nActive,
                               include_interaction = FALSE,
                               cluster_distance = cluster_distance,
                               n = n, n0 = n0,
                               eclust_distance = Ecluster_distance,
                               signal_to_noise_ratio = SNR,
                               distance_method = distanceMethod,
                               cluster_method = clustMethod,
                               cut_method = cutMethod,
                               agglomeration_method = agglomerationMethod,
                               nPC = 1)


mars_res <- s_mars_clust(x_train = result[["X_train"]],
                         x_test = result[["X_test"]],
                         y_train = result[["Y_train"]],
                         y_test = result[["Y_test"]],
                         s0 = result[["S0"]],
                         summary = "pc",
                         exp_family = "gaussian",
                         gene_groups = result[["clustersAddon"]],
                         clust_type = "ECLUST")
unlist(mars_res)

## End(Not run)

Fit Multivariate Adaptive Regression Splines on Simulated Data

Description

This function can run Friedman's MARS models on the untransformed design matrix. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016. Uses caret functions to tune the degree and the nprune parameters

Usage

s_mars_separate(x_train, x_test, y_train, y_test, s0, model = c("MARS"),
  exp_family = c("gaussian", "binomial"), topgenes = NULL, stability = F,
  filter = F, include_E = T, include_interaction = F, ...)
s_mars_separate(x_train, x_test, y_train, y_test, s0, model = c("MARS"),
  exp_family = c("gaussian", "binomial"), topgenes = NULL, stability = F,
  filter = F, include_E = T, include_interaction = F, ...)

Arguments

`x_train`	`ntrain x p` matrix of simulated training set where `ntrain` is the number of training observations and `p` is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names
`x_test`	`ntest x p` matrix of simulated training set where `ntest` is the number of training observations and `p` is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names
`y_train`	numeric vector of length `ntrain` representing the responses for the training subjects. If continuous then you must set `exp_family = "gaussion"`. For `exp_family="binomial"` should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class)
`y_test`	numeric vector of length `ntest` representing the responses for the test subjects. If continuous then you must set `exp_family = "gaussion"`. For `exp_family="binomial"` should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class).
`s0`	chracter vector of the active feature names, i.e., the features in `x_train` that are truly associated with the response.
`model`	Type of non-linear model to be fit. Currently only Friedman's MARS is supported.
`exp_family`	Response type. See details for `y_train` argument above.
`topgenes`	List of features to keep if `filter=TRUE`. Default is `topgenes = NULL` which means all features are kept for the analysis
`stability`	Should stability measures be calculated. Default is `stability=FALSE`. See details
`filter`	Should analysis be run on a subset of features. Default is `filter = FALSE`
`include_E`	Should the environment variable be included in the regression analysis. Default is `include_E = TRUE`
`include_interaction`	Should interaction effects between the features in `x_train` and the environment variable be fit. Default is `include_interaction=TRUE`
`...`	other parameters passed to `trainControl` function

Details

Value

This function has two different outputs depending on whether stability = TRUE or stability = FALSE

If stability = FALSE then returns a vector with the following elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+) for definitions of each measure of performance):

`mse or AUC`	Test set mean squared error if `exp_family = "gaussion"` or test set Area under the curve if `exp_family = "binomial"` calculated using the `roc` function
`RMSE`	Square root of the mse. Only applicable if `exp_family = "gaussion"`
`Shat`	Number of non-zero estimated regression coefficients. The non-zero estimated regression coefficients are referred to as being selected by the model
`TPR`	true positive rate
`FPR`	false positive rate
`Correct Sparsity`	Correct true positives + correct true negative coefficients divided by the total number of features
`CorrectZeroMain`	Proportion of correct true negative main effects
`CorrectZeroInter`	Proportion of correct true negative interactions
`IncorrectZeroMain`	Proportion of incorrect true negative main effects
`IncorrectZeroInter`	Proportion of incorrect true negative interaction effects

Examples

## Not run: 
library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_generate_data_mars(p = p, X = X,
                               beta = beta,
                               binary_outcome = FALSE,
                               truemodule = truemodule1,
                               nActive = nActive,
                               include_interaction = FALSE,
                               cluster_distance = cluster_distance,
                               n = n, n0 = n0,
                               eclust_distance = Ecluster_distance,
                               signal_to_noise_ratio = SNR,
                               distance_method = distanceMethod,
                               cluster_method = clustMethod,
                               cut_method = cutMethod,
                               agglomeration_method = agglomerationMethod,
                               nPC = 1)


mars_res <- s_mars_separate(x_train = result[["X_train"]],
                            x_test = result[["X_test"]],
                            y_train = result[["Y_train"]],
                            y_test = result[["Y_test"]],
                            s0 = result[["S0"]],
                            exp_family = "gaussian")
unlist(mars_res)

## End(Not run)
## Not run: 
library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_generate_data_mars(p = p, X = X,
                               beta = beta,
                               binary_outcome = FALSE,
                               truemodule = truemodule1,
                               nActive = nActive,
                               include_interaction = FALSE,
                               cluster_distance = cluster_distance,
                               n = n, n0 = n0,
                               eclust_distance = Ecluster_distance,
                               signal_to_noise_ratio = SNR,
                               distance_method = distanceMethod,
                               cluster_method = clustMethod,
                               cut_method = cutMethod,
                               agglomeration_method = agglomerationMethod,
                               nPC = 1)


mars_res <- s_mars_separate(x_train = result[["X_train"]],
                            x_test = result[["X_test"]],
                            y_train = result[["Y_train"]],
                            y_test = result[["Y_test"]],
                            s0 = result[["S0"]],
                            exp_family = "gaussian")
unlist(mars_res)

## End(Not run)

Simulate Covariates With Exposure Dependent Correlations

Description

This is a wrapper of the simulateDatExpr function which simulates data in a modular structure (i.e. in blocks). This function simulates data in 5 blocks referred to as Turquoise, Blue, Red, Green and Yellow, separately for exposed (E=1) and unexposed (E=0) observations.

Usage

s_modules(n, p, rho, exposed, ...)
s_modules(n, p, rho, exposed, ...)

Arguments

`n`	number of observations
`p`	total number of predictors to simulate
`rho`	numeric value representing the expected correlation between green module and red module
`exposed`	binary numeric vector of length `n` with 0 for unexposed and 1 for exposed
`...`	arguments passed to the `simulateDatExpr` function

Value

n x p matrix of simulated data

Examples

library(magrittr)
p <- 1000
n <- 200
d0 <- s_modules(n = 100, p = 1000, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = 100, p = 1000, rho = 0.90, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

X <- rbind(d0$datExpr, d1$datExpr) %>%
 magrittr::set_colnames(paste0("Gene", 1:p)) %>%
 magrittr::set_rownames(paste0("Subject",1:n))
dim(X)
library(magrittr)
p <- 1000
n <- 200
d0 <- s_modules(n = 100, p = 1000, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = 100, p = 1000, rho = 0.90, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

X <- rbind(d0$datExpr, d1$datExpr) %>%
 magrittr::set_colnames(paste0("Gene", 1:p)) %>%
 magrittr::set_rownames(paste0("Subject",1:n))
dim(X)

Fit Penalized Regression Models on Simulated Cluster Summaries

Description

This function creates summaries of the given clusters (e.g. 1st PC or average), and then fits a penalized regression model on those summaries. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016. Can run lasso, elasticnet, SCAD or MCP models

Usage

s_pen_clust(x_train, x_test, y_train, y_test, s0, gene_groups,
  summary = c("pc", "avg"), model = c("lasso", "elasticnet", "scad", "mcp"),
  exp_family = c("gaussian", "binomial"), filter = F, topgenes = NULL,
  stability = F, include_E = T, include_interaction = F,
  clust_type = c("CLUST", "ECLUST"), number_pc = 1)
s_pen_clust(x_train, x_test, y_train, y_test, s0, gene_groups,
  summary = c("pc", "avg"), model = c("lasso", "elasticnet", "scad", "mcp"),
  exp_family = c("gaussian", "binomial"), filter = F, topgenes = NULL,
  stability = F, include_E = T, include_interaction = F,
  clust_type = c("CLUST", "ECLUST"), number_pc = 1)

Arguments

`x_train`	`ntrain x p` matrix of simulated training set where `ntrain` is the number of training observations and `p` is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names
`x_test`	`ntest x p` matrix of simulated training set where `ntest` is the number of training observations and `p` is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names
`y_train`	numeric vector of length `ntrain` representing the responses for the training subjects. If continuous then you must set `exp_family = "gaussion"`. For `exp_family="binomial"` should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class)
`y_test`	numeric vector of length `ntest` representing the responses for the test subjects. If continuous then you must set `exp_family = "gaussion"`. For `exp_family="binomial"` should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class).
`s0`	chracter vector of the active feature names, i.e., the features in `x_train` that are truly associated with the response.
`gene_groups`	data.frame that contains the group membership for each feature. The first column is called 'gene' and the second column should be called 'cluster'. The 'gene' column identifies the features and must be the same identifiers in the `x_train,x_test` matrices. The 'cluster' column is a numeric integer indicating the cluster group membership. A cluster group membership of 0 implies the feature did not cluster into any group.
`summary`	the summary of each cluster. Can be the principal component or average. Default is `summary = "pc"` which takes the first `number_pc` principal components. Currently a maximum of 2 principal components can be chosen.
`model`	Regression model to be fit on cluster summaries. Default is `model="lasso"` which corresponds to glmnet mixing parameter `alpha=1`. `model="elasticnet"` corresponds to glmnet mixing parameter `alpha=0.5`, `model="mcp"` and `model="scad"` are the non-convex models from the `ncvreg` package
`exp_family`	Response type. See details for `y_train` argument above.
`filter`	Should analysis be run on a subset of features. Default is `filter = FALSE`
`topgenes`	List of features to keep if `filter=TRUE`. Default is `topgenes = NULL` which means all features are kept for the analysis
`stability`	Should stability measures be calculated. Default is `stability=FALSE`. See details
`include_E`	Should the environment variable be included in the regression analysis. Default is `include_E = TRUE`
`include_interaction`	Should interaction effects between the features in `x_train` and the environment variable be fit. Default is `include_interaction=TRUE`
`clust_type`	Method used to cluster the features. This is used for naming the output only and has no consequence for the results. `clust_type = "CLUST"` is the default which means that the environment varible was not used in the clustering step. `clust_type = "ECLUST"` means that the environment variable was used in the clustering aspect.
`number_pc`	Number of principal components if `summary = "pc"`. Default is `number_pc = 1`. Can be either 1 or 2.

Details

The stability of feature importance is defined as the variability of feature weights under perturbations of the training set, i.e., small modifications in the training set should not lead to considerable changes in the set of important covariates (Toloşi, L., & Lengauer, T. (2011)). A feature selection algorithm produces a weight, a ranking, and a subset of features. In the CLUST and ECLUST methods, we defined a predictor to be non-zero if its corresponding cluster representative weight was non-zero. Using 10-fold cross validation (CV), we evaluated the similarity between two features and their rankings using Pearson and Spearman correlation, respectively. For each CV fold we re-ran the models and took the average Pearson/Spearman correlation of the 10 choose 2 combinations of estimated coefficients vectors. To measure the similarity between two subsets of features we took the average of the Jaccard distance in each fold. A Jaccard distance of 1 indicates perfect agreement between two sets while no agreement will result in a distance of 0.

Value

This function has two different outputs depending on whether stability = TRUE or stability = FALSE

If stability = FALSE then returns a vector with the following elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+) for definitions of each measure of performance):

`mse or AUC`	Test set mean squared error if `exp_family = "gaussion"` or test set Area under the curve if `exp_family = "binomial"` calculated using the `roc` function
`RMSE`	Square root of the mse. Only applicable if `exp_family = "gaussion"`
`Shat`	Number of non-zero estimated regression coefficients. The non-zero estimated regression coefficients are referred to as being selected by the model
`TPR`	true positive rate
`FPR`	false positive rate
`Correct Sparsity`	Correct true positives + correct true negative coefficients divided by the total number of features
`CorrectZeroMain`	Proportion of correct true negative main effects
`CorrectZeroInter`	Proportion of correct true negative interactions
`IncorrectZeroMain`	Proportion of incorrect true negative main effects
`IncorrectZeroInter`	Proportion of incorrect true negative interaction effects
`nclusters`	number of estimated clusters by the `cutreeDynamic` function

Note

number_pc=2 will not work if there is only one feature in an estimated cluster

References

Toloşi, L., & Lengauer, T. (2011). Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27(14), 1986-1994.

Bhatnagar, SR., Yang, Y., Blanchette, M., Bouchard, L., Khundrakpam, B., Evans, A., Greenwood, CMT. (2016+). An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures Preprint

Langfelder, P., Zhang, B., & Horvath, S. (2008). Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics, 24(5), 719-720.

Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent, http://www.stanford.edu/~hastie/Papers/glmnet.pdf

Breheny, P. and Huang, J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Statist., 5: 232-253.

Examples

library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)
betaMainInteractions <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1)
beta <- c(betaMainEffect, betaE, betaMainInteractions)

result <- s_generate_data(p = p, X = X,
                          beta = beta,
                          include_interaction = TRUE,
                          cluster_distance = cluster_distance,
                          n = n, n0 = n0,
                          eclust_distance = Ecluster_distance,
                          signal_to_noise_ratio = SNR,
                          distance_method = distanceMethod,
                          cluster_method = clustMethod,
                          cut_method = cutMethod,
                          agglomeration_method = agglomerationMethod,
                          nPC = 1)

pen_res <- s_pen_clust(x_train = result[["X_train"]],
                       x_test = result[["X_test"]],
                       y_train = result[["Y_train"]],
                       y_test = result[["Y_test"]],
                       s0 = result[["S0"]],
                       gene_groups = result[["clustersAddon"]],
                       summary = "pc",
                       model = "lasso",
                       exp_family = "gaussian",
                       clust_type = "ECLUST",
                       include_interaction = TRUE)
unlist(pen_res)


library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)
betaMainInteractions <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1)
beta <- c(betaMainEffect, betaE, betaMainInteractions)

result <- s_generate_data(p = p, X = X,
                          beta = beta,
                          include_interaction = TRUE,
                          cluster_distance = cluster_distance,
                          n = n, n0 = n0,
                          eclust_distance = Ecluster_distance,
                          signal_to_noise_ratio = SNR,
                          distance_method = distanceMethod,
                          cluster_method = clustMethod,
                          cut_method = cutMethod,
                          agglomeration_method = agglomerationMethod,
                          nPC = 1)

pen_res <- s_pen_clust(x_train = result[["X_train"]],
                       x_test = result[["X_test"]],
                       y_train = result[["Y_train"]],
                       y_test = result[["Y_test"]],
                       s0 = result[["S0"]],
                       gene_groups = result[["clustersAddon"]],
                       summary = "pc",
                       model = "lasso",
                       exp_family = "gaussian",
                       clust_type = "ECLUST",
                       include_interaction = TRUE)
unlist(pen_res)

Fit Penalized Regression Models on Simulated Data

Description

This function can run penalized regression models on the untransformed design matrix. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016. Can run lasso, elasticnet, SCAD or MCP models

Usage

s_pen_separate(x_train, x_test, y_train, y_test, s0,
  exp_family = c("gaussian", "binomial"), model = c("lasso", "elasticnet",
  "scad", "mcp"), topgenes = NULL, stability = F, filter = F,
  include_E = T, include_interaction = F)
s_pen_separate(x_train, x_test, y_train, y_test, s0,
  exp_family = c("gaussian", "binomial"), model = c("lasso", "elasticnet",
  "scad", "mcp"), topgenes = NULL, stability = F, filter = F,
  include_E = T, include_interaction = F)

Arguments

`x_train`	`ntrain x p` matrix of simulated training set where `ntrain` is the number of training observations and `p` is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names
`x_test`	`ntest x p` matrix of simulated training set where `ntest` is the number of training observations and `p` is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names
`y_train`	numeric vector of length `ntrain` representing the responses for the training subjects. If continuous then you must set `exp_family = "gaussion"`. For `exp_family="binomial"` should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class)
`y_test`	numeric vector of length `ntest` representing the responses for the test subjects. If continuous then you must set `exp_family = "gaussion"`. For `exp_family="binomial"` should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class).
`s0`	chracter vector of the active feature names, i.e., the features in `x_train` that are truly associated with the response.
`exp_family`	Response type. See details for `y_train` argument above.
`model`	Regression model to be fit on cluster summaries. Default is `model="lasso"` which corresponds to glmnet mixing parameter `alpha=1`. `model="elasticnet"` corresponds to glmnet mixing parameter `alpha=0.5`, `model="mcp"` and `model="scad"` are the non-convex models from the `ncvreg` package
`topgenes`	List of features to keep if `filter=TRUE`. Default is `topgenes = NULL` which means all features are kept for the analysis
`stability`	Should stability measures be calculated. Default is `stability=FALSE`. See details
`filter`	Should analysis be run on a subset of features. Default is `filter = FALSE`
`include_E`	Should the environment variable be included in the regression analysis. Default is `include_E = TRUE`
`include_interaction`	Should interaction effects between the features in `x_train` and the environment variable be fit. Default is `include_interaction=TRUE`

Details

Value

This function has two different outputs depending on whether stability = TRUE or stability = FALSE

If stability = FALSE then returns a vector with the following elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+) for definitions of each measure of performance):

`mse or AUC`	Test set mean squared error if `exp_family = "gaussion"` or test set Area under the curve if `exp_family = "binomial"` calculated using the `roc` function
`RMSE`	Square root of the mse. Only applicable if `exp_family = "gaussion"`
`Shat`	Number of non-zero estimated regression coefficients. The non-zero estimated regression coefficients are referred to as being selected by the model
`TPR`	true positive rate
`FPR`	false positive rate
`Correct Sparsity`	Correct true positives + correct true negative coefficients divided by the total number of features
`CorrectZeroMain`	Proportion of correct true negative main effects
`CorrectZeroInter`	Proportion of correct true negative interactions
`IncorrectZeroMain`	Proportion of incorrect true negative main effects
`IncorrectZeroInter`	Proportion of incorrect true negative interaction effects

References

Toloşi, L., & Lengauer, T. (2011). Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27(14), 1986-1994.

Langfelder, P., Zhang, B., & Horvath, S. (2008). Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics, 24(5), 719-720.

Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent, http://www.stanford.edu/~hastie/Papers/glmnet.pdf

Breheny, P. and Huang, J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Statist., 5: 232-253.

Examples

## Not run: 
library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)
betaMainInteractions <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1)
beta <- c(betaMainEffect, betaE, betaMainInteractions)

result <- s_generate_data(p = p, X = X,
                          beta = beta,
                          include_interaction = TRUE,
                          cluster_distance = cluster_distance,
                          n = n, n0 = n0,
                          eclust_distance = Ecluster_distance,
                          signal_to_noise_ratio = SNR,
                          distance_method = distanceMethod,
                          cluster_method = clustMethod,
                          cut_method = cutMethod,
                          agglomeration_method = agglomerationMethod,
                          nPC = 1)

pen_res <- s_pen_separate(x_train = result[["X_train"]],
                          x_test = result[["X_test"]],
                          y_train = result[["Y_train"]],
                          y_test = result[["Y_test"]],
                          s0 = result[["S0"]],
                          model = "lasso",
                          exp_family = "gaussian",
                          include_interaction = TRUE)
unlist(pen_res)

## End(Not run)
## Not run: 
library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)
betaMainInteractions <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1)
beta <- c(betaMainEffect, betaE, betaMainInteractions)

result <- s_generate_data(p = p, X = X,
                          beta = beta,
                          include_interaction = TRUE,
                          cluster_distance = cluster_distance,
                          n = n, n0 = n0,
                          eclust_distance = Ecluster_distance,
                          signal_to_noise_ratio = SNR,
                          distance_method = distanceMethod,
                          cluster_method = clustMethod,
                          cut_method = cutMethod,
                          agglomeration_method = agglomerationMethod,
                          nPC = 1)

pen_res <- s_pen_separate(x_train = result[["X_train"]],
                          x_test = result[["X_test"]],
                          y_train = result[["Y_train"]],
                          y_test = result[["Y_test"]],
                          s0 = result[["S0"]],
                          model = "lasso",
                          exp_family = "gaussian",
                          include_interaction = TRUE)
unlist(pen_res)

## End(Not run)

Generate True Response vector for Linear Simulation

Description

Given the true beta vector, covariates and environment variable this function generates the linear response with specified signal to noise ratio.

Usage

s_response(n, n0, p, genes, binary_outcome = FALSE, E,
  signal_to_noise_ratio = 1, include_interaction = FALSE, beta = NULL)
s_response(n, n0, p, genes, binary_outcome = FALSE, E,
  signal_to_noise_ratio = 1, include_interaction = FALSE, beta = NULL)

Arguments

`n`	Total number of subjects
`n0`	Total number of unexposed subjects
`p`	Total number of genes (or covariates)
`genes`	nxp matrix of the genes or covariates
`binary_outcome`	Logical. Should a binary outcome be generated. Default is `FALSE`. See details on how a binary outcome is generated
`E`	binary 0,1, vector of the exposure/environment variable
`signal_to_noise_ratio`	a numeric variable for the signal to noise ratio
`include_interaction`	Logical. Should the response include the interaction between E and the genes (for the non-zero `beta` coefficient vector)
`beta`	true beta coefficient vector. Assumes this vector is in the same order as the `genes`.

Value

a data.frame/data.table containing the response and the design matrix. Also an object of class expression

Examples

library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_response(n = n, n0 = n0,
                     p = p, genes = X, binary_outcome = FALSE,
                     E = c(rep(0,n0), rep(1, n1)), signal_to_noise_ratio = 1,
                     include_interaction = FALSE,
                     beta = beta)
result[1:5,1:5]

library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_response(n = n, n0 = n0,
                     p = p, genes = X, binary_outcome = FALSE,
                     E = c(rep(0,n0), rep(1, n1)), signal_to_noise_ratio = 1,
                     include_interaction = FALSE,
                     beta = beta)
result[1:5,1:5]

Generate True Response vector for Non-Linear Simulation

Description

Given the covariates and environment variable this function generates the nonlinear response with specified signal to noise ratio.

Usage

s_response_mars(n, n0, p, genes, beta, binary_outcome = FALSE, E,
  signal_to_noise_ratio = 1, truemodule, nActive)
s_response_mars(n, n0, p, genes, beta, binary_outcome = FALSE, E,
  signal_to_noise_ratio = 1, truemodule, nActive)

Arguments

`n`	total number of subjects
`n0`	total number of subjects with E=0
`p`	number of genes in design matrix
`genes`	nxp matrix of the genes or covariates
`beta`	true beta coefficient vector
`binary_outcome`	Logical. Should a binary outcome be generated. Default is `FALSE`. See details on how a binary outcome is generated
`E`	binary 0,1, vector of the exposure/environment variable
`signal_to_noise_ratio`	signal to noise ratio, default is 1
`truemodule`	numeric vector of the true module membership used in the `s_response_mars` function. Modules 3 and 4 are active in the response. See `s_response_mars` function for details.
`nActive`	number of active genes in the response used in the `s_response_mars`

Value

a data.frame/data.table containing the response and the design matrix. Also an object of class expression

Note

See Bhatnagar et al (2017+) for details on how the response is simulated.

Examples

library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_response_mars(n = n, n0 = n0,
                          p = p, genes = X, binary_outcome = TRUE,
                          E = c(rep(0,n0), rep(1, n1)), signal_to_noise_ratio = 1,
                          truemodule = truemodule1, nActive = nActive,
                          beta = beta)
result[1:5,1:5]
library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_response_mars(n = n, n0 = n0,
                          p = p, genes = X, binary_outcome = TRUE,
                          E = c(rep(0,n0), rep(1, n1)), signal_to_noise_ratio = 1,
                          truemodule = truemodule1, nActive = nActive,
                          beta = beta)
result[1:5,1:5]

Simulated Data with Environment Dependent Correlations

Description

A dataset containing simulated data for example use of the eclust package functions. This data was generated using the s_modules and s_generate_data

Usage

simdata
simdata

Format

A matrix with 100 rows and 502 variables:

Y: continuous response vector
E: binary environment variable for ECLUST method. E = 0 for unexposed (n=50) and E = 1 for exposed (n=50)
columns 3:502: gene expression data for 1000 genes. column names are the gene names

Note

Code used to generate this data can be found on the GitHub page for this package. See URL below.

Source

https://raw.githubusercontent.com/sahirbhatnagar/eclust/master/data-raw/simulated-data-processing.R

References

Examples

simdata[1:5, 1:10]
table(simdata[,"E"])
simdata[1:5, 1:10]
table(simdata[,"E"])

Subset of TCGA mRNA Ovarian serous cystadenocarcinoma data

Description

A dataset containing a subset of the TCGA mRNA Ovarian serous cystadenocarcinoma data generated using Affymetrix HTHGU133a arrays. Differences in gene expression profiles have led to the identification of robust molecular subtypes of ovarian cancer; these are of biological and clinical importance because they have been shown to correlate with overall survival (Tothill et al., 2008). Improving prediction of survival time based on gene expression signatures can lead to targeted therapeutic interventions (Helland et al., 2011). The proposed ECLUST algorithm was applied to gene expression data from 511 ovarian cancer patients profiled by the Affymetrix Human Genome U133A 2.0 Array. The data were obtained from the TCGA Research Network: http://cancergenome.nih.gov/ and downloaded via the TCGA2STAT R library (Wanet al., 2015). Using the 881 signature genes from Helland et al. (2011) we grouped subjects into two groups based on the results in this paper, to create a “positive control” environmental variable expected to have a strong effect. Specifically, we defined an environment variable in our framework as: E = 0 for subtypes C1 and C2 (n = 253), and E = 1 for subtypes C4 and C5 (n = 258).

Usage

tcgaov
tcgaov

Format

A data.table and data.frame with 511 rows and 886 variables:

rn: unique patient identifier (character)
subtype: cancer subtype (1,2,3 or 4) as per Helland et al. 2011 (integer)
E: binary environment variable for ECLUST method. E = 0 for subtypes 1 and 2 (n = 253), and E = 1 for subtypes 4 and 5 (n = 258) (numeric)
status: vital status, 0 = alive, 1 = dead (numeric)
OS: overall survival time (numeric)
columns 6:886: gene expression data for 881 genes. column names are the gene names (numeric)

Source

http://www.liuzlab.org/TCGA2STAT/#import-gene-expression

http://gdac.broadinstitute.org/

http://journals.plos.org/plosone/article/asset?unique&id=info:doi/10.1371/journal.pone.0018064.s015

References

Richard W Tothill, Anna V Tinker, Joshy George, Robert Brown, Stephen B Fox, Stephen Lade, Daryl S Johnson, Melanie K Trivett, Dariush Etemadmoghadam, Bianca Locandro, et al. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clinical Cancer Research, 14(16):5198–5208, 2008.

Aslaug Helland, Michael S Anglesio, Joshy George, Prue A Cowin, Cameron N Johnstone, Colin M House, Karen E Sheppard, Dariush Etemadmoghadam, Nataliya Melnyk, Anil K Rustgi, et al. Deregulation of mycn, lin28b and let7 in a molecular subtype of aggressive high-grade serous ovarian cancers. PloS one, 6(4):e18064, 2011.

Examples

# using data.table syntax from the data.table package
tcgaov[1:5, 1:10, with = FALSE]
tcgaov[,table(subtype, E, useNA = "always")]
# using data.table syntax from the data.table package
tcgaov[1:5, 1:10, with = FALSE]
tcgaov[,table(subtype, E, useNA = "always")]

Cluster similarity matrix

Description

Return cluster membership of each predictor. This function is called internally by the s_generate_data and s_generate_data_mars functions. Is also used by the r_clust function for real data analysis.

Usage

u_cluster_similarity(x, expr, exprTest, distanceMethod,
  clustMethod = c("hclust", "protoclust"), cutMethod = c("dynamic", "gap",
  "fixed"), nClusters, method = c("complete", "average", "ward.D2", "single",
  "ward.D", "mcquitty", "median", "centroid"), K.max = 10, B = 50, nPC,
  minimum_cluster_size = 50)
u_cluster_similarity(x, expr, exprTest, distanceMethod,
  clustMethod = c("hclust", "protoclust"), cutMethod = c("dynamic", "gap",
  "fixed"), nClusters, method = c("complete", "average", "ward.D2", "single",
  "ward.D", "mcquitty", "median", "centroid"), K.max = 10, B = 50, nPC,
  minimum_cluster_size = 50)

Arguments

`x`	similarity matrix. must have non-NULL dimnames i.e., the rows and columns should be labelled, e.g. "Gene1, Gene2, ..."
`expr`	gene expression data (training set). rows are people, columns are genes
`exprTest`	gene expression test set. If using real data, and you dont have enough samples for a test set then just supply the same data supplied to the `expr` argument
`distanceMethod`	one of "euclidean","maximum","manhattan", "canberra", "binary","minkowski" to be passed to `dist` function. If missing, then this function will take 1-x as the dissimilarity measure. This functionality is for diffCorr,diffTOM, fisherScore matrices which need to be converted to a distance type matrix.
`clustMethod`	Cluster the data using hierarchical clustering or prototype clustering. Defaults `clustMethod="hclust"`. Other option is `protoclust`, however this package must be installed before proceeding with this option
`cutMethod`	what method to use to cut the dendrogram. `'dynamic'` refers to `cutreeDynamicTree` library. `'gap'` is Tibshirani's gap statistic `clusGap` using the `'Tibs2001SEmax'` rule. `'fixed'` is a fixed number specified by the `nClusters` argument
`nClusters`	number of clusters. Only used if `cutMethod = fixed`
`method`	the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
`K.max`	the maximum number of clusters to consider, must be at least two. Only used if `cutMethod='gap'`
`B`	integer, number of Monte Carlo (“bootstrap”) samples. Only used if `cutMethod='gap'`
`nPC`	number of principal components. Can be 1 or 2.
`minimum_cluster_size`	The minimum cluster size. Only applicable if `cutMethod='dynamic'`. This argument is passed to the `cutreeDynamic` function. Default is 50.

Value

a list of length 2:

clusters

a p x 3 data.frame or data.table which give the cluster membership of each gene, where p is the number of genes. The first column is the gene name, the second column is the cluster number (numeric) and the third column is the cluster membership as a character vector of color names (these will match up exactly with the cluster number)

pcInfo

a list of length 9:

eigengenes: a list of the eigengenes i.e. the 1st (and 2nd if nPC=2) principal component of each module
averageExpr: a data.frame of the average expression for each module for the training set
averageExprTest: a data.frame of the average expression for each module for the test set
varExplained: percentage of variance explained by each 1st (and 2nd if nPC=2) principal component of each module
validColors: cluster membership of each gene
PC: a data.frame of the 1st (and 2nd if nPC=2) PC for each module for the training set
PCTest: a data.frame of the 1st (and 2nd if nPC=2) PC for each module for the test set
prcompObj: the prcomp object
nclusters: a numeric value for the total number of clusters

Examples

data("simdata")
X = simdata[,c(-1,-2)]
train_index <- sample(1:nrow(simdata),100)

cluster_results <- u_cluster_similarity(x = cor(X),
                                        expr = X[train_index,],
                                        exprTest = X[-train_index,],
                                        distanceMethod = "euclidean",
                                        clustMethod = "hclust",
                                        cutMethod = "dynamic",
                                        method = "average", nPC = 2,
                                        minimum_cluster_size = 75)

cluster_results$clusters[, table(module)]
names(cluster_results$pcInfo)
cluster_results$pcInfo$nclusters
data("simdata")
X = simdata[,c(-1,-2)]
train_index <- sample(1:nrow(simdata),100)

cluster_results <- u_cluster_similarity(x = cor(X),
                                        expr = X[train_index,],
                                        exprTest = X[-train_index,],
                                        distanceMethod = "euclidean",
                                        clustMethod = "hclust",
                                        cutMethod = "dynamic",
                                        method = "average", nPC = 2,
                                        minimum_cluster_size = 75)

cluster_results$clusters[, table(module)]
names(cluster_results$pcInfo)
cluster_results$pcInfo$nclusters

Get selected terms from an earth object

Description

function to extract the selected terms from an earth object

Usage

u_extract_selected_earth(obj)
u_extract_selected_earth(obj)

Arguments

obj

object of class earth returned by the earth function

Details

called internally by the s_mars_separate and s_mars_clust functions

Value

character vector of selected terms from the MARS model

Calculates cluster summaries

Description

This is a modified version of moduleEigengenes. It can extract (1st and 2nd principal component) of modules in a given single dataset. It can also return the average, the variance explained This function is more flexible and the nPC argument is used. currently only nPC = 1 and nPC = 2 are supported

Usage

u_extract_summary(x_train, colors, x_test, y_train, y_test, impute = TRUE,
  nPC, excludeGrey = FALSE, grey = if (is.numeric(colors)) 0 else "grey",
  subHubs = TRUE, trapErrors = FALSE, returnValidOnly = trapErrors,
  softPower = 6, scale = TRUE, verbose = 0, indent = 0)
u_extract_summary(x_train, colors, x_test, y_train, y_test, impute = TRUE,
  nPC, excludeGrey = FALSE, grey = if (is.numeric(colors)) 0 else "grey",
  subHubs = TRUE, trapErrors = FALSE, returnValidOnly = trapErrors,
  softPower = 6, scale = TRUE, verbose = 0, indent = 0)

Arguments

`x_train`	Training data for a single set in the form of a data frame where rows are samples and columns are genes (probes, cpgs, covariates).
`colors`	A vector of the same length as the number of probes in expr, giving module color for all probes (genes). Color "grey" is reserved for unassigned genes.
`x_test`	Test set in the form of a data frame where rows are samples and columns are genes (probes, cpgs, covariates).
`y_train`	Training response numeric vector
`y_test`	Test response numeric vector
`impute`	If TRUE, expression data will be checked for the presence of NA entries and if the latter are present, numerical data will be imputed, using function impute.knn and probes from the same module as the missing datum. The function impute.knn uses a fixed random seed giving repeatable results.
`nPC`	Number of principal components and variance explained entries to be calculated. Note that only 1 or 2 is possible.
`excludeGrey`	Should the improper module consisting of 'grey' genes be excluded from the eigengenes?
`grey`	Value of colors designating the improper module. Note that if colors is a factor of numbers, the default value will be incorrect.
`subHubs`	Controls whether hub genes should be substituted for missing eigengenes. If TRUE, each missing eigengene (i.e., eigengene whose calculation failed and the error was trapped) will be replaced by a weighted average of the most connected hub genes in the corresponding module. If this calculation fails, or if subHubs==FALSE, the value of trapErrors will determine whether the offending module will be removed or whether the function will issue an error and stop.
`trapErrors`	Controls handling of errors from that may arise when there are too many NA entries in expression data. If TRUE, errors from calling these functions will be trapped without abnormal exit. If FALSE, errors will cause the function to stop. Note, however, that subHubs takes precedence in the sense that if subHubs==TRUE and trapErrors==FALSE, an error will be issued only if both the principal component and the hubgene calculations have failed.
`returnValidOnly`	logical; controls whether the returned data frame of module eigengenes contains columns corresponding only to modules whose eigengenes or hub genes could be calculated correctly (TRUE), or whether the data frame should have columns for each of the input color labels (FALSE).
`softPower`	The power used in soft-thresholding the adjacency matrix. Only used when the hubgene approximation is necessary because the principal component calculation failed. It must be non-negative. The default value should only be changed if there is a clear indication that it leads to incorrect results.
`scale`	logical; can be used to turn off scaling of the expression data before calculating the singular value decomposition. The scaling should only be turned off if the data has been scaled previously, in which case the function can run a bit faster. Note however that the function first imputes, then scales the expression data in each module. If the expression contain missing data, scaling outside of the function and letting the function impute missing data may lead to slightly different results than if the data is scaled within the function.
`verbose`	Controls verbosity of printed progress messages. 0 means silent, up to (about) 5 the verbosity gradually increases.
`indent`	A single non-negative integer controlling indentation of printed messages. 0 means no indentation, each unit above that adds two spaces.

Details

This function is called internally by the u_cluster_similarity function

Value

A list with the following components:

eigengenes: Module eigengenes in a dataframe, with each column corresponding to one eigengene
averageExpr: the average expression per module in the training set
averageExprTest: the average expression per module in the training set
varExplained: The variance explained by the first PC in each module
validColors: A copy of the input colors with entries corresponding to invalid modules set to grey if given, otherwise 0 if colors is numeric and "grey" otherwise.
PC: The 1st or 1st and 2nd PC from each module in the training set
PCTest: The 1st or 1st and 2nd PC from each module in the test set
prcompObj: The prcomp object returned by prcomp
nclusters: the number of modules (clusters)

References

Zhang, B. and Horvath, S. (2005), "A General Framework for Weighted Gene Co-Expression Network Analysis", Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17

Examples

## Not run: 
#see u_cluster_similarity for examples

## End(Not run)
## Not run: 
#see u_cluster_similarity for examples

## End(Not run)

Calculate Fisher's Z Transformation for Correlations

Description

Calculate Fisher's Z transformation for correlations. This can be used as an alternative measure of similarity. Used in the s_generate_data function

Usage

u_fisherZ(n0, cor0, n1, cor1)

fisherTransform(n_1, r1, n_2, r2)
u_fisherZ(n0, cor0, n1, cor1)

fisherTransform(n_1, r1, n_2, r2)

Arguments

`n0`	number of unexposed subjects
`cor0`	correlation matrix of unexposed covariate values. Should be dimension pxp
`n1`	number of exposed subjects
`cor1`	correlation matrix of exposed covariate values. Should be dimension pxp
`n_1`	number of unexposed subjects
`r1`	correlation for unexposed
`n_2`	number of exposed subjects
`r2`	correlation for exposed

Value

a pxp matrix of Fisher's Z transformation of correlations

Note

fisherTransform is called internally by u_fisherZ function

References

https://en.wikipedia.org/wiki/Fisher_transformation

Examples

data("simdata")

X = simdata[,c(-1,-2)]
fisherScore <- u_fisherZ(n0 = 100, cor0 = cor(X[1:50,]),
                         n1 = 100, cor1 = cor(X[51:100,]))

dim(fisherScore)

fisherScore[1:5,1:5]
data("simdata")

X = simdata[,c(-1,-2)]
fisherScore <- u_fisherZ(n0 = 100, cor0 = cor(X[1:50,]),
                         n1 = 100, cor1 = cor(X[51:100,]))

dim(fisherScore)

fisherScore[1:5,1:5]

Package 'eclust'

Help Index

Plot Heatmap of Cluster Summaries by Exposure Status

Description

Usage

Arguments

Details

Value

Examples

Function to generate heatmap

Description

Usage

Arguments

Value

Note

Examples

Cluster data using environmental exposure

Description

Usage

Arguments

Details

Value

See Also

Examples

Prepare data for regression routines

Description

Usage

Arguments

Value

Examples

Generate linear response data and test and training sets for simulation study

Description

Usage

Arguments

Details

Value

Note

Examples

Generate non linear response and test and training sets for non-linear simulation study

Description

Usage

Arguments

Value

Examples

Fit MARS Models on Simulated Cluster Summaries

Description

Usage

Arguments

Details

Value

Examples

Fit Multivariate Adaptive Regression Splines on Simulated Data

Description

Usage

Arguments

Details

Value

Examples

Simulate Covariates With Exposure Dependent Correlations

Description

Usage

Arguments

Value

Examples

Fit Penalized Regression Models on Simulated Cluster Summaries

Description

Usage

Arguments

Details

Value

Note

References

Examples

Fit Penalized Regression Models on Simulated Data

Description

Usage

Arguments

Details

Value

References