Weighted Correlation Network Analysis

Analyzing broad patterns in transcriptomic variation.

Last updated on Jun 2, 2022 5 min read Methods

Weighted correlation network analysis is a family of approaches to study the pattern of correlation between features using an inferred network. In the context of gene expression, these techniques allow for the identification of co-expressed sets of genes.

Note

I write these articles to help me keep track of the various methods I have learned. If you have any suggestions or find any errors, feel free to contact me!

WGCNA

Weighted gene co-expression network analysis (WGCNA) is a type of weighted network correlation analysis. Broadly, these analyses attempt to build a network structure over transcriptomic data and use this network to make inferences about the relationships between genes. In WGCNA, a network is built using the genes as vertices. Connections (or edges) between genes are weighted based on how similar the genes are.

Figure: An example of a network. The vertices (dots) represent genes and the edges between them represent inferred similarity.

WGCNA was a method introduced by Zhang and Horvath (2005). WGCNA is built on the assumption that the biological network follows a scale-free topology. That is to say, the probability that a vertex $v$ has a degree $k$ is

$$\mathbb{P}(\deg(v)=k) \propto k^{-\gamma}$$

for some positive $\gamma$. A natural consequence of this relationship is that the network will tend to have a lot of vertices with a small degree (few connections) and a few vertices with a large degree (many connections).

Building the Network

Suppose we have a dataset of gene expression with $n$ samples and $p$ genes. We want to build a network over the $p$ genes using the information across the $n$ samples to infer relationships between genes. We take the following steps to do so:

Build a similarity matrix $\mathbf{S}=[s_{ij}]$, where some measure of similarity $s_{ij}\in [0,1]$ can be used to quantify the strength of relationship between the $i$-th and $j$-th genes.
Build an adjacency matrix $\mathbf{A}=[a_{ij}]$ using some adjacency function $a_{ij}=f(s_{ij})$. This function has to be monotonically increasing over $[0,1]$.

There are many options for both the similarity measure and the adjacency function.

Similarity Measure

Given that $\mathbf{x}_i,\mathbf{x}_j\in\mathbb{R}^n$ are the gene expression profiles of the $i$-th and $j$-th gene respectively:

Correlation: $s_{ij}=|\text{cor}(\mathbf{x}_i, \mathbf{x}_j)|$, where $\text{cor}(\mathbf{x}_i, \mathbf{x}_j)$ is Pearson’s correlation coefficient.
Spearman’s Rho: Absolute value of Spearman’s Rho to restrict the range.
Biweigth Midcorrelation: Absolute value of Biweight Midcorrelation to restrict the range.

I’m sure that more similarity measures such as cosine similarity can also be used. The signed variants of correlation can also be used, as long as they are scaled to fit within $[0,1]$.

Adjacency Function

Hard Threshold: $a_{ij}$ is $1$ if $s_{ij}\geq \tau$ and $0$ if $s_{ij}<\tau$. This forces some genes to be adjacent and others to not be adjacent.
Sigmoid: $a_{ij}=\frac{1}{1+e^{-\alpha(s_{ij} - \tau_0)}}$.
Power: $a_{ij}=|s_{ij}|^\beta$.

Generally, WGCNA is run using the power adjacency function.

Hyperparameter Selection

Take the power adjacency function as an example. How do we determine the optimal value of $\beta$? This is where the scale-free topology assumption is used. The $\beta$ value is chosen to maximize the fit of the network to the expected scale-free topology. Essentially, the idea is to use different values of $\beta$ to build multiple adjacency matrices, build degree distributions for each of the networks, and check to see which network is the best fit to the expected decay in degree according to the power law. A similar grid search can be employed for the hyperparameters of other adjacency functions.

If $p(k)=\mathbb{P}(\deg(v)=k) \propto k^{-\gamma}$ with $\gamma > 0$, we would expect $\log (p(k)) = \alpha \log (k)$ with $\alpha < 0$. This can be tested using a standard linear regression. The $R^2$ coefficient of determination can be used to quantify how well the network abides to a scale-free topology. The exact optimization used is $\text{argmax}_{\beta} -\text{sign} (\alpha) R^2$, since a value of $\alpha > 0$ would imply that there are more “hub” genes than peripheral genes.

Graph Clustering

The goal of clustering over the network is to use the topology of the network to define clusters of genes. These clusters are called modules. WGCNA utilizes the topological overlap as a measure of similarity between two vertices. Let $l_{ij}=\sum_u a_{iu}a_{uj}$ and $k_i=\sum_u a_{iu}$. Intuitively, $l_{ij}$ quantifies how often the $i$-th and $j$-th vertices tend to be connected to the same vertices and $k_i$ is the degree of the $i$-th vertex. Then, the topological overlap matrix $\mathbf{\Omega}=[\omega_{ij}]$ is defined as

$$\omega_{ij} = \frac{l_{ij} + a_{ij}}{\min{ k_i, k_j } + 1 - a_{ij}}$$

Figure: Examples of $\omega_{ij}$ for various pairs of vertices.

Intuition here is easier to build with an unweighted, hard-thresholded network. In this case, $\omega_{ij}=1$ if the vertex with the smaller degree shares all neighbors with the other vertex and is also connected to the other vertex. Similarly, $\omega_{ij}=0$ if the verticies are not adjacent and do not share any neighbors. In both the unweighted and weighted networks, the value of $\omega_{ij}$ is constrained within $0\leq \omega_{ij}\leq 1$.

The matrix $\mathbf{\Omega}$ is a similarity matrix. A distance matrix is required to use clustering approaches (most often hierarchical clustering using the Dynamic Tree Cut approach). This distance matrix $\mathbf{\Delta}=[\delta_{ij}]$ is obtained easily as

$$\delta_{ij}=1-\omega_{ij}$$

Figure: An example of the modules that might be generated from this network based on the topological overlap matrix.

Implementation

WGCNA R package - More here.
dynamicTreeCut R package - More here.

Deriving Meaningful Modules

Work by Parsana et al. demonstrated that building co-expression networks directly from gene expression data may be susceptible to biases from technical variation. Technical variation may induce spurious correlation and result in modules that do not reflect biological co-expression. In simulation, these sources of technical variation can be regressed out using the top principal components of the gene expression data to reconstruct the true underlying co-expression network. This method seems to improve module identification in real-world data as well.

Wang et al. have addressed the issue of mean-correlation bias in co-expression data. This bias occurs when highly-expressed genes are also more likely to be correlated with each other. They developed a method called Spatial Quantile Normalization that acts directly on the gene correlation matrix. The method divides the correlation matrix into a grid with an associated empirical distribution of correlations $F_{ij}$ for each grid point $(i,j)$. A reference grid point is chosen, and all the other grid points are quantile normalized to the target grid point’s empirical distribution.

Implementation

sva R package - More here.
spqn R package - More here.