Methods to filter data for constructing a gene network
Hello guys!
I have a matrix with gene expression counts (rows -> genes, columns -> samples). I have applied the Pearson correlation to these data because I want to generate an adjacency matrix. My purpose is to apply graph based methods on the network that it will be constructed.
My main problem is that the adjacency matrix is huge (dimensions: 33028*22) and the network cannot be constructed on my laptop.
So I was thinking to filter the counts first and then generate the adjacency matrix. Although I read a lot of papers about it, I got confused on which method to follow. Because I don't have two conditions on the data that I found online, but there are many replicates for its cell (for some are 3 for other 5 or 2), so I struggle to apply t-test (like I was taught during my Master) and find the most significant genes.
How should approach this? Sorry if I am asking something obvious but it is my first time to apply all these stuff on raw data...
Thank you very much in advance!!! :-)
Hi, do you want to generate an adjacency matrix for cells or for genes? I am going to guess it's for genes.
Filtering your count matrix is easy: you can get rid of all the genes that are 0 in all samples like so:
mat[rowSums(mat > 0),]
or you can apply something a little more sophisticated such as get rid of genes that don't have at least 10 counts in at least 3 samples (as an example).Then you can pick the most variable genes by first applying a variance stabilizing transformation (normalization + log transformation, or
vst
, both fromDESeq2
) and then picking the N most variable genes. This will give you a reduced set of genes to build a network on where you are maximizing for their variability while protecting your choice from the mean-variance relationship - in other words, if you didn't apply a variance stabilizing transformation, you would take the most expressed genes only as they are also the most variable.What are you using to generate the graph?
igraph
should be able to handle large matrices in a memory efficient way, and yours doesn't look to be too difficult.