Myrsini Gounari
posted 7 days ago

Methods to filter data for constructing a gene network

Bioinformatics Homo sapiens

Hello guys!

I have a matrix with gene expression counts (rows -> genes, columns -> samples). I have applied the Pearson correlation to these data because I want to generate an adjacency matrix. My purpose is to apply graph based methods on the network that it will be constructed.

My main problem is that the adjacency matrix is huge (dimensions: 33028*22) and the network cannot be constructed on my laptop.

So I was thinking to filter the counts first and then generate the adjacency matrix. Although I read a lot of papers about it, I got confused on which method to follow. Because I don't have two conditions on the data that I found online, but there are many replicates for its cell (for some are 3 for other 5 or 2), so I struggle to apply t-test (like I was taught during my Master) and find the most significant genes.

How should approach this? Sorry if I am asking something obvious but it is my first time to apply all these stuff on raw data...

Thank you very much in advance!!! :-)


Hi, do you want to generate an adjacency matrix for cells or for genes? I am going to guess it's for genes.

Filtering your count matrix is easy: you can get rid of all the genes that are 0 in all samples like so: mat[rowSums(mat > 0),] or you can apply something a little more sophisticated such as get rid of genes that don't have at least 10 counts in at least 3 samples (as an example).

Then you can pick the most variable genes by first applying a variance stabilizing transformation (normalization + log transformation, or vst, both from DESeq2) and then picking the N most variable genes. This will give you a reduced set of genes to build a network on where you are maximizing for their variability while protecting your choice from the mean-variance relationship - in other words, if you didn't apply a variance stabilizing transformation, you would take the most expressed genes only as they are also the most variable.

What are you using to generate the graph? igraph should be able to handle large matrices in a memory efficient way, and yours doesn't look to be too difficult.

Myrsini Gounari7 days ago

Hello, Thanks for your reply! Yes I want an adjacency matrix for genes. First of all, I should mention that I am trying to code it in python and not in R (I know that R has functions for everything). What I did: 1. I checked which genes were never expressed and I removed those (In that way I gor rid almost 25000 genes). I also tried the more sophisticated approach about removing like you mentioned "genes that don't have at least 10 counts in at least 3 samples (as an example)."

Then I thought to normalize the data so I can keep the most expressed genes... I read a lot of papers on how to normalize it in python because I wanted to check if there was any module that I could use before implementing my own function for it. About normalization, what is your suggestion?

I also thought to check the CV for each gene and keep the ones with the highest value.

For generating the graph I am using the NetworkX.

Again thanks a lot for your reply.

Apologies for assuming you use R! It's my go-to language for gene expression and I went full autopilot. I am not confident in my Python skills to help you troubleshoot the actual functions and libraries, but I will try to give you some pointers.

I think that a simple normalization consists in what is referred to as CPM, or Counts Per Million. You simply divide each count by the sum total in each sample, and multiply the result by 1 million. Then you can log transform it (adding a pseudo-CPM of 1 to avoid log-transforming values of 0). Ideally you could also do a FPKM or TPM normalization if you have the necessary information (length and effective length, read here for an in-depth explanation).

Regarding the CV, that is already quite a good measure since it accounts for the mean/variance relationship. There are more sophisticated methods where you explicitly model that relationship across all genes, i.e. you fit a non-linear trend to the mean and variance of the counts and subtract the trended variance from each gene's observed variance. This will allow you to share information between genes to identify what is the expected, average relationship between mean and variance (also termed some times the "technical variability") for your dataset, and choose the genes whose variance deviates the most from the trend ("biological variability").

Concerning graphs unfortunately I am not familiar with NetworkX. I have not used the Python API, but igraph has really a lot of memory-efficient implementations to do analyses, clustering, and also for visualization. It does have a bit of a steep learning curve, but the documentation is very good.

Miguel Ramos5 days ago

Hi there, I think that I have 2 cents to contribute to this topic:

  1. A Pearson correlation for this kind of data may be questionable, depending on your objective. Usually, these methods assume that you have many observations (samples) compared to features (genes). If you have more features than observations, it may introduce bias, and you need to carefully supervise your results.

  2. Python is indeed a great programming language with a strong community. Often, for popular programs or libraries, there are Python alternatives available. For example: edgeR (R) -> edgeRpy (Python) DESeq2 (R) -> DESeq (Python) limma (available in both R and Python)

Login or Signup to leave a comment