How many genes to include for a GSEA analysis


This seems to be a simple question.

You have a list of genes from DEG analysis, with p-values, FDRs, & logFCs, etc. Previously, what I do for GSEA analysis is to filter in genes with FDR < 0.25 or 0.05, rank them by logFC (in other words, pre-rank the genes by logFC), and then do GSEA. Now I am wondering if this is a good way:

  • There might be too many genes (typically ~50%). Assuming usually
    there are 4~5 pathways involved and each pathway has about 500 genes,
    then the top 2,000 genes might be enough to be included for GSEA
  • Not sure if logFC is the best way to rank genes. Maybe
    use -log(PValue) as the magnitude of the rank score and the sign of
    logFC as the sign of the sore? i.e., use sign(logFC) * (-log(PValue))
    as the rank score?

Googled briefly but didn't find a convention.






updated 2 hours ago by


written 4.3 years ago by



According to the GSEA documentation:

The GSEA algorithm does not filter the expression dataset and does not
benefit from your filtering of the expression dataset. During the
analysis, genes that are poorly expressed or that have low variance
across the dataset populate the middle of the ranked gene list and the
use of a weighted statistic ensures that they do not contribute to a
positive enrichment score. By removing such genes from your dataset,
you may actually reduce the power of the statistic.

And additionally in the wiki:

We hopefully will be able to devote some time to investigating this,
but in the mean time, we are recommending use of the GSEAPreranked
tool for conducting gene set enrichment analysis of data derived from
RNA-seq experiments. In particular: Prior to conducting gene set
enrichment analysis, conduct your differential expression analysis
using any of the tools developed by the bioinformatics community
(e.g., cuffdiff, edgeR, DESeq, etc). Based on your differential
expression analysis, rank your features and capture your ranking in an
RNK-formatted file. The ranking metric can be whatever measure of
differential expression you choose from the output of your selected DE
tool. For example, cuffdiff provides the (base 2) log of the fold

before adding your answer.

Traffic: 1901 users visited in the last hour

Source link