Gene Set Enrichment Analysis Workflow Explanation


I am considering using GSEA as it was recommended to me in my project; however, I am wanting to make sure I understand its purpose and how it works. I have read that it is used to associate a disease phenotype with a set of either significantly enriched or significantly depleted genes or proteins? So for example, if one is looking at breast cancer, you would use GSEA in order to look at highly expressed or lowly expressed genes that potentially play a role in the formation of breast cancer (compared to your wildtype sample)? Another thing I’m slightly confused about is gene sets – I’ve seen that gene sets are sets of genes that are grouped together due to some commonality (e.g., these genes have the same function, they’re found in the same pathway (e.g., RAS pathway), they’re found in the same disease (e.g., cancer) or a part of some disease (e.g., metastasis), etc.)? Comparing it to something like pathway analysis, would those be considered gene sets as well?

Lastly, I have a question about the workflow image provided in the PNAS paper ( I am a little confused on how to read it because I know that heat maps are used for the visualization of gene expression, but the “ranked gene list” gives you the list of genes (rows) from most to least highly expressed between “A” (your experimental group) and “B” (your control group)? I do know that colors on a heat map correlate to most (darker color) to lesser (lighter color) expressed so for the lighter colors (e.g., the light red ones) shown at the top, how would these be more highly expressed than say the dark red ones at the very bottom of the ranked gene list?

