I'm trying to do a box plot for gene expression from BulkRNAseq. This is the pipeline I followed: STAR->stringtie. The raw counts are normalized by DESEQ2 (disease vs controls). The normalized counts are used to plot the graph.

I'm plotting a gene expression of a gene A (disease and controls). There are 150samples in disease and 30samples in controls, the normalized counts vary from 0 to 3000 in disease, and 0 to 30 in controls.

The distribution is not normal (there are a lot of samples showing ranges around 0 to 10 and very less samples show ranges above 10). How can I make a better box plot?(What can be considered as an outlier)

enter image description here

enter image description here

enter image description here





You can use VST or rlog counts extracted from DESeq2 functions (example in vignette), this will compress the values (because they are in log scale) and you won't see outliers most probably. Regarding showing the distribution of the data, try using violin plots (you could use ggplot2 with geom_violin()), or violin plots with boxplots on top of them (examples here), or boxplots with data points on top of them, etc.

