Hello to everybody. In these days I am performing some analyses on TCGA RNAseq data using the R bioconductor package "TCGAbiolinks". I have a simple question to answer regarding the type of data that I can download.
Basically, there are three types of RNAseq data that you can download for illumina RNAseq strategy, that are:
1) HTseq - Counts,
2) FPKM
3) FPKM-UQ.

Now, HTseq should be the raw counts of the analysis, which I can normalize with other functions in the package or other packages, while FPKM and FPKM-UQ should be the already normalized counts using these two methodologies.

My question is related to this fact:
when I start the analysis with HTseq-Counts and perform myself the normalization/filtering procedure, at the end of all the steps I have only 1/3 of the total genes that I have at the beginning (roughly 56'000 at start, 17'000 at end); conversely, if I download the FPKM or FPKM-UQ, I obtain the already normalized data of the same 56'000 genes, that I just need to filter for low count values etc, getting roughly 40'000 genes at the end.
So, my question is: it is correct to download the data already normalized and proceed with just the filtering procedure, so to keep overall more genes in all the analysis? Or it is in any case better to start with the raw counts and normalize by myself (but loosing a lot of genes)? Or I am doing something wrong?

Here some example code with HTseq-Counts:

query.exp <- GDCquery(project = TCGAprj,
...



Source link