I have a question regarding accessing the subtype information associated with TCGA projects using
TCGAbiolinks package (in this example, specifically COAD but my question applies to other projects including SKCM for instance)
When I download the RNAseq experiment as a SummarizedExperiment object I can access the metadata associated with the samples by calling
colData(coad). In this data frame, there is information regarding MSI (microsatellite instability) status of tumors. The information I get from there is the following:
# Prepared coad object previously by using GDCdownload and GDCprepare functions meta <- as.data.frame(colData(coad)) dim(meta) #> 521 102 summary(meta$subtype_MSI_status) #> MSI-H MSI-L MSS Not Evaluable NA's #> 0 40 42 126 0 313
Alternatively, I can also download subtype information using
TCGAquery_subtype function. When I do that and look at the MSI data in the downloaded data frame, this is what I see:
subtype <- TCGAbiolinks::TCGAquery_subtype("COAD") dim(subtype) #> 276 45 summary(subtype$MSI_status) #> MSI-H MSI-L MSS Not Evaluable #> 0 38 44 193 1
A similar discrepancy is also present when comparing survival times between SummarizedExperiment and TCGAquery_subtype data frames. One has a shorter followup time than the other for some patients (ie. the patient is censored at an early date with alive vital_status in one data frame whereas he/she appears deceased in the other data frame at a later time point.
What is the reason for the discrepancy between different subtype data? I remember having similar issues with SKCM (both for subtype and survival data). I would appreciate if you can let me know which is the more accurate version to use.