gravatar for atakanekiz

2 hours ago by


I have a question regarding accessing the subtype information associated with TCGA projects using TCGAbiolinks package (in this example, specifically COAD but my question applies to other projects including SKCM for instance)

When I download the RNAseq experiment as a SummarizedExperiment object I can access the metadata associated with the samples by calling colData(coad). In this data frame, there is information regarding MSI (microsatellite instability) status of tumors. The information I get from there is the following:

# Prepared coad object previously by using GDCdownload and GDCprepare functions

meta <-

#>[1] 521 102

#>                      MSI-H         MSI-L           MSS Not Evaluable          NA's 
#>            0            40            42           126             0           313

Alternatively, I can also download subtype information using TCGAquery_subtype function. When I do that and look at the MSI data in the downloaded data frame, this is what I see:

subtype <- TCGAbiolinks::TCGAquery_subtype("COAD")

#>[1] 276  45

#>                      MSI-H         MSI-L           MSS Not Evaluable 
#>           0            38            44           193             1

A similar discrepancy is also present when comparing survival times between SummarizedExperiment and TCGAquery_subtype data frames. One has a shorter followup time than the other for some patients (ie. the patient is censored at an early date with alive vital_status in one data frame whereas he/she appears deceased in the other data frame at a later time point.

What is the reason for the discrepancy between different subtype data? I remember having similar issues with SKCM (both for subtype and survival data). I would appreciate if you can let me know which is the more accurate version to use.



modified 1 hour ago

2 hours ago


Source link