Hi everyone,

I'm working on Copy Number data from TCGA. I download "Gene Level Copy Number Variation" using TCGABiolinks R package and the following code:


query_cnv <- GDCquery(project = "TCGA-KICH",
                  data.category = "Copy Number Variation",
                  data.type = "Gene Level Copy Number Scores")
data <- GDCprepare(query_cnv)

Everything works great. I get a nice dataframe with first three columns being: "Gene.Symbol"https://www.biostars.org/"Gene.ID"https://www.biostars.org/"Cytoband".
To facilitate the analysis and being able to merge data from other sources such as RNASeq, I tried to convert Ensembl gene ids contained in Gene.Symbol in Hugo Symbol using BioMart.

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- gsub(".\.","\1",data$Gene.Symbol)
geneIDs <- getBM(filters = "ensembl_gene_id", attributes = c("ensembl_gene_id","hgnc_symbol"), values = genes, mart = mart)

However, over 19729 different Ensembl ID, I only get 3269 match.

What is surprizing is that according GDC docs (docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/), this dataset should contain CNV associated to each gene, so I would expect a little bit more match according coding regions.

When I tried to search the description of Ensembl ID not found by Biomart. I get zero answers from both Ensembl and NCBI. (Example: "ENSG000000081221" "ENSG000000081314" "ENSG000000676014" "ENSG000000783616" "ENSG000000788015"); So, it's like these ID did not exist in any database. So, where they are coming from ?

Did I miss something ? Is it normal to have few genes coding for proteins in this kind of datasets ? Should I process differently for the analysis of such data ?

Any suggestions or comments will be really helpful.

Source link