I'm having issues resolving NAs whilst trying to annotate a list of diff genes with entrez IDs using ensembl IDs. I would be surprised if this hadn't been asked before, but finding answers and suggestions is half of the battle if you're not quite sure what to look for. Main questions in bold.
I've followed the RNA seq DESeq2 Bioconductor tutorial/outlined steps. The reference transcriptome is Homo_sapiens.GRCh38.v100 - I combined both coding and non-coding. I have list of diff genes for several compound-treatment experiments. I need entrez IDs for a chemistry process downstream from here, so I ran what you would expect:
library("AnnotationDbi") library("org.Hs.org.db") resAmi$entrez <- mapIds(org.Hs.eg.db, keys=ens.str, column="ENTREZID", keytype="ENSEMBL", multiVals="first")
A good proportion of each diff genes are given an entrez ID of NA. Firstly, why are there NAs? Something to do with ensembl dropping gene mappings after a particular version of their DB? A random comment I found on Biostars!
Secondly, I decided to try and annotate with EnsDb.Hsapiens.v86; a complete stab in the dark in an effort to understand more. This resolved some of the NAs but many of the entrez values I once had with org.Hs.eg.db are now different. In fact, I can see how two ensembl ID entries share the same entrez ID depending on the annotation DB. Which annotation DB is appropriate?
Here's just a snippet of what I'm seeing (sorry about the formatting, the two entrez IDs in question are in bold):
Symbol | Entrez | Entrez_ens | txbiotype | LFC
NA | NA | 2920 | protein_coding | -26.4976570057622
TAS2R3 | 50831 | 1417 | protein_coding | -16.5810022683443
NA | NA | 50831 | protein_coding | 17.5184870830614
NA | NA | 102724652 | protein_coding | 14.3289350041311
ARHGAP11B | 89839 | NA | processed_pseudogene | -16.7365692557264
The two entrez columns came from the sources:
Entrez = org.Hs.eg.db.
Entrez_ens = EnsDb.Hsapiens.v86.
As much information you can spare is greatly appreciated.