How to convert bulk UniProt Id to GO terms/Ids?



I have worked on a transcriptome and I have got UniProt Id from blastx output (near 20K uniprot accessions). In my project I should do GO analysis and pathway analysis for them and I could not use Trinotate because I have done analysis with different software.

How can I extract GO Ids/terms from bulk UniProt accession? and then enrich them?






To extract GO terms for a list of UniProtKB identifiers, use the UniProt batch retrieve tool as suggested above, but instead of mapping UniProtKB IDs to an external database, map from UniProtKB to UniProtKB.

Once you have your result, you can click on "Columns" and customize your result table layout, as described here or here.

The customization interface contains a section "Gene Ontology", where you can select to see a complete
list, or separate columns for the 3 ontologies molecular function, biological process or cellular component, or a list of identifiers only.

You can remove all columns you are not interested in in this context, and then download the results in tab-delimited format.

Or you can access the UniProt website programmatically, with one query per accession number:
for a given UniProtKB identifier, e.g. Q9ZUA2, you can use this URL

Please don't hesitate to contact the UniProt helpdesk if you have any additional questions.

1) Convert your Uniprot Ids to Gene name/HGNC or Gene Id (Entrez ID) using uniprot id mapping.

2) Use Entrez Ids or Gene names (symbols) in GeneSCF for enrichment analysis (KEGG and GO) or annotation.

You can also use EBI QuickGO tools to fetch GO terms/ID programmatically.

Dear mirzaei86.vahid,

you can use the query functions of the python library pyuniprot.

install (with pip or git clone) and update. Find out which taxonomy identifier fits to your organisms. Example here (human, mouse, rat). Don't make a full update for all organisms (takes very long).

Python code:

pyuniprot.update(taxids=[9606, 10090, 10116])

Use following python code for your problem:

if 1433E_HUMAN and A4_HUMAN are the identifiers you are looking for:

Python code:

import pyuniprot
query = pyuniprot.query() 
entries = query.entry(name=('1433E_HUMAN', 'A4_HUMAN'))  
first_accessions = [entry.accessions[0] for entry in entries]
gos = query.db_reference(entry_name=('1433E_HUMAN', 'A4_HUMAN'), type_='GO')
go_ids = [x.identifier for x in gos]

Best regards

I've been doing something similar recently. Here's a way to make a data frame with the UNIPROT ID and also all the gene ontology information, not sure if this is what is needed exactly, then of course you'd have to figure out a way to graph the information...

Download this and extract:

I extracted to my desktop. The following should make the data frame.

system("awk 'NR>=42' ~/Desktop/goa_human.gaf > ~/Desktop/goa_human_no_header.txt")
GO <-read.csv("~/Desktop/goa_human_no_header.txt", header=F, sep="t")

GOdb <-
GO$V10 <- NULL
GO$V13 <- NULL
GO$V14 <- NULL
GO$V16 <- NULL
GO$V17 <- NULL
GO$V12 <- NULL
GO$V15 <- NULL
GO$V11 <- NULL
colnames(GO) <- c("UNIPROTID", "GOID")
colnames(GOdb)[1] <- c("GOID")
GOdb <- head(GOdb,-1)
UPwithGO <- merge(GO, GOdb, by = "GOID")
rm(GOdb, GO)
UPwithGO$go_id <- NULL

It's kind of messy to be honest, but I tried lol

before adding your answer.

Traffic: 1476 users visited in the last hour

Source link