gravatar for martin.busch

2 hours ago by

Dear community,

I have a list of approximately 1500 RNA sequences (from Sus scrofa) that I want to BLAST against the human refseq_rna database and since the number is more or less big, I cannot use NCBI's web GUI but I have to do that locally. My goal is to get the best matching human mRNA Accession number for each pig RNA I query. Unfortunately, I struggle with thres questions:

  1. Which DB from ftp.ncbi.nlm.nih.gov/blast/db should I download?
  2. How do I restrict my BLAST to human RNA hits only?
  3. How can I get the accession number of my hit?

In the example below you can see that I managed to run BLAST, but I struggeling with the best DB to download and the best parameters to run BLAST. Ideally, my result would include not only scoring values, but also the accession number of the human RNA that matches my pig RNA query best.

blastn = "/usr/bin/blastn" blast_db = "/mnt/sdb/projects/genomic_data/blast/human/GCF_000001405.38_top_level" input = "data/fasta_for_blast.fas" evalue = 1e-6 format = 6 colnames <- c("qseqid",
              "sseqid",
              "pident",
              "length",
              "mismatch",
              "gapopen",
              "qstart",
              "qend",
              "sstart",
              "send",
              "evalue",
              "bitscore")

blast_out <- system2(command = "blastn", 
                     args = c("-db", blast_db, 
                              "-query", input, 
                              "-outfmt", format, 
                              "-evalue", evalue,
                              "-num_threads", 18,
                              "-ungapped"),
                     wait = TRUE,
                     stdout = TRUE) %>%   as_tibble() %>%    separate(col = value, 
           into = colnames,
           sep = "t",
           convert = TRUE) blast_out

> blast_out
# A tibble: 5 x 12
  qseqid sseqid       pident length mismatch gapopen qstart  qend    sstart      send   evalue bitscore
  <chr>  <chr>         <dbl>  <int>    <int>   <int>  <int> <int>     <int>     <int>    <dbl>    <dbl>
1 Test1  NC_000006.12   94.5    183       10       0    762   944 170569597 170569779 9.97e-77      294
2 Test1  NC_000006.12   97      100        3       0    942  1041 170571407 170571506 7.65e-41      175
3 Test1  NC_000006.12   95.6     68        3       0   1482  1549 170572632 170572699 2.54e-22      114
4 Test1  NC_000006.12  100       35        0       0   1758  1792 170572908 170572942 1.98e- 8       68
5 Test2  NC_000006.12   98      100        2       0   1604  1703 169704007 169704106 1.96e-42      181

(adapted from palfalvi.org/post/local-blast-from-r/)

Thank you so much for your help, any input is highly appreciated!

Best,
Martin



Source link