How to get the genes that contain the highest number of isoforms from a fasta file generated by trinity de novo?


I have a trinity file in a fasta format and I want to get the gene(s) with the highest no. of isoforms.






this could work.

You should have a file called: Trinity.fasta.gene_trans_map link

Upload the file in R:

Trinity_fasta_gene_trans_map <- read_delim("D:/Download/Trinity.fasta.gene_trans_map.txt", 
    "t", escape_double = FALSE, col_names = FALSE, 
    trim_ws = TRUE)


# A tibble: 6 x 2
  #X1                 X2                   
  #<chr>              <chr>                
#1 TRINITY_DN1_c0_g1  TRINITY_DN1_c0_g1_i1 
#2 TRINITY_DN1_c0_g2  TRINITY_DN1_c0_g2_i1 
#3 TRINITY_DN13_c0_g1 TRINITY_DN13_c0_g1_i1
#4 TRINITY_DN13_c0_g1 TRINITY_DN13_c0_g1_i2
#5 TRINITY_DN14_c0_g1 TRINITY_DN14_c0_g1_i1
#6 TRINITY_DN15_c0_g1 TRINITY_DN15_c0_g1_i1

out<-as.matrix(table(Trinity_fasta_gene_trans_map$X1)) #get the number of isoform per gene
TRINITY_DN1_c0_g1      1
TRINITY_DN1_c0_g2      1
TRINITY_DN101_c0_g1    1
TRINITY_DN101_c0_g2    1
TRINITY_DN102_c0_g1    1
TRINITY_DN102_c0_g2    1

The output of a Trinity fasta assembly has headers following a convention, see the documentation for details:

>TRINITY_DN17755_c0_g1_i1 len=221 path=[0:0-220]
>TRINITY_DN17876_c0_g1_i1 len=254 path=[0:0-253]
>TRINITY_DN17736_c0_g1_i1 len=413 path=[0:0-412]

The c, g and i stand for clusters, genes and isoforms - note those are somewhat loosely determined, as short reads transcriptome assemblies are very noisy in general.

To get the single "gene" with the most "isoforms", you can use standard linux command line utilities:

grep ">" Trinity.fasta 
  | cut -d " " -f1 
  | sort -t"_" -n -r -k5.2 
  | head -n1

To get a list of the "N" genes with the most isoforms would be a bit more complicated, and probably involve some scripting, parsing, sorting and filtering the fields from the fasta headers.

before adding your answer.

Traffic: 1740 users visited in the last hour

Source link