RNAseq

0

We want to identify lncrna from rainbow trout.(2 treatments,6sample;paired-end, illumina hiseq2500)
After using fastQC an HISAT2, we used stringtie, then we used –merge and with gffcompare we tried to have our final annotated gtf file. I will have my codes below.(one sample)

java -jar trimmomatic-0.32.jar PE -threads 8 -phred33 /home/user/MahmoodPhd/Trimmomatic-master/R_19191_1.fq.gz /home/user/MahmoodPhd/Trimmomatic-master/R_19191_2.fq.gz /home/user/MahmoodPhd/Trimmomatic-master/R_19191_1_paired.fq.gz /home/user/MahmoodPhd/Trimmomatic-master/R_19191_2_unpaired.fq.gz /home/user/MahmoodPhd/Trimmomatic-master/R_19191_1_unpaired.fq.gz /home/user/MahmoodPhd/Trimmomatic-master/R_19191_2_paired.fq.gz  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
gzip -d Oncorhynchus_mykiss.Omyk_1.0.dna.toplevel.fa.gz
mv GCF_002163495.1_Omyk_1.0_genomic.fna GCF_002163495.1_Omyk_1.0_genomic.fa
hisat2-build  GCF_002163495.1_Omyk_1.0_genomic.fa Omykindex
./hisat2 -t --known-splicesite-infile splicesites.txt --dta --summary-file -x Omykindex -1 R_19191_1.fq.gz, -2 R_19191_2.fq.gz -p 8 R_19191.sam
samtools sort [email protected] 8 -o R_19340_sort.bam R_19340.sam
./stringtie /home/user/MahmoodPhd/hisat2-2.1.0-Linux_x86_64/hisat2-2.1.0/R_19340_sort.bam -l R_19340 -p 8 -G GCF_002163495.1_Omyk_1.0_genomic.gff -o R_19340.gtf
./stringtie --merge -p 8 -G /home/user/MahmoodPhd/Stringtie/GCF_002163495.1_Omyk_1.0_genomic.gtf  -o stringtie_merged1.gtf list.txt
# ./stringtie --merge -p 8 -G /home/user/MahmoodPhd/Stringtie/GCF_002163495.1_Omyk_1.0_genomic.gtf -o stringtie_merged1.gtf list.txt
# StringTie version 2.1.4
stringtie_merged1.gtf:
NC_001717.1 StringTie   transcript  1004    16642   1000    +   .   gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; ref_gene_id ""; 
NC_001717.1 StringTie   exon    1004    4890    1000    +   .   gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; exon_number "1"; ref_gene_id ""; 
NC_001717.1 StringTie   exon    4958    6147    1000    +   .   gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; exon_number "2"; ref_gene_id ""; 
NC_001717.1 StringTie   exon    6468    8018    1000    +   .   gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; exon_number "3"; ref_gene_id ""; 
NC_001717.1 StringTie   exon    8094    8166    1000    +   .   gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; exon_number "4"; ref_gene_id ""; 
NC_001717.1 StringTie   exon    8181    14770   1000    +   .   gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; exon_number "5"; ref_gene_id ""; 
NC_001717.1 StringTie   exon    15361   16642   1000    +   .   gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; exon_number "6"; ref_gene_id ""; 
NC_001717.1 StringTie   transcript  4888    16642   1000    -   .   gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; ref_gene_id ""; 
NC_001717.1 StringTie   exon    4888    4958    1000    -   .   gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; exon_number "1"; ref_gene_id ""; 
NC_001717.1 StringTie   exon    6149    6291    1000    -   .   gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; exon_number "2"; ref_gene_id ""; 
NC_001717.1 StringTie   exon    6329    6466    1000    -   .   gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; exon_number "3"; ref_gene_id ""; 
NC_001717.1 StringTie   exon    8019    8089    1000    -   .   gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; exon_number "4"; ref_gene_id ""; 
NC_001717.1 StringTie   exon    14767   15357   1000    -   .   gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; exon_number "5"; ref_gene_id ""; 
NC_001717.1 StringTie   exon    16573   16642   1000    -   .   gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; exon_number "6"; ref_gene_id ""; 
NC_035077.1 StringTie   transcript  26931   50257   1000    +   .   gene_id "MSTRG.3"; transcript_id "XM_021602508.1"; gene_name "LOC110523613"; ref_gene_id "LOC110523613"; 
NC_035077.1 StringTie   exon    26931   27080   1000    +   .   gene_id "MSTRG.3"; transcript_id "XM_021602508.1"; exon_number "1"; gene_name "LOC110523613"; ref_gene_id "LOC110523613"; 
NC_035077.1 StringTie   exon    33577   33719   1000    +   .   gene_id "MSTRG.3"; transcript_id "XM_021602508.1"; exon_number "2"; gene_name "LOC110523613"; ref_gene_id "LOC110523613"; 
NC_035077.1 StringTie   exon    34556   34671   1000    +   .   gene_id "MSTRG.3"; transcript_id "XM_021602508.1"; exon_number "3"; gene_name "LOC110523613"; ref_gene_id "LOC110523613"; 
NC_035077.1 StringTie   exon    40440   40620   1000    +   .   gene_id "MSTRG.3"; transcript_id "XM_021602508.1"; exon_number "4"; gene_name "LOC110523613"; ref_gene_id "LOC110523613"; 
NC_035077.1 StringTie   exon    47972   50257   1000    +   .   gene_id "MSTRG.3"; transcript_id "XM_021602508.1"; exon_number "5"; gene_name "LOC110523613"; ref_gene_id "LOC110523613";
…..
./gffcompare -r /home/user/MahmoodPhd/Stringtie/GCF_002163495.1_Omyk_1.0_genomic.gtf -G -o finalgffestimation.gtf /home/user/MahmoodPhd/Stringtie/stringtie_merged1.gtf
Our problem is that our last gtf file (finalgffestimation.gtf) is like below:  when I filter out size (200nt) and class codes(uiojx) and then merge them (with cat) I cannot convert it to FASTA for downstream analysis.
finalgffestimation.gtf: (exon numbers are seen in different row and I can filter out correctly)
NC_001717.1 StringTie   transcript  1004    16642   .   +   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; xloc "XLOC_000001"; ref_gene_id ""; cmp_ref "unknown_transcript_1"; class_code "j"; tss_id "TSS1";
NC_001717.1 StringTie   exon    1004    4890    .   +   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; exon_number "1";
NC_001717.1 StringTie   exon    4958    6147    .   +   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; exon_number "2";
NC_001717.1 StringTie   exon    6468    8018    .   +   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; exon_number "3";
NC_001717.1 StringTie   exon    8094    8166    .   +   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; exon_number "4";
NC_001717.1 StringTie   exon    8181    14770   .   +   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; exon_number "5";
NC_001717.1 StringTie   exon    15361   16642   .   +   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; exon_number "6";
NC_001717.1 StringTie   transcript  4888    16642   .   -   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; xloc "XLOC_000002"; ref_gene_id ""; cmp_ref "unknown_transcript_1"; class_code "j"; tss_id "TSS2";
NC_001717.1 StringTie   exon    4888    4958    .   -   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; exon_number "1";
NC_001717.1 StringTie   exon    6149    6291    .   -   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; exon_number "2";
NC_001717.1 StringTie   exon    6329    6466    .   -   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; exon_number "3";
NC_001717.1 StringTie   exon    8019    8089    .   -   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; exon_number "4";
NC_001717.1 StringTie   exon    14767   15357   .   -   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; exon_number "5";
NC_001717.1 StringTie   exon    16573   16642   .   -   .   transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; exon_number "6";
NC_035077.1 StringTie   transcript  26931   50257   .   +   .   transcript_id "XM_021602508.1"; gene_id "MSTRG.3"; gene_name "LOC110523613"; xloc "XLOC_000003"; ref_gene_id "LOC110523613"; cmp_ref "XM_021602508.1"; class_code "="; tss_id "TSS3";
NC_035077.1 StringTie   exon    26931   27080   .   +   .   transcript_id "XM_021602508.1"; gene_id "MSTRG.3"; exon_number "1";
NC_035077.1 StringTie   exon    33577   33719   .   +   .   transcript_id "XM_021602508.1"; gene_id "MSTRG.3"; exon_number "2";
NC_035077.1 StringTie   exon    34556   34671   .   +   .   transcript_id "XM_021602508.1"; gene_id "MSTRG.3"; exon_number "3";
NC_035077.1 StringTie   exon    40440   40620   .   +   .   transcript_id "XM_021602508.1"; gene_id "MSTRG.3"; exon_number "4";
NC_035077.1 StringTie   exon    47972   50257   .   +   .   transcript_id "XM_021602508.1"; gene_id "MSTRG.3"; exon_number "5";
NC_035077.1 StringTie   transcript  32704   50257   .   +   .   transcript_id "MSTRG.3.2"; gene_id "MSTRG.3"; gene_name "LOC110523613"; xloc "XLOC_000003"; cmp_ref "XM_021602508.1"; class_code "j"; tss_id "TSS4";
NC_035077.1 StringTie   exon    32704   33070   .   +   .   transcript_id "MSTRG.3.2"; gene_id "MSTRG.3"; exon_number "1";
NC_035077.1 StringTie   exon    40440   40620   .   +   .   transcript_id "MSTRG.3.2"; gene_id "MSTRG.3"; exon_number "2";
NC_035077.1 StringTie   exon    47972   50257   .   +   .   transcript_id "MSTRG.3.2"; gene_id "MSTRG.3"; exon_number "3";
NC_035077.1 StringTie   transcript  145370  152927  .   +   .   transcript_id "XM_021602762.1"; gene_id "MSTRG.5"; gene_name "LOC110523749"; xloc "XLOC_000004"; ref_gene_id "LOC110523749"; cmp_ref "XM_021602762.1"; class_code "="; tss_id "TSS5";
NC_035077.1 StringTie   exon    145370  145969  .   +   .   transcript_id "XM_021602762.1"; gene_id "MSTRG.5"; exon_number "1";
NC_035077.1 StringTie   exon    146059  146626  .   +   .   transcript_id "XM_021602762.1"; gene_id "MSTRG.5"; exon_number "2";
NC_035077.1 StringTie   exon    146738  152927  .   +   .   transcript_id "XM_021602762.1"; gene_id "MSTRG.5"; exon_number "3";
NC_035077.1 Gnomon  transcript  177504  190839  .   +   .   transcript_id "XM_021603017.1"; gene_id "LOC110523873"; gene_name "LOC110523873"; xloc "XLOC_000005"; ref_gene_id "LOC110523873"; cmp_ref "XM_021603017.1"; class_code "="; tss_id "TSS6";
NC_035077.1 Gnomon  exon    177504  177974  .   +   .   transcript_id "XM_021603017.1"; gene_id "LOC110523873"; exon_number "1";
NC_035077.1 Gnomon  exon    190591  190839  .   +   .   transcript_id "XM_021603017.1"; gene_id "LOC110523873"; exon_number "2";
NC_035077.1 Gnomon  transcript  201613  226609  .   +   .   transcript_id "XM_021602890.1"; gene_id "LOC110523811"; gene_name "LOC110523811"; xloc "XLOC_000006"; ref_gene_id "LOC110523811"; cmp_ref "XM_021602890.1"; class_code "="; tss_id "TSS7";
NC_035077.1 Gnomon  exon    201613  202200  .   +   .   transcript_id "XM_021602890.1"; gene_id "LOC110523811"; exon_number "1";
NC_035077.1 Gnomon  exon    206366  206502  .   +   .   transcript_id "XM_021602890.1"; gene_id "LOC110523811"; exon_number "2";
NC_035077.1 Gnomon  exon    209520  209564  .   +   .   transcript_id "XM_021602890.1"; gene_id "LOC110523811"; exon_number "3";
NC_035077.1 Gnomon  exon    209917  209985  .   +   .   transcript_id "XM_021602890.1"; gene_id "LOC110523811"; exon_number "4";
NC_035077.1 Gnomon  exon    217023  217200  .   +   .   transcript_id "XM_021602890.1"; gene_id "LOC110523811"; exon_number "5";
NC_035077.1 Gnomon  exon    226478  226609  .   +   .   transcript_id "XM_021602890.1"; gene_id "LOC110523811"; exon_number "6";
NC_035077.1 BestRefSeq  transcript  355926  382310  .   +   .   transcript_id "NM_001124315.1"; gene_id "LOC100135976"; gene_name "LOC100135976"; xloc "XLOC_000007"; ref_gene_id "LOC100135976"; cmp_ref "NM_001124315.1"; class_code "="; tss_id "TSS8";
NC_035077.1 BestRefSeq  exon    355926  356160  .   +   .   transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "1";
NC_035077.1 BestRefSeq  exon    365943  366101  .   +   .   transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "2";
NC_035077.1 BestRefSeq  exon    367207  367323  .   +   .   transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "3";
NC_035077.1 BestRefSeq  exon    368982  369110  .   +   .   transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "4";
NC_035077.1 BestRefSeq  exon    370615  370674  .   +   .   transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "5";
NC_035077.1 BestRefSeq  exon    374855  375049  .   +   .   transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "6";
NC_035077.1 BestRefSeq  exon    375557  375658  .   +   .   transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "7";
NC_035077.1 BestRefSeq  exon    375795  375901  .   +   .   transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "8";
NC_035077.1 BestRefSeq  exon    379788  382310  .   +   .   transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "9";
NC_035077.1 Gnomon  transcript  422138  465051  .   +   .   transcript_id "XM_021601631.1"; gene_id "LOC110523159"; gene_name "LOC110523159"; xloc "XLOC_000008"; ref_gene_id "LOC110523159"; cmp_ref "XM_021601631.1"; class_code "="; tss_id "TSS9";
NC_035077.1 Gnomon  exon    422138  422304  .   +   .   transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "1";
NC_035077.1 Gnomon  exon    430842  430903  .   +   .   transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "2";
NC_035077.1 Gnomon  exon    431903  431943  .   +   .   transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "3";
NC_035077.1 Gnomon  exon    452095  452215  .   +   .   transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "4";
NC_035077.1 Gnomon  exon    456772  456888  .   +   .   transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "5";
NC_035077.1 Gnomon  exon    457131  457323  .   +   .   transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "6";
NC_035077.1 Gnomon  exon    465044  465051  .   +   .   transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "7";
….
Codes for filtring
awk '{ if ($5-$4>200) print $0 }'  /home/user/MahmoodPhd/FEELnc/merged.annotated.gtf > merged.annotated_200.gtf
awk '$20 ~ /"x"/ { print }' '/home/user/MahmoodPhd/merged.annotated_200.gtf' > x20.gtf
cat u16.gtf j20.gtf i20.gtf o20.gtf x20.gtf > ujiox.gtf
gffread /home/user/MahmoodPhd/cuffcom/ujiox.gtf  -g /home/user/MahmoodPhd/files/GCF_002163495.1_Omyk_1.0_genomic.fa -w /home/user/MahmoodPhd/ujiox.fasta
ujiox.fasta is empty!!


gffcompare


error


stringtie


rnaseq

• 53 views


Login
before adding your answer.

Traffic: 1422 users visited in the last hour



Source link