How can I extract protein sequence from gff file?


I have a gff file containing genes predicted by AUGUSTUS, the file already containing CDS, exons, and protein sequences.
I need to extract protein sequence from the file using bash.





Oh, I see. Strange, and more complicated output. Not impossible though. I'm assuming you want the hash (#) and space removed from the beginning of each protein sequence as well.

Let me know if this does the trick for you:

 awk '/# protein sequence/{a=1}/# Evidence/{a=0}a' Genes.gff | sed 's/# //'  
  • /# protein sequence/ matches lines having this text, as well as /# Evidence/ does.
  • /# protein sequence/{a=1} sets the flag when the text # protein sequence is found.
  • /# Evidence/{a=0} unsets the flag when the text /# Evidence is found.
  • The final a is a pattern with the default action, which is to print $0: if flag is equal 1 the line is printed.

