How to separate sub-families from transposons sequence based fasta files?

1

I'm working on the classification of transposable elements. I want to retrieve sequences of their sub-classes in separate files. Is there any code or tool present to separate their sub-families because dataset contains thousands of sequence entries for different species.

I really appreciate any help or suggestion!

DATASET SOURCE: pgsb.helmholtz-muenchen.de/plant/recat/index.jsp
For example:
I want to separate RLC Sequences in separate files and so far for other entries like for RLX & TXX

>RLC_163294|LTR_Gr_chr_04_982|LTR/Copia|02.01.01.05|29730|Gossypium
tgttagagtagttagtaaagttgttagtagttaaaactgttgtacgttcagttaacagttgagctgttaaatagttgacctgttagttatgcattcatttgagtataaaactatgagaagtctgtacttaaagatatgagttttataatgaagaaattctaagtctttgtttttaagctgcttgtttagcttaacatggtatcag
>RLX_163369|LTR_Gr_chr_10_2326|LTR|02.01.01|29730|Gossypium
tgtcacgggcaaaagtgcaaagcccgtgaccatggcataagatgtgccccatggaggtctatcgattagacaaggaacatttagcccacgagaacttgcccgattcaaaaaactgttggagaagcctgtcagattgaagcctggttggcccgataatgaagacgtggcaacttaggccaattttggt
>TXX_174935|TXX_Gr_DX404975.1_8351|MobileElement|02|29730|Gossypium
atccgtgcccatgccatgtcccagacatggtcttatgggggactctcatctcggtgccaacgccatatcccagacatggtcttacatgggacctctcataatctcaattatgccaatgccatgtcccagacatggtcttacatgggatctctttacccaaatatcatgacatttgtatccattacattcccaatgtttcaacggggcttttatcactgattctctgtcatctcatacttgagttaacattagatattttcatgaaataaatacataattgctggaaaatagcagcattaa


files


fasta


transposons


sequences


dataset

• 114 views

updated 43 minutes ago by

16k

written 1 day ago by

0

You can linearize the fasta file (code by @Pierre), search for the pattern you want and then reformat back to fasta.

awk '/^>/ {printf("%s%st",(N>0?"n":""),$0);N++;next;} {printf("%s",$0);} END {printf("n");}'  your.fa | grep "^>RLC" | tr "t" "n" > RLC.fa


Login
before adding your answer.

Traffic: 1113 users visited in the last hour



Source link