Hi, I have discovered that there is a problem with the database I created, which is that my pipeline is not handling multiple genes in a single fasta entry. Carbohydrate active enzyme (CAZyme) genes often contain a GH and a CBM close to one another, and in my current database that looks like:

>NZ_LR732827.1_2243__GH81(350-867)+CBM32(1336-1449)+CBM32(1571-1687)+CBM32(1705-1820) Curtobacterium_sp_8I-2
WGDWNVSWRMPQDGPGGQYMDVTSVQGSPY... [truncated for ease of viewing]

The number in parentheses reflects the position of the gene in the sequence that follows it. I have read through lots of solutions to similar problems but I can't work out how to apply them to my specific situation.

My required output would contain:


With the sequence corresponding to the position in parentheses. I understand that bedtools getfasta will let me split based on position so if it's not possible to get the above in one step, if I could simply duplicate the whole sequence for each new entry then I will be able to extract the relevant part afterwards.

I also would like this code to look through my whole fasta which contains mostly one gene per header and ignore those, for those are already in the format I want, I just want it to create a new line and duplicate sequence for every "+" it sees.

I believe the best tool for this would be biopython, but my current attempt:

from Bio import SeqIO

for record in SeqIO.parse("file.fa", "fasta"):
    species = record.id.split('+')[0]
    with open(f"{species}.fa", "a") as f:
        SeqIO.write(record, f, "fasta")

Is simply cutting off everything after the first "+" and also is missing the Curtobacterium name at the end as it is separated by a space I think.

Thank you so much for your help!! Please let me know if there is anything that is unclear

Source link