How to concatenate FASTA files in a specific order based on sequence ID

I've got a problem that I don't have the scripting skills to solve (nor the time to gain them at the moment).

I have two fasta files and what I want to do is to merge them but in a way that similar sequence ids should be printed back to back - and any sequence ids in one of the files that do not exist in another file then those should not be printed.

cat f1.fa

>seq1
ATCGTCA
>seq2
AAAAACT
>seq3
AACATCA
>seq71
CCCGA

cat f2.fa

>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq71
AACGCGCATG
>seq81
AAACCCAGCGCATGCA

so the desired output should look like :

>seq1
ATCGTCA
>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
AAAAACT
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq71
CCCGA
>seq71
AACGCGCATG

I'd prefer to use python as that is the language I'm learning but any solution will suffice.

Thanks for helping.


sequence


fasta


awk


python


pipeline

• 102 views

using linearizefasta.awk

join -t $'t' -1 1 -2 1 
         <(awk -f linearizefasta.awk in1.fa | sort -t $'t' -k1,1) 
         <(awk -f linearizefasta.awk in2.fa | sort -t $'t' -k1,1) |
awk '{printf("%sn%sn%sn%sn",$1,$2,$1,$3);}'

The following script should do what you want as long as you don't have duplicate sequence names in the same file (as you do in f2.fa in your example but hopefully not in your real data). You will need to install the pyfaidx module (e.g. with pip3 install faidx --user).

#!/usr/bin/env python3
import sys
from pyfaidx import Fasta


def cat_fastas(*fasta_files):
    fas = list()
    all_contigs = set()
    for f in fasta_files:
        faidx = Fasta(f)
        fas.append(faidx)
        contigs = set(faidx.keys())
        all_contigs.update(contigs)
    for c in sorted(all_contigs):
        for f in fas:
            if c in f:
                print(">" + f[c].long_name)
                print(f[c])


if __name__ == '__main__':
    if len(sys.argv) < 3:
        sys.exit("Usage: {} f1.fa f2.fa ...fN.fa".format(sys.argv[0]))
    cat_fastas(*sys.argv[1:])

Because you technically shouldn't have two fasta records with the same sequence name in the same file you might want to consider checking for duplicates and adding a suffix to any duplicate sequences too.

I missed the part of your question

and any sequence ids in one of the files that do not exist in another file then those should not be printed

In which case:

#!/usr/bin/env python3
import sys
from pyfaidx import Fasta


def cat_fastas(f1, f2):
    fas1 = Fasta(f1)
    fas2 = Fasta(f2)
    overlapping_contigs = set(fas1.keys()).intersection(set(fas2.keys()))
    for c in sorted(overlapping_contigs):
        for f in (fas1, fas2):
            print(">" + f[c].long_name)
            print(f[c])


if __name__ == '__main__':
    if len(sys.argv) != 3:
        sys.exit("Usage: {} f1.fa f2.fa".format(sys.argv[0]))
    cat_fastas(*sys.argv[1:]

Should do what you want.

if the order of the elements with the same names does not really matters you could do the following:

sort -V <(cat <yourFile1> |paste - -) <(cat <yourFile2> |paste - -) | sed 's/t/n/g'

[replace <yourFile> parts by your input fasta files]

the part between the brackets is a bash subprocess and will transform the fasta file to ID and seq on 1 line, for both files. Then sort will order the combination of both files and finally the sed part will transform the output back to fasta format (where ID on a line and next line is the seq)

OK, missed this part as well :

and any sequence ids in one of the files that do not exist in another
file then those should not be printed

then it's all for Pierre Lindenbaum 's solution 🙂

with OP data

$ cat seq1.fa seq2.fa | paste - -  | sort -sk1,1 | tr "t" "n"                                                                                                                     

>seq1
ATCGTCA
>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
AAAAACT
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq3
AACATCA
>seq71
CCCGA
>seq71
AACGCGCATG
>seq81
AAACCCAGCGCATGCA

If fasta files are not linearized, try with seqkit:

$ seqkit fx2tab seq1.fa seq2.fa | sort -sk1,1 | seqkit tab2fx                                                                                                                         

Round about is way due to limitations in seqkit sort. Seqkit doesn't allow duplicate headers while sorting.


Login
before adding your answer.

Traffic: 2131 users visited in the last hour



Source link