gravatar for KHP

2 hours ago by

Hello,

I have some Sanger sequence data and a database of Illumina sequences. I want to match the Sanger Sequence to the corresponding Illumina sequence. This is the code I used:

from Bio import SeqIO
from Bio import pairwise2
from Bio import Seq

fasta_sequences = SeqIO.parse(open("taxa_all.fasta"),'fasta') ### database of Illumina sequences

with open ("isolation-round1/778291/High_Intensity/18_F.ab1.seq") as myfile:
       isolate=myfile.readline()                              ### the Sanger sequence

score=[]
name=[]

for fasta in fasta_sequences:
    n,sequence = fasta.id, str(fasta.seq)
    for a in pairwise2.align.localxx(isolate,sequence):
        al1,al2,s,begin,end=a
        score.append(s)
        name.append(n)

When I ran this code, which took about 5 min, the scores I obtained were very close to each other, and I was therefore unable to say with certainty what the correct mapping was.

So then I changed it to

for a in pairwise2.align.localms(isolate,sequence,2, -1, -.5, -.1):

keeping everything else the same. This code took about 3h to run! On examining the results, I noticed that there were a 1000 repeats of each name and score. But I don't understand why. Am I doing something wrong? Is there a way to speed up the process?

link

modified 2 hours ago

written
2 hours ago
by

KHP0



Source link