I have some Sanger sequence data and a database of Illumina sequences. I want to match the Sanger Sequence to the corresponding Illumina sequence. This is the code I used:
from Bio import SeqIO from Bio import pairwise2 from Bio import Seq fasta_sequences = SeqIO.parse(open("taxa_all.fasta"),'fasta') ### database of Illumina sequences with open ("isolation-round1/778291/High_Intensity/18_F.ab1.seq") as myfile: isolate=myfile.readline() ### the Sanger sequence score= name= for fasta in fasta_sequences: n,sequence = fasta.id, str(fasta.seq) for a in pairwise2.align.localxx(isolate,sequence): al1,al2,s,begin,end=a score.append(s) name.append(n)
When I ran this code, which took about 5 min, the scores I obtained were very close to each other, and I was therefore unable to say with certainty what the correct mapping was.
So then I changed it to
for a in pairwise2.align.localms(isolate,sequence,2, -1, -.5, -.1):
keeping everything else the same. This code took about 3h to run! On examining the results, I noticed that there were a 1000 repeats of each
score. But I don't understand why. Am I doing something wrong? Is there a way to speed up the process?