Hey I'm a new student in bioinformatics and I'm working on this project - I want to replace some nucleotides with a missing "-", let's say I want to replace a bit from the beginning of the sequence, and a bit from the end of the sequence. How should I go about doing this, and in a scalable manner?
this is the code I have so far. I'm not sure how to edit these sequences, is it better if I use a numpy array? What do I use to write
fasta = {}
with open('example.fasta') as file_one:
for line in file_one:
line = line.strip()
if not line:
continue
if line.startswith(">"):
active_sequence_name = line[1:]
if active_sequence_name not in fasta:
fasta[active_sequence_name] = []
continue
sequence = line
fasta[active_sequence_name].append(sequence)
seqMat = np.array(fasta)
output:
{'seq1': ['AAATATATATATATATATTATATATTATATATATTATATATATAT'],
'seq2': ['GCGCGAGATAGGGCGCGCGCGCGCGATTAGCGAGGCGCGCGCGGC'],
'seq3': ['TCTCTCTCTCTCTCTTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC']}
And this is what I have as an array. What is the best way to replace nucleotides?
from Bio import SeqIO
import os
import numpy as np
pathToFile = open("example.fasta")
allSeqs = []
for seq_record in SeqIO.parse(pathToFile, """fasta"""):
allSeqs.append(seq_record.seq)
seqMat = np.array(allSeqs)
Output:
array([['A', 'A', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A',
'T', 'A', 'T', 'A', 'T', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'T',
'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'T', 'A', 'T', 'A', 'T',
'A', 'T', 'A', 'T', 'A', 'T'],
['G', 'C', 'G', 'C', 'G', 'A', 'G', 'A', 'T', 'A', 'G', 'G', 'G',
'C', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'A',
'T', 'T', 'A', 'G', 'C', 'G', 'A', 'G', 'G', 'C', 'G', 'C', 'G',
'C', 'G', 'C', 'G', 'G', 'C'],
['T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T',
'C', 'T', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T',
'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C',
'T', 'C', 'T', 'C', 'T', 'C']], dtype='<u1')< p="">