I want to write a program that
- calculates the number of all kmers of a given length across all DNA sequences
- displays just the ones that occur more than a given a number of times.
I have tried this script:
import os
import sys
import shutil
# convert command line arguments to variables
kmer_size = int(sys.argv[1])
print(kmer_size)
count_cutoff = int(sys.argv[2])
# define the function to split dna
def split_dna(dna, kmer_size):
kmers = []
for start in range(0,len(dna)-kmer_size-1,1):
kmer = dna[start:start+kmer_size]
kmers.append(kmer)
return kmers
# create an empty dictionary to hold the counts
kmer_counts = {}
# process each file with the right name
for file_name in os.listdir("."):
if file_name.endswith(".fastq"):
dna_file = open(file_name)
# process each DNA sequence in a file
for line in dna_file:
dna = line.rstrip("n")
# increase the count for each k-mer that we find
for kmer in split_dna(dna, kmer_size):
current_count = kmer_counts.get(kmer, 0)
new_count = current_count + 1
kmer_counts[kmer] = new_count
# print k-mers whose counts are above the cutoff
for kmer, count in kmer_counts.items():
if count > count_cutoff:
print(kmer + " : " + str(count))
But it gives an error:
ValueError Traceback (most recent call last)
<ipython-input-42-02b791e42fca> in <module>()
4
5 # convert command line arguments to variables
----> 6 kmer_size = int(sys.argv[1])
7 print(kmer_size)
8 count_cutoff = int(sys.argv[2])
ValueError: invalid literal for int() with base 10: '-f'
I have been trying from last 3 months I don't know he can I execute it? I can't change the type of any variable
Kindly help me