gravatar for anasjamshed1994

6 hours ago by

I want to write a program that

  1. calculates the number of all kmers of a given length across all DNA sequences
  2. displays just the ones that occur more than a given a number of times.

I have tried this script:

import os
import sys
import shutil

# convert command line arguments to variables
kmer_size = int(sys.argv[1])
print(kmer_size)
count_cutoff = int(sys.argv[2])

# define the function to split dna
def split_dna(dna, kmer_size):
    kmers = []
    for start in range(0,len(dna)-kmer_size-1,1):
        kmer = dna[start:start+kmer_size]
        kmers.append(kmer)
    return kmers

# create an empty dictionary to hold the counts
kmer_counts = {}

# process each file with the right name
for file_name in os.listdir("."):
    if file_name.endswith(".fastq"):
        dna_file = open(file_name)

        # process each DNA sequence in a file
        for line in dna_file:
            dna = line.rstrip("n")

            # increase the count for each k-mer that we find
            for kmer in split_dna(dna, kmer_size):
                current_count = kmer_counts.get(kmer, 0)
                new_count = current_count + 1
                kmer_counts[kmer] = new_count

# print k-mers whose counts are above the cutoff
for kmer, count in kmer_counts.items():
    if count > count_cutoff:
        print(kmer + " : " + str(count))

But it gives an error:

ValueError                                Traceback (most recent call last)
<ipython-input-42-02b791e42fca> in <module>()
      4 
      5 # convert command line arguments to variables
----> 6 kmer_size = int(sys.argv[1])
      7 print(kmer_size)
      8 count_cutoff = int(sys.argv[2])

ValueError: invalid literal for int() with base 10: '-f'

I have been trying from last 3 months I don't know he can I execute it? I can't change the type of any variable

Kindly help me



Source link