gravatar for xatabadich

2 hours ago by

Hello everyone!

I developed a little script to search for two primers in reads which are converted into .fa format. so the script is able to find both primers in one read using bbduk.

The problem is that the script is not as efficient as it could be. It takes a lot of time to process. According to the code below(Python) it searches for the first primer in reads that contain the primer, then deletes the initial file that contains all the reads, after that it searches for the second primer and gives and an empty files that do not contain both primers.

The question is: how can I improve the processing speed by changing the code or using a different approach. Or does something similar already exist as an example?

 #!/usr/bin/env python3
import csv
import os


#determine tool directories
bbduk = '/home/vladislav.shevtsov/miniconda3/envs/mlvapython3/bin/bbduk.sh'

#determine files directories
data_in_dir = '/home/vladislav.shevtsov/bbduk_search/reads'
data_out_dir = '/home/vladislav.shevtsov/bbduk_search/output/'

fastq_gz_row_list = sorted([os.path.join(data_in_dir, i) for i in os.listdir(data_in_dir) if i.endswith('.gz')])

def generate_bbduk_output_name(name, suffix = 'OUT'): #Generates the output name
    a, b = os.path.splitext(name)
    new_name = a + '_'+ suffix + b
    return new_name

doc='/home/vladislav.shevtsov/bbduk_search/Primers_Pseudo_7fix_run.txt' # Path to Primers.txt



bash_file = 'bbdu1.sh'
with open(bash_file, 'w') as f, open(doc,'r') as primer:
    file_reader=csv.reader(primer,delimiter='t')
    f.write('#!/bin/bash n')
    for lines in file_reader:

        for sample in fastq_gz_row_list:
            name = os.path.basename(sample)
            name, _ = os.path.splitext(name)
            out_sample = os.path.join(data_out_dir, name)
            out_sample = generate_bbduk_output_name(out_sample, lines[0])
            out_sample2 = generate_bbduk_output_name(out_sample)
            #print(out_sample)
            if not os.path.isfile(out_sample):                
              f.write(f'{bbduk} in={sample} outm={out_sample} literal={lines[1]} mm=f k=22 minlen=1 hdist=1 forbidn=t threads=18 n') #searches primer 1 ... run_1
              f.write(f'{bbduk} in={out_sample} outm={out_sample2} literal={lines[2]} mm=f k=10 minlen=1 hdist=1 forbidn=t threads=18 n') #searches primer 2 from previous run
              f.write(f'rm {out_sample} n' ) #removes run_1 files



Source link