I've seen that HMMER can be sped up with GNU Parallel: Speed of hmmsearch

I have around 100,000 sequences and a HMMER database of around 300 HMM profiles.

I’m running everything at once but I’m wondering if it’ll be faster to split up the sequences and/or split up the jobs.

There’s a few ways to do it, if I'm working with 16 threads:

  1. Should I split up the sequences equal to the number of threads and run in parallel with 1 thread each? For example, split sequences into 16 files and then running 16 jobs at 1 thread?

  2. Should I split up into around 1000 files and then run HMMSearch for each of them one at a time but at full 16 threads?

  3. Should I split up into 1000 files, then run 16 of them at a time, each with 1 thread?

What’s the best way to do it?

Are there any wrapper that do this for you?

If I wanted to do option 1, how would I do this?

For simplicity, instead of 16 threads, let's just use 4 threads.

I have my large fasta file (in reality it's a lot of different fasta files of different size that I cat together),

# The pipeline I'm working on uses seqkit as the fasta parsing and manipulation package
cat *.faa | seqkit split -p 4 -O proteins/

In proteins/ there will be 4 fasta files:

stdin.part_001.fasta, stdin.part_002.fasta, stdin.part_003.fasta, stdin.part_004.fasta

Here's my marker databases:


How would I run hmmsearch with GNU Parallel using each of these 4 fasta files that I split up?

I'm trying to adapt this code here from a previous BioStars post to my situation with the multiple files:

function hmmer() {
    n=$(basename "$1")
    hmmsearch -Z 2000000 --domZ 89 --cpu 1 -o $1.output.hmmsearch.txt stockholm89.hmm $1

export -f hmmer
find /where/the/split/files/are/ -maxdepth 1 -type f -name "*specific2splitFiles" | parallel -j 20 hmmer {}

I'm expecting an output like this but I don't understand how to adapt the parallel command:


Source link