GATK HaolotypeCaller takes too much time for variant calling

3

Hi all,

I am using GATK HaplotypeCaller to genotype ~3000 SNPs from amplicon sequencing data, using --genotyping_mode GENOTYPE_GIVEN_ALLELES --alleles $vcf --output_mode EMIT_ALL_SITES mode.

Howver, it take huge time to call these variants, ~20h.

And I found that adding -nct option did not work, and it even increase to 23h.

Is there any option to accelerate the calling process?

Thanks,

Junfeng


GATK


HaplotypeCaller


long

• 3.6k views

GATK4 HaplotypeCaller no longer has the option to use -nt or -nct. HaplotypeCaller in Spark is in development so that the program can be parallelised, however this is only in beta and is not yet recommended.

To speed things up, I am running HaplotypeCaller with the -L option. I run the program once for each chromosome and then concatenate the results with bcftools (cat won't work because of the file headers)

Pseudocode:

@chr.list = (1..22, X, Y) # make a list of chromosomes
foreach $chr (@chr.list) # for each chromosome in the list
    {
         gatk HaplotypeCaller 
         --input example.bam 
         --output example.gvcf.gz 
         --reference human.fasta 
         --emit-ref-confidence GVCF 
         --dbsnp knownsnps.vcf 
         --native-pair-hmm-threads 32  # not -nt or -nct , default = 4 (1/3 extra runtime)
         --L $chr
    }
bcftools concat -o example.gvcf 1.gz 2.gz ... Y.gz # must be in order
rm *.gz *.tbi
gzip example.gvcf

use -nt instead of -nct

Hello. this is a few years late, but I hope this helps someone else out. I made an improved version of YaGalbi's code above so that it is much faster (9-fold faster). By adding & done; wait to the end of the @chr loop allows all processes to be run simultaneously. wait makes the machine stop before continuing onto the concatenation step. However, you must adjust your resource partitioning in --java-options "Xmx__g" and --native-pair-hmm-threads so that they do not exceed the machine limit for all chromosomes being computed. For example, I run on an HPC server and have 16gb RAM and 6 CPU for each of my 10 chromosomes. This means I allocated 172 gb RAM and 64 CPU (a little extra just in case). This only works if you can allocate these kinds of resources. Here is the updated code:

@chr.list = (1..22, X, Y) # make a list of chromosomes
foreach $chr (@chr.list) # for each chromosome in the list
    {
         gatk --java-options "-Xmx16g" HaplotypeCaller 
         --input example.bam 
         --output example.gvcf.gz 
         --reference human.fasta 
         --emit-ref-confidence GVCF 
         --dbsnp knownsnps.vcf 
         --native-pair-hmm-threads 6
         --L $chr
    } & done;
wait
bcftools concat -o example.gvcf 1.gz 2.gz ... Y.gz # must be in order
rm *.gz *.tbi
gzip example.gvcf


Login
before adding your answer.

Traffic: 1348 users visited in the last hour



Source link