Hi!

I tried to do what other posts reported and I have a problem that I do not fully understand why ...

1) I downloaded the fastq files from Garvan (ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/) with the bed file. I had to convert the bed file to hg38 (my_regions) ... as I understand it is in hg19.

2) I get the vcf (truth.vcf) and high confidence (confidence.bed) files from here ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/

3) I ran the fastq files with the GATK codes, up to HaplotypeCaller. Garvan use 2 libraries for the same sample. I used them separately, I did not merg them at the end, before the variant calling.

4) To compere my query data (output of HaplotypeCaller) with the GIAB truth test I ran these two codes, to have also a .html file:

hap.py ${truth.vcf} ${query.vcf} -f ${confident} -o /hap.py_results/${outname} -r ${Hg38}

python rep.py -o /hap.py_results/"${outname}.html" "${outname}"_hap.py:/hap.py_results/"${outname}.roc.all.csv.gz"

5) I re-ran my results also with a similar code ... I found and tested it in a GA4GH tutorial:

hap.py ${truth.vcf} ${query.vcf} -f ${confident.bed} -o /hap.py_results/"${outname}" --engine=vcfeval --engine-vcfeval-path=rtg --no-decompose --no-leftshift

6) I ran also

rtg vcfeval -b ${truth.vcf} -c ${query.vcf} -o /hap.py_results/rtgHC/ -t ${ref_sdf} -e ${confidence.bed} --region=${my_regions}

Error: After applying regions there were no sequence names in common
between the reference and the supplied variant sets. Check the regions
supplied by --region or --bed-regions are correct.

my_regions:

chr1    826206  827522
chr1    827683  827775

...

confidence regions:

chr1    821881  822046
chr1    823126  823188
chr1    823426  823479
chr1    823580  826321
chr1    826481  827827
chr1    827928  839291

...

RESULTS:

With the codes 5 and 6 the problem is that I have a recall of 1.65% ... the precision is 97.07% and the F-score is 0.033

What is the problem about "recall" and at the end with F-score? The intervals?
How can I fix the problem with the bed files? How they done the analysis considering that the intervals in the to bed files do not overlap?

Many thanks for your time!



Source link