I have two vcf files, each with the same samples and the same variants, but with different INFO and FORMAT fields. I need to merge them together into a single vcf file.

It seems like the obvious answer would be something like bcftools concat or GATK MergeVcfs, however neither of them are working correctly for this.

The vcf files look like this;

$ cat 1.vcf
##fileformat=VCFv4.2
##FORMAT=<ID=FL_AD,Number=1,Type=Integer,Description="Depth matching alternate (ALT) allele">
##FORMAT=<ID=FL_ADN,Number=1,Type=Integer,Description="Alternate depth on negative strand">
##FORMAT=<ID=FL_ADP,Number=1,Type=Integer,Description="Alternate depth on postitive strand">
##FORMAT=<ID=FL_DP,Number=1,Type=Integer,Description="Total depth">
##FORMAT=<ID=FL_DPN,Number=1,Type=Integer,Description="Depth on negative strand">
##FORMAT=<ID=FL_DPP,Number=1,Type=Integer,Description="Depth on postitive strand">
##FORMAT=<ID=FL_RD,Number=1,Type=Integer,Description="Depth matching reference (REF) allele">
##FORMAT=<ID=FL_RDN,Number=1,Type=Integer,Description="Reference depth on negative strand">
##FORMAT=<ID=FL_RDP,Number=1,Type=Integer,Description="Reference depth on postitive strand">
##FORMAT=<ID=FL_VF,Number=1,Type=Float,Description="Variant frequence (AD/DP)">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3
1       43804359        .       C       T       .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     603:374:229:898:546:352:295:172:123:0.671492     222:131:91:390:228:162:168:97:71:0.569231       243:123:120:972:502:470:727:378:349:0.25
2       29551282        .       C       T       .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     616:322:294:1346:702:644:729:380:349:0.457652    242:112:130:705:333:372:462:221:241:0.343262    237:124:113:1360:673:687:1120:547:573:0.174265
2       29551282        .       CCT     TCA     .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     593:307:286:1331:699:632:679:364:315:0.44553     236:107:129:709:337:372:447:218:229:0.332863    230:119:111:1373:688:685:1085:538:547:0.167516
$ cat 2.vcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic Depths of REF and ALT(s) in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3
1       43804359        .       C       T       .       .       AC=3;AN=6       GT:AD:DP        0/1:284,559:845 0/1:167,219:387 0/1:687,323
2       29551282        .       C       T       .       .       AC=2;AN=4       GT:AD:DP        0/1:651,558:1211        0/1:459,241:702 ./.
2       29551282        .       CCT     TCA     .       .       AC=1;AN=2       GT:AD   ./.     ./.     0/1:825,306
  • notably, 2.vcf has some empty FORMAT fields for some variants for some samples

The results of standard merge operations look like this:

bcftools concat;

$ bcftools concat -a 1.vcf.gz 2.vcf.gz --output bcftools_concat.vcf --output-type v

$ cat bcftools_concat.vcf
##fileformat=VCFv4.2
## ...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3
1       43804359        .       C       T       .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     603:374:229:898:546:352:295:172:123:0.671492     222:131:91:390:228:162:168:97:71:0.569231       243:123:120:972:502:470:727:378:349:0.25
1       43804359        .       C       T       .       .       AC=3;AN=6       GT:AD:DP        0/1:284,559:845 0/1:167,219:387 0/1:687,323:.
2       29551282        .       C       T       .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     616:322:294:1346:702:644:729:380:349:0.457652    242:112:130:705:333:372:462:221:241:0.343262    237:124:113:1360:673:687:1120:547:573:0.174265
2       29551282        .       C       T       .       .       AC=2;AN=4       GT:AD:DP        0/1:651,558:1211        0/1:459,241:702 ./.:.:.
2       29551282        .       CCT     TCA     .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     593:307:286:1331:699:632:679:364:315:0.44553     236:107:129:709:337:372:447:218:229:0.332863    230:119:111:1373:688:685:1085:538:547:0.167516
2       29551282        .       CCT     TCA     .       .       AC=1;AN=2       GT:AD   ./.:.   ./.:.   0/1:825,306

GATK;

$ gatk MergeVcfs -I 1.vcf -I 2.vcf -O gatk_concat.vcf

$ cat gatk_concat.vcf
##fileformat=VCFv4.2
## ...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3
1       43804359        .       C       T       .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     603:374:229:898:546:352:295:172:123:0.671492     222:131:91:390:228:162:168:97:71:0.569231       243:123:120:972:502:470:727:378:349:0.25
1       43804359        .       C       T       .       .       AC=3;AN=6       GT:AD:DP        0/1:284,559:845 0/1:167,219:387 0/1:687,323
2       29551282        .       C       T       .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     616:322:294:1346:702:644:729:380:349:0.457652    242:112:130:705:333:372:462:221:241:0.343262    237:124:113:1360:673:687:1120:547:573:0.174265
2       29551282        .       C       T       .       .       AC=2;AN=4       GT:AD:DP        0/1:651,558:1211        0/1:459,241:702 ./.
2       29551282        .       CCT     TCA     .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     593:307:286:1331:699:632:679:364:315:0.44553     236:107:129:709:337:372:447:218:229:0.332863    230:119:111:1373:688:685:1085:538:547:0.167516
2       29551282        .       CCT     TCA     .       .       AC=1;AN=2       GT:AD   ./.     ./.     0/1:825,306

In both cases, the lines are duplicated, not merged together.

Trying instead with bcftools merge does not seem to work right either;

$ bcftools merge --force-samples 1.vcf.gz 2.vcf.gz --output bcftools_merge.vcf

$ cat bcftools_merge.vcf
##fileformat=VCFv4.2
## ...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3 2:Sample1       2:Sample2       2:Sample3
1       43804359        .       C       T       .       .       AN=6;AC=3       GT:FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:AD:DP     ./.:603:374:229:898:546:352:295:172:123:0.671492:.:.    ./.:222:131:91:390:228:162:168:97:71:0.569231:.:.       ./.:243:123:120:972:502:470:727:378:349:0.25:.:.       0/1:.:.:.:.:.:.:.:.:.:.:284,559:845     0/1:.:.:.:.:.:.:.:.:.:.:167,219:387     0/1:.:.:.:.:.:.:.:.:.:.:687,323:.
2       29551282        .       C       T       .       .       AN=4;AC=2       GT:FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:AD:DP     ./.:616:322:294:1346:702:644:729:380:349:0.457652:.:.   ./.:242:112:130:705:333:372:462:221:241:0.343262:.:.    ./.:237:124:113:1360:673:687:1120:547:573:0.174265:.:. 0/1:.:.:.:.:.:.:.:.:.:.:651,558:1211    0/1:.:.:.:.:.:.:.:.:.:.:459,241:702     ./.:.:.:.:.:.:.:.:.:.:.:.:.
2       29551282        .       CCT     TCA     .       .       AN=2;AC=1       GT:FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:AD        ./.:593:307:286:1331:699:632:679:364:315:0.44553:.      ./.:236:107:129:709:337:372:447:218:229:0.332863:.      ./.:230:119:111:1373:688:685:1085:538:547:0.167516:.   ./.:.:.:.:.:.:.:.:.:.:.:.       ./.:.:.:.:.:.:.:.:.:.:.:.       0/1:.:.:.:.:.:.:.:.:.:.:825,306

This time, the variants are not duplicated, but the samples are duplicated instead of being merged.

Ultimately, what I want is something like this;

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3
1   43804359    .   C   T   .   .   AC=3;AN=6   FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:GT:AD:DP  603:374:229:898:546:352:295:172:123:0.671492:0/1:284,559:845    222:131:91:390:228:162:168:97:71:0.569231:0/1:167,219:387   243:123:120:972:502:470:727:378:349:0.25:0/1:687,323
2   29551282    .   C   T   .   .   AC=2;AN=4   FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:GT:AD:DP  616:322:294:1346:702:644:729:380:349:0.457652:0/1:651,558:1211  242:112:130:705:333:372:462:221:241:0.343262:0/1:459,241:702    237:124:113:1360:673:687:1120:547:573:0.174265:./.
2   29551282    .   CCT TCA .   .   AC=1;AN=2   FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:GT:AD 593:307:286:1331:699:632:679:364:315:0.44553:./.    236:107:129:709:337:372:447:218:229:0.332863:./.    230:119:111:1373:688:685:1085:538:547:0.167516:0/1:825,306

Is there a way to get this output from either of these tools, or some other tool? I was trying to avoid having to resort to sed/awk/etc. if there is something already available that can do this correctly.



Source link