1000G Tabix download: EOF marker is absent


I want to download the last release of the phased 1000Genomes (high coverage), that it is in the hg38 build but only for a set of samples (203 samples to pre precise)...

I have used the command line:

tabix -h http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.filtered.shapeit2-duohmm-phased.vcf.gz chr1 | vcf-subset -c Sample_1kgp.txt | bgzip -c > CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.filtered.shapeit2-duohmm-phased_out.vcf.gz

It starts to download but then eventually I get this error that appears do be random (sometimes I get it right after starting the download and sometimes it past already 1 hour and then this happens).

[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
Broken VCF: empty columns (trailing TABs) starting at chr1:35966205.
Wrong number of fields; expected 3211, got 1926.

and this error at the end too:

at /usr/local/Cellar/vcftools/0.1.16/lib/perl5/site_perl/Vcf.pm line 172, <STDIN> line 968801.
    Vcf::throw(Vcf4_1=HASH(0x7fdcbe8b2c40), "Wrong number of fields; expected 3211, got 1926. The offendin"...) called at /usr/local/Cellar/vcftools/0.1.16/lib/perl5/site_perl/Vcf.pm line 507
    VcfReader::next_data_hash(Vcf4_1=HASH(0x7fdcbe8b2c40)) called at /usr/local/Cellar/vcftools/0.1.16/lib/perl5/site_perl/Vcf.pm line 3479
    Vcf4_1::next_data_hash(Vcf4_1=HASH(0x7fdcbe8b2c40)) called at /usr/local/Cellar/vcftools/0.1.16/libexec/bin/vcf-subset line 146
    main::vcf_subset(HASH(0x7fdcbd8243c0)) called at /usr/local/Cellar/vcftools/0.1.16/libexec/bin/vcf-subset line 12

Any inputs to solve this?







It sounds like an incomplete download. I suggest you download the file and run it with file name instead of URL.

Your code tries to download all chr1 data through tabix, and pipe it to vcf-subset and to bgzip internally, which is not really efficient.

I would suggest to do it all at once through bcftools:

bcftools view -Oz -S Sample_1kgp.txt -o CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.filtered.shapeit2-duohmm-phased_out.vcf.gz http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.filtered.shapeit2-duohmm-phased.vcf.gz chr1

Having said that, I've tried it myself with the first 2 samples (HG00096 and HG00097) and I've received a similar error message when I first ran it (it ended successfully in ~30 minutes when I tried it again):

[E::bgzf_read_block] Failed to read BGZF block data at offset 26325506 expected 3300 bytes; hread returned 2888
[E::vcf_parse_format] Couldn't read GT data: value not a number or '.' at chr1:2195994
Error: VCF parse error

I discard a problem with RAM usage because bcftools is perfectly able to work fine with a minimum memory footprint, and I agree that there must be a problem with the data retrieval. It looks like querying such large files can be demanding from the server, and maybe the server was not capable of responding properly. In fact, a simple sample name query such as bcftools query -l remotefile failed when I first ran it, but worked when I ran it again a few seconds later.

before adding your answer.

Traffic: 1716 users visited in the last hour

Source link