Checking integrity of SRA downloaded fastq files
I have been downloading data using the SRA toolkit with prefetch + fastq-dump and more recently fasterq-dump. I have had a variety of messages that come up when I was in the troubleshooting process, but I still get this message regularly: fasterq-dump.2.9.6 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned-76 ( NET - Reading information from the socket failed )
- the process continues and eventually I get the readout stating the number of spots and reads read.
I'm now curious about checking the integrity of the downloaded data - I've read that you can use MD5 checksums for the SRA file but not the fastq files. Now that I already have the fastq files, is there another way I can check the integrity? Is having the correct number of "reads read/written" printed out after fasterq-dump is finished enough to confirm correct downloading and processing or is there something I'm not thinking of?
• 2.2k views
I ran into the same issue in that the vdb-validate does not work if you are downloading the fastq files without prefetching the sra, you are left with only a fastq file which cannot be validated. It strange that the there isn't a simple checksum that is provided, may be we can get the admin to include this? Anyhow, if you download the SRA meta file it will give a you a list of total bases in one of the columns. What I did next was to just count the total bases for either the _1 or _2 and mulitple by 2 if its paired. Single just leave as is.
Here is a simple sh script that you can use. It will output the total base and you can then match that.
#!/bin/bash
F=$1
gzip -dc $F |
awk 'NR%4==2{c++; l+=length($0)}
END{
print l;
}'
Another option I see is to also download the meta-data file that is associated with your SRA run. I am currently doing this using grabseqs which is using fasterq-dump in the background. I used something like this:
grabseqs sra -m SRR10229698.meta.csv
-o results/main/HeLa_Kyoto_MboI_G1_Snyc_STAG2_Depleted/reads/
-r 4
-t 4
SRR10229698
Afterwards you will find the number of spots for the given SRR ID and then you can count the number of reads just as simplitia did:
I ran into the same issue in that the vdb-validate does not work if
you are downloading the fastq files without prefetching the sra, you
are left with only a fastq file which cannot be validated. It strange
that the there isn't a simple checksum that is provided, may be we can
get the admin to include this? Anyhow, if you download the SRA meta
file it will give a you a list of total bases in one of the columns.
What I did next was to just count the total bases for either the _1 or
_2 and mulitple by 2 if its paired. Single just leave as is.
(link)
Traffic: 1684 users visited in the last hour