Checking integrity of SRA downloaded fastq files

3

I have been downloading data using the SRA toolkit with prefetch + fastq-dump and more recently fasterq-dump. I have had a variety of messages that come up when I was in the troubleshooting process, but I still get this message regularly: fasterq-dump.2.9.6 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned-76 ( NET - Reading information from the socket failed ) - the process continues and eventually I get the readout stating the number of spots and reads read.

I'm now curious about checking the integrity of the downloaded data - I've read that you can use MD5 checksums for the SRA file but not the fastq files. Now that I already have the fastq files, is there another way I can check the integrity? Is having the correct number of "reads read/written" printed out after fasterq-dump is finished enough to confirm correct downloading and processing or is there something I'm not thinking of?


sra toolkit


fastq-dump


fasterq-dump


sequence

• 2.2k views

I ran into the same issue in that the vdb-validate does not work if you are downloading the fastq files without prefetching the sra, you are left with only a fastq file which cannot be validated. It strange that the there isn't a simple checksum that is provided, may be we can get the admin to include this? Anyhow, if you download the SRA meta file it will give a you a list of total bases in one of the columns. What I did next was to just count the total bases for either the _1 or _2 and mulitple by 2 if its paired. Single just leave as is.

Here is a simple sh script that you can use. It will output the total base and you can then match that.

#!/bin/bash
F=$1

gzip -dc $F |
     awk 'NR%4==2{c++; l+=length($0)}
          END{
                print l;
              }'

Another option I see is to also download the meta-data file that is associated with your SRA run. I am currently doing this using grabseqs which is using fasterq-dump in the background. I used something like this:

grabseqs sra -m SRR10229698.meta.csv 
             -o results/main/HeLa_Kyoto_MboI_G1_Snyc_STAG2_Depleted/reads/ 
             -r 4 
             -t 4 
             SRR10229698 

Afterwards you will find the number of spots for the given SRR ID and then you can count the number of reads just as simplitia did:

I ran into the same issue in that the vdb-validate does not work if
you are downloading the fastq files without prefetching the sra, you
are left with only a fastq file which cannot be validated. It strange
that the there isn't a simple checksum that is provided, may be we can
get the admin to include this? Anyhow, if you download the SRA meta
file it will give a you a list of total bases in one of the columns.
What I did next was to just count the total bases for either the _1 or
_2 and mulitple by 2 if its paired. Single just leave as is.
(link)


Login
before adding your answer.

Traffic: 1684 users visited in the last hour



Source link