Suppose you want to download some raw sequence data in fastq format from GEO/SRA and run through an appropriate aligner (BWA, TopHat, STAR, etc) and then variant caller (Strelka, etc) or other analysis pipeline. How do you get started?  First, things first, you need the sequence data.

I will use the data released along with the following publication as an example:
Daemen A*, Griffith OL* et al. 2013. Modeling precision treatment of breast cancer. Genome Biology. 14:R110.

Data were deposited at GEO/SRA and are accessible through the GEO data set super-series for GSE48216 which is comprised of a sub-series for RNA-seq at GSE48213 and Exome-seq at GSE48215. From there you can link to the relevant SRA projects for RNA-seq at SRP026537 and Exome-seq at SRP026538.

You can download the raw data using the SRA toolkit. Please read:
www.ncbi.nlm.nih.gov/books/NBK47540/
www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

For example, to get fastq files for the T47D exome cell line data you could do something like the following:
Find the appropriate GEO record for T47D from the GEO data set sub-series page for GSE48215 listed above.
www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1173000

Under 'Relations' is a link to the corresponding SRA page:
www.ncbi.nlm.nih.gov/sra?term=SRX317818

Note: You can also find this SRX record page directly from the SRA project page for SRP026538 listed above.

Determine the SRR number and then download the data at the command-line with:

prefetch -v SRR925811

Note where the sra file is downloaded (by default to /home/[USER]/ncbi/public/sra/.) and then convert to fastq with something like the following.

fastq-dump --outdir /opt/fastq/ --split-files /home/[USER]/ncbi/public/sra/SRR925811.sra

This should produce two fastq files (one for R1 and one for R2). That will give you the raw exome sequence data for the T47D cell line. A very similar process should work for any RNAseq samples that you want.

If you want to start with sam/bam files you can use sam-dump instead of fastq-dump. But note that these will still just contain the unaligned raw sequence data. You will still need to run through an aligner and variant caller.
www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=sam-dump

If you just want to download X number of raw (fastq) reads to standard output from a particular run you can use a command like the following. This can be useful to just take a quick look at some reads, or obtain some reads for testing purposes or just check whether the SRA toolkit is even working for you.

fastq-dump -X 5 -Z SRR925811

 



Source link