normalize different size fastq files

Hi
I have different sizes of fastq files eg one is 3 GB and other one is 18 GB in size. For analysis I want to randomly select reads from the larger files to make them upto 5 GB in size. I want to know if there is any programme or script that can do this.

Thank You
Saraswati


metagenome


data


genomics


normalization

• 52 views

By size probably not but there are many options to select by read number, testing it out with a few sizes can help you hone in on the number that is closest to the size you want (Google either of the tools seqtk or seqkit for more details)

seqtk sample

prints:

Usage:   seqtk sample [-2] [-s seed=11] <in.fa> <frac>|<number>

Options: -s INT       RNG seed [11]
         -2           2-pass mode: twice as slow but with much reduced memory

or using seqkit:

seqkit sample -h

prints:

sample sequences by number or proportion.

Usage:
  seqkit sample [flags]

Flags:
  -h, --help               help for sample
  -n, --number int         sample by number (result may not exactly match)
  -p, --proportion float   sample by proportion
  -s, --rand-seed int      rand seed (default 11)
  -2, --two-pass           2-pass mode read files twice to lower memory usage. Not allowed when reading from stdin

You are asking for two different operations. Down-sampling and normalization are different operations.

You can use reformat.sh from BBMap suite to simply down-sample the larger file. Following options are relevant.

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

Depending on the aim of the analysis it may be more appropriate to normalize the data using bbnorm.sh. A guide is available.


Login
before adding your answer.

Traffic: 2542 users visited in the last hour



Source link