gravatar for Earendil

3 hours ago by


I have downloaded fastq files from two completely unrelated projects that are using 16s amplicon sequencing.

Both have a weird read length distribution. Forward reads are mostly 301 bp long, as seen from awk 'NR%4==2{print length}' fw_1.fastq | sort | uniq -c | less

Reverse reads are mostly 300bp long, as seen from awk 'NR%4==2{print length}' rv_1.fastq | sort | uniq -c | less

Why would that be so? I want to analyse the datasets with DADA2 and FIGARO (FIGARO detects optimal trim parameters for DADA2) and FIGARO requires that all reads are of the same length. If I filter reads based on their length and decide to keep only reads and their pair that are, for instance, 300 bp, I will loose the vast majority of reads.

I am at a complete loss as to why would the forward and reverse reads be of mostly unequal length.

