I am looking at some RNAseq PE reads (250 bp) data for de novo assembly, and ran fastp
on them for quality control and diagnostics.
The insert size distribution
plot looks like this:
I presume the broad peak between ca. 100-300 bp would be "proper" reads. What are the additional peaks in here though? I am referring to the sharp peak(s) at ca. 30 bp (to the left of the "main" peaks) and the sort of bimodal peak to the far right around 480 bp.
Would it be safe to exclude those reads using length-based filtering criteria?
Coming to the sequence quality
plot, it looks like this after running fastp
(this is the plot for the second read in the read pairs):
Why is there that distinct dip in quality ca. 120 bp? Is this normal, and is it an artifact of the sequencing chemistry? (The data is--I presume--pretty good otherwise.)