NovaSeq Duplicates

Dear community,

our group switched to NovaSeq and since then we have seen very strange data, meaning that we have extremly high read counts (over 200 million reads in the BAM files for paired-end data) and the number of reads remaining after using Picard MarkDuplicates goes down drastically. The most extreme case was from 270 million to 9 million reads. We talk now from ChIP-seq data (TFs)

I am not on the experimental side, but as far as I know they did not change the protocol for library prep. Is there something we should know or how to solve this issue? Can this data be used as it is now?
Before NovaSeq we did not have this kind of problems (with HiSeq).

Thank you for any advice.

Regards from our mini group




it can be that because the much higher throughput of the novaseq machines you libraries are not complex enough and you thus sequence the same sequences over and over resulting in what could seen like duplicates indeed.

200-300 million reads is much higher than what is normally done for ChIP-seq samples. Since for ChIP-seq you can start out with a fairly low number of molecules going into PCR, your initial library complexity is probably low - meaning a low number of unique molecules to sequence per samples. Since PCR duplicates tend to have a fairly uniform distribution, you are probably sequencing most unique molecules by 10 million or so reads, and then you are just sequencing PCR duplicates of those molecules for the remainder.

In addition to what others have said my recommendation is to run (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ) on this dataset. This will allow you to identify duplicates without doing any alignments. Find out how many optical duplicates you have as opposed to other types of duplicates. It is possible that your facility is overloading these FC in addition to the fact that these are low complexity libraries to begin with.

before adding your answer.

Traffic: 2158 users visited in the last hour

Source link