I have paired-end sequence reads that were sequenced on the Illumina HiSeq 2500 system. In a subset of samples, I noticed that a small proportion of FASTQ entries have duplicate read identifiers, sequences, quality scores and barcodes.
sample1:@HISEQ:664:HYGKJBCXY:2:2104:21100:19520 1:N:0:CTGAGCCA
sample2:@HISEQ:664:HYGKJBCXY:2:2104:21100:19520 1:N:0:CTGAGCCA
sample3:@HISEQ:664:HYGKJBCXY:2:2104:21100:19520 1:N:0:CTGAGCCA
sample4:@HISEQ:664:HYGKJBCXY:2:2104:21100:19520 1:N:0:CTGAGCCA
sample5:@HISEQ:664:HYGKJBCXY:2:2104:21100:19520 1:N:0:CTGAGCCA
Libraries were prepared and sequenced by a third-party company. Does anyone have a possible explanation for this and suggestions for moving forward?
Thanks, Chris