I used magic-BLAST to map SRA paired reads (Illumina 150 bp) to the the corresponding whole genome assembly (WGA). I expected that paired end reads would map near each other on a scaffold; however, as shown below the paired reads typically overlap. Here is an example from each magic-blast output file that shows the mapping of a read pair (query) from four related libraries to assembled contigs (refID) of the WGA:

queryID               refID  %_ident q_start  q_end   r_start   r_end

SRR_1.sra.6388.1      S107.1  100     27  127       80397   80497
SRR_1.sra.6388.2      S107.1  100     1   101       80497   80397

SRR_2.sra.576423.1    S007.1  100     1   151       297238  297388
SRR_2.sra.576423.2    S007.1  100     58  151       297455  297362

SRR_3.sra.4219.1      S516.1  99.0654   45  151     40745   40639
SRR_3.sra.4219.2      S516.1  99.1379   1   116     40630   40745

SRR_4.sra.3159.1      S557.1  99.3333 1   150       37510   37659
SRR_4.sra.3159.2      S557.1  100     1   151       37706   37556

As seen, the SRR_1 and SRR_3 read pairs are mapping to (nearly) the same (reversed) coordinates of the corresponding contig. The mapped coordinates of the SRR_2 and SRR_4 read pairs are off-set, but still overlap (for other read pairs, the overlap is as extensive as for the SRR_1 and SRR_3 pairs).

I suppose this mapping of read pairs is possible if the library fragments used to create the paired end library were about the same size as the read lengths (150 bp), but I would expect that longer library fragments (300-600 bp) would have been made. Unfortunately, the SRA metadata doesn't include these details about library construction. The SRA metadata does note that two of the four libraries are paired-end and the other two are mate pair, which is even more perplexing. If the read pairs were from a mate pair library, I would expect them to map to the same scaffold but hundreds of base pairs from each other.

My working hypothesis is that all four libraries were made with very small fragments, but perhaps there is another likely explanation. Or perhaps my expectations about the mapping of pair end and mate pair reads are wrong. Any insights?

Source link