I used magic-BLAST to map SRA paired reads (Illumina 150 bp) to the the corresponding whole genome assembly (WGA). I expected that paired end reads would map near each other on a scaffold; however, as shown below the paired reads typically overlap. Here is an example from each magic-blast output file that shows the mapping of a read pair (query) from four related libraries to assembled contigs (refID) of the WGA:
queryID refID %_ident q_start q_end r_start r_end SRR_1.sra.6388.1 S107.1 100 27 127 80397 80497 SRR_1.sra.6388.2 S107.1 100 1 101 80497 80397 SRR_2.sra.576423.1 S007.1 100 1 151 297238 297388 SRR_2.sra.576423.2 S007.1 100 58 151 297455 297362 SRR_3.sra.4219.1 S516.1 99.0654 45 151 40745 40639 SRR_3.sra.4219.2 S516.1 99.1379 1 116 40630 40745 SRR_4.sra.3159.1 S557.1 99.3333 1 150 37510 37659 SRR_4.sra.3159.2 S557.1 100 1 151 37706 37556
As seen, the SRR_1 and SRR_3 read pairs are mapping to (nearly) the same (reversed) coordinates of the corresponding contig. The mapped coordinates of the SRR_2 and SRR_4 read pairs are off-set, but still overlap (for other read pairs, the overlap is as extensive as for the SRR_1 and SRR_3 pairs).
I suppose this mapping of read pairs is possible if the library fragments used to create the paired end library were about the same size as the read lengths (150 bp), but I would expect that longer library fragments (300-600 bp) would have been made. Unfortunately, the SRA metadata doesn't include these details about library construction. The SRA metadata does note that two of the four libraries are paired-end and the other two are mate pair, which is even more perplexing. If the read pairs were from a mate pair library, I would expect them to map to the same scaffold but hundreds of base pairs from each other.
My working hypothesis is that all four libraries were made with very small fragments, but perhaps there is another likely explanation. Or perhaps my expectations about the mapping of pair end and mate pair reads are wrong. Any insights?