I'm hoping to identify reads pairs that have a mismatch between each read in the pair at a particular position in my genome.
I have a bam file of PE reads mapped to my genome. It was a pretty shoddy library prep (supposed to be 150bp PE with insert size of 150bp), so there's massive variation in insert size, and many of mates overlap with each other.
I'm hoping to make the best of a bad situation and try to use this data to identify mismatches between each read in a pair at a specific position in my genome. The problem is very much that I'm looking at mismatches between mates of a pair rather than mismatches to the genome.
My thinking was as follows:
1) Identify read pairs that map correctly
2) Of these read pairs from 1, find pairs for which both reads map to the same position.
3) Count mismatches between each pair at this position.
I've made a quick drawing which may help
In these three pairs, there is one mismatch at the position, which is between mates Pair2. I'm hoping to count the number of mismatches at a position for as many pairs as I can. So I would have a mismatch count of 1.
Any suggestions on how I can do this?