Hi, I have the following toy SAM file entry.
1_7906 0 chr1 4771271 60 146M1D37M * 0 0 AAAAGTTTTTCTGCTTGGGGAAGAAGTTGCCCAGTATGACGGTGCATACAAGGTTAGCAGAGGCCTGTGGAAGAAATACGGTGACAAGAGGATCATCGACACCCCCATCTCAGAGATGGGCTTTGCTGGAATTGCTGTTGGTGCAGCTATGGCTGGGTTGCGGCCCATCTGTGAATTTATGAC FFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJJJJJFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:1785 NM:i:2 NH:i:1 XI:f:0.9891 X0:i:1 XE:i:53 XR:i:183 MD:Z:146^C2G34 TC:i:0 RA:Z:44,0,0,0,0,0,35,0,0,0,1,0,56,0,0,0,0,0,47,0,0,0,0,0,0, MP:Z:10:149:150
MP tags, but I need a script to get these tags in files that lack them, but have
MD:Z tag, CIGAR string, and read sequence.
RA encodes the type of matches/mismathed in the read, as follows:
# Read # A C G T N # A 0 1 2 3 4 # R C 5 6 7 8 9 # e G 10 11 12 13 14 # f T 15 16 17 18 19 # N 20 21 22 23 24
RA:Z:44,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, would correspond to 44bp read with only As, and 100% match.
MP:Z:type of mismatch in RA:position in the reference:position in the read
MP:Z:2:10:10 means a mismatch A->C at position 10, and no indels before this position in the alignment (10 and 10)
I thought of doing it in Awk, in the following order:
1 convert cigar to a string equal to the read length (e.g., 44M -> [MMM..M]44)
2 get the mismatch positions from
MD:Z tag (using the above toy SAM entry:
150=146+1del+2 (# of bases before the mismatch) + 1 (mismatch position), and the mismatch is G -> A)
3 use the converted cigar to get the corresponding reference index # (in this case one deletion shifts by 1 to the left, so 149) and find the type of the mismatched pair (A to G, C to T, etc.)
4 write the mismatch code to the respective position (10 in the toy case) in
RA construction by adding the "matches", calculating each by subtracting total counts of each base and the mismatches thereof
Before I dive deep into Awk struggle, is that a good idea to use Awk here, and if not, what would be the better way to handle it (would be nice if it could process 50M reads in less than a couple of hrs)