Hi, I have the following toy SAM file entry.

1_7906  0   chr1    4771271 60  146M1D37M   *   0   0   AAAAGTTTTTCTGCTTGGGGAAGAAGTTGCCCAGTATGACGGTGCATACAAGGTTAGCAGAGGCCTGTGGAAGAAATACGGTGACAAGAGGATCATCGACACCCCCATCTCAGAGATGGGCTTTGCTGGAATTGCTGTTGGTGCAGCTATGGCTGGGTTGCGGCCCATCTGTGAATTTATGAC FFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJJJJJFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:1785   NM:i:2  NH:i:1  XI:f:0.9891 X0:i:1  XE:i:53 XR:i:183    MD:Z:146^C2G34  TC:i:0  RA:Z:44,0,0,0,0,0,35,0,0,0,1,0,56,0,0,0,0,0,47,0,0,0,0,0,0, MP:Z:10:149:150

It has RA and MP tags, but I need a script to get these tags in files that lack them, but have MD:Z tag, CIGAR string, and read sequence.

RA encodes the type of matches/mismathed in the read, as follows:

 #                   Read
 #             A     C     G     T     N
 #      A      0     1     2     3     4
 # R    C      5     6     7     8     9
 # e    G     10    11    12    13    14
 # f    T     15    16    17    18    19
 #      N     20    21    22    23    24

So, RA:Z:44,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, would correspond to 44bp read with only As, and 100% match.

Now MP:

MP:Z:type of mismatch in RA:position in the reference:position in the read

MP:Z:2:10:10 means a mismatch A->C at position 10, and no indels before this position in the alignment (10 and 10)

I thought of doing it in Awk, in the following order:

1 convert cigar to a string equal to the read length (e.g., 44M -> [MMM..M]44)

2 get the mismatch positions from MD:Z tag (using the above toy SAM entry: MD:Z:146^C2G34 gives 150=146+1del+2 (# of bases before the mismatch) + 1 (mismatch position), and the mismatch is G -> A)

3 use the converted cigar to get the corresponding reference index # (in this case one deletion shifts by 1 to the left, so 149) and find the type of the mismatched pair (A to G, C to T, etc.)

4 write the mismatch code to the respective position (10 in the toy case) in RA tag

5 finalize RA construction by adding the "matches", calculating each by subtracting total counts of each base and the mismatches thereof

Before I dive deep into Awk struggle, is that a good idea to use Awk here, and if not, what would be the better way to handle it (would be nice if it could process 50M reads in less than a couple of hrs)



Source link