Hello,

I am working on parsing some BAM files from pysam and I am was looking for some information on what the output from pysam fetch produdes.

This is what I am working with code wise:

import pysam
import sys
from ast import literal_eval as make_tuple
samfile = pysam.AlignmentFile('sample1.bam', "rb")
for c in contigs:
    for read in samfile.fetch(c):
        r=str(read).split("t")
        cig=read.cigarstring
        MD_TAG=make_tuple(r[-1])[-2][1] 
        print(read)

This pulls some information such as the CIGAR string and MD Tag. I want to know what the other features that are being output are though from fetch (I looked around but I haven't found a satisfactory explanation). This is the output example from one of my BAM files:

SRR5720260.4404562.1    83      0       1135779 40      144M    0       1135607 144     AATTATTGTCACTTTGCCAATATTTTCATTTAATTCCTCAAATCCAATTATGATTGAAGGGTCAATTGGTGTGGTAAGGCTAGAAGGTTTGTAAATAGTAGATTCAGTTTGAGTCGGTTGATTTTCCCCAGTCCAGTTTGATAG       array('B', [36, 36, 36, 36, 36, 36, 36, 36, 27, 36, 36, 36, 36, 36, 36, 36, 32, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 32, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 32, 32, 32, 32])  [('AS', -30), ('XN', 0), ('XM', 6), ('XO', 0), ('XG', 0), ('NM', 6), ('MD', '9A58T0C8T31C32A0'), ('YS', -10), ('YT', 'CP')]

These are the columns I have been able to extrapolate:

  1. Read --> SRR5720260.4404562.1
  2. unknowcn --> 83
  3. unknown --> 0
  4. unknown --> 1135779 40
  5. CIGAR --> 144M
  6. unknown --> 0
  7. unknown --> 1135607 144
  8. query sequence: AATTATTGTCACTTTGCCAATATTTTCATTTAATTCCTCAAATCCAATTATGATTGAAGGGTCAATTGGTGTGGTAAGGCTAGAAGGTTTGTAAATAGTAGATTCAGTTTGAGTCGGTTGATTTTCCCCAGTCCAGTTTGATAG
  9. unknown (I think this is quality scores) --> array('B', [36, 36, 36, 36, 36, 36, 36, 36, 27, 36, 36, 36, 36, 36, 36, 36, 32, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 32, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 32, 32, 32, 32])
  10. Read info --> [('AS', -30), ('XN', 0), ('XM', 6), ('XO', 0), ('XG', 0), ('NM', 6), ('MD', '9A58T0C8T31C32A0'), ('YS', -10), ('YT', 'CP')]
        Known values: 
        - Paired-end alignment quality (AS)
        - Amplicon name tag (XN)
        - Edit distance tag (NM)
        - the number of mismatches in the alignment(XM) 
        - the number of gap opens, for both read and reference gaps, in the alignment (XO)
        - the number of gap extensions, for both read and reference gaps, in the alignment.(XG)
        - edit distance to the reference, including ambiguous bases but excluding clipping(NM) 
        - MD tag (MD)
        Unknown values:
        - YS
        - YT
    

Any help would be appreciated and please correct me if I have misunderstood any of these values!



Source link