I have various BAM files for which I want to plot the distribution of fragment lengths. I know that deepTools has the bamPEFragmentSize tool, but this only samples the reads from a binned genome. We have relatively small files--maybe only 100k reads that map to the viral genome of interest--so the binning and sampling causes a loss of resolution.

I wanted to just take the 9th column from the SAM file, then turn that into a Numpy array and create a histogram from there. However, I'm getting a histogram that doesn't make sense to me. I would like fragment length on the x axis and frequency of that fragment length on the y axis.

After cutting the 9th column from the SAM file, I loaded this file into Python and converted it into a Numpy array:

#read file line by line, then remove /n
with open(samFile) as f:
    content = f.readlines()
content = [x.strip() for x in content]

#convert string list to integer list
for i in range(0, len(content)):
    content[i] = int(content[i])
>>> [68, 0, 0, 147, 347, 177, 347, 246, 245, -68]

#convert integer list to numpy array
arr = np.array(content)
abArr = np.absolute(arr)
>>>[ 68   0   0 147 347 177 347 246 245  68]

#make histogram
readLengths = np.histogram(abArr)
>>>(array([39010090,      111,       68,       59,       41,       15,
              10,       16,        7,        4]),
 array([0.00000000e+00, 2.27951284e+07, 4.55902568e+07, 6.83853852e+07,
        9.11805136e+07, 1.13975642e+08, 1.36770770e+08, 1.59565899e+08,
        1.82361027e+08, 2.05156156e+08, 2.27951284e+08]))



enter image description here

Source link