I have various BAM files for which I want to plot the distribution of fragment lengths. I know that deepTools has the bamPEFragmentSize tool, but this only samples the reads from a binned genome. We have relatively small files--maybe only 100k reads that map to the viral genome of interest--so the binning and sampling causes a loss of resolution.
I wanted to just take the 9th column from the SAM file, then turn that into a Numpy array and create a histogram from there. However, I'm getting a histogram that doesn't make sense to me. I would like fragment length on the x axis and frequency of that fragment length on the y axis.
After cutting the 9th column from the SAM file, I loaded this file into Python and converted it into a Numpy array:
samFile="ninth-column.txt" #read file line by line, then remove /n with open(samFile) as f: content = f.readlines() content = [x.strip() for x in content] #convert string list to integer list for i in range(0, len(content)): content[i] = int(content[i]) content[0:10] >>> [68, 0, 0, 147, 347, 177, 347, 246, 245, -68] #convert integer list to numpy array arr = np.array(content) abArr = np.absolute(arr) print(abArr[0:10]) >>>[ 68 0 0 147 347 177 347 246 245 68] #make histogram readLengths = np.histogram(abArr) readLengths >>>(array([39010090, 111, 68, 59, 41, 15, 10, 16, 7, 4]), array([0.00000000e+00, 2.27951284e+07, 4.55902568e+07, 6.83853852e+07, 9.11805136e+07, 1.13975642e+08, 1.36770770e+08, 1.59565899e+08, 1.82361027e+08, 2.05156156e+08, 2.27951284e+08])) plt.hist(readLengths) plt.show()