P-value calculation for DNA motifs
I want to calculate the statistical significance for finding a DNA motif in the promoter sequence of specific length. for example:
The motif "ATCGAT" is occuring 5 times in the 2000bp promoter. So what kind of statistical test can be done for finding the enrichment of this motif in the promoter sequence?
Is there any python script for this?
I will be very grateful for this.
Thanks in advance
• 37 views
A simple calculation for a number of non-unique DNA k-mers is 4 raised to the power of k (4^k). That means there are 256 non-unique tetramers, 1024 for pentamers and 4096 for hexamers. Statistically speaking, any given hexamer would be expected to occur once in 4096 nucleotides, so 5 in 2000 is statistically significant.
It is a different question whether it is biologically significant. Your motif is short, which is usually the case with eukaryotic TFs. Yet your motif is a palindrome, which is usually not the case with eukaryotic TFs. All that and a neat 2000 bp promoter size sounds like a made-up example rather than being real, so I think this might be a homework. I will let you figure out the rest.