Computing significance of overlap between two gene lists in Python

2 hours ago by

I have two gene lists, derived from two independent datasets. I want to compute the significance of overlap between two subgroups. This is the case of 2 lists of differentially expressed genes in each dataset. I want to know whether the overlap between both groups would be given by chance. For instance:

Dataset1 total: 500
Dataset1 subgroup: 100

Dataset2 total: 300
Dataset2 subgroup: 50

Intersection between subgroups: 25
Union 1 and 2 (no duplicates): 600

I want to compute how significant is the overlap between subgroups against what would be gotten by chance. How would you do this in Python? I was looking at Fisher's exact test or hypergeometric test but have some problems putting my data into the analyses.

From what I understand, the contingency table would be:

``````                Dataset1    Dataset2
In_subgroup     100         50
Not_subgroup    400         250
Total           500         300
``````

Here, note that the universe is comprised by 600 unique elements, and not 500+300 (as there are duplicates within dataset1 and dataset2). Given this, and based on another post, I would do this in R:

``````phyper(24, 100, 500, 50, lower.tail = FALSE)
[1] 9.15e-19
``````

Translating this into Python I would use in scipy:

``````>>> scipy.stats.hypergeom.cdf(24, 600, 500, 300)
0.0
``````

Can I assume that the difference between both results is numeric error?