I have two gene lists, derived from two independent datasets. I want to compute the significance of overlap between two subgroups. This is the case of 2 lists of differentially expressed genes in each dataset. I want to know whether the overlap between both groups would be given by chance. For instance:
Dataset1 total: 500
Dataset1 subgroup: 100
Dataset2 total: 300
Dataset2 subgroup: 50
Intersection between subgroups: 25
Union 1 and 2 (no duplicates): 600
I want to compute how significant is the overlap between subgroups against what would be gotten by chance. How would you do this in Python? I was looking at Fisher's exact test or hypergeometric test but have some problems putting my data into the analyses.
From what I understand, the contingency table would be:
Dataset1 Dataset2 In_subgroup 100 50 Not_subgroup 400 250 Total 500 300
Here, note that the universe is comprised by 600 unique elements, and not 500+300 (as there are duplicates within dataset1 and dataset2). Given this, and based on another post, I would do this in R:
phyper(24, 100, 500, 50, lower.tail = FALSE)  9.15e-19
Translating this into Python I would use in scipy:
>>> scipy.stats.hypergeom.cdf(24, 600, 500, 300) 0.0
Can I assume that the difference between both results is numeric error?