I am attempting to processes some ChIP-seq data. One objective of this processing is to identify differential peaks between my groups of interest.

I initially performed differential peak identification using DiffBind2 using both Deseq2 and EdgeR. Deseq2 identified far few differential regions than EdgeR for almost all comparisons of interest, but for the most part they were a subset of the Deseq2 peaks. There were obviously some peaks where this was not the case, but it holds true for the majority of called differential peaks. Example code for how I processed this data for this select comparison is below:

P23M_VS_P30F <- read.csv("P23M_VS_P30F_DS.csv")
P23M_VS_P30M <- dba(sampleSheet = P23M_VS_P30M)
P23M_VS_P30M <- dba.count(P23M_VS_P30M, bParallel = FALSE)
P23M_VS_P30M <- dba.contrast(P23M_VS_P30M, categories=DBA_CONDITION, minMembers = 2)
P23M_VS_P30M <- dba.analyze(P23M_VS_P30M, method = DBA_ALL_METHODS)
dba.plotVenn(P23M_VS_P30M, contrast = 1, method = DBA_ALL_METHODS)

enter image description here

Visual investigation of these peaks in IGV supports them as being differential, however, visual comparison of .bam and peak files is not the most reliable thing, at least in my limited experience.

Upon the DiffBind3 release, I reran all the comparisons using the new program to see how outputs might change. DiffBind3 identified far fewer differential peaks than what were returned using DiffBind2. Additionally, EdgeR is now identifying fewer differential peaks than Deseq2 for each comparison. To me this appears as some sort of shift in which program is more stringent. I understand that each of these programs has also undergone updates over the course of me trying to learn how to process this data as well, so I am not trying to attribute all of the change to DiffBind alone. This was the code I used to run the data in DiffBind3 for this comparison:

P23M_VS_P30M <- read.csv("P23M_VS_P30M_DS.csv")
P23M_VS_P30M <- dba(sampleSheet = P23M_VS_P30M)
P23M_VS_P30M_3.0 <- dba.count(P23M_VS_P30M)
P23M_VS_P30M_3.0.N <- dba.normalize(P23M_VS_P30M_3.0, normalize = DBA_NORM_NATIVE)
P23M_VS_P30M_3.0.C <- dba.contrast(P23M_VS_P30M_3.0.N)
P23M_VS_P30M_3.0.A <- dba.analyze(P23M_VS_P30M_3.0.C, method =DBA_ALL_METHODS)
dba.plotPCA(P23M_VS_P30M_3.0.A, contrast =1, DBA_CONDITION, label=DBA_REPLICATE)

enter image description here

I thought this was a little strange to observe such a drastic drop in the number of differential regions being identified by each program, but it sounds like it may be expected based on the changes that have been made to DiffBind3. Any advice on how to best values these different outputs would be greatly appreciated. Should I hold one in higher regard? Are both sets valid?

A further question that I have is based on an attempt to replicate the DiffBind2 outputs in DiffBind3 according to the release notes and information outlined in the vignettes. I have run the code below in an attempt to do this. The Deseq2 results from DiffBind3 mirror the DiffBind2 results well, but there are still notable differences. The EdgeR results do not replicate in DiffBind3 at all when run in this way. Am I doing something wrong here? Does this have to do with an EdgeR update or have I done something wrong when normalizing that I have missed? Any information to this end would be appreciated!

P23M_VS_P30M <- read.csv("P23M_VS_P30M_DS.csv")
P23M_VS_P30M <- dba(sampleSheet = P23M_VS_P30M)
P23M_VS_P30M_old <- dba.count(P23M_VS_P30M, summits = FALSE, filter = 0, minCount = 1, bParallel = FALSE)
P23M_VS_P30M_old <- dba.normalize(P23M_VS_P30M_old, normalize =DBA_NORM_DEFAULT, library = DBA_LIBSIZE_FULL, background = FALSE)
P23M_VS_P30M_old <- dba.contrast(P23M_VS_P30M_old, categories=DBA_CONDITION, minMembers = 2)
P23M_VS_P30M_old <- dba.analyze(P23M_VS_P30M_old, method = DBA_ALL_METHODS)
dba.plotVenn(P23M_VS_P30M_old, contrast = 1, method = DBA_ALL_METHODS)

enter image description here

Additionally, I have run this analysis in CSAW and the results do not particularly match up with any of the methodologies described above. CSAW identified a small set of peaks, a majority of which did not overlap with those previously called and all of which appeared differential upon visual investigation.

I entirely understand that these are not the most clear cut questions, so for that I apologize. More or less, I am trying to figure out how to move forward while having all of these different results. Are differences like this to be expected with program updates? Is there a simple explaination for the decrease in identified differential peaks I am observing? Is there a set I should value most highly? Should I do some sort of homogenization; if so is that even statistically sound or scientifically correct. Why do the Deseq2, but not the EdgeR, results replicate (at least to a much higher degree) in DiffBind3?

If you got here, thanks for taking the time to read the post. I hope that this explanation makes sense. If further detail would be helpful, I am obviously more than happy to provide. Any assistance in addressing these questions is beyond appreciated!


Source link