Ensure identical output between platforms and operating systems



This questions is a generalisation of this one and the related GitHub issue:

I am looking for suggestions on what I could investigate to find out what causes differences in R function outputs when running identical code on identical input data with identical software versions on two different machine. Specifically, I am looking for general factors that are not specific to the code of the function I ran.

More specifically:
I was running the R function (sctransform::vst) which runs a variance stabilizing transformation and then reports a per-gene residual variance.
I run it on two machines. The first one a Macbook Pro with Mojave and the relevant R package installed into a local user library, and the second one a Skylake node using a Ubuntu-based Singularity image in which I installed the R packages via the renv lock file created from the Macbook user library. Version of R is the same as well. Afaict both input data, software versions and code are identical.

Still I get different outputs differing in the decimal place which have impact on downstream analysis that are based on this.

Please throw me buzzwords on what I could check and investigate to make outputs 100% identical, related to rounding and handling of decimals.

What I checked:

  • I use set.seed() before running the function
  • Machine$double.eps is identical
  • options()$digits is 7 on both machine
  • I set options(scipen=999) on both machine
  • I disabled BLAS and OpenMP implicit multithreading on the Linux node via RhpcBLASctl package

...and please lets not discuss whether decimal differences are important or not etc, this is not the point here 😉



Random seeds are different?

I have two identical Linux computers that were purchased and put into production a week apart. I saved all commands used to install python, R and ML packages on the first machine, and repeated them verbatim on the second. I update them in parallel, and reboot as well. Whenever I install a package one one of them, I run the same commands on the other. For 3-4 months they appeared identical in all respects, but after a while a package would install without a hiccup on one but not the other. A year-plus into their existence I know for a fact that they are no longer identical even though I still update them simultaneously. The reason I am writing all this is to emphasize that it is almost impossible that two different computers will have identical configurations unless they are literally booted for the first time. In your case that's almost a guarantee because yours could only be identical in terms of software configuration and not in any other way. I have accepted this as a fact: no matter how hard I try to keep the two configurations identical, I have to accept that they are only near-identical.

I don't know exactly about the kind of analysis you are doing, but in many machine learning applications setting the random seed once will not do the trick. For example, in Keras setting the NumPy random seed is not enough, as it uses many other libraries which have their own random seeds. I don't remember exactly since it has been several years, but for Keras one has to set a NumPy, a TensorFlow and at least one other random seed to even hope for reproducibility. Not to mention that this has to be run using a single thread, which is not worth it for most ML applications.

Below is an example of a t-SNE plot that was done on the same dataset and on the same machine, maybe a month apart. By design they do not use the same random seed. It is normal in all embedding runs that their plots will not be identical, but usually they are related by some kind of simple rotation. The two plots were aligned afterwards according to the global transformation, but there are still parts that need to be rotated or flipped locally in order to overlap. For example, the part above 150 on the Y-axis needs to rotate by ~30 degrees clock-wise to overlap, and the part between 100-150 on the Y-axis needs to flip 180 around the vertical axis. Yet there is no doubt that these two embeddings are identical when it comes to subsequent clustering, which is all that matters to me.

There are too many packages involved in calculation of these embeddings, and some may use their own random seeds explicitly. C'est la vie.

enter image description here

before adding your answer.

Traffic: 1983 users visited in the last hour

Source link