Re: Reproducibility of guix shell container across different host OS

Timothee Mathieu Wed, 14 May 2025 00:45:26 -0700

Hello,

After a lot of experimentations and discussion with colleagues, I found that 
the culprit! It seems to be AVX-512. Apparently, the physics behind my 
simulator uses AVX (cf 
https://mujoco.readthedocs.io/en/stable/programming/index.html).
The result of my script is different on a computer that has AVX-512 compared to 
one that does not have it (as verified through lscpu).


I am not super familiar with such low level instructions, but I verified that 
on three separate AVX-512 computers I got the same result and on 5 separate non 
AVX-512 I got the other result. 

I am not sure if I understand everything about AVX, I tried to tune the 
compilation to CPU without AVX with 
https://hpc.guix.info/blog/2022/01/tuning-packages-for-a-cpu-micro-architecture/
 in order to get reproducible results, but it did not work, maybe because only 
a few of the dependency packages are tunable. Is there a way to force 
everything to use AVX and not AVX-512? I understand that AVX-512 is meant to be 
faster but I think in my case before being faster I want to see if it is 
possible to be reproducible.

Thanks,
Timothée 


----- Mail original -----
> De: "Timothee Mathieu" <timothee.math...@inria.fr>
> À: "Andreas Enge" <andr...@enge.fr>
> Cc: "Ludovic Courtès" <ludovic.cour...@inria.fr>, "Steve George" 
> <st...@futurile.net>, "Cayetano Santos"
> <csant...@inventati.org>, "help-guix" <help-guix@gnu.org>
> Envoyé: Mercredi 7 Mai 2025 09:34:44
> Objet: Re: Reproducibility of guix shell container across different host OS

> I checked and I am now convinced that the fault lies in the physics simulator 
> as
> I tried on other simpler reinforcement learning environments and everything 
> was
> reproducible, so it is not due to the neural network part (which is already
> impressive I guess as neural network libraries tend to be quite a mess
> reproducibility-wise).
> 
> So it seems that something weird is going on with mujoco, the physics 
> simulator
> for which we did a package. And it seems that it is the interaction between
> mujoco and the neural network from pytorch because using random action seems
> reproducible.
> I guess this could be due to floating point rounding error, although the
> difference seems to be huge for this to be rounding error. The computation is
> quite long so maybe the errors amplify, but I am a bit doubtful about this
> because I found a complete reproducibility between my laptop and some powerful
> servers with very different hardware, wouldn't the results be different with
> very different hardware if the problem was rounding error?
> 
> Is there a way to check whether this is due to floating point calculation
> rounding error? I tried to use Float64 instead of Float 32 and it does not
> change that I have non-reproducible results (although it changes the value a
> little bit, in the scale of 10^{-5}).
> 
> Thanks,
> Timothée
> 
> ----- Mail original -----
>> De: "Andreas Enge" <andr...@enge.fr>
>> À: "Ludovic Courtès" <ludovic.cour...@inria.fr>
>> Cc: "Timothee Mathieu" <timothee.math...@inria.fr>, "Steve George"
>> <st...@futurile.net>, "Cayetano Santos"
>> <csant...@inventati.org>, "help-guix" <help-guix@gnu.org>
>> Envoyé: Mardi 6 Mai 2025 10:30:12
>> Objet: Re: Reproducibility of guix shell container across different host OS
> 
>> Am Tue, May 06, 2025 at 09:26:51AM +0200 schrieb Ludovic Courtès:
>>> Do you have evidence that the problem is a leak like this?  Or could it
>>> be that the Python code being run is non-deterministic?
>>> If you run ‘guix shell -CN --no-cwd coreutils’, you can see with ‘ls’
>>> etc. that nothing leaks from the host OS (apart of course from the
>>> kernel).
>> 
>> Or maybe the hardware "leaks"? Are the two machines exactly identical,
>> in particular, do they have the exact same processor? Since the
>> differences involve floating point computations, I would not be
>> surprised if the precise processor architecture made a difference.
>> 
>> Someone mentioned the IEEE-754 standard in the thread, which mandates
>> that basic arithmetic operations follow a precise, deterministic
>> semantics, but not necessarily trigonometric functions.
>> 
>> Also, if I remember well, special flags are required to make GCC emit
>> IEEE conforming code; otherwise the old, but faster x86 80 bit extended
>> precision built into the processor is used. I have seen a case where
>> *printing* a variable changed its value, because this meant it would be
>> moved from an 80 bit processor register to a 64 bit memory location.
>> Otherwise said, something like the following code:
>> double x = ...;
>> if (x!=some value) {
>>   printf ("%f", x);
>>   if (x!=some value) // the same value as above, of course
>>      printf ("0");
>>   else
>>      printf ("1");
>> }
>> would print x, followed by "1"...
>> 
>> See this thread:
>>   https://lists.gnu.org/archive/html/guix-devel/2023-03/msg00277.html
>> and commit 098bd280f82350073e8280e37d56a14162eed09c .
>> 
>> If you want deterministic, reproducible floating point computations,
>> I am afraid you would need to use the (comparably slow in low precision)
>> GNU MPFR and GNU MPC libraries; or use interval arithmetic from FLINT
>> and replace exact comparisons by looking at intersections of intervals.
>> 
> > Andreas

Re: Reproducibility of guix shell container across different host OS

Reply via email to