Hello, 

yes mujoco is packaged in guix and I did it so I hope it is correct :) 
I checked that on all the computers the resulting compiled package have exactly 
the same hash so they should be identical on all the machine. I also tried by 
just copying a guix pack tar.gz file, uncompress and run the code so really 
there should be no difference. 

Timothée 

> De: "Etienne B. Roesch" <etienne.roe...@gmail.com>
> À: "Timothee Mathieu" <timothee.math...@inria.fr>
> Cc: "Andreas Enge" <andr...@enge.fr>, "Ludovic Courtès"
> <ludovic.cour...@inria.fr>, "Steve George" <st...@futurile.net>, "Cayetano
> Santos" <csant...@inventati.org>, "help-guix" <help-guix@gnu.org>
> Envoyé: Mercredi 14 Mai 2025 12:19:44
> Objet: Re: Reproducibility of guix shell container across different host OS

> Very interesting.
> Is it the case that mujoco is packaged correctly in guix, but then itself 
> calls
> different routines depending on the running architecture? (or alternatively, 
> it
> wouldn't be packaged "correctly" (or not at all!) and be compiled with
> different flags on different architectures, .. then I think that would have
> shown in your investigation of diff)

> Etienne

> On Wed, May 14, 2025 at 8:45 AM Timothee Mathieu < [
> mailto:timothee.math...@inria.fr | timothee.math...@inria.fr ] > wrote:

>> Hello,

>> After a lot of experimentations and discussion with colleagues, I found that 
>> the
>> culprit! It seems to be AVX-512. Apparently, the physics behind my simulator
>> uses AVX (cf [ 
>> https://mujoco.readthedocs.io/en/stable/programming/index.html |
>> https://mujoco.readthedocs.io/en/stable/programming/index.html ] ).
>> The result of my script is different on a computer that has AVX-512 compared 
>> to
>> one that does not have it (as verified through lscpu).

>> I am not super familiar with such low level instructions, but I verified 
>> that on
>> three separate AVX-512 computers I got the same result and on 5 separate non
>> AVX-512 I got the other result.

>> I am not sure if I understand everything about AVX, I tried to tune the
>> compilation to CPU without AVX with [
>> https://hpc.guix.info/blog/2022/01/tuning-packages-for-a-cpu-micro-architecture/
>> |
>> https://hpc.guix.info/blog/2022/01/tuning-packages-for-a-cpu-micro-architecture/
>> ] in order to get reproducible results, but it did not work, maybe because 
>> only
>> a few of the dependency packages are tunable. Is there a way to force
>> everything to use AVX and not AVX-512? I understand that AVX-512 is meant to 
>> be
>> faster but I think in my case before being faster I want to see if it is
>> possible to be reproducible.

>> Thanks,
>> Timothée

>> ----- Mail original -----
>>> De: "Timothee Mathieu" < [ mailto:timothee.math...@inria.fr |
>> > timothee.math...@inria.fr ] >
>> > À: "Andreas Enge" < [ mailto:andr...@enge.fr | andr...@enge.fr ] >
>>> Cc: "Ludovic Courtès" < [ mailto:ludovic.cour...@inria.fr |
>>> ludovic.cour...@inria.fr ] >, "Steve George" < [ mailto:st...@futurile.net |
>> > st...@futurile.net ] >, "Cayetano Santos"
>>> < [ mailto:csant...@inventati.org | csant...@inventati.org ] >, "help-guix" 
>>> < [
>> > mailto:help-guix@gnu.org | help-guix@gnu.org ] >
>> > Envoyé: Mercredi 7 Mai 2025 09:34:44
>> > Objet: Re: Reproducibility of guix shell container across different host OS

>> > I checked and I am now convinced that the fault lies in the physics 
>> > simulator as
>> > I tried on other simpler reinforcement learning environments and 
>> > everything was
>> > reproducible, so it is not due to the neural network part (which is already
>> > impressive I guess as neural network libraries tend to be quite a mess
>> > reproducibility-wise).

>> > So it seems that something weird is going on with mujoco, the physics 
>> > simulator
>> > for which we did a package. And it seems that it is the interaction between
>> > mujoco and the neural network from pytorch because using random action 
>> > seems
>> > reproducible.
>> > I guess this could be due to floating point rounding error, although the
>> > difference seems to be huge for this to be rounding error. The computation 
>> > is
>> > quite long so maybe the errors amplify, but I am a bit doubtful about this
>> > because I found a complete reproducibility between my laptop and some 
>> > powerful
>> > servers with very different hardware, wouldn't the results be different 
>> > with
>> > very different hardware if the problem was rounding error?

>> > Is there a way to check whether this is due to floating point calculation
>> > rounding error? I tried to use Float64 instead of Float 32 and it does not
>> > change that I have non-reproducible results (although it changes the value 
>> > a
>> > little bit, in the scale of 10^{-5}).

>> > Thanks,
>> > Timothée

>> > ----- Mail original -----
>> >> De: "Andreas Enge" < [ mailto:andr...@enge.fr | andr...@enge.fr ] >
>>>> À: "Ludovic Courtès" < [ mailto:ludovic.cour...@inria.fr |
>> >> ludovic.cour...@inria.fr ] >
>>>> Cc: "Timothee Mathieu" < [ mailto:timothee.math...@inria.fr |
>> >> timothee.math...@inria.fr ] >, "Steve George"
>> >> < [ mailto:st...@futurile.net | st...@futurile.net ] >, "Cayetano Santos"
>>>> < [ mailto:csant...@inventati.org | csant...@inventati.org ] >, 
>>>> "help-guix" < [
>> >> mailto:help-guix@gnu.org | help-guix@gnu.org ] >
>> >> Envoyé: Mardi 6 Mai 2025 10:30:12
>> >> Objet: Re: Reproducibility of guix shell container across different host 
>> >> OS

>> >> Am Tue, May 06, 2025 at 09:26:51AM +0200 schrieb Ludovic Courtès:
>> >>> Do you have evidence that the problem is a leak like this? Or could it
>> >>> be that the Python code being run is non-deterministic?
>> >>> If you run ‘guix shell -CN --no-cwd coreutils’, you can see with ‘ls’
>> >>> etc. that nothing leaks from the host OS (apart of course from the
>> >>> kernel).

>> >> Or maybe the hardware "leaks"? Are the two machines exactly identical,
>> >> in particular, do they have the exact same processor? Since the
>> >> differences involve floating point computations, I would not be
>> >> surprised if the precise processor architecture made a difference.

>> >> Someone mentioned the IEEE-754 standard in the thread, which mandates
>> >> that basic arithmetic operations follow a precise, deterministic
>> >> semantics, but not necessarily trigonometric functions.

>> >> Also, if I remember well, special flags are required to make GCC emit
>> >> IEEE conforming code; otherwise the old, but faster x86 80 bit extended
>> >> precision built into the processor is used. I have seen a case where
>> >> *printing* a variable changed its value, because this meant it would be
>> >> moved from an 80 bit processor register to a 64 bit memory location.
>> >> Otherwise said, something like the following code:
>> >> double x = ...;
>> >> if (x!=some value) {
>> >> printf ("%f", x);
>> >> if (x!=some value) // the same value as above, of course
>> >> printf ("0");
>> >> else
>> >> printf ("1");
>> >> }
>> >> would print x, followed by "1"...

>> >> See this thread:
>>>> [ https://lists.gnu.org/archive/html/guix-devel/2023-03/msg00277.html |
>> >> https://lists.gnu.org/archive/html/guix-devel/2023-03/msg00277.html ]
>> >> and commit 098bd280f82350073e8280e37d56a14162eed09c .

>> >> If you want deterministic, reproducible floating point computations,
>> >> I am afraid you would need to use the (comparably slow in low precision)
>> >> GNU MPFR and GNU MPC libraries; or use interval arithmetic from FLINT
>> >> and replace exact comparisons by looking at intersections of intervals.

>> > > Andreas

Reply via email to