Hello, yes mujoco is packaged in guix and I did it so I hope it is correct :) I checked that on all the computers the resulting compiled package have exactly the same hash so they should be identical on all the machine. I also tried by just copying a guix pack tar.gz file, uncompress and run the code so really there should be no difference.
Timothée > De: "Etienne B. Roesch" <etienne.roe...@gmail.com> > À: "Timothee Mathieu" <timothee.math...@inria.fr> > Cc: "Andreas Enge" <andr...@enge.fr>, "Ludovic Courtès" > <ludovic.cour...@inria.fr>, "Steve George" <st...@futurile.net>, "Cayetano > Santos" <csant...@inventati.org>, "help-guix" <help-guix@gnu.org> > Envoyé: Mercredi 14 Mai 2025 12:19:44 > Objet: Re: Reproducibility of guix shell container across different host OS > Very interesting. > Is it the case that mujoco is packaged correctly in guix, but then itself > calls > different routines depending on the running architecture? (or alternatively, > it > wouldn't be packaged "correctly" (or not at all!) and be compiled with > different flags on different architectures, .. then I think that would have > shown in your investigation of diff) > Etienne > On Wed, May 14, 2025 at 8:45 AM Timothee Mathieu < [ > mailto:timothee.math...@inria.fr | timothee.math...@inria.fr ] > wrote: >> Hello, >> After a lot of experimentations and discussion with colleagues, I found that >> the >> culprit! It seems to be AVX-512. Apparently, the physics behind my simulator >> uses AVX (cf [ >> https://mujoco.readthedocs.io/en/stable/programming/index.html | >> https://mujoco.readthedocs.io/en/stable/programming/index.html ] ). >> The result of my script is different on a computer that has AVX-512 compared >> to >> one that does not have it (as verified through lscpu). >> I am not super familiar with such low level instructions, but I verified >> that on >> three separate AVX-512 computers I got the same result and on 5 separate non >> AVX-512 I got the other result. >> I am not sure if I understand everything about AVX, I tried to tune the >> compilation to CPU without AVX with [ >> https://hpc.guix.info/blog/2022/01/tuning-packages-for-a-cpu-micro-architecture/ >> | >> https://hpc.guix.info/blog/2022/01/tuning-packages-for-a-cpu-micro-architecture/ >> ] in order to get reproducible results, but it did not work, maybe because >> only >> a few of the dependency packages are tunable. Is there a way to force >> everything to use AVX and not AVX-512? I understand that AVX-512 is meant to >> be >> faster but I think in my case before being faster I want to see if it is >> possible to be reproducible. >> Thanks, >> Timothée >> ----- Mail original ----- >>> De: "Timothee Mathieu" < [ mailto:timothee.math...@inria.fr | >> > timothee.math...@inria.fr ] > >> > À: "Andreas Enge" < [ mailto:andr...@enge.fr | andr...@enge.fr ] > >>> Cc: "Ludovic Courtès" < [ mailto:ludovic.cour...@inria.fr | >>> ludovic.cour...@inria.fr ] >, "Steve George" < [ mailto:st...@futurile.net | >> > st...@futurile.net ] >, "Cayetano Santos" >>> < [ mailto:csant...@inventati.org | csant...@inventati.org ] >, "help-guix" >>> < [ >> > mailto:help-guix@gnu.org | help-guix@gnu.org ] > >> > Envoyé: Mercredi 7 Mai 2025 09:34:44 >> > Objet: Re: Reproducibility of guix shell container across different host OS >> > I checked and I am now convinced that the fault lies in the physics >> > simulator as >> > I tried on other simpler reinforcement learning environments and >> > everything was >> > reproducible, so it is not due to the neural network part (which is already >> > impressive I guess as neural network libraries tend to be quite a mess >> > reproducibility-wise). >> > So it seems that something weird is going on with mujoco, the physics >> > simulator >> > for which we did a package. And it seems that it is the interaction between >> > mujoco and the neural network from pytorch because using random action >> > seems >> > reproducible. >> > I guess this could be due to floating point rounding error, although the >> > difference seems to be huge for this to be rounding error. The computation >> > is >> > quite long so maybe the errors amplify, but I am a bit doubtful about this >> > because I found a complete reproducibility between my laptop and some >> > powerful >> > servers with very different hardware, wouldn't the results be different >> > with >> > very different hardware if the problem was rounding error? >> > Is there a way to check whether this is due to floating point calculation >> > rounding error? I tried to use Float64 instead of Float 32 and it does not >> > change that I have non-reproducible results (although it changes the value >> > a >> > little bit, in the scale of 10^{-5}). >> > Thanks, >> > Timothée >> > ----- Mail original ----- >> >> De: "Andreas Enge" < [ mailto:andr...@enge.fr | andr...@enge.fr ] > >>>> À: "Ludovic Courtès" < [ mailto:ludovic.cour...@inria.fr | >> >> ludovic.cour...@inria.fr ] > >>>> Cc: "Timothee Mathieu" < [ mailto:timothee.math...@inria.fr | >> >> timothee.math...@inria.fr ] >, "Steve George" >> >> < [ mailto:st...@futurile.net | st...@futurile.net ] >, "Cayetano Santos" >>>> < [ mailto:csant...@inventati.org | csant...@inventati.org ] >, >>>> "help-guix" < [ >> >> mailto:help-guix@gnu.org | help-guix@gnu.org ] > >> >> Envoyé: Mardi 6 Mai 2025 10:30:12 >> >> Objet: Re: Reproducibility of guix shell container across different host >> >> OS >> >> Am Tue, May 06, 2025 at 09:26:51AM +0200 schrieb Ludovic Courtès: >> >>> Do you have evidence that the problem is a leak like this? Or could it >> >>> be that the Python code being run is non-deterministic? >> >>> If you run ‘guix shell -CN --no-cwd coreutils’, you can see with ‘ls’ >> >>> etc. that nothing leaks from the host OS (apart of course from the >> >>> kernel). >> >> Or maybe the hardware "leaks"? Are the two machines exactly identical, >> >> in particular, do they have the exact same processor? Since the >> >> differences involve floating point computations, I would not be >> >> surprised if the precise processor architecture made a difference. >> >> Someone mentioned the IEEE-754 standard in the thread, which mandates >> >> that basic arithmetic operations follow a precise, deterministic >> >> semantics, but not necessarily trigonometric functions. >> >> Also, if I remember well, special flags are required to make GCC emit >> >> IEEE conforming code; otherwise the old, but faster x86 80 bit extended >> >> precision built into the processor is used. I have seen a case where >> >> *printing* a variable changed its value, because this meant it would be >> >> moved from an 80 bit processor register to a 64 bit memory location. >> >> Otherwise said, something like the following code: >> >> double x = ...; >> >> if (x!=some value) { >> >> printf ("%f", x); >> >> if (x!=some value) // the same value as above, of course >> >> printf ("0"); >> >> else >> >> printf ("1"); >> >> } >> >> would print x, followed by "1"... >> >> See this thread: >>>> [ https://lists.gnu.org/archive/html/guix-devel/2023-03/msg00277.html | >> >> https://lists.gnu.org/archive/html/guix-devel/2023-03/msg00277.html ] >> >> and commit 098bd280f82350073e8280e37d56a14162eed09c . >> >> If you want deterministic, reproducible floating point computations, >> >> I am afraid you would need to use the (comparably slow in low precision) >> >> GNU MPFR and GNU MPC libraries; or use interval arithmetic from FLINT >> >> and replace exact comparisons by looking at intersections of intervals. >> > > Andreas