Hello, We’ve noticed the problem below on clusters running a foreign distro when slurmd is version 19.x and our clients are version 20.x:
--8<---------------cut here---------------start------------->8--- [courtes@devel01 ~]$ guix time-machine --commit=2f107f273de3db1d01bdec66b13334edef7ad036 -- package -A slurm Mise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »... python-slurm-magic 0.0-0.73dd1a2 out gnu/packages/parallel.scm:225:4 slurm 20.02.5 out gnu/packages/parallel.scm:109:2 slurm-drmaa 1.1.1 out gnu/packages/parallel.scm:194:2 [courtes@devel01 ~]$ guix time-machine --commit=2f107f273de3db1d01bdec66b13334edef7ad036 -- environment --ad-hoc slurm -- squeue Mise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »... slurm_load_jobs error: Zero Bytes were transmitted or received [courtes@devel01 ~]$ guix time-machine --commit=09b00a62b297edb92ac4dde6f4838261ac0cad16 -- package -A slurm Mise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »... python-slurm-magic 0.0-0.73dd1a2 out gnu/packages/parallel.scm:225:4 slurm 19.05.3-2 out gnu/packages/parallel.scm:109:2 slurm-drmaa 1.1.1 out gnu/packages/parallel.scm:194:2 [courtes@devel01 ~]$ guix time-machine --commit=09b00a62b297edb92ac4dde6f4838261ac0cad16 -- environment --ad-hoc slurm -- squeue Mise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »... JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [courtes@devel01 ~]$ /usr/bin/squeue --version slurm 19.05.2 --8<---------------cut here---------------end--------------->8--- It means that we cannot generally use the Guix-provided SLURM on clusters running foreign distros. <https://slurm.schedmd.com/troubleshoot.html#network> reads: Slurm daemons will support RPCs and state files from the two previous major releases (e.g. a version 17.11.x SlurmDBD will support slurmctld daemons and commands with a version of 17.11.x, 17.02.x or 16.05.x). Looking at <https://download.schedmd.com/slurm/>, there’s been quite a few releases between 19.05.3-2 and 20.02.5, which may explain the problem I described. Apparently the only .so in Open MPI linked against SLURM is ‘lib/openmpi/mca_pmix_s1.so’. The diff suggests that the two versions are not ABI-compatible, so one wouldn’t be able to use ‘--with-graft’ to graft one version in lieu of the other: --8<---------------cut here---------------start------------->8--- [courtes@devel01 ~]$ guix time-machine --commit=09b00a62b297edb92ac4dde6f4838261ac0cad16 -- build slurm Mise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »... /gnu/store/37b7qnwck4pg51qia4w002i62g156xgw-slurm-19.05.3-2 [courtes@devel01 ~]$ guix time-machine --commit=2f107f273de3db1d01bdec66b13334edef7ad036 -- build slurm Mise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »... /gnu/store/7n6aks2wcmn2pxv03q8ij38hsj9zfzk9-slurm-20.02.5 [courtes@devel01 ~]$ abidiff --stat /gnu/store/37b7qnwck4pg51qia4w002i62g156xgw-slurm-19.05.3-2/lib/slurm/libslurmfull.so /gnu/store/7n6aks2wcmn2pxv03q8ij38hsj9zfzk9-slurm-20.02.5/lib/slurm/libslurmfull.so Functions changes summary: 0 Removed, 0 Changed, 0 Added function Variables changes summary: 0 Removed, 0 Changed, 0 Added variable Function symbols changes summary: 80 Removed, 162 Added function symbols not referenced by debug info Variable symbols changes summary: 3 Removed, 0 Added variable symbols not referenced by debug info --8<---------------cut here---------------end--------------->8--- What can we do about it? At least, we should package several known-useful versions, so that people can use ‘--with-input=slurm@X=slurm@Y’ (if needed) or explicitly refer to the version they want in their profile. I’ll work on that. Anything else? I heard that PMIx, a scheduler-independent API, will eventually supersede SLURM in Open MPI. Let’s see if that loosens version requirements. Thanks, Ludo’.