Good day,

We have a cluster with some MPI distributions and a SLURM serving as queue
manager. We also have a SLURM Spank Plugin, and it is simple, you just
define some functions in a library and SLURM loads and calls eventually.

The isue arises with OpenMPI 4.0.1 (and possibly greater) and MPIRUN
command. As far as I know, no other combination produces that. If your job
has 2 or more node reservations, and you launch an SBATCH with MPIRUNs
inside, the node reading the sbatch script doesn't follows the normal Spank
Plugin call pipeline. It misses the function "slurm_spank_user_init" after
an MPIRUN:

This is the OpenMPI 3.1.4 version:
srun: function slurm_spank_init
srun: function slurm_spank_init_post_opt
srun: function slurm_spank_local_user_init
remote: function slurm_spank_user_init
srun: slurm_spank_exit

This is the OpenMPI 4.0.1 version:
srun: function slurm_spank_init
srun: function slurm_spank_init_post_opt
srun: function slurm_spank_local_user_init
srun: slurm_spank_exit

It is like "ok i'm currently reading the SBATCH script, I don't have to
initialize the user again". It is causing some problems because we are
using this specific function.

Also, in this node application instance, we are missing some SLURM's
environment variables such as SLURM_STEP_ID, SLURM_STEP_NODELIST,
SLURM_STEP_NUM_NODES, SLURM_STEP_NUM_TASKS, SLURM_STEP_TASKS_PER_NODE...

I would like to know more about it. Because if it is perfectly normal
behavior and will remain, I will have to make changes to the plugin.

Thank you,
Jordi.


http://bsc.es/disclaimer

Reply via email to