Good day, We have a cluster with some MPI distributions and a SLURM serving as queue manager. We also have a SLURM Spank Plugin, and it is simple, you just define some functions in a library and SLURM loads and calls eventually.
The isue arises with OpenMPI 4.0.1 (and possibly greater) and MPIRUN command. As far as I know, no other combination produces that. If your job has 2 or more node reservations, and you launch an SBATCH with MPIRUNs inside, the node reading the sbatch script doesn't follows the normal Spank Plugin call pipeline. It misses the function "slurm_spank_user_init" after an MPIRUN: This is the OpenMPI 3.1.4 version: srun: function slurm_spank_init srun: function slurm_spank_init_post_opt srun: function slurm_spank_local_user_init remote: function slurm_spank_user_init srun: slurm_spank_exit This is the OpenMPI 4.0.1 version: srun: function slurm_spank_init srun: function slurm_spank_init_post_opt srun: function slurm_spank_local_user_init srun: slurm_spank_exit It is like "ok i'm currently reading the SBATCH script, I don't have to initialize the user again". It is causing some problems because we are using this specific function. Also, in this node application instance, we are missing some SLURM's environment variables such as SLURM_STEP_ID, SLURM_STEP_NODELIST, SLURM_STEP_NUM_NODES, SLURM_STEP_NUM_TASKS, SLURM_STEP_TASKS_PER_NODE... I would like to know more about it. Because if it is perfectly normal behavior and will remain, I will have to make changes to the plugin. Thank you, Jordi. http://bsc.es/disclaimer