Occasionally during program exit with Open MPI SHMEM jobs, we are seeing the following message:

srun: fatal: ../../../src/api/step_launch.c:1037 step_launch_state_destroy: pthread_mutex_destroy(): Device or resource busy

Our environment:

 * 100+ node KNL cluster
 * CentOS 7.4
 * Open MPI 3.x (an interim kit between 3.0 and 3.1)
 * Slurm 17.11.0

This was reported at <https://bugs.schedmd.com/show_bug.cgi?id=4333against a 17.11.0 RC kit, but we are seeing it now in the 17.11.0 released kit (I confirmed that Moe's fix appears in our sources). Has anyone else seen this? Or better yet, has anyone found a way to fix it?

Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
    May the source be with you!

Reply via email to