We have a user who wants to run multiple instances of a single process job across a cluster, using a loop like

-----
for N in $nodelist; do
     srun -w $N program &
done
wait
-----

This works up to a thousand nodes or so (jobs are allocated by node here), but as the number of jobs submitted increases, we periodically see a variety of different error messages, such as

 * srun: error: Ignoring job_complete for job 100035 because our job ID
   is 102937
 * srun: error: io_init_msg_read too small
 * srun: error: task 0 launch failed: Unspecified error
 * srun: error: Unable to allocate resources: Job/step already
   completing or completed
 * srun: error: Unable to allocate resources: No error
 * srun: error: unpack error in io_init_msg_unpack
 * srun: Job step 211042.0 aborted before step completely launched.

We have tried setting

   ulimit -n 500000
   ulimit -u 64000

but that wasn't sufficient.

The environment:

 * CentOS 7.3 (x86_64)
 * Slurm 17.11.0

Does this ring any bells? Any thoughts about how we should proceed?

Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
    May the source be with you!

Reply via email to