How large is very large? Where is the executable being started? In the parallel filesystem/NFS? If that is the case you may be able to trim start times by using sbcast to transfer the executable (and its dependencies if dynamically linked) into a node-local resource, such as /tmp or /dev/shm depending on your local configuration. ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer Acting Group Lead, Computational Systems Group National Energy Research Scientific Computing Center dmjacob...@lbl.gov
------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Fri, Apr 26, 2019 at 5:34 AM Andy Riebs <andy.ri...@hpe.com> wrote: > > Hi All, > > We've got a very large x86_64 cluster with lots of cores on each node, and > hyper-threading enabled. We're running Slurm 18.08.7 with Open MPI 4.x on > CentOS 7.6. > > We have a job that reports > > srun: error: timeout waiting for task launch, started 0 of xxxxxx tasks > srun: Job step 291963.0 aborted before step completely launched. > > when we try to run it at large scale. We anticipate that it could take as > long as 15 minutes for the job to launch, based on our experience with > smaller numbers of nodes. > > Is there a timeout setting that we're missing that can be changed to > accommodate a lengthy startup time like this? > > Andy > > -- > > Andy Riebs > andy.ri...@hpe.com > Hewlett-Packard Enterprise > High Performance Computing Software Engineering > +1 404 648 9024 > My opinions are not necessarily those of HPE > May the source be with you!