Re: [slurm-users] job startup timeouts?

Douglas Jacobsen Fri, 26 Apr 2019 06:26:57 -0700

How large is very large?  Where is the executable being started?  In
the parallel filesystem/NFS?  If that is the case you may be able to
trim start times by using sbcast to transfer the executable (and its
dependencies if dynamically linked) into a node-local resource, such
as /tmp or /dev/shm depending on your local configuration.
----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
Acting Group Lead, Computational Systems Group
National Energy Research Scientific Computing Center
dmjacob...@lbl.gov


------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Fri, Apr 26, 2019 at 5:34 AM Andy Riebs <andy.ri...@hpe.com> wrote:
>
> Hi All,
>
> We've got a very large x86_64 cluster with lots of cores on each node, and 
> hyper-threading enabled. We're running Slurm 18.08.7 with Open MPI 4.x on 
> CentOS 7.6.
>
> We have a job that reports
>
> srun: error: timeout waiting for task launch, started 0 of xxxxxx tasks
> srun: Job step 291963.0 aborted before step completely launched.
>
> when we try to run it at large scale. We anticipate that it could take as 
> long as 15 minutes for the job to launch, based on our experience with 
> smaller numbers of nodes.
>
> Is there a timeout setting that we're missing that can be changed to 
> accommodate a lengthy startup time like this?
>
> Andy
>
> --
>
> Andy Riebs
> andy.ri...@hpe.com
> Hewlett-Packard Enterprise
> High Performance Computing Software Engineering
> +1 404 648 9024
> My opinions are not necessarily those of HPE
>     May the source be with you!

Re: [slurm-users] job startup timeouts?

Reply via email to