It's a DNS problem, isn't it? Seriously though - how long does srun hostname take for a single system?
On Fri, 26 Apr 2019 at 15:49, Douglas Jacobsen <dmjacob...@lbl.gov> wrote: > We have 12,000 nodes in our system, 9,600 of which are KNL. We can > start a parallel application within a few seconds in most cases (when > the machine is dedicated to this task), even at full scale. So I > don't think there is anything intrinsic to Slurm that would > necessarily be limiting you, though we have seen cases in the past > where arbitrary task distribution has caused contoller slow-down > issues as the detailed scheme was parsed. > > Do you know if all the slurmstepd's are starting quickly on the > compute nodes? How is the OS/Slurm/executable delivered to the node? > ---- > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > Acting Group Lead, Computational Systems Group > National Energy Research Scientific Computing Center > dmjacob...@lbl.gov > > ------------- __o > ---------- _ '\<,_ > ----------(_)/ (_)__________________________ > > > On Fri, Apr 26, 2019 at 7:40 AM Riebs, Andy <andy.ri...@hpe.com> wrote: > > > > Thanks for the quick response Doug! > > > > Unfortunately, I can't be specific about the cluster size, other than to > say it's got more than a thousand nodes. > > > > In a separate test that I had missed, even "srun hostname" took 5 > minutes to run. So there was no remote file system or MPI involvement. > > > > Andy > > > > -----Original Message----- > > From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On > Behalf Of Douglas Jacobsen > > Sent: Friday, April 26, 2019 9:24 AM > > To: Slurm User Community List <slurm-users@lists.schedmd.com> > > Subject: Re: [slurm-users] job startup timeouts? > > > > How large is very large? Where is the executable being started? In > > the parallel filesystem/NFS? If that is the case you may be able to > > trim start times by using sbcast to transfer the executable (and its > > dependencies if dynamically linked) into a node-local resource, such > > as /tmp or /dev/shm depending on your local configuration. > > ---- > > Doug Jacobsen, Ph.D. > > NERSC Computer Systems Engineer > > Acting Group Lead, Computational Systems Group > > National Energy Research Scientific Computing Center > > dmjacob...@lbl.gov > > > > ------------- __o > > ---------- _ '\<,_ > > ----------(_)/ (_)__________________________ > > > > > > On Fri, Apr 26, 2019 at 5:34 AM Andy Riebs <andy.ri...@hpe.com> wrote: > > > > > > Hi All, > > > > > > We've got a very large x86_64 cluster with lots of cores on each node, > and hyper-threading enabled. We're running Slurm 18.08.7 with Open MPI 4.x > on CentOS 7.6. > > > > > > We have a job that reports > > > > > > srun: error: timeout waiting for task launch, started 0 of xxxxxx tasks > > > srun: Job step 291963.0 aborted before step completely launched. > > > > > > when we try to run it at large scale. We anticipate that it could take > as long as 15 minutes for the job to launch, based on our experience with > smaller numbers of nodes. > > > > > > Is there a timeout setting that we're missing that can be changed to > accommodate a lengthy startup time like this? > > > > > > Andy > > > > > > -- > > > > > > Andy Riebs > > > andy.ri...@hpe.com > > > Hewlett-Packard Enterprise > > > High Performance Computing Software Engineering > > > +1 404 648 9024 > > > My opinions are not necessarily those of HPE > > > May the source be with you! > > > >