Dependencies is not an appropriate approach. --- Professor Laurence Marks (Laurie) www.numis.northwestern.edu "Research is to see what everybody else has seen, and to think what nobody else has thought" Albert Szent-Györgyi
On Wed, Dec 20, 2023, 14:40 Renfro, Michael <ren...@tntech.edu> wrote: > Is this Northwestern’s Quest HPC or another one? I know at least a few of > the people involved with Quest, and I wouldn’t have thought they’d be in > dire need of coaching. > > > > And to follow on with Davide’s point, this really sounds like a case for > submitting multiple jobs with dependencies between them, as per [1, 2, 3]. > > > > [1] > https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1795 > > [2] > https://bioinformaticsworkbook.org/Appendix/HPC/SLURM/submitting-dependency-jobs-using-slurm.html#gsc.tab=0 > > [3] https://slurm.schedmd.com/sbatch.html#OPT_dependency > > > > *From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Laurence Marks <laurence.ma...@gmail.com> > *Date: *Wednesday, December 20, 2023 at 1:40 PM > *To: *Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject: *Re: [slurm-users] Reproducible irreproducible problem > (timeout?) > > *External Email Warning* > > *This email originated from outside the university. Please use caution > when opening attachments, clicking links, or responding to requests.* > ------------------------------ > > It is a University "supercomputer", not a national facility. Hence they > are not that expert, which is why I am asking here. I am pretty certain > that it is some form of communication issue, but beyond that it is not > clear. > > > > If I get suggestions such as "why don't they look for ABC in XYZ" then I > may persuade them to look at specifics. They will need the coaching, alas. > > > > On Wed, Dec 20, 2023 at 1:25 PM Gerhard Strangar <g...@arcor.de> wrote: > > Laurence Marks wrote: > > > After some (irreproducible) time, often one of the three slow tasks > hangs. > > A symptom is that if I try and ssh into the main node of the subtask > (which > > is running 128 mpi on the 4 nodes) I get "Authentication failed". > > How about asking an admin to check why it hangs? > > > > > -- > > Emeritus Professor Laurence Marks (Laurie) > > Northwestern University > > Webpage <http://www.numis.northwestern.edu/> and Google Scholar link > <http://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en> > > "Research is to see what everybody else has seen, and to think what nobody > else has thought", Albert Szent-Györgyi >