It is a University "supercomputer", not a national facility. Hence they are not that expert, which is why I am asking here. I am pretty certain that it is some form of communication issue, but beyond that it is not clear.
If I get suggestions such as "why don't they look for ABC in XYZ" then I may persuade them to look at specifics. They will need the coaching, alas. On Wed, Dec 20, 2023 at 1:25 PM Gerhard Strangar <g...@arcor.de> wrote: > Laurence Marks wrote: > > > After some (irreproducible) time, often one of the three slow tasks > hangs. > > A symptom is that if I try and ssh into the main node of the subtask > (which > > is running 128 mpi on the 4 nodes) I get "Authentication failed". > > How about asking an admin to check why it hangs? > > -- Emeritus Professor Laurence Marks (Laurie) Northwestern University Webpage <http://www.numis.northwestern.edu> and Google Scholar link <http://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en> "Research is to see what everybody else has seen, and to think what nobody else has thought", Albert Szent-Györgyi