The problem is that GridEngine doesn’t tell me the context of the error. It could be returning from one of many things that are happening under qrsh but it doesn’t specify at what stage the error happened and who reported it. That makes it very difficult to troubleshoot. How do I even know if the error is coming from getent and not something else? This is like trying to debug an exception in a Java app without having the info of the exception chain.
Mfg, Juan Jimenez System Administrator, BIH HPC Cluster MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 On 12.04.17, 15:34, "William Hay" <w....@ucl.ac.uk> wrote: On Wed, Apr 12, 2017 at 12:33:07PM +0000, juanesteban.jime...@mdc-berlin.de wrote: > We???re still in the same boat. > > What I am trying to figure out is why QRSH is looking for any password in the first place when the system is configured to use SSH keys, not passwords. ?? The passwd file and the corresponding databases in NIS/LDAP/AD also contain mappings between a username and various useful pieces of information like uid, home directory and shell. These days the actual password is rarely contained therein. Grid engine needs that information(rather than the password) to start a job. With the scripts below I would be surprised if qrsh were invoked at all (that is it won't be unless stress is a different program than the one I think it is). Your problem appears to be that the machine is not reliably looking the information in the passwd database up. Since this appears to happen flakily when you are deliberately putting stress on the nodes my suspicion is that this is load related. Which is why I suggested putting stress on a node outside grid engine's control and then running getent passwd <username> a few times (preferably with different usernames) which is a very thin wrapper around the function call that is returning an error to grid engine. If this also errors then you can reproduce the problem without the involvement of grid engine. Solution probably involves tweaking the username lookup process to retry a bit more or allow longer timeouts. Alternatively if the nodes only exhibit this problem when significantly overloaded you could setup load_thresholds to prevent gridengine sending jobs to overloaded nodes. William _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss