Are all nodes affected or only a single one? I would try to add the troublesome user in the local /etc/passwd file to see if it makes any difference. Ondrej
> -----Original Message----- > From: SGE-discuss [mailto:sge-discuss-boun...@liverpool.ac.uk] On Behalf > Of juanesteban.jime...@mdc-berlin.de > Sent: Wednesday, April 12, 2017 5:15 PM > To: William Hay <w....@ucl.ac.uk> > Cc: SGE-discuss@liv.ac.uk <sge-disc...@liverpool.ac.uk> > Subject: Re: [SGE-discuss] Kerberos authentication > > The problem is that GridEngine doesn’t tell me the context of the error. It > could be returning from one of many things that are happening under qrsh > but it doesn’t specify at what stage the error happened and who reported it. > That makes it very difficult to troubleshoot. How do I even know if the error > is coming from getent and not something else? This is like trying to debug an > exception in a Java app without having the info of the exception chain. > > Mfg, > Juan Jimenez > System Administrator, BIH HPC Cluster > MDC Berlin / IT-Dept. > Tel.: +49 30 9406 2800 > > > > > On 12.04.17, 15:34, "William Hay" <w....@ucl.ac.uk> wrote: > > On Wed, Apr 12, 2017 at 12:33:07PM +0000, JuanEsteban.Jimenez@mdc- > berlin.de wrote: > > We???re still in the same boat. > > > > What I am trying to figure out is why QRSH is looking for any password > in > the first place when the system is configured to use SSH keys, not > passwords. ?? > > The passwd file and the corresponding databases in NIS/LDAP/AD > also contain mappings between a username and various useful pieces > of information like uid, home directory and shell. These days the > actual password is rarely contained therein. Grid engine needs that > information(rather than the password) to start a job. With the scripts > below I would be surprised if qrsh were invoked at all (that is it won't > be unless stress is a different program than the one I think it is). > > Your problem appears to be that the machine is not reliably looking > the information in the passwd database up. Since this appears to > happen flakily when you are deliberately putting stress on the nodes my > suspicion is that this is load related. Which is why I suggested putting > stress on a node outside grid engine's control and then running getent > passwd <username> a few times (preferably with different usernames) > which is a very thin wrapper around the function call that is returning > an error to grid engine. If this also errors then you can reproduce > the problem without the involvement of grid engine. Solution probably > involves tweaking the username lookup process to retry a bit more or > allow longer timeouts. > > Alternatively if the nodes only exhibit this problem when significantly > overloaded you could setup load_thresholds to prevent gridengine > sending > jobs to overloaded nodes. > > > William > > > _______________________________________________ > SGE-discuss mailing list > SGE-discuss@liv.ac.uk > https://arc.liv.ac.uk/mailman/listinfo/sge-discuss ----- The information contained in this e-mail and in any attachments is confidential and is designated solely for the attention of the intended recipient(s). If you are not an intended recipient, you must not use, disclose, copy, distribute or retain this e-mail or any part thereof. If you have received this e-mail in error, please notify the sender by return e-mail and delete all copies of this e-mail from your computer system(s). Please direct any additional queries to: communicati...@s3group.com. Thank You. Silicon and Software Systems Limited (S3 Group). Registered in Ireland no. 378073. Registered Office: South County Business Park, Leopardstown, Dublin 18. _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss