The problem is that GridEngine doesn’t tell me the context of the error. It 
could be returning from one of many things that are happening under qrsh but it 
doesn’t specify at what stage the error happened and who reported it. That 
makes it very difficult to troubleshoot. How do I even know if the error is 
coming from getent and not something else? This is like trying to debug an 
exception in a Java app without having the info of the exception chain.

Mfg,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800


 

On 12.04.17, 15:34, "William Hay" <w....@ucl.ac.uk> wrote:

    On Wed, Apr 12, 2017 at 12:33:07PM +0000, juanesteban.jime...@mdc-berlin.de 
wrote:
    > We???re still in the same boat.
    > 
    > What I am trying to figure out is why QRSH is looking for any password in 
the first place when the system is configured to use SSH keys, not passwords. ??
    
    The passwd file and the corresponding databases in NIS/LDAP/AD
    also contain mappings between a username and various useful pieces
    of information like uid, home directory and shell.  These days the
    actual password is rarely contained therein.  Grid engine needs that
    information(rather than the password) to start a job.  With the scripts
    below I would be surprised if qrsh were invoked at all (that is it won't
    be unless stress is a different program than the one I think it is).
    
    Your problem appears to be that the machine is not reliably looking
    the information in the passwd database up.  Since this appears to
    happen flakily when you are deliberately putting stress on the nodes my
    suspicion is that this is load related.  Which is why I suggested putting
    stress on a node outside grid engine's control and then running getent
    passwd <username> a few times (preferably with different usernames)
    which is a very thin wrapper around the function call that is returning
    an error to grid engine.  If this also errors then you can reproduce
    the problem without the involvement of grid engine.  Solution probably
    involves tweaking the username lookup process to retry a bit more or
    allow longer timeouts.
    
    Alternatively if the nodes only exhibit this problem when significantly
    overloaded you could setup load_thresholds to prevent gridengine sending
    jobs to overloaded nodes.
    
    
    William
    

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to