This issue has been resolved. For the benefit of those who don’t like to bang 
their heads against the wall just to see how good it feels when you stop, 
here’s the root cause. Please chime in with corrections if I say something that 
doesn’t sound kosher.

Some of our nodes were not joined correctly into Active Directory, and had 
duplicate entries for the server. I was told that I could verify the join with 
“net ads testjoin” but that is not correct – apparently that only tells you if 
you _can_ join, not if you _are_ joined.

The correct way to figure out if the node has a good join is “getent passwd 
<username>”. No response means you are dead in the water.

This is what was causing not only the error I talked about at the beginning of 
the thread, but also what was causing failed qrsh’s. I think, but I am not 
sure, that this is also causing stat() failures on our GPFS storage cluster 
when a ton of jobs try to access the same log folder within the GPFS file 
system, but not in NFS. 

Anyway, resolving the AD issue by deleting the duplicate server entries in AD 
and then rejoining has cleared this up.

The ultimate root cause of all this is clearly lack of training. I need more 
knowledge about Linux AD integration. No problem admitting that. :)

THANK YOU to everyone who chimed in and tried to help. Much appreciated. :)

Mfg,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800


 

On 13.04.17, 09:29, "Ondrej Valousek" <ondrej.valou...@s3group.com> wrote:

    Are all nodes affected or only a single one?
    I would try to add the troublesome user in the local /etc/passwd file to 
see if it makes any difference.
    Ondrej
    
    > -----Original Message-----
    > From: SGE-discuss [mailto:sge-discuss-boun...@liverpool.ac.uk] On Behalf
    > Of juanesteban.jime...@mdc-berlin.de
    > Sent: Wednesday, April 12, 2017 5:15 PM
    > To: William Hay <w....@ucl.ac.uk>
    > Cc: SGE-discuss@liv.ac.uk <sge-disc...@liverpool.ac.uk>
    > Subject: Re: [SGE-discuss] Kerberos authentication
    > 
    > The problem is that GridEngine doesn’t tell me the context of the error. 
It
    > could be returning from one of many things that are happening under qrsh
    > but it doesn’t specify at what stage the error happened and who reported 
it.
    > That makes it very difficult to troubleshoot. How do I even know if the 
error
    > is coming from getent and not something else? This is like trying to 
debug an
    > exception in a Java app without having the info of the exception chain.
    > 
    > Mfg,
    > Juan Jimenez
    > System Administrator, BIH HPC Cluster
    > MDC Berlin / IT-Dept.
    > Tel.: +49 30 9406 2800
    > 
    > 
    > 
    > 
    > On 12.04.17, 15:34, "William Hay" <w....@ucl.ac.uk> wrote:
    > 
    >     On Wed, Apr 12, 2017 at 12:33:07PM +0000, JuanEsteban.Jimenez@mdc-
    > berlin.de wrote:
    >     > We???re still in the same boat.
    >     >
    >     > What I am trying to figure out is why QRSH is looking for any 
password in
    > the first place when the system is configured to use SSH keys, not
    > passwords. ??
    > 
    >     The passwd file and the corresponding databases in NIS/LDAP/AD
    >     also contain mappings between a username and various useful pieces
    >     of information like uid, home directory and shell.  These days the
    >     actual password is rarely contained therein.  Grid engine needs that
    >     information(rather than the password) to start a job.  With the 
scripts
    >     below I would be surprised if qrsh were invoked at all (that is it 
won't
    >     be unless stress is a different program than the one I think it is).
    > 
    >     Your problem appears to be that the machine is not reliably looking
    >     the information in the passwd database up.  Since this appears to
    >     happen flakily when you are deliberately putting stress on the nodes 
my
    >     suspicion is that this is load related.  Which is why I suggested 
putting
    >     stress on a node outside grid engine's control and then running getent
    >     passwd <username> a few times (preferably with different usernames)
    >     which is a very thin wrapper around the function call that is 
returning
    >     an error to grid engine.  If this also errors then you can reproduce
    >     the problem without the involvement of grid engine.  Solution probably
    >     involves tweaking the username lookup process to retry a bit more or
    >     allow longer timeouts.
    > 
    >     Alternatively if the nodes only exhibit this problem when 
significantly
    >     overloaded you could setup load_thresholds to prevent gridengine
    > sending
    >     jobs to overloaded nodes.
    > 
    > 
    >     William
    > 
    > 
    > _______________________________________________
    > SGE-discuss mailing list
    > SGE-discuss@liv.ac.uk
    > https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
    
    -----
    
    The information contained in this e-mail and in any attachments is 
confidential and is designated solely for the attention of the intended 
recipient(s). If you are not an intended recipient, you must not use, disclose, 
copy, distribute or retain this e-mail or any part thereof. If you have 
received this e-mail in error, please notify the sender by return e-mail and 
delete all copies of this e-mail from your computer system(s). Please direct 
any additional queries to: communicati...@s3group.com. Thank You. Silicon and 
Software Systems Limited (S3 Group). Registered in Ireland no. 378073. 
Registered Office: South County Business Park, Leopardstown, Dublin 18.

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to