SGE is NOT looking for any passwords. The error you receive says only one thing: > getent passwd <username> Is failing on your nodes.
You need to ensure sssd/ypbind/whatever is running on your nodes so that they could recognize the user. SGE is not using passwords to authenticate users. Ondrej > -----Original Message----- > From: SGE-discuss [mailto:sge-discuss-boun...@liverpool.ac.uk] On Behalf > Of juanesteban.jime...@mdc-berlin.de > Sent: Wednesday, April 12, 2017 2:33 PM > To: William Hay <w....@ucl.ac.uk> > Cc: SGE-discuss@liv.ac.uk <sge-disc...@liverpool.ac.uk> > Subject: Re: [SGE-discuss] Kerberos authentication > > We’re still in the same boat. > > What I am trying to figure out is why QRSH is looking for any password in the > first place when the system is configured to use SSH keys, not passwords. ?? > > Mfg, > Juan Jimenez > System Administrator, BIH HPC Cluster > MDC Berlin / IT-Dept. > Tel.: +49 30 9406 2800 > > On 12.04.17, 10:21, "William Hay" <w....@ucl.ac.uk> wrote: > > On Tue, Apr 11, 2017 at 05:11:58PM +0000, JuanEsteban.Jimenez@mdc- > berlin.de wrote: > > I've got a serious problem here with authenetication with AD and > Kerberos. I have already done away with all the possibilities I can think of > outside of SGE and I can't find a solution. > > > > The following scripts show how to dup the problem: > > > > #!/bin/bash -x > > # Usage ./qsub.sh > > set -euo pipefail > > host_cores=$(qhost | grep med0 | grep lx-amd64 | awk '{ sum += $3 } > END { print sum }') > > job_cores=2 > > num_jobs=$(( host_cores / 2 )) > > logdir=sge_log/$(date +%Y-%m-%d__%H-%M-%S) > > mkdir -p $logdir > > qsub -o $logdir -v JOB_CORES=${job_cores} -t 1-${num_jobs} > array_stress.sh > > > > > > #!/bin/bash -x > > #$ -S /bin/bash > > #$ -o sge_log > > #$ -cwd > > #$ -pe smp 2 > > #$ -l h_vmem=2.75G,h_rt=02:00:00 > > #$ -j y > > set -xeuo pipefail > > echo "Beginning..." > > hostname > > date > > stress \ > > ?????? --verbose \ > > ?????? --cpu ${JOB_CORES} \ > > ?????? --vm ${JOB_CORES} \ > > ?????? --vm-bytes $(( 2 * 1024 * 1024 * 1024 )) \ > > ?????? -t 600 > > echo "Done." > > hostname > > date > > > > > > These two scripts 100% of the time create this error in some of the > subjobs. > > > > error reason ###: can't get password entry for user "jjimene". Either > user does not exist or error with NIS/LDAP etc. > You're running a program called stress and getting an error. The obvious > conclusion is that something is failing your stress test. > > You are running 4 workers while requesting only 2 cores. Sure two of them > are spinning on malloc/free but if that is non-blocking > because your malloc is just allocating and de-allocating the same memory > over and over without bothering to return it to > the kernel then I could easily see that being cpu bound as well. > > Grid Engine will try to limit your job to a certain number of cores but > depending on the version and settings > it may not be able to restrain a program that attempts to explicitly set > the > cores which it uses. > > It doesn't look like there is anything stopping multiple instances of this > stress test trying to run on the same node so possibly > if you are already running several copies on a node and another one tries > to start you get a timeout due to an overloaded machine. > > To test this I would try logging directly into a node (ie outside > grid-engine) > and running N-1 > copies of the stress test where N is the maximum number that grid engine > would run on the node > simultaneously and then try getent passwd <user> on a few users to see if > it fails. > > If you can reproduce the failure outside grid engine then I'd try playing > with > timeout and retry settings for anything in the > password lookup process (eg sssd). > > William > > > > _______________________________________________ > SGE-discuss mailing list > SGE-discuss@liv.ac.uk > https://arc.liv.ac.uk/mailman/listinfo/sge-discuss ----- The information contained in this e-mail and in any attachments is confidential and is designated solely for the attention of the intended recipient(s). If you are not an intended recipient, you must not use, disclose, copy, distribute or retain this e-mail or any part thereof. If you have received this e-mail in error, please notify the sender by return e-mail and delete all copies of this e-mail from your computer system(s). Please direct any additional queries to: communicati...@s3group.com. Thank You. Silicon and Software Systems Limited (S3 Group). Registered in Ireland no. 378073. Registered Office: South County Business Park, Leopardstown, Dublin 18. _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss