We’re still in the same boat. What I am trying to figure out is why QRSH is looking for any password in the first place when the system is configured to use SSH keys, not passwords. ??
Mfg, Juan Jimenez System Administrator, BIH HPC Cluster MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 On 12.04.17, 10:21, "William Hay" <w....@ucl.ac.uk> wrote: On Tue, Apr 11, 2017 at 05:11:58PM +0000, juanesteban.jime...@mdc-berlin.de wrote: > I've got a serious problem here with authenetication with AD and Kerberos. I have already done away with all the possibilities I can think of outside of SGE and I can't find a solution. > > The following scripts show how to dup the problem: > > #!/bin/bash -x > # Usage ./qsub.sh > set -euo pipefail > host_cores=$(qhost | grep med0 | grep lx-amd64 | awk '{ sum += $3 } END { print sum }') > job_cores=2 > num_jobs=$(( host_cores / 2 )) > logdir=sge_log/$(date +%Y-%m-%d__%H-%M-%S) > mkdir -p $logdir > qsub -o $logdir -v JOB_CORES=${job_cores} -t 1-${num_jobs} array_stress.sh > > > #!/bin/bash -x > #$ -S /bin/bash > #$ -o sge_log > #$ -cwd > #$ -pe smp 2 > #$ -l h_vmem=2.75G,h_rt=02:00:00 > #$ -j y > set -xeuo pipefail > echo "Beginning..." > hostname > date > stress \ > ?????? --verbose \ > ?????? --cpu ${JOB_CORES} \ > ?????? --vm ${JOB_CORES} \ > ?????? --vm-bytes $(( 2 * 1024 * 1024 * 1024 )) \ > ?????? -t 600 > echo "Done." > hostname > date > > > These two scripts 100% of the time create this error in some of the subjobs. > > error reason ###: can't get password entry for user "jjimene". Either user does not exist or error with NIS/LDAP etc. You're running a program called stress and getting an error. The obvious conclusion is that something is failing your stress test. You are running 4 workers while requesting only 2 cores. Sure two of them are spinning on malloc/free but if that is non-blocking because your malloc is just allocating and de-allocating the same memory over and over without bothering to return it to the kernel then I could easily see that being cpu bound as well. Grid Engine will try to limit your job to a certain number of cores but depending on the version and settings it may not be able to restrain a program that attempts to explicitly set the cores which it uses. It doesn't look like there is anything stopping multiple instances of this stress test trying to run on the same node so possibly if you are already running several copies on a node and another one tries to start you get a timeout due to an overloaded machine. To test this I would try logging directly into a node (ie outside grid-engine) and running N-1 copies of the stress test where N is the maximum number that grid engine would run on the node simultaneously and then try getent passwd <user> on a few users to see if it fails. If you can reproduce the failure outside grid engine then I'd try playing with timeout and retry settings for anything in the password lookup process (eg sssd). William _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss