We’re still in the same boat.

What I am trying to figure out is why QRSH is looking for any password in the 
first place when the system is configured to use SSH keys, not passwords. ??

Mfg,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800

On 12.04.17, 10:21, "William Hay" <w....@ucl.ac.uk> wrote:

    On Tue, Apr 11, 2017 at 05:11:58PM +0000, juanesteban.jime...@mdc-berlin.de 
wrote:
    > I've got a serious problem here with authenetication with AD and 
Kerberos. I have already done away with all the possibilities I can think of 
outside of SGE and I can't find a solution. 
    > 
    > The following scripts show how to dup the problem:
    > 
    > #!/bin/bash -x
    > # Usage ./qsub.sh
    > set -euo pipefail
    > host_cores=$(qhost | grep med0 | grep lx-amd64 | awk '{ sum += $3 } END { 
print sum }')
    > job_cores=2
    > num_jobs=$(( host_cores / 2 ))
    > logdir=sge_log/$(date +%Y-%m-%d__%H-%M-%S)
    > mkdir -p $logdir
    > qsub -o $logdir -v JOB_CORES=${job_cores} -t 1-${num_jobs} array_stress.sh
    > 
    > 
    > #!/bin/bash -x
    > #$ -S /bin/bash
    > #$ -o sge_log
    > #$ -cwd
    > #$ -pe smp 2
    > #$ -l h_vmem=2.75G,h_rt=02:00:00
    > #$ -j y
    > set -xeuo pipefail
    > echo "Beginning..."
    > hostname
    > date
    > stress \
    > ?????? --verbose \
    > ?????? --cpu ${JOB_CORES} \
    > ?????? --vm ${JOB_CORES} \
    > ?????? --vm-bytes $(( 2 * 1024 * 1024 * 1024 )) \
    > ?????? -t 600
    > echo "Done."
    > hostname
    > date
    > 
    > 
    > These two scripts 100% of the time create this error in some of the 
subjobs.
    > 
    > error reason  ###: can't get password entry for user "jjimene". Either 
user does not exist or error with NIS/LDAP etc.
    You're running a program called stress and getting an error.  The obvious 
conclusion is that something is failing your stress test.
    
    You are running 4 workers while requesting only 2 cores.  Sure two of them 
are spinning on malloc/free but if that is non-blocking
    because your malloc is just allocating and de-allocating the same memory 
over and over without bothering to return it to
    the kernel then I could easily see that being cpu bound as well.
    
    Grid Engine will try to limit your job to a certain number of cores but 
depending on the version and settings
    it may not be able to restrain a program that attempts to explicitly set 
the cores which it uses.  
    
    It doesn't look like there is anything stopping multiple instances of this 
stress test trying to run on the same node so possibly 
    if you are already running several copies on a node and another one tries 
to start you get a timeout due to an overloaded machine.
    
    To test this I would try logging directly into a node (ie outside 
grid-engine) and running N-1 
    copies of the stress test where N is the maximum number that grid engine 
would run on the node 
    simultaneously and then try getent passwd <user> on a few users to see if 
it fails.
    
    If you can reproduce the failure outside grid engine then I'd try playing 
with timeout and retry settings for anything in the
    password lookup process (eg sssd).
    
    William
    
    

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to