I've got a serious problem here with authenetication with AD and Kerberos. I 
have already done away with all the possibilities I can think of outside of SGE 
and I can't find a solution. 

The following scripts show how to dup the problem:

#!/bin/bash -x
# Usage ./qsub.sh
set -euo pipefail
host_cores=$(qhost | grep med0 | grep lx-amd64 | awk '{ sum += $3 } END { print 
sum }')
job_cores=2
num_jobs=$(( host_cores / 2 ))
logdir=sge_log/$(date +%Y-%m-%d__%H-%M-%S)
mkdir -p $logdir
qsub -o $logdir -v JOB_CORES=${job_cores} -t 1-${num_jobs} array_stress.sh


#!/bin/bash -x
#$ -S /bin/bash
#$ -o sge_log
#$ -cwd
#$ -pe smp 2
#$ -l h_vmem=2.75G,h_rt=02:00:00
#$ -j y
set -xeuo pipefail
echo "Beginning..."
hostname
date
stress \
    --verbose \
    --cpu ${JOB_CORES} \
    --vm ${JOB_CORES} \
    --vm-bytes $(( 2 * 1024 * 1024 * 1024 )) \
    -t 600
echo "Done."
hostname
date


These two scripts 100% of the time create this error in some of the subjobs.

error reason  ###: can't get password entry for user "jjimene". Either user 
does not exist or error with NIS/LDAP etc.

All my AD configuration is correct. Firewall is not blockinhg anything to AD. 
Test join to ADS works fine. Done authconfig, etc. and reset the sssd daemon. 
OS is Centos 7.1.

What's going on?

Mfg,
Juan Jimenez
System Administrator, HPC
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800


________________________________________
From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of 
juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de]
Sent: Sunday, April 09, 2017 12:28
To: Orion Poplawski; sge-disc...@liverpool.ac.uk
Subject: Re: [SGE-discuss] Kerberos authentication

I have a lot of problems with AD, Kerberos, SSSD, LDAP and GridEngine, but I 
think it is related to the fact that I connect to AD servers that do not 
synchronize with the master quicktly enough. Once in a while I have to clear 
the SSSD cache and restart the SSSD services on all the nodes, and until they 
manage to repopulated, qrsh refuses to open a new shell unless I point it to a 
node that I know is working.

Mfg,
Juan Jimenez
System Administrator, HPC
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800


________________________________________
From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of Orion 
Poplawski [or...@cora.nwra.com]
Sent: Thursday, April 06, 2017 22:46
To: sge-disc...@liverpool.ac.uk
Subject: [SGE-discuss] Kerberos authentication

    I've built the gss utils with 'aimk -gss' and am testing with
security_mode set to kerberos.
   In my first attempt I tried to make use of gssproxy to store the
sge/qmaster principal, but unfortunately it appears that gssproxy is too old
on EL7 to handle storing the delegated credential for us:

put_cred stderr: GSS-API error copying delegated creds to ccache: The
operation or option is not available or unsuppo

    Next attempt was to set:

KRB5_KTNAME=FILE:/var/spool/gridengine/sge.keytab

in the environment of the daemons and store the sge/host principals there.
This avoids needing to run qmaster as root to access /etc/krb5.keytab.  Need a
sge service principal for the qmaster and each of the exec hosts, which seems
appropriate.

   Another issue I ran into is that I'm running in an IPA/Active Directory
trust setup where the users are stored in the AD domain, and the hosts are in
the IPA domain.  Therefore the code in gsslib_put_credentials that was using
gss_compare_name() to compare users ended up comparing "orion" to
"or...@ad.nwra.com".  I changed that to also try using gss_localname() to
convert the client principal to a local username and comparing that.

   Also, the later code that called krb5_kuserok() segfaulted because it was
erroneously casting gss_name_t to krb5_principal.  I've started work changing
that to do the conversion properly but as of now that is untested.

  There are also a bunch of memory leaks in this code that probably should be
cleaned up, although at the moment this is all run in short lived executables.

  Finally, I needed to tweak my peopen() patch to run put_cred and delete_cred
as root on the exec hosts since they need to change the ownership and remove
files of the user running the job.

  At least for a simple test case, this appears to be working now for me, so
I'm fairly pleased.  Next issue I expect to face is renewing and expiring user
credentials for long running jobs.

--
Orion Poplawski
Technical Manager                          720-772-5637
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       or...@nwra.com
Boulder, CO 80301                   http://www.nwra.com
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to