I've got a serious problem here with authenetication with AD and Kerberos. I have already done away with all the possibilities I can think of outside of SGE and I can't find a solution.
The following scripts show how to dup the problem: #!/bin/bash -x # Usage ./qsub.sh set -euo pipefail host_cores=$(qhost | grep med0 | grep lx-amd64 | awk '{ sum += $3 } END { print sum }') job_cores=2 num_jobs=$(( host_cores / 2 )) logdir=sge_log/$(date +%Y-%m-%d__%H-%M-%S) mkdir -p $logdir qsub -o $logdir -v JOB_CORES=${job_cores} -t 1-${num_jobs} array_stress.sh #!/bin/bash -x #$ -S /bin/bash #$ -o sge_log #$ -cwd #$ -pe smp 2 #$ -l h_vmem=2.75G,h_rt=02:00:00 #$ -j y set -xeuo pipefail echo "Beginning..." hostname date stress \ --verbose \ --cpu ${JOB_CORES} \ --vm ${JOB_CORES} \ --vm-bytes $(( 2 * 1024 * 1024 * 1024 )) \ -t 600 echo "Done." hostname date These two scripts 100% of the time create this error in some of the subjobs. error reason ###: can't get password entry for user "jjimene". Either user does not exist or error with NIS/LDAP etc. All my AD configuration is correct. Firewall is not blockinhg anything to AD. Test join to ADS works fine. Done authconfig, etc. and reset the sssd daemon. OS is Centos 7.1. What's going on? Mfg, Juan Jimenez System Administrator, HPC MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 ________________________________________ From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de] Sent: Sunday, April 09, 2017 12:28 To: Orion Poplawski; sge-disc...@liverpool.ac.uk Subject: Re: [SGE-discuss] Kerberos authentication I have a lot of problems with AD, Kerberos, SSSD, LDAP and GridEngine, but I think it is related to the fact that I connect to AD servers that do not synchronize with the master quicktly enough. Once in a while I have to clear the SSSD cache and restart the SSSD services on all the nodes, and until they manage to repopulated, qrsh refuses to open a new shell unless I point it to a node that I know is working. Mfg, Juan Jimenez System Administrator, HPC MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 ________________________________________ From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of Orion Poplawski [or...@cora.nwra.com] Sent: Thursday, April 06, 2017 22:46 To: sge-disc...@liverpool.ac.uk Subject: [SGE-discuss] Kerberos authentication I've built the gss utils with 'aimk -gss' and am testing with security_mode set to kerberos. In my first attempt I tried to make use of gssproxy to store the sge/qmaster principal, but unfortunately it appears that gssproxy is too old on EL7 to handle storing the delegated credential for us: put_cred stderr: GSS-API error copying delegated creds to ccache: The operation or option is not available or unsuppo Next attempt was to set: KRB5_KTNAME=FILE:/var/spool/gridengine/sge.keytab in the environment of the daemons and store the sge/host principals there. This avoids needing to run qmaster as root to access /etc/krb5.keytab. Need a sge service principal for the qmaster and each of the exec hosts, which seems appropriate. Another issue I ran into is that I'm running in an IPA/Active Directory trust setup where the users are stored in the AD domain, and the hosts are in the IPA domain. Therefore the code in gsslib_put_credentials that was using gss_compare_name() to compare users ended up comparing "orion" to "or...@ad.nwra.com". I changed that to also try using gss_localname() to convert the client principal to a local username and comparing that. Also, the later code that called krb5_kuserok() segfaulted because it was erroneously casting gss_name_t to krb5_principal. I've started work changing that to do the conversion properly but as of now that is untested. There are also a bunch of memory leaks in this code that probably should be cleaned up, although at the moment this is all run in short lived executables. Finally, I needed to tweak my peopen() patch to run put_cred and delete_cred as root on the exec hosts since they need to change the ownership and remove files of the user running the job. At least for a simple test case, this appears to be working now for me, so I'm fairly pleased. Next issue I expect to face is renewing and expiring user credentials for long running jobs. -- Orion Poplawski Technical Manager 720-772-5637 NWRA, Boulder/CoRA Office FAX: 303-415-9702 3380 Mitchell Lane or...@nwra.com Boulder, CO 80301 http://www.nwra.com _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss