In addition, if I then qmod -cj xxx,xxx, etc., the jobs run fine. If the users 
throttles the job submission, then all jobs run fine. ??

Mfg,
Juan Jimenez
System Administrator, HPC
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800


________________________________________
From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of 
juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de]
Sent: Wednesday, February 01, 2017 23:56
To: sge-discuss@liv.ac.uk
Subject: [SGE-discuss] qsub permission denied

Hi Folks,

New to the list! I am the sysadmin of an HPC cluster using SGE 8.1.8. The 
cluster has 100+ nodes running Centos 7 with a shared DDN storage cluster 
configured as a GPFS device and a number of NFS mounts to a Centos 7 server.

Some of my users are reporting problems with qsub that have turned out to be 
very difficult to pin down. One user was able to replicate the problem reliably 
by creating a job submission script that creates some 470+ jobs submitted as 
fast as qsub will handle them. Every time he runs it a number of the jobs are 
stalled in Eqw state because they can't stat() the folder on the GPFS file 
system that has been designated to hold the logs for each job. No log is 
created for these jobs because qsub apparently thinks it can't get to the 
shared folder.

This is one example of those stalled jobs (non-relevant info removed). The 
/fast file system is the GPFS storage subsystem:

exec_file:                  job_scripts/2084885
submission_time:            Tue Jan 31 14:55:16 2017
owner:                      xxxxx
uid:                        22933
group:                      xxxxx
gid:                        22933
sge_o_home:                 /fast/users/xxxxx
sge_o_log_name:             xxxxxsge_o_shell:                /bin/bash
sge_o_workdir:              
/fast/groups/cubi/projects/2017-01-13_ddn_sge_problems
sge_o_host:                 med02xx
account:                    sge
cwd:                        
/fast/groups/cubi/projects/2017-01-13_ddn_sge_problems
merge:                      y
hard resource_list:         h_vmem=100M
stdout_path_list:           
NONE:NONE:/fast/groups/cubi/projects/2017-01-13_ddn_sge_problems/sge_log
jobshare:                   0
env_list:                   TERM=NONE
script_file:                run_GuessEncoding.sh
parallel environment:  smp range: 1
binding:                    NONE
job_type:                   NONE
error reason          1:      01/31/2017 14:55:26 [22933:24246]: can't stat() 
"/fast/groups/cubi/projects/2017-01-13_ddn_sge_problems/sge_log" as 
stdout_path: Permission denied KRB5CCNAME=none uid=22933 gid=22933 20032 22933
scheduling info:            (Collecting of scheduler job information is turned 
off)

Anybody seen this before? Any insights? I've seen this issue mentioned a few 
times on the Internet but I've never seen a resolution. I thought it had to do 
with ldap access authorization but it doesn't seem to be related to that. My 
cluster gets access authorization from a campus-wide Active Directory which is 
designed for very high performance (mine is not the only HPC cluster on the 
campus, and there's a ton of desktops as well).

Mfg,
Juan Jimenez
System Administrator, HPC
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to