Today I did some more testing and the problem appears to be specific to GPFS. I changed the script to put the logs in a folder on an NFS share and *without* the throttling, there are no errors.
Juan On 02/02/2017, 00:23, "SGE-discuss on behalf of juanesteban.jime...@mdc-berlin.de" <sge-discuss-boun...@liverpool.ac.uk on behalf of juanesteban.jime...@mdc-berlin.de> wrote: ...and even stranger, if I throttle the qsub's by putting a usleep 50 in the loop, everything works fine. Mfg, Juan Jimenez System Administrator, HPC MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 ________________________________________ From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de] Sent: Thursday, February 02, 2017 00:19 To: sge-discuss@liv.ac.uk Subject: Re: [SGE-discuss] qsub permission denied In addition, if I then qmod -cj xxx,xxx, etc., the jobs run fine. If the users throttles the job submission, then all jobs run fine. ?? Mfg, Juan Jimenez System Administrator, HPC MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 ________________________________________ From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de] Sent: Wednesday, February 01, 2017 23:56 To: sge-discuss@liv.ac.uk Subject: [SGE-discuss] qsub permission denied Hi Folks, New to the list! I am the sysadmin of an HPC cluster using SGE 8.1.8. The cluster has 100+ nodes running Centos 7 with a shared DDN storage cluster configured as a GPFS device and a number of NFS mounts to a Centos 7 server. Some of my users are reporting problems with qsub that have turned out to be very difficult to pin down. One user was able to replicate the problem reliably by creating a job submission script that creates some 470+ jobs submitted as fast as qsub will handle them. Every time he runs it a number of the jobs are stalled in Eqw state because they can't stat() the folder on the GPFS file system that has been designated to hold the logs for each job. No log is created for these jobs because qsub apparently thinks it can't get to the shared folder. This is one example of those stalled jobs (non-relevant info removed). The /fast file system is the GPFS storage subsystem: exec_file: job_scripts/2084885 submission_time: Tue Jan 31 14:55:16 2017 owner: xxxxx uid: 22933 group: xxxxx gid: 22933 sge_o_home: /fast/users/xxxxx sge_o_log_name: xxxxxsge_o_shell: /bin/bash sge_o_workdir: /fast/groups/cubi/projects/2017-01-13_ddn_sge_problems sge_o_host: med02xx account: sge cwd: /fast/groups/cubi/projects/2017-01-13_ddn_sge_problems merge: y hard resource_list: h_vmem=100M stdout_path_list: NONE:NONE:/fast/groups/cubi/projects/2017-01-13_ddn_sge_problems/sge_log jobshare: 0 env_list: TERM=NONE script_file: run_GuessEncoding.sh parallel environment: smp range: 1 binding: NONE job_type: NONE error reason 1: 01/31/2017 14:55:26 [22933:24246]: can't stat() "/fast/groups/cubi/projects/2017-01-13_ddn_sge_problems/sge_log" as stdout_path: Permission denied KRB5CCNAME=none uid=22933 gid=22933 20032 22933 scheduling info: (Collecting of scheduler job information is turned off) Anybody seen this before? Any insights? I've seen this issue mentioned a few times on the Internet but I've never seen a resolution. I thought it had to do with ldap access authorization but it doesn't seem to be related to that. My cluster gets access authorization from a campus-wide Active Directory which is designed for very high performance (mine is not the only HPC cluster on the campus, and there's a ton of desktops as well). Mfg, Juan Jimenez System Administrator, HPC MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss