Juan, Can you share the IBM ticket information? email me privately if you want.
We have recently been looking at a similar problem with Grid Engine and GPFS and I sent out a query on the grid-engine mailing list last week. In our case we don't qsub lots of jobs but instead see the problem with large array jobs (-t 1-100000) starting on a near idle system. When 2000+ array tasks all start at once across 300+ nodes we see GPFS start to expel nodes and jobs go into E state. Once the initial spike is over things seem to return to normal. The jobs do still tend to all run about the same length of time, but there is enough spread of start times that the second round of jobs usually work okay. I see a couple of solutions: - I can modulate the jobs started with -tc, but need to manually increase the -tc value with qalter over a period of time (seems to work, but is awkward to explain to an user) - Get the user to put the log giles on /tmp or /dev/null (suggested but not yet tested) - Write a prolog to put a random delay in it (this might still have problems when the log files all get created at the same time) - Only start a defined number of tasks on each scheduling step (probably requires a change to GE) - Start all tasks with a small sequential delay (probably requires a change to GE) Thanks, Stuart Barkley On Thu, 9 Feb 2017 at 02:47 -0000, juanesteban.jime...@mdc-berlin.de wrote: > Date: Thu, 9 Feb 2017 02:47:44 > From: "juanesteban.jime...@mdc-berlin.de" <juanesteban.jime...@mdc-berlin.de> > To: "sge-discuss@liv.ac.uk" <sge-disc...@liverpool.ac.uk> > Subject: Re: [SGE-discuss] qsub permission denied > > I've now confirmed that this is a bug in GPFS. Heavy number of > stat() requests for the same folder or file (inode?) from many > processes in many nodes causes some of them to fail and return > permission denied. A ticket is being opened with IBM. > > The workaround in our case is to use an NFS share instead, or put a > minimum delay "usleep 5" or longer between qsub calls (our error > happens with qsub, but I can also replicate it with a C program that > only makes stat() calls to a single folder). > > Mfg, > Juan Jimenez > System Administrator, HPC > MDC Berlin / IT-Dept. > Tel.: +49 30 9406 2800 > > > ________________________________________ > From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of > juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de] > Sent: Thursday, February 02, 2017 00:23 > To: sge-discuss@liv.ac.uk > Subject: Re: [SGE-discuss] qsub permission denied > > ...and even stranger, if I throttle the qsub's by putting a usleep > 50 in the loop, everything works fine. > > Mfg, > Juan Jimenez > System Administrator, HPC > MDC Berlin / IT-Dept. > Tel.: +49 30 9406 2800 > > > ________________________________________ > From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of > juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de] > Sent: Thursday, February 02, 2017 00:19 > To: sge-discuss@liv.ac.uk > Subject: Re: [SGE-discuss] qsub permission denied > > In addition, if I then qmod -cj xxx,xxx, etc., the jobs run fine. If > the users throttles the job submission, then all jobs run fine. ?? > > Mfg, > Juan Jimenez > System Administrator, HPC > MDC Berlin / IT-Dept. > Tel.: +49 30 9406 2800 > > > ________________________________________ > From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of > juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de] > Sent: Wednesday, February 01, 2017 23:56 > To: sge-discuss@liv.ac.uk > Subject: [SGE-discuss] qsub permission denied > > Hi Folks, > > New to the list! I am the sysadmin of an HPC cluster using SGE > 8.1.8. The cluster has 100+ nodes running Centos 7 with a shared DDN > storage cluster configured as a GPFS device and a number of NFS > mounts to a Centos 7 server. > > Some of my users are reporting problems with qsub that have turned > out to be very difficult to pin down. One user was able to replicate > the problem reliably by creating a job submission script that > creates some 470+ jobs submitted as fast as qsub will handle them. > Every time he runs it a number of the jobs are stalled in Eqw state > because they can't stat() the folder on the GPFS file system that > has been designated to hold the logs for each job. No log is created > for these jobs because qsub apparently thinks it can't get to the > shared folder. > > This is one example of those stalled jobs (non-relevant info > removed). The /fast file system is the GPFS storage subsystem: > > exec_file: job_scripts/2084885 > submission_time: Tue Jan 31 14:55:16 2017 > owner: xxxxx > uid: 22933 > group: xxxxx > gid: 22933 > sge_o_home: /fast/users/xxxxx > sge_o_log_name: xxxxxsge_o_shell: /bin/bash > sge_o_workdir: > /fast/groups/cubi/projects/2017-01-13_ddn_sge_problems > sge_o_host: med02xx > account: sge > cwd: > /fast/groups/cubi/projects/2017-01-13_ddn_sge_problems > merge: y > hard resource_list: h_vmem=100M > stdout_path_list: > NONE:NONE:/fast/groups/cubi/projects/2017-01-13_ddn_sge_problems/sge_log > jobshare: 0 > env_list: TERM=NONE > script_file: run_GuessEncoding.sh > parallel environment: smp range: 1 > binding: NONE > job_type: NONE > error reason 1: 01/31/2017 14:55:26 [22933:24246]: can't stat() > "/fast/groups/cubi/projects/2017-01-13_ddn_sge_problems/sge_log" as > stdout_path: Permission denied KRB5CCNAME=none uid=22933 gid=22933 20032 22933 > scheduling info: (Collecting of scheduler job information is > turned off) > > Anybody seen this before? Any insights? I've seen this issue > mentioned a few times on the Internet but I've never seen a > resolution. I thought it had to do with ldap access authorization > but it doesn't seem to be related to that. My cluster gets access > authorization from a campus-wide Active Directory which is designed > for very high performance (mine is not the only HPC cluster on the > campus, and there's a ton of desktops as well). > > Mfg, > Juan Jimenez > System Administrator, HPC > MDC Berlin / IT-Dept. > Tel.: +49 30 9406 2800 _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss