Today I did some more testing and the problem appears to be specific to GPFS.  
I changed the script to put the logs in a folder on an NFS share and *without* 
the throttling, there are no errors.

Juan

On 02/02/2017, 00:23, "SGE-discuss on behalf of 
juanesteban.jime...@mdc-berlin.de" <sge-discuss-boun...@liverpool.ac.uk on 
behalf of juanesteban.jime...@mdc-berlin.de> wrote:

    ...and even stranger, if I throttle the qsub's by putting a usleep 50 in 
the loop, everything works fine.
    
    Mfg,
    Juan Jimenez
    System Administrator, HPC
    MDC Berlin / IT-Dept.
    Tel.: +49 30 9406 2800
    
    
    ________________________________________
    From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of 
juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de]
    Sent: Thursday, February 02, 2017 00:19
    To: sge-discuss@liv.ac.uk
    Subject: Re: [SGE-discuss] qsub permission denied
    
    In addition, if I then qmod -cj xxx,xxx, etc., the jobs run fine. If the 
users throttles the job submission, then all jobs run fine. ??
    
    Mfg,
    Juan Jimenez
    System Administrator, HPC
    MDC Berlin / IT-Dept.
    Tel.: +49 30 9406 2800
    
    
    ________________________________________
    From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of 
juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de]
    Sent: Wednesday, February 01, 2017 23:56
    To: sge-discuss@liv.ac.uk
    Subject: [SGE-discuss] qsub permission denied
    
    Hi Folks,
    
    New to the list! I am the sysadmin of an HPC cluster using SGE 8.1.8. The 
cluster has 100+ nodes running Centos 7 with a shared DDN storage cluster 
configured as a GPFS device and a number of NFS mounts to a Centos 7 server.
    
    Some of my users are reporting problems with qsub that have turned out to 
be very difficult to pin down. One user was able to replicate the problem 
reliably by creating a job submission script that creates some 470+ jobs 
submitted as fast as qsub will handle them. Every time he runs it a number of 
the jobs are stalled in Eqw state because they can't stat() the folder on the 
GPFS file system that has been designated to hold the logs for each job. No log 
is created for these jobs because qsub apparently thinks it can't get to the 
shared folder.
    
    This is one example of those stalled jobs (non-relevant info removed). The 
/fast file system is the GPFS storage subsystem:
    
    exec_file:                  job_scripts/2084885
    submission_time:            Tue Jan 31 14:55:16 2017
    owner:                      xxxxx
    uid:                        22933
    group:                      xxxxx
    gid:                        22933
    sge_o_home:                 /fast/users/xxxxx
    sge_o_log_name:             xxxxxsge_o_shell:                /bin/bash
    sge_o_workdir:              
/fast/groups/cubi/projects/2017-01-13_ddn_sge_problems
    sge_o_host:                 med02xx
    account:                    sge
    cwd:                        
/fast/groups/cubi/projects/2017-01-13_ddn_sge_problems
    merge:                      y
    hard resource_list:         h_vmem=100M
    stdout_path_list:           
NONE:NONE:/fast/groups/cubi/projects/2017-01-13_ddn_sge_problems/sge_log
    jobshare:                   0
    env_list:                   TERM=NONE
    script_file:                run_GuessEncoding.sh
    parallel environment:  smp range: 1
    binding:                    NONE
    job_type:                   NONE
    error reason          1:      01/31/2017 14:55:26 [22933:24246]: can't 
stat() "/fast/groups/cubi/projects/2017-01-13_ddn_sge_problems/sge_log" as 
stdout_path: Permission denied KRB5CCNAME=none uid=22933 gid=22933 20032 22933
    scheduling info:            (Collecting of scheduler job information is 
turned off)
    
    Anybody seen this before? Any insights? I've seen this issue mentioned a 
few times on the Internet but I've never seen a resolution. I thought it had to 
do with ldap access authorization but it doesn't seem to be related to that. My 
cluster gets access authorization from a campus-wide Active Directory which is 
designed for very high performance (mine is not the only HPC cluster on the 
campus, and there's a ton of desktops as well).
    
    Mfg,
    Juan Jimenez
    System Administrator, HPC
    MDC Berlin / IT-Dept.
    Tel.: +49 30 9406 2800
    
    _______________________________________________
    SGE-discuss mailing list
    SGE-discuss@liv.ac.uk
    https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
    _______________________________________________
    SGE-discuss mailing list
    SGE-discuss@liv.ac.uk
    https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
    _______________________________________________
    SGE-discuss mailing list
    SGE-discuss@liv.ac.uk
    https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
    

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to