Re: [SGE-discuss] jobs stuck in 'r' state

Thomas Beaudry Wed, 21 Dec 2016 12:11:46 -0800

Hi Reuti,

I think it's good:


69847 ?        S      0:00  \_ sge_shepherd-15272 -bg
 69848 ?        SNs    0:00      \_ /bin/sh 
/opt/sge/default/spool/perf-hpc04/job_scripts/15272
 69850 ?        DN     0:01          \_ mnc2nii -short -nii 
/NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_
______________


With the original pid of shepherd (69847), i get:

perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69847"

Dec 19 11:23:48 perf-hpc04 kernel: [251479.437471] audit: type=1400 
audit(1482164628.332:69847): apparmor="ALLOWED" operation="open" 
profile="/usr/sbin/sssd//null-/usr/sbin/adcli" 
name="/usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2.2" pid=31820 comm="adcli" 
requested_mask="r" denied_mask="r" fsuid=0 ouid=0


Are we getting somewhere?

Thanks so much!
Thomas
__________________________
From: Reuti <re...@staff.uni-marburg.de>
Sent: Wednesday, December 21, 2016 2:38 PM
To: Thomas Beaudry
Cc: Hanby, Mike; sge-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] jobs stuck in 'r' state

Am 21.12.2016 um 18:41 schrieb Thomas Beaudry:

> Hi Hanby,
>
> Yes  I've checked before - no need to excuse yourself, any suggestion is 
> helpful because I am really stumped on finding a solution.  This is what I've 
> tried on a machine that has a job in the 'r' state:
>
> perform-admin@perf-hpc04:~$ ps aux | grep "perform-admin"

How is "perform-admin" related to the job? The job should be a child of the 
sge_shepherd. A tree of all process you can get with:

$ ps -e f

(f w/o -). The relevant processes of SGE is the sge_execd and the spawned 
sge_shepherd's for all the started processes thereon and their children.

-- Reuti


> perform+  69850  0.0  0.0  73656 56664 ?        DN   11:45   0:01 mnc2nii 
> -short -nii 
> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks-short.nii
>  
> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks.nii
>
> perform-admin@perf-hpc04:~$ sudo cat /var/log/auth.log | grep "69850"
> perform-admin@perf-hpc04:~$ sudo cat /var/log/syslog | grep "69850"
> perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69850"
>
> I'm not finding anything helpful.
>
> Thanks so much guys!
> Thomas
> ________________________________________
> From: Hanby, Mike <mha...@uab.edu>
> Sent: Wednesday, December 21, 2016 12:34 PM
> To: Thomas Beaudry; Reuti
> Cc: sge-discuss@liv.ac.uk
> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>
> Please excuse if you've already checked this, but are you sure that all job 
> related processes have terminated on the compute nodes?
>
> Just a thought.
>
> -----Original Message-----
> From: SGE-discuss <sge-discuss-boun...@liverpool.ac.uk> on behalf of Thomas 
> Beaudry <thomas.beau...@concordia.ca>
> Date: Wednesday, December 21, 2016 at 11:58
> To: Reuti <re...@staff.uni-marburg.de>
> Cc: "sge-discuss@liv.ac.uk" <sge-disc...@liverpool.ac.uk>
> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>
>    Hi Reuti,
>
>    Setting the loglevel to log_info didn't add any additional warnings to my 
> spool messages file.
>
>    Any other ideas as to what I can do?
>
>    Thanks!
>    Thomas
>    ________________________________________
>    From: Reuti <re...@staff.uni-marburg.de>
>    Sent: Wednesday, December 21, 2016 5:48 AM
>    To: Thomas Beaudry
>    Cc: sge-discuss@liv.ac.uk
>    Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>
>> Am 21.12.2016 um 04:03 schrieb Thomas Beaudry <thomas.beau...@concordia.ca>:
>>
>> Hi Reuti,
>>
>> It is:   loglevel                     log_warning
>
>    Please set it to log_info, then you will get more output in the messages 
> file (`man sge_conf`). Maybe you get some hints then.
>
>
>> In case it helps, here is the full output:
>>
>> #global:
>> execd_spool_dir              /opt/sge/default/spool
>
>    This can be set to have the spool directories local to safe some network 
> traffic. My favorite place is /var/spool/sge which is owned by the account 
> owning SGE (for me sgeadmin:gridware).
>
>    - Create the local spool directories
>    - Adjust the setting in the configuration to read /var/spool/sge
>    - Shut down the execd's
>    - Start the execd's
>
>    This will create the subdirectory "nodeXY" therein automatically on each 
> exechost then.
>
>    https://arc.liv.ac.uk/SGE/howto/nfsreduce.html
>
>    -- Reuti
>
>
>> mailer                       /bin/mail
>> xterm                        /usr/bin/xterm
>> load_sensor                  none
>> prolog                       none
>> epilog                       none
>> shell_start_mode             posix_compliant
>> login_shells                 sh,bash,ksh,csh,tcsh
>> min_uid                      100
>> min_gid                      100
>> user_lists                   none
>> xuser_lists                  none
>> projects                     none
>> xprojects                    none
>> enforce_project              false
>> enforce_user                 auto
>> load_report_time             00:00:40
>> max_unheard                  00:05:00
>> reschedule_unknown           00:00:00
>> loglevel                     log_warning
>> administrator_mail           thomas.beau...@concordia.ca
>> set_token_cmd                none
>> pag_cmd                      none
>> token_extend_time            none
>> shepherd_cmd                 none
>> qmaster_params               none
>> execd_params                 none
>> reporting_params             accounting=true reporting=false \
>>                            flush_time=00:00:15 joblog=false sharelog=00:00:00
>> finished_jobs                100
>> gid_range                    20000-20100
>> qlogin_command               builtin
>> qlogin_daemon                builtin
>> rlogin_command               builtin
>> rlogin_daemon                builtin
>> rsh_command                  builtin
>> rsh_daemon                   builtin
>> max_aj_instances             2000
>> max_aj_tasks                 75000
>> max_u_jobs                   0
>> max_jobs                     0
>> max_advance_reservations     0
>> auto_user_oticket            0
>> auto_user_fshare             0
>> auto_user_default_project    none
>> auto_user_delete_time        86400
>> delegated_file_staging       false
>> reprioritize                 0
>> jsv_url                      none
>> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>>
>> Thanks!
>> Thomas
>> ________________________________________
>> From: Reuti <re...@staff.uni-marburg.de>
>> Sent: Tuesday, December 20, 2016 5:35 PM
>> To: Thomas Beaudry
>> Cc: sge-discuss@liv.ac.uk
>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>
>> Am 20.12.2016 um 22:37 schrieb Thomas Beaudry:
>>
>>> Hi Reuti,
>>>
>>> The jobs stay in the queue forever - and don't get processed.  There are no 
>>> messages in the spool directory for these jobs.
>>
>> The "r" state is already after the "t" state. With NFS problems they are 
>> often stuck in "t" state. What is your setting off:
>>
>> $ qconf -sconf
>> ...
>> loglevel                     log_info
>>
>> -- Reuti
>>
>>>
>>> Thomas
>>> ________________________________________
>>> From: Reuti <re...@staff.uni-marburg.de>
>>> Sent: Tuesday, December 20, 2016 4:25 PM
>>> To: Thomas Beaudry
>>> Cc: sge-discuss@liv.ac.uk
>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>>
>>> Hi,
>>>
>>> Am 20.12.2016 um 22:20 schrieb Thomas Beaudry:
>>>
>>>> Hi,
>>>>
>>>> I've run into a problem recently where users jobs are stuck in the 'r' 
>>>> state.  It doesn't always happen, but it's happening enough to be a 
>>>> persistent error. My guess is that it is IO realted (the jobs are 
>>>> accessing a NFS 4.1. share off of a windows 2012 file server).  I really 
>>>> don't know how to debug this since I'm not getting any useful info from 
>>>> qstat -j <jobid>  and the /var/log/* logs don't seem to give me any clues 
>>>> - or maybe i'm missin something.
>>>>
>>>> I would be very greatful if anyone has any suggestions as to where I can 
>>>> start to debug this issue.  My cluster is unusable because of this error.
>>>
>>> You mean the job exited already and is not removed from `qstat`? Usually 
>>> there is a delay of some minutes for parallel jobs.
>>>
>>> What does the messages file in the spool directory of the nodes say? Unless 
>>> it's local it's in $SGE_ROOT/default/spool/nodeXY/messages
>>>
>>> -- Reuti
>>>
>>>
>>>> Thanks,
>>>> Thomas
>>>> _______________________________________________
>>>> SGE-discuss mailing list
>>>> SGE-discuss@liv.ac.uk
>>>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
>>>
>>>
>>
>>
>
>    _______________________________________________
>    SGE-discuss mailing list
>    SGE-discuss@liv.ac.uk
>    https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
>
>
>

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Re: [SGE-discuss] jobs stuck in 'r' state

Reply via email to