Re: [SGE-discuss] jobs stuck in 'r' state

Reuti Wed, 21 Dec 2016 11:39:41 -0800

Am 21.12.2016 um 18:41 schrieb Thomas Beaudry:

> Hi Hanby,
> 
> Yes  I've checked before - no need to excuse yourself, any suggestion is 
> helpful because I am really stumped on finding a solution.  This is what I've 
> tried on a machine that has a job in the 'r' state:
> 
> perform-admin@perf-hpc04:~$ ps aux | grep "perform-admin"


How is "perform-admin" related to the job? The job should be a child of the 
sge_shepherd. A tree of all process you can get with:

$ ps -e f

(f w/o -). The relevant processes of SGE is the sge_execd and the spawned 
sge_shepherd's for all the started processes thereon and their children.

-- Reuti


> perform+  69850  0.0  0.0  73656 56664 ?        DN   11:45   0:01 mnc2nii 
> -short -nii 
> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks-short.nii
>  
> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks.nii
> 
> perform-admin@perf-hpc04:~$ sudo cat /var/log/auth.log | grep "69850"
> perform-admin@perf-hpc04:~$ sudo cat /var/log/syslog | grep "69850"
> perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69850"
> 
> I'm not finding anything helpful.
> 
> Thanks so much guys!
> Thomas
> ________________________________________
> From: Hanby, Mike <mha...@uab.edu>
> Sent: Wednesday, December 21, 2016 12:34 PM
> To: Thomas Beaudry; Reuti
> Cc: sge-discuss@liv.ac.uk
> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
> 
> Please excuse if you've already checked this, but are you sure that all job 
> related processes have terminated on the compute nodes?
> 
> Just a thought.
> 
> -----Original Message-----
> From: SGE-discuss <sge-discuss-boun...@liverpool.ac.uk> on behalf of Thomas 
> Beaudry <thomas.beau...@concordia.ca>
> Date: Wednesday, December 21, 2016 at 11:58
> To: Reuti <re...@staff.uni-marburg.de>
> Cc: "sge-discuss@liv.ac.uk" <sge-disc...@liverpool.ac.uk>
> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
> 
>    Hi Reuti,
> 
>    Setting the loglevel to log_info didn't add any additional warnings to my 
> spool messages file.
> 
>    Any other ideas as to what I can do?
> 
>    Thanks!
>    Thomas
>    ________________________________________
>    From: Reuti <re...@staff.uni-marburg.de>
>    Sent: Wednesday, December 21, 2016 5:48 AM
>    To: Thomas Beaudry
>    Cc: sge-discuss@liv.ac.uk
>    Subject: Re: [SGE-discuss] jobs stuck in 'r' state
> 
>> Am 21.12.2016 um 04:03 schrieb Thomas Beaudry <thomas.beau...@concordia.ca>:
>> 
>> Hi Reuti,
>> 
>> It is:   loglevel                     log_warning
> 
>    Please set it to log_info, then you will get more output in the messages 
> file (`man sge_conf`). Maybe you get some hints then.
> 
> 
>> In case it helps, here is the full output:
>> 
>> #global:
>> execd_spool_dir              /opt/sge/default/spool
> 
>    This can be set to have the spool directories local to safe some network 
> traffic. My favorite place is /var/spool/sge which is owned by the account 
> owning SGE (for me sgeadmin:gridware).
> 
>    - Create the local spool directories
>    - Adjust the setting in the configuration to read /var/spool/sge
>    - Shut down the execd's
>    - Start the execd's
> 
>    This will create the subdirectory "nodeXY" therein automatically on each 
> exechost then.
> 
>    https://arc.liv.ac.uk/SGE/howto/nfsreduce.html
> 
>    -- Reuti
> 
> 
>> mailer                       /bin/mail
>> xterm                        /usr/bin/xterm
>> load_sensor                  none
>> prolog                       none
>> epilog                       none
>> shell_start_mode             posix_compliant
>> login_shells                 sh,bash,ksh,csh,tcsh
>> min_uid                      100
>> min_gid                      100
>> user_lists                   none
>> xuser_lists                  none
>> projects                     none
>> xprojects                    none
>> enforce_project              false
>> enforce_user                 auto
>> load_report_time             00:00:40
>> max_unheard                  00:05:00
>> reschedule_unknown           00:00:00
>> loglevel                     log_warning
>> administrator_mail           thomas.beau...@concordia.ca
>> set_token_cmd                none
>> pag_cmd                      none
>> token_extend_time            none
>> shepherd_cmd                 none
>> qmaster_params               none
>> execd_params                 none
>> reporting_params             accounting=true reporting=false \
>>                            flush_time=00:00:15 joblog=false sharelog=00:00:00
>> finished_jobs                100
>> gid_range                    20000-20100
>> qlogin_command               builtin
>> qlogin_daemon                builtin
>> rlogin_command               builtin
>> rlogin_daemon                builtin
>> rsh_command                  builtin
>> rsh_daemon                   builtin
>> max_aj_instances             2000
>> max_aj_tasks                 75000
>> max_u_jobs                   0
>> max_jobs                     0
>> max_advance_reservations     0
>> auto_user_oticket            0
>> auto_user_fshare             0
>> auto_user_default_project    none
>> auto_user_delete_time        86400
>> delegated_file_staging       false
>> reprioritize                 0
>> jsv_url                      none
>> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>> 
>> Thanks!
>> Thomas
>> ________________________________________
>> From: Reuti <re...@staff.uni-marburg.de>
>> Sent: Tuesday, December 20, 2016 5:35 PM
>> To: Thomas Beaudry
>> Cc: sge-discuss@liv.ac.uk
>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>> 
>> Am 20.12.2016 um 22:37 schrieb Thomas Beaudry:
>> 
>>> Hi Reuti,
>>> 
>>> The jobs stay in the queue forever - and don't get processed.  There are no 
>>> messages in the spool directory for these jobs.
>> 
>> The "r" state is already after the "t" state. With NFS problems they are 
>> often stuck in "t" state. What is your setting off:
>> 
>> $ qconf -sconf
>> ...
>> loglevel                     log_info
>> 
>> -- Reuti
>> 
>>> 
>>> Thomas
>>> ________________________________________
>>> From: Reuti <re...@staff.uni-marburg.de>
>>> Sent: Tuesday, December 20, 2016 4:25 PM
>>> To: Thomas Beaudry
>>> Cc: sge-discuss@liv.ac.uk
>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>> 
>>> Hi,
>>> 
>>> Am 20.12.2016 um 22:20 schrieb Thomas Beaudry:
>>> 
>>>> Hi,
>>>> 
>>>> I've run into a problem recently where users jobs are stuck in the 'r' 
>>>> state.  It doesn't always happen, but it's happening enough to be a 
>>>> persistent error. My guess is that it is IO realted (the jobs are 
>>>> accessing a NFS 4.1. share off of a windows 2012 file server).  I really 
>>>> don't know how to debug this since I'm not getting any useful info from 
>>>> qstat -j <jobid>  and the /var/log/* logs don't seem to give me any clues 
>>>> - or maybe i'm missin something.
>>>> 
>>>> I would be very greatful if anyone has any suggestions as to where I can 
>>>> start to debug this issue.  My cluster is unusable because of this error.
>>> 
>>> You mean the job exited already and is not removed from `qstat`? Usually 
>>> there is a delay of some minutes for parallel jobs.
>>> 
>>> What does the messages file in the spool directory of the nodes say? Unless 
>>> it's local it's in $SGE_ROOT/default/spool/nodeXY/messages
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> Thanks,
>>>> Thomas
>>>> _______________________________________________
>>>> SGE-discuss mailing list
>>>> SGE-discuss@liv.ac.uk
>>>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
>>> 
>>> 
>> 
>> 
> 
>    _______________________________________________
>    SGE-discuss mailing list
>    SGE-discuss@liv.ac.uk
>    https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
> 
> 
> 

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Re: [SGE-discuss] jobs stuck in 'r' state

Reply via email to