Re: [SGE-discuss] jobs stuck in 'r' state

Thomas Beaudry Wed, 21 Dec 2016 08:59:18 -0800

Hi Reuti,

Setting the loglevel to log_info didn't add any additional warnings to my spool 
messages file.


Any other ideas as to what I can do?

Thanks!
Thomas
________________________________________
From: Reuti <re...@staff.uni-marburg.de>
Sent: Wednesday, December 21, 2016 5:48 AM
To: Thomas Beaudry
Cc: sge-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] jobs stuck in 'r' state

> Am 21.12.2016 um 04:03 schrieb Thomas Beaudry <thomas.beau...@concordia.ca>:
>
> Hi Reuti,
>
> It is:   loglevel                     log_warning

Please set it to log_info, then you will get more output in the messages file 
(`man sge_conf`). Maybe you get some hints then.


> In case it helps, here is the full output:
>
> #global:
> execd_spool_dir              /opt/sge/default/spool

This can be set to have the spool directories local to safe some network 
traffic. My favorite place is /var/spool/sge which is owned by the account 
owning SGE (for me sgeadmin:gridware).

- Create the local spool directories
- Adjust the setting in the configuration to read /var/spool/sge
- Shut down the execd's
- Start the execd's

This will create the subdirectory "nodeXY" therein automatically on each 
exechost then.

https://arc.liv.ac.uk/SGE/howto/nfsreduce.html

-- Reuti


> mailer                       /bin/mail
> xterm                        /usr/bin/xterm
> load_sensor                  none
> prolog                       none
> epilog                       none
> shell_start_mode             posix_compliant
> login_shells                 sh,bash,ksh,csh,tcsh
> min_uid                      100
> min_gid                      100
> user_lists                   none
> xuser_lists                  none
> projects                     none
> xprojects                    none
> enforce_project              false
> enforce_user                 auto
> load_report_time             00:00:40
> max_unheard                  00:05:00
> reschedule_unknown           00:00:00
> loglevel                     log_warning
> administrator_mail           thomas.beau...@concordia.ca
> set_token_cmd                none
> pag_cmd                      none
> token_extend_time            none
> shepherd_cmd                 none
> qmaster_params               none
> execd_params                 none
> reporting_params             accounting=true reporting=false \
>                             flush_time=00:00:15 joblog=false sharelog=00:00:00
> finished_jobs                100
> gid_range                    20000-20100
> qlogin_command               builtin
> qlogin_daemon                builtin
> rlogin_command               builtin
> rlogin_daemon                builtin
> rsh_command                  builtin
> rsh_daemon                   builtin
> max_aj_instances             2000
> max_aj_tasks                 75000
> max_u_jobs                   0
> max_jobs                     0
> max_advance_reservations     0
> auto_user_oticket            0
> auto_user_fshare             0
> auto_user_default_project    none
> auto_user_delete_time        86400
> delegated_file_staging       false
> reprioritize                 0
> jsv_url                      none
> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>
> Thanks!
> Thomas
> ________________________________________
> From: Reuti <re...@staff.uni-marburg.de>
> Sent: Tuesday, December 20, 2016 5:35 PM
> To: Thomas Beaudry
> Cc: sge-discuss@liv.ac.uk
> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>
> Am 20.12.2016 um 22:37 schrieb Thomas Beaudry:
>
>> Hi Reuti,
>>
>> The jobs stay in the queue forever - and don't get processed.  There are no 
>> messages in the spool directory for these jobs.
>
> The "r" state is already after the "t" state. With NFS problems they are 
> often stuck in "t" state. What is your setting off:
>
> $ qconf -sconf
> ...
> loglevel                     log_info
>
> -- Reuti
>
>>
>> Thomas
>> ________________________________________
>> From: Reuti <re...@staff.uni-marburg.de>
>> Sent: Tuesday, December 20, 2016 4:25 PM
>> To: Thomas Beaudry
>> Cc: sge-discuss@liv.ac.uk
>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>
>> Hi,
>>
>> Am 20.12.2016 um 22:20 schrieb Thomas Beaudry:
>>
>>> Hi,
>>>
>>> I've run into a problem recently where users jobs are stuck in the 'r' 
>>> state.  It doesn't always happen, but it's happening enough to be a 
>>> persistent error. My guess is that it is IO realted (the jobs are accessing 
>>> a NFS 4.1. share off of a windows 2012 file server).  I really don't know 
>>> how to debug this since I'm not getting any useful info from qstat -j 
>>> <jobid>  and the /var/log/* logs don't seem to give me any clues - or maybe 
>>> i'm missin something.
>>>
>>> I would be very greatful if anyone has any suggestions as to where I can 
>>> start to debug this issue.  My cluster is unusable because of this error.
>>
>> You mean the job exited already and is not removed from `qstat`? Usually 
>> there is a delay of some minutes for parallel jobs.
>>
>> What does the messages file in the spool directory of the nodes say? Unless 
>> it's local it's in $SGE_ROOT/default/spool/nodeXY/messages
>>
>> -- Reuti
>>
>>
>>> Thanks,
>>> Thomas
>>> _______________________________________________
>>> SGE-discuss mailing list
>>> SGE-discuss@liv.ac.uk
>>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
>>
>>
>
>

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Re: [SGE-discuss] jobs stuck in 'r' state

Reply via email to