Re: [SGE-discuss] jobs stuck in 'r' state

Thomas Beaudry Wed, 21 Dec 2016 09:42:42 -0800

Hi Hanby,

Yes  I've checked before - no need to excuse yourself, any suggestion is 
helpful because I am really stumped on finding a solution.  This is what I've 
tried on a machine that has a job in the 'r' state:


perform-admin@perf-hpc04:~$ ps aux | grep "perform-admin"
perform+  69850  0.0  0.0  73656 56664 ?        DN   11:45   0:01 mnc2nii 
-short -nii 
/NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks-short.nii
 
/NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks.nii

perform-admin@perf-hpc04:~$ sudo cat /var/log/auth.log | grep "69850"
perform-admin@perf-hpc04:~$ sudo cat /var/log/syslog | grep "69850"
perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69850"

I'm not finding anything helpful.

Thanks so much guys!
Thomas
________________________________________
From: Hanby, Mike <mha...@uab.edu>
Sent: Wednesday, December 21, 2016 12:34 PM
To: Thomas Beaudry; Reuti
Cc: sge-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] jobs stuck in 'r' state

Please excuse if you've already checked this, but are you sure that all job 
related processes have terminated on the compute nodes?

Just a thought.

-----Original Message-----
From: SGE-discuss <sge-discuss-boun...@liverpool.ac.uk> on behalf of Thomas 
Beaudry <thomas.beau...@concordia.ca>
Date: Wednesday, December 21, 2016 at 11:58
To: Reuti <re...@staff.uni-marburg.de>
Cc: "sge-discuss@liv.ac.uk" <sge-disc...@liverpool.ac.uk>
Subject: Re: [SGE-discuss] jobs stuck in 'r' state

    Hi Reuti,

    Setting the loglevel to log_info didn't add any additional warnings to my 
spool messages file.

    Any other ideas as to what I can do?

    Thanks!
    Thomas
    ________________________________________
    From: Reuti <re...@staff.uni-marburg.de>
    Sent: Wednesday, December 21, 2016 5:48 AM
    To: Thomas Beaudry
    Cc: sge-discuss@liv.ac.uk
    Subject: Re: [SGE-discuss] jobs stuck in 'r' state

    > Am 21.12.2016 um 04:03 schrieb Thomas Beaudry 
<thomas.beau...@concordia.ca>:
    >
    > Hi Reuti,
    >
    > It is:   loglevel                     log_warning

    Please set it to log_info, then you will get more output in the messages 
file (`man sge_conf`). Maybe you get some hints then.


    > In case it helps, here is the full output:
    >
    > #global:
    > execd_spool_dir              /opt/sge/default/spool

    This can be set to have the spool directories local to safe some network 
traffic. My favorite place is /var/spool/sge which is owned by the account 
owning SGE (for me sgeadmin:gridware).

    - Create the local spool directories
    - Adjust the setting in the configuration to read /var/spool/sge
    - Shut down the execd's
    - Start the execd's

    This will create the subdirectory "nodeXY" therein automatically on each 
exechost then.

    https://arc.liv.ac.uk/SGE/howto/nfsreduce.html

    -- Reuti


    > mailer                       /bin/mail
    > xterm                        /usr/bin/xterm
    > load_sensor                  none
    > prolog                       none
    > epilog                       none
    > shell_start_mode             posix_compliant
    > login_shells                 sh,bash,ksh,csh,tcsh
    > min_uid                      100
    > min_gid                      100
    > user_lists                   none
    > xuser_lists                  none
    > projects                     none
    > xprojects                    none
    > enforce_project              false
    > enforce_user                 auto
    > load_report_time             00:00:40
    > max_unheard                  00:05:00
    > reschedule_unknown           00:00:00
    > loglevel                     log_warning
    > administrator_mail           thomas.beau...@concordia.ca
    > set_token_cmd                none
    > pag_cmd                      none
    > token_extend_time            none
    > shepherd_cmd                 none
    > qmaster_params               none
    > execd_params                 none
    > reporting_params             accounting=true reporting=false \
    >                             flush_time=00:00:15 joblog=false 
sharelog=00:00:00
    > finished_jobs                100
    > gid_range                    20000-20100
    > qlogin_command               builtin
    > qlogin_daemon                builtin
    > rlogin_command               builtin
    > rlogin_daemon                builtin
    > rsh_command                  builtin
    > rsh_daemon                   builtin
    > max_aj_instances             2000
    > max_aj_tasks                 75000
    > max_u_jobs                   0
    > max_jobs                     0
    > max_advance_reservations     0
    > auto_user_oticket            0
    > auto_user_fshare             0
    > auto_user_default_project    none
    > auto_user_delete_time        86400
    > delegated_file_staging       false
    > reprioritize                 0
    > jsv_url                      none
    > jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
    >
    > Thanks!
    > Thomas
    > ________________________________________
    > From: Reuti <re...@staff.uni-marburg.de>
    > Sent: Tuesday, December 20, 2016 5:35 PM
    > To: Thomas Beaudry
    > Cc: sge-discuss@liv.ac.uk
    > Subject: Re: [SGE-discuss] jobs stuck in 'r' state
    >
    > Am 20.12.2016 um 22:37 schrieb Thomas Beaudry:
    >
    >> Hi Reuti,
    >>
    >> The jobs stay in the queue forever - and don't get processed.  There are 
no messages in the spool directory for these jobs.
    >
    > The "r" state is already after the "t" state. With NFS problems they are 
often stuck in "t" state. What is your setting off:
    >
    > $ qconf -sconf
    > ...
    > loglevel                     log_info
    >
    > -- Reuti
    >
    >>
    >> Thomas
    >> ________________________________________
    >> From: Reuti <re...@staff.uni-marburg.de>
    >> Sent: Tuesday, December 20, 2016 4:25 PM
    >> To: Thomas Beaudry
    >> Cc: sge-discuss@liv.ac.uk
    >> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
    >>
    >> Hi,
    >>
    >> Am 20.12.2016 um 22:20 schrieb Thomas Beaudry:
    >>
    >>> Hi,
    >>>
    >>> I've run into a problem recently where users jobs are stuck in the 'r' 
state.  It doesn't always happen, but it's happening enough to be a persistent 
error. My guess is that it is IO realted (the jobs are accessing a NFS 4.1. 
share off of a windows 2012 file server).  I really don't know how to debug 
this since I'm not getting any useful info from qstat -j <jobid>  and the 
/var/log/* logs don't seem to give me any clues - or maybe i'm missin something.
    >>>
    >>> I would be very greatful if anyone has any suggestions as to where I 
can start to debug this issue.  My cluster is unusable because of this error.
    >>
    >> You mean the job exited already and is not removed from `qstat`? Usually 
there is a delay of some minutes for parallel jobs.
    >>
    >> What does the messages file in the spool directory of the nodes say? 
Unless it's local it's in $SGE_ROOT/default/spool/nodeXY/messages
    >>
    >> -- Reuti
    >>
    >>
    >>> Thanks,
    >>> Thomas
    >>> _______________________________________________
    >>> SGE-discuss mailing list
    >>> SGE-discuss@liv.ac.uk
    >>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
    >>
    >>
    >
    >

    _______________________________________________
    SGE-discuss mailing list
    SGE-discuss@liv.ac.uk
    https://arc.liv.ac.uk/mailman/listinfo/sge-discuss


_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Re: [SGE-discuss] jobs stuck in 'r' state

Reply via email to