Re: [SGE-discuss] jobs stuck in 'r' state

Reuti Wed, 21 Dec 2016 12:38:06 -0800

Am 21.12.2016 um 21:22 schrieb Thomas Beaudry:

> These are the last lines of qstat -j:
> 
> 
> script_file:                STDIN
> binding:                    NONE
> job_type:                   NONE
> usage         1:            cpu=00:00:01, mem=0.04828 GB s, io=0.02693 GB, 
> vmem=56.121M, maxvmem=57.477M


Aha, and the cpu time never increases? Then it's really a problem with the disk 
access of the application to the NAS. Depending on the application: would it 
help to copy the file form the NAS to the $TMPDIR on the node beforehand and 
perform the computation then with the local file? A copy process might access 
the file just sequentially while the application could do random seeks.

-- Reuti


> binding       1:            NONE
> scheduling info:            (Collecting of scheduler job information is 
> turned off)
> 
> ________________________________________
> From: Reuti <re...@staff.uni-marburg.de>
> Sent: Wednesday, December 21, 2016 3:20 PM
> To: Thomas Beaudry
> Cc: Hanby, Mike; sge-discuss@liv.ac.uk
> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
> 
> Am 21.12.2016 um 21:11 schrieb Thomas Beaudry:
> 
>> Hi Reuti,
>> 
>> I think it's good:
>> 
>> 69847 ?        S      0:00  \_ sge_shepherd-15272 -bg
>> 69848 ?        SNs    0:00      \_ /bin/sh 
>> /opt/sge/default/spool/perf-hpc04/job_scripts/15272
>> 69850 ?        DN     0:01          \_ mnc2nii -short -nii 
>> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_
> 
> State D means uninterruptible kernel task, often because of disk IO. So it 
> looks running, but waiting for the disk. And `qstat -j 15272` has no output? 
> It should have a line:
> 
> usage 1: cpu=…, mem=… GBs, io=…, vmem=…M, maxvmem=…M
> 
> -- Reuti
> 
> 
>> ______________
>> 
>> 
>> With the original pid of shepherd (69847), i get:
>> 
>> perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69847"
>> 
>> Dec 19 11:23:48 perf-hpc04 kernel: [251479.437471] audit: type=1400 
>> audit(1482164628.332:69847): apparmor="ALLOWED" operation="open" 
>> profile="/usr/sbin/sssd//null-/usr/sbin/adcli" 
>> name="/usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2.2" pid=31820 
>> comm="adcli" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
>> 
>> 
>> Are we getting somewhere?
>> 
>> Thanks so much!
>> Thomas
>> __________________________
>> From: Reuti <re...@staff.uni-marburg.de>
>> Sent: Wednesday, December 21, 2016 2:38 PM
>> To: Thomas Beaudry
>> Cc: Hanby, Mike; sge-discuss@liv.ac.uk
>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>> 
>> Am 21.12.2016 um 18:41 schrieb Thomas Beaudry:
>> 
>>> Hi Hanby,
>>> 
>>> Yes  I've checked before - no need to excuse yourself, any suggestion is 
>>> helpful because I am really stumped on finding a solution.  This is what 
>>> I've tried on a machine that has a job in the 'r' state:
>>> 
>>> perform-admin@perf-hpc04:~$ ps aux | grep "perform-admin"
>> 
>> How is "perform-admin" related to the job? The job should be a child of the 
>> sge_shepherd. A tree of all process you can get with:
>> 
>> $ ps -e f
>> 
>> (f w/o -). The relevant processes of SGE is the sge_execd and the spawned 
>> sge_shepherd's for all the started processes thereon and their children.
>> 
>> -- Reuti
>> 
>> 
>>> perform+  69850  0.0  0.0  73656 56664 ?        DN   11:45   0:01 mnc2nii 
>>> -short -nii 
>>> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks-short.nii
>>>  
>>> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks.nii
>>> 
>>> perform-admin@perf-hpc04:~$ sudo cat /var/log/auth.log | grep "69850"
>>> perform-admin@perf-hpc04:~$ sudo cat /var/log/syslog | grep "69850"
>>> perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69850"
>>> 
>>> I'm not finding anything helpful.
>>> 
>>> Thanks so much guys!
>>> Thomas
>>> ________________________________________
>>> From: Hanby, Mike <mha...@uab.edu>
>>> Sent: Wednesday, December 21, 2016 12:34 PM
>>> To: Thomas Beaudry; Reuti
>>> Cc: sge-discuss@liv.ac.uk
>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>> 
>>> Please excuse if you've already checked this, but are you sure that all job 
>>> related processes have terminated on the compute nodes?
>>> 
>>> Just a thought.
>>> 
>>> -----Original Message-----
>>> From: SGE-discuss <sge-discuss-boun...@liverpool.ac.uk> on behalf of Thomas 
>>> Beaudry <thomas.beau...@concordia.ca>
>>> Date: Wednesday, December 21, 2016 at 11:58
>>> To: Reuti <re...@staff.uni-marburg.de>
>>> Cc: "sge-discuss@liv.ac.uk" <sge-disc...@liverpool.ac.uk>
>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>> 
>>>  Hi Reuti,
>>> 
>>>  Setting the loglevel to log_info didn't add any additional warnings to my 
>>> spool messages file.
>>> 
>>>  Any other ideas as to what I can do?
>>> 
>>>  Thanks!
>>>  Thomas
>>>  ________________________________________
>>>  From: Reuti <re...@staff.uni-marburg.de>
>>>  Sent: Wednesday, December 21, 2016 5:48 AM
>>>  To: Thomas Beaudry
>>>  Cc: sge-discuss@liv.ac.uk
>>>  Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>> 
>>>> Am 21.12.2016 um 04:03 schrieb Thomas Beaudry 
>>>> <thomas.beau...@concordia.ca>:
>>>> 
>>>> Hi Reuti,
>>>> 
>>>> It is:   loglevel                     log_warning
>>> 
>>>  Please set it to log_info, then you will get more output in the messages 
>>> file (`man sge_conf`). Maybe you get some hints then.
>>> 
>>> 
>>>> In case it helps, here is the full output:
>>>> 
>>>> #global:
>>>> execd_spool_dir              /opt/sge/default/spool
>>> 
>>>  This can be set to have the spool directories local to safe some network 
>>> traffic. My favorite place is /var/spool/sge which is owned by the account 
>>> owning SGE (for me sgeadmin:gridware).
>>> 
>>>  - Create the local spool directories
>>>  - Adjust the setting in the configuration to read /var/spool/sge
>>>  - Shut down the execd's
>>>  - Start the execd's
>>> 
>>>  This will create the subdirectory "nodeXY" therein automatically on each 
>>> exechost then.
>>> 
>>>  https://arc.liv.ac.uk/SGE/howto/nfsreduce.html
>>> 
>>>  -- Reuti
>>> 
>>> 
>>>> mailer                       /bin/mail
>>>> xterm                        /usr/bin/xterm
>>>> load_sensor                  none
>>>> prolog                       none
>>>> epilog                       none
>>>> shell_start_mode             posix_compliant
>>>> login_shells                 sh,bash,ksh,csh,tcsh
>>>> min_uid                      100
>>>> min_gid                      100
>>>> user_lists                   none
>>>> xuser_lists                  none
>>>> projects                     none
>>>> xprojects                    none
>>>> enforce_project              false
>>>> enforce_user                 auto
>>>> load_report_time             00:00:40
>>>> max_unheard                  00:05:00
>>>> reschedule_unknown           00:00:00
>>>> loglevel                     log_warning
>>>> administrator_mail           thomas.beau...@concordia.ca
>>>> set_token_cmd                none
>>>> pag_cmd                      none
>>>> token_extend_time            none
>>>> shepherd_cmd                 none
>>>> qmaster_params               none
>>>> execd_params                 none
>>>> reporting_params             accounting=true reporting=false \
>>>>                          flush_time=00:00:15 joblog=false sharelog=00:00:00
>>>> finished_jobs                100
>>>> gid_range                    20000-20100
>>>> qlogin_command               builtin
>>>> qlogin_daemon                builtin
>>>> rlogin_command               builtin
>>>> rlogin_daemon                builtin
>>>> rsh_command                  builtin
>>>> rsh_daemon                   builtin
>>>> max_aj_instances             2000
>>>> max_aj_tasks                 75000
>>>> max_u_jobs                   0
>>>> max_jobs                     0
>>>> max_advance_reservations     0
>>>> auto_user_oticket            0
>>>> auto_user_fshare             0
>>>> auto_user_default_project    none
>>>> auto_user_delete_time        86400
>>>> delegated_file_staging       false
>>>> reprioritize                 0
>>>> jsv_url                      none
>>>> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>>>> 
>>>> Thanks!
>>>> Thomas
>>>> ________________________________________
>>>> From: Reuti <re...@staff.uni-marburg.de>
>>>> Sent: Tuesday, December 20, 2016 5:35 PM
>>>> To: Thomas Beaudry
>>>> Cc: sge-discuss@liv.ac.uk
>>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>>> 
>>>> Am 20.12.2016 um 22:37 schrieb Thomas Beaudry:
>>>> 
>>>>> Hi Reuti,
>>>>> 
>>>>> The jobs stay in the queue forever - and don't get processed.  There are 
>>>>> no messages in the spool directory for these jobs.
>>>> 
>>>> The "r" state is already after the "t" state. With NFS problems they are 
>>>> often stuck in "t" state. What is your setting off:
>>>> 
>>>> $ qconf -sconf
>>>> ...
>>>> loglevel                     log_info
>>>> 
>>>> -- Reuti
>>>> 
>>>>> 
>>>>> Thomas
>>>>> ________________________________________
>>>>> From: Reuti <re...@staff.uni-marburg.de>
>>>>> Sent: Tuesday, December 20, 2016 4:25 PM
>>>>> To: Thomas Beaudry
>>>>> Cc: sge-discuss@liv.ac.uk
>>>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Am 20.12.2016 um 22:20 schrieb Thomas Beaudry:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I've run into a problem recently where users jobs are stuck in the 'r' 
>>>>>> state.  It doesn't always happen, but it's happening enough to be a 
>>>>>> persistent error. My guess is that it is IO realted (the jobs are 
>>>>>> accessing a NFS 4.1. share off of a windows 2012 file server).  I really 
>>>>>> don't know how to debug this since I'm not getting any useful info from 
>>>>>> qstat -j <jobid>  and the /var/log/* logs don't seem to give me any 
>>>>>> clues - or maybe i'm missin something.
>>>>>> 
>>>>>> I would be very greatful if anyone has any suggestions as to where I can 
>>>>>> start to debug this issue.  My cluster is unusable because of this error.
>>>>> 
>>>>> You mean the job exited already and is not removed from `qstat`? Usually 
>>>>> there is a delay of some minutes for parallel jobs.
>>>>> 
>>>>> What does the messages file in the spool directory of the nodes say? 
>>>>> Unless it's local it's in $SGE_ROOT/default/spool/nodeXY/messages
>>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> 
>>>>>> Thanks,
>>>>>> Thomas
>>>>>> _______________________________________________
>>>>>> SGE-discuss mailing list
>>>>>> SGE-discuss@liv.ac.uk
>>>>>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>>  _______________________________________________
>>>  SGE-discuss mailing list
>>>  SGE-discuss@liv.ac.uk
>>>  https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
>>> 
>>> 
>>> 
>> 
>> 
> 
> 

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Re: [SGE-discuss] jobs stuck in 'r' state

Reply via email to