Re: [SGE-discuss] jobs stuck in 'r' state

Reuti Wed, 21 Dec 2016 12:52:50 -0800

Am 21.12.2016 um 21:47 schrieb Thomas Beaudry:

> Hi Reuti,
> 
> My initial guess was that it was the disk access to the NAS since if I run 
> the job several times, it will only fail a few times.  I'm not quite sure how 
> to trouble shoot it since I can't find logs.


As I suggested, can you insert a `cp` in the job script before the computation:

cp 
/NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256…
 $TMPDIR
mnc2nii -short -nii $TMPDIR//tal_t1_00256…

(or the final filename in the long path)

-- Reuti


> Thanks,
> Thomas
> ________________________________________
> From: Reuti <re...@staff.uni-marburg.de>
> Sent: Wednesday, December 21, 2016 3:36 PM
> To: Thomas Beaudry
> Cc: Hanby, Mike; sge-discuss@liv.ac.uk
> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
> 
> Am 21.12.2016 um 21:22 schrieb Thomas Beaudry:
> 
>> These are the last lines of qstat -j:
>> 
>> 
>> script_file:                STDIN
>> binding:                    NONE
>> job_type:                   NONE
>> usage         1:            cpu=00:00:01, mem=0.04828 GB s, io=0.02693 GB, 
>> vmem=56.121M, maxvmem=57.477M
> 
> Aha, and the cpu time never increases? Then it's really a problem with the 
> disk access of the application to the NAS. Depending on the application: 
> would it help to copy the file form the NAS to the $TMPDIR on the node 
> beforehand and perform the computation then with the local file? A copy 
> process might access the file just sequentially while the application could 
> do random seeks.
> 
> -- Reuti
> 
> 
>> binding       1:            NONE
>> scheduling info:            (Collecting of scheduler job information is 
>> turned off)
>> 
>> ________________________________________
>> From: Reuti <re...@staff.uni-marburg.de>
>> Sent: Wednesday, December 21, 2016 3:20 PM
>> To: Thomas Beaudry
>> Cc: Hanby, Mike; sge-discuss@liv.ac.uk
>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>> 
>> Am 21.12.2016 um 21:11 schrieb Thomas Beaudry:
>> 
>>> Hi Reuti,
>>> 
>>> I think it's good:
>>> 
>>> 69847 ?        S      0:00  \_ sge_shepherd-15272 -bg
>>> 69848 ?        SNs    0:00      \_ /bin/sh 
>>> /opt/sge/default/spool/perf-hpc04/job_scripts/15272
>>> 69850 ?        DN     0:01          \_ mnc2nii -short -nii 
>>> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_
>> 
>> State D means uninterruptible kernel task, often because of disk IO. So it 
>> looks running, but waiting for the disk. And `qstat -j 15272` has no output? 
>> It should have a line:
>> 
>> usage 1: cpu=…, mem=… GBs, io=…, vmem=…M, maxvmem=…M
>> 
>> -- Reuti
>> 
>> 
>>> ______________
>>> 
>>> 
>>> With the original pid of shepherd (69847), i get:
>>> 
>>> perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69847"
>>> 
>>> Dec 19 11:23:48 perf-hpc04 kernel: [251479.437471] audit: type=1400 
>>> audit(1482164628.332:69847): apparmor="ALLOWED" operation="open" 
>>> profile="/usr/sbin/sssd//null-/usr/sbin/adcli" 
>>> name="/usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2.2" pid=31820 
>>> comm="adcli" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
>>> 
>>> 
>>> Are we getting somewhere?
>>> 
>>> Thanks so much!
>>> Thomas
>>> __________________________
>>> From: Reuti <re...@staff.uni-marburg.de>
>>> Sent: Wednesday, December 21, 2016 2:38 PM
>>> To: Thomas Beaudry
>>> Cc: Hanby, Mike; sge-discuss@liv.ac.uk
>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>> 
>>> Am 21.12.2016 um 18:41 schrieb Thomas Beaudry:
>>> 
>>>> Hi Hanby,
>>>> 
>>>> Yes  I've checked before - no need to excuse yourself, any suggestion is 
>>>> helpful because I am really stumped on finding a solution.  This is what 
>>>> I've tried on a machine that has a job in the 'r' state:
>>>> 
>>>> perform-admin@perf-hpc04:~$ ps aux | grep "perform-admin"
>>> 
>>> How is "perform-admin" related to the job? The job should be a child of the 
>>> sge_shepherd. A tree of all process you can get with:
>>> 
>>> $ ps -e f
>>> 
>>> (f w/o -). The relevant processes of SGE is the sge_execd and the spawned 
>>> sge_shepherd's for all the started processes thereon and their children.
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> perform+  69850  0.0  0.0  73656 56664 ?        DN   11:45   0:01 mnc2nii 
>>>> -short -nii 
>>>> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks-short.nii
>>>>  
>>>> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks.nii
>>>> 
>>>> perform-admin@perf-hpc04:~$ sudo cat /var/log/auth.log | grep "69850"
>>>> perform-admin@perf-hpc04:~$ sudo cat /var/log/syslog | grep "69850"
>>>> perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69850"
>>>> 
>>>> I'm not finding anything helpful.
>>>> 
>>>> Thanks so much guys!
>>>> Thomas
>>>> ________________________________________
>>>> From: Hanby, Mike <mha...@uab.edu>
>>>> Sent: Wednesday, December 21, 2016 12:34 PM
>>>> To: Thomas Beaudry; Reuti
>>>> Cc: sge-discuss@liv.ac.uk
>>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>>> 
>>>> Please excuse if you've already checked this, but are you sure that all 
>>>> job related processes have terminated on the compute nodes?
>>>> 
>>>> Just a thought.
>>>> 
>>>> -----Original Message-----
>>>> From: SGE-discuss <sge-discuss-boun...@liverpool.ac.uk> on behalf of 
>>>> Thomas Beaudry <thomas.beau...@concordia.ca>
>>>> Date: Wednesday, December 21, 2016 at 11:58
>>>> To: Reuti <re...@staff.uni-marburg.de>
>>>> Cc: "sge-discuss@liv.ac.uk" <sge-disc...@liverpool.ac.uk>
>>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>>> 
>>>> Hi Reuti,
>>>> 
>>>> Setting the loglevel to log_info didn't add any additional warnings to my 
>>>> spool messages file.
>>>> 
>>>> Any other ideas as to what I can do?
>>>> 
>>>> Thanks!
>>>> Thomas
>>>> ________________________________________
>>>> From: Reuti <re...@staff.uni-marburg.de>
>>>> Sent: Wednesday, December 21, 2016 5:48 AM
>>>> To: Thomas Beaudry
>>>> Cc: sge-discuss@liv.ac.uk
>>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>>> 
>>>>> Am 21.12.2016 um 04:03 schrieb Thomas Beaudry 
>>>>> <thomas.beau...@concordia.ca>:
>>>>> 
>>>>> Hi Reuti,
>>>>> 
>>>>> It is:   loglevel                     log_warning
>>>> 
>>>> Please set it to log_info, then you will get more output in the messages 
>>>> file (`man sge_conf`). Maybe you get some hints then.
>>>> 
>>>> 
>>>>> In case it helps, here is the full output:
>>>>> 
>>>>> #global:
>>>>> execd_spool_dir              /opt/sge/default/spool
>>>> 
>>>> This can be set to have the spool directories local to safe some network 
>>>> traffic. My favorite place is /var/spool/sge which is owned by the account 
>>>> owning SGE (for me sgeadmin:gridware).
>>>> 
>>>> - Create the local spool directories
>>>> - Adjust the setting in the configuration to read /var/spool/sge
>>>> - Shut down the execd's
>>>> - Start the execd's
>>>> 
>>>> This will create the subdirectory "nodeXY" therein automatically on each 
>>>> exechost then.
>>>> 
>>>> https://arc.liv.ac.uk/SGE/howto/nfsreduce.html
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> mailer                       /bin/mail
>>>>> xterm                        /usr/bin/xterm
>>>>> load_sensor                  none
>>>>> prolog                       none
>>>>> epilog                       none
>>>>> shell_start_mode             posix_compliant
>>>>> login_shells                 sh,bash,ksh,csh,tcsh
>>>>> min_uid                      100
>>>>> min_gid                      100
>>>>> user_lists                   none
>>>>> xuser_lists                  none
>>>>> projects                     none
>>>>> xprojects                    none
>>>>> enforce_project              false
>>>>> enforce_user                 auto
>>>>> load_report_time             00:00:40
>>>>> max_unheard                  00:05:00
>>>>> reschedule_unknown           00:00:00
>>>>> loglevel                     log_warning
>>>>> administrator_mail           thomas.beau...@concordia.ca
>>>>> set_token_cmd                none
>>>>> pag_cmd                      none
>>>>> token_extend_time            none
>>>>> shepherd_cmd                 none
>>>>> qmaster_params               none
>>>>> execd_params                 none
>>>>> reporting_params             accounting=true reporting=false \
>>>>>                         flush_time=00:00:15 joblog=false sharelog=00:00:00
>>>>> finished_jobs                100
>>>>> gid_range                    20000-20100
>>>>> qlogin_command               builtin
>>>>> qlogin_daemon                builtin
>>>>> rlogin_command               builtin
>>>>> rlogin_daemon                builtin
>>>>> rsh_command                  builtin
>>>>> rsh_daemon                   builtin
>>>>> max_aj_instances             2000
>>>>> max_aj_tasks                 75000
>>>>> max_u_jobs                   0
>>>>> max_jobs                     0
>>>>> max_advance_reservations     0
>>>>> auto_user_oticket            0
>>>>> auto_user_fshare             0
>>>>> auto_user_default_project    none
>>>>> auto_user_delete_time        86400
>>>>> delegated_file_staging       false
>>>>> reprioritize                 0
>>>>> jsv_url                      none
>>>>> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>>>>> 
>>>>> Thanks!
>>>>> Thomas
>>>>> ________________________________________
>>>>> From: Reuti <re...@staff.uni-marburg.de>
>>>>> Sent: Tuesday, December 20, 2016 5:35 PM
>>>>> To: Thomas Beaudry
>>>>> Cc: sge-discuss@liv.ac.uk
>>>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>>>> 
>>>>> Am 20.12.2016 um 22:37 schrieb Thomas Beaudry:
>>>>> 
>>>>>> Hi Reuti,
>>>>>> 
>>>>>> The jobs stay in the queue forever - and don't get processed.  There are 
>>>>>> no messages in the spool directory for these jobs.
>>>>> 
>>>>> The "r" state is already after the "t" state. With NFS problems they are 
>>>>> often stuck in "t" state. What is your setting off:
>>>>> 
>>>>> $ qconf -sconf
>>>>> ...
>>>>> loglevel                     log_info
>>>>> 
>>>>> -- Reuti
>>>>> 
>>>>>> 
>>>>>> Thomas
>>>>>> ________________________________________
>>>>>> From: Reuti <re...@staff.uni-marburg.de>
>>>>>> Sent: Tuesday, December 20, 2016 4:25 PM
>>>>>> To: Thomas Beaudry
>>>>>> Cc: sge-discuss@liv.ac.uk
>>>>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Am 20.12.2016 um 22:20 schrieb Thomas Beaudry:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I've run into a problem recently where users jobs are stuck in the 'r' 
>>>>>>> state.  It doesn't always happen, but it's happening enough to be a 
>>>>>>> persistent error. My guess is that it is IO realted (the jobs are 
>>>>>>> accessing a NFS 4.1. share off of a windows 2012 file server).  I 
>>>>>>> really don't know how to debug this since I'm not getting any useful 
>>>>>>> info from qstat -j <jobid>  and the /var/log/* logs don't seem to give 
>>>>>>> me any clues - or maybe i'm missin something.
>>>>>>> 
>>>>>>> I would be very greatful if anyone has any suggestions as to where I 
>>>>>>> can start to debug this issue.  My cluster is unusable because of this 
>>>>>>> error.
>>>>>> 
>>>>>> You mean the job exited already and is not removed from `qstat`? Usually 
>>>>>> there is a delay of some minutes for parallel jobs.
>>>>>> 
>>>>>> What does the messages file in the spool directory of the nodes say? 
>>>>>> Unless it's local it's in $SGE_ROOT/default/spool/nodeXY/messages
>>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>> Thanks,
>>>>>>> Thomas
>>>>>>> _______________________________________________
>>>>>>> SGE-discuss mailing list
>>>>>>> SGE-discuss@liv.ac.uk
>>>>>>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> SGE-discuss mailing list
>>>> SGE-discuss@liv.ac.uk
>>>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Re: [SGE-discuss] jobs stuck in 'r' state

Reply via email to