Am 21.12.2016 um 21:47 schrieb Thomas Beaudry: > Hi Reuti, > > My initial guess was that it was the disk access to the NAS since if I run > the job several times, it will only fail a few times. I'm not quite sure how > to trouble shoot it since I can't find logs.
As I suggested, can you insert a `cp` in the job script before the computation: cp /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256… $TMPDIR mnc2nii -short -nii $TMPDIR//tal_t1_00256… (or the final filename in the long path) -- Reuti > Thanks, > Thomas > ________________________________________ > From: Reuti <re...@staff.uni-marburg.de> > Sent: Wednesday, December 21, 2016 3:36 PM > To: Thomas Beaudry > Cc: Hanby, Mike; sge-discuss@liv.ac.uk > Subject: Re: [SGE-discuss] jobs stuck in 'r' state > > Am 21.12.2016 um 21:22 schrieb Thomas Beaudry: > >> These are the last lines of qstat -j: >> >> >> script_file: STDIN >> binding: NONE >> job_type: NONE >> usage 1: cpu=00:00:01, mem=0.04828 GB s, io=0.02693 GB, >> vmem=56.121M, maxvmem=57.477M > > Aha, and the cpu time never increases? Then it's really a problem with the > disk access of the application to the NAS. Depending on the application: > would it help to copy the file form the NAS to the $TMPDIR on the node > beforehand and perform the computation then with the local file? A copy > process might access the file just sequentially while the application could > do random seeks. > > -- Reuti > > >> binding 1: NONE >> scheduling info: (Collecting of scheduler job information is >> turned off) >> >> ________________________________________ >> From: Reuti <re...@staff.uni-marburg.de> >> Sent: Wednesday, December 21, 2016 3:20 PM >> To: Thomas Beaudry >> Cc: Hanby, Mike; sge-discuss@liv.ac.uk >> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >> >> Am 21.12.2016 um 21:11 schrieb Thomas Beaudry: >> >>> Hi Reuti, >>> >>> I think it's good: >>> >>> 69847 ? S 0:00 \_ sge_shepherd-15272 -bg >>> 69848 ? SNs 0:00 \_ /bin/sh >>> /opt/sge/default/spool/perf-hpc04/job_scripts/15272 >>> 69850 ? DN 0:01 \_ mnc2nii -short -nii >>> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_ >> >> State D means uninterruptible kernel task, often because of disk IO. So it >> looks running, but waiting for the disk. And `qstat -j 15272` has no output? >> It should have a line: >> >> usage 1: cpu=…, mem=… GBs, io=…, vmem=…M, maxvmem=…M >> >> -- Reuti >> >> >>> ______________ >>> >>> >>> With the original pid of shepherd (69847), i get: >>> >>> perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69847" >>> >>> Dec 19 11:23:48 perf-hpc04 kernel: [251479.437471] audit: type=1400 >>> audit(1482164628.332:69847): apparmor="ALLOWED" operation="open" >>> profile="/usr/sbin/sssd//null-/usr/sbin/adcli" >>> name="/usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2.2" pid=31820 >>> comm="adcli" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 >>> >>> >>> Are we getting somewhere? >>> >>> Thanks so much! >>> Thomas >>> __________________________ >>> From: Reuti <re...@staff.uni-marburg.de> >>> Sent: Wednesday, December 21, 2016 2:38 PM >>> To: Thomas Beaudry >>> Cc: Hanby, Mike; sge-discuss@liv.ac.uk >>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >>> >>> Am 21.12.2016 um 18:41 schrieb Thomas Beaudry: >>> >>>> Hi Hanby, >>>> >>>> Yes I've checked before - no need to excuse yourself, any suggestion is >>>> helpful because I am really stumped on finding a solution. This is what >>>> I've tried on a machine that has a job in the 'r' state: >>>> >>>> perform-admin@perf-hpc04:~$ ps aux | grep "perform-admin" >>> >>> How is "perform-admin" related to the job? The job should be a child of the >>> sge_shepherd. A tree of all process you can get with: >>> >>> $ ps -e f >>> >>> (f w/o -). The relevant processes of SGE is the sge_execd and the spawned >>> sge_shepherd's for all the started processes thereon and their children. >>> >>> -- Reuti >>> >>> >>>> perform+ 69850 0.0 0.0 73656 56664 ? DN 11:45 0:01 mnc2nii >>>> -short -nii >>>> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks-short.nii >>>> >>>> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks.nii >>>> >>>> perform-admin@perf-hpc04:~$ sudo cat /var/log/auth.log | grep "69850" >>>> perform-admin@perf-hpc04:~$ sudo cat /var/log/syslog | grep "69850" >>>> perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69850" >>>> >>>> I'm not finding anything helpful. >>>> >>>> Thanks so much guys! >>>> Thomas >>>> ________________________________________ >>>> From: Hanby, Mike <mha...@uab.edu> >>>> Sent: Wednesday, December 21, 2016 12:34 PM >>>> To: Thomas Beaudry; Reuti >>>> Cc: sge-discuss@liv.ac.uk >>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >>>> >>>> Please excuse if you've already checked this, but are you sure that all >>>> job related processes have terminated on the compute nodes? >>>> >>>> Just a thought. >>>> >>>> -----Original Message----- >>>> From: SGE-discuss <sge-discuss-boun...@liverpool.ac.uk> on behalf of >>>> Thomas Beaudry <thomas.beau...@concordia.ca> >>>> Date: Wednesday, December 21, 2016 at 11:58 >>>> To: Reuti <re...@staff.uni-marburg.de> >>>> Cc: "sge-discuss@liv.ac.uk" <sge-disc...@liverpool.ac.uk> >>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >>>> >>>> Hi Reuti, >>>> >>>> Setting the loglevel to log_info didn't add any additional warnings to my >>>> spool messages file. >>>> >>>> Any other ideas as to what I can do? >>>> >>>> Thanks! >>>> Thomas >>>> ________________________________________ >>>> From: Reuti <re...@staff.uni-marburg.de> >>>> Sent: Wednesday, December 21, 2016 5:48 AM >>>> To: Thomas Beaudry >>>> Cc: sge-discuss@liv.ac.uk >>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >>>> >>>>> Am 21.12.2016 um 04:03 schrieb Thomas Beaudry >>>>> <thomas.beau...@concordia.ca>: >>>>> >>>>> Hi Reuti, >>>>> >>>>> It is: loglevel log_warning >>>> >>>> Please set it to log_info, then you will get more output in the messages >>>> file (`man sge_conf`). Maybe you get some hints then. >>>> >>>> >>>>> In case it helps, here is the full output: >>>>> >>>>> #global: >>>>> execd_spool_dir /opt/sge/default/spool >>>> >>>> This can be set to have the spool directories local to safe some network >>>> traffic. My favorite place is /var/spool/sge which is owned by the account >>>> owning SGE (for me sgeadmin:gridware). >>>> >>>> - Create the local spool directories >>>> - Adjust the setting in the configuration to read /var/spool/sge >>>> - Shut down the execd's >>>> - Start the execd's >>>> >>>> This will create the subdirectory "nodeXY" therein automatically on each >>>> exechost then. >>>> >>>> https://arc.liv.ac.uk/SGE/howto/nfsreduce.html >>>> >>>> -- Reuti >>>> >>>> >>>>> mailer /bin/mail >>>>> xterm /usr/bin/xterm >>>>> load_sensor none >>>>> prolog none >>>>> epilog none >>>>> shell_start_mode posix_compliant >>>>> login_shells sh,bash,ksh,csh,tcsh >>>>> min_uid 100 >>>>> min_gid 100 >>>>> user_lists none >>>>> xuser_lists none >>>>> projects none >>>>> xprojects none >>>>> enforce_project false >>>>> enforce_user auto >>>>> load_report_time 00:00:40 >>>>> max_unheard 00:05:00 >>>>> reschedule_unknown 00:00:00 >>>>> loglevel log_warning >>>>> administrator_mail thomas.beau...@concordia.ca >>>>> set_token_cmd none >>>>> pag_cmd none >>>>> token_extend_time none >>>>> shepherd_cmd none >>>>> qmaster_params none >>>>> execd_params none >>>>> reporting_params accounting=true reporting=false \ >>>>> flush_time=00:00:15 joblog=false sharelog=00:00:00 >>>>> finished_jobs 100 >>>>> gid_range 20000-20100 >>>>> qlogin_command builtin >>>>> qlogin_daemon builtin >>>>> rlogin_command builtin >>>>> rlogin_daemon builtin >>>>> rsh_command builtin >>>>> rsh_daemon builtin >>>>> max_aj_instances 2000 >>>>> max_aj_tasks 75000 >>>>> max_u_jobs 0 >>>>> max_jobs 0 >>>>> max_advance_reservations 0 >>>>> auto_user_oticket 0 >>>>> auto_user_fshare 0 >>>>> auto_user_default_project none >>>>> auto_user_delete_time 86400 >>>>> delegated_file_staging false >>>>> reprioritize 0 >>>>> jsv_url none >>>>> jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w >>>>> >>>>> Thanks! >>>>> Thomas >>>>> ________________________________________ >>>>> From: Reuti <re...@staff.uni-marburg.de> >>>>> Sent: Tuesday, December 20, 2016 5:35 PM >>>>> To: Thomas Beaudry >>>>> Cc: sge-discuss@liv.ac.uk >>>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >>>>> >>>>> Am 20.12.2016 um 22:37 schrieb Thomas Beaudry: >>>>> >>>>>> Hi Reuti, >>>>>> >>>>>> The jobs stay in the queue forever - and don't get processed. There are >>>>>> no messages in the spool directory for these jobs. >>>>> >>>>> The "r" state is already after the "t" state. With NFS problems they are >>>>> often stuck in "t" state. What is your setting off: >>>>> >>>>> $ qconf -sconf >>>>> ... >>>>> loglevel log_info >>>>> >>>>> -- Reuti >>>>> >>>>>> >>>>>> Thomas >>>>>> ________________________________________ >>>>>> From: Reuti <re...@staff.uni-marburg.de> >>>>>> Sent: Tuesday, December 20, 2016 4:25 PM >>>>>> To: Thomas Beaudry >>>>>> Cc: sge-discuss@liv.ac.uk >>>>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >>>>>> >>>>>> Hi, >>>>>> >>>>>> Am 20.12.2016 um 22:20 schrieb Thomas Beaudry: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I've run into a problem recently where users jobs are stuck in the 'r' >>>>>>> state. It doesn't always happen, but it's happening enough to be a >>>>>>> persistent error. My guess is that it is IO realted (the jobs are >>>>>>> accessing a NFS 4.1. share off of a windows 2012 file server). I >>>>>>> really don't know how to debug this since I'm not getting any useful >>>>>>> info from qstat -j <jobid> and the /var/log/* logs don't seem to give >>>>>>> me any clues - or maybe i'm missin something. >>>>>>> >>>>>>> I would be very greatful if anyone has any suggestions as to where I >>>>>>> can start to debug this issue. My cluster is unusable because of this >>>>>>> error. >>>>>> >>>>>> You mean the job exited already and is not removed from `qstat`? Usually >>>>>> there is a delay of some minutes for parallel jobs. >>>>>> >>>>>> What does the messages file in the spool directory of the nodes say? >>>>>> Unless it's local it's in $SGE_ROOT/default/spool/nodeXY/messages >>>>>> >>>>>> -- Reuti >>>>>> >>>>>> >>>>>>> Thanks, >>>>>>> Thomas >>>>>>> _______________________________________________ >>>>>>> SGE-discuss mailing list >>>>>>> SGE-discuss@liv.ac.uk >>>>>>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> SGE-discuss mailing list >>>> SGE-discuss@liv.ac.uk >>>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss >>>> >>>> >>>> >>> >>> >> >> > > _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss