Am 21.12.2016 um 21:22 schrieb Thomas Beaudry: > These are the last lines of qstat -j: > > > script_file: STDIN > binding: NONE > job_type: NONE > usage 1: cpu=00:00:01, mem=0.04828 GB s, io=0.02693 GB, > vmem=56.121M, maxvmem=57.477M
Aha, and the cpu time never increases? Then it's really a problem with the disk access of the application to the NAS. Depending on the application: would it help to copy the file form the NAS to the $TMPDIR on the node beforehand and perform the computation then with the local file? A copy process might access the file just sequentially while the application could do random seeks. -- Reuti > binding 1: NONE > scheduling info: (Collecting of scheduler job information is > turned off) > > ________________________________________ > From: Reuti <re...@staff.uni-marburg.de> > Sent: Wednesday, December 21, 2016 3:20 PM > To: Thomas Beaudry > Cc: Hanby, Mike; sge-discuss@liv.ac.uk > Subject: Re: [SGE-discuss] jobs stuck in 'r' state > > Am 21.12.2016 um 21:11 schrieb Thomas Beaudry: > >> Hi Reuti, >> >> I think it's good: >> >> 69847 ? S 0:00 \_ sge_shepherd-15272 -bg >> 69848 ? SNs 0:00 \_ /bin/sh >> /opt/sge/default/spool/perf-hpc04/job_scripts/15272 >> 69850 ? DN 0:01 \_ mnc2nii -short -nii >> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_ > > State D means uninterruptible kernel task, often because of disk IO. So it > looks running, but waiting for the disk. And `qstat -j 15272` has no output? > It should have a line: > > usage 1: cpu=…, mem=… GBs, io=…, vmem=…M, maxvmem=…M > > -- Reuti > > >> ______________ >> >> >> With the original pid of shepherd (69847), i get: >> >> perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69847" >> >> Dec 19 11:23:48 perf-hpc04 kernel: [251479.437471] audit: type=1400 >> audit(1482164628.332:69847): apparmor="ALLOWED" operation="open" >> profile="/usr/sbin/sssd//null-/usr/sbin/adcli" >> name="/usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2.2" pid=31820 >> comm="adcli" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 >> >> >> Are we getting somewhere? >> >> Thanks so much! >> Thomas >> __________________________ >> From: Reuti <re...@staff.uni-marburg.de> >> Sent: Wednesday, December 21, 2016 2:38 PM >> To: Thomas Beaudry >> Cc: Hanby, Mike; sge-discuss@liv.ac.uk >> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >> >> Am 21.12.2016 um 18:41 schrieb Thomas Beaudry: >> >>> Hi Hanby, >>> >>> Yes I've checked before - no need to excuse yourself, any suggestion is >>> helpful because I am really stumped on finding a solution. This is what >>> I've tried on a machine that has a job in the 'r' state: >>> >>> perform-admin@perf-hpc04:~$ ps aux | grep "perform-admin" >> >> How is "perform-admin" related to the job? The job should be a child of the >> sge_shepherd. A tree of all process you can get with: >> >> $ ps -e f >> >> (f w/o -). The relevant processes of SGE is the sge_execd and the spawned >> sge_shepherd's for all the started processes thereon and their children. >> >> -- Reuti >> >> >>> perform+ 69850 0.0 0.0 73656 56664 ? DN 11:45 0:01 mnc2nii >>> -short -nii >>> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks-short.nii >>> >>> /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks.nii >>> >>> perform-admin@perf-hpc04:~$ sudo cat /var/log/auth.log | grep "69850" >>> perform-admin@perf-hpc04:~$ sudo cat /var/log/syslog | grep "69850" >>> perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69850" >>> >>> I'm not finding anything helpful. >>> >>> Thanks so much guys! >>> Thomas >>> ________________________________________ >>> From: Hanby, Mike <mha...@uab.edu> >>> Sent: Wednesday, December 21, 2016 12:34 PM >>> To: Thomas Beaudry; Reuti >>> Cc: sge-discuss@liv.ac.uk >>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >>> >>> Please excuse if you've already checked this, but are you sure that all job >>> related processes have terminated on the compute nodes? >>> >>> Just a thought. >>> >>> -----Original Message----- >>> From: SGE-discuss <sge-discuss-boun...@liverpool.ac.uk> on behalf of Thomas >>> Beaudry <thomas.beau...@concordia.ca> >>> Date: Wednesday, December 21, 2016 at 11:58 >>> To: Reuti <re...@staff.uni-marburg.de> >>> Cc: "sge-discuss@liv.ac.uk" <sge-disc...@liverpool.ac.uk> >>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >>> >>> Hi Reuti, >>> >>> Setting the loglevel to log_info didn't add any additional warnings to my >>> spool messages file. >>> >>> Any other ideas as to what I can do? >>> >>> Thanks! >>> Thomas >>> ________________________________________ >>> From: Reuti <re...@staff.uni-marburg.de> >>> Sent: Wednesday, December 21, 2016 5:48 AM >>> To: Thomas Beaudry >>> Cc: sge-discuss@liv.ac.uk >>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >>> >>>> Am 21.12.2016 um 04:03 schrieb Thomas Beaudry >>>> <thomas.beau...@concordia.ca>: >>>> >>>> Hi Reuti, >>>> >>>> It is: loglevel log_warning >>> >>> Please set it to log_info, then you will get more output in the messages >>> file (`man sge_conf`). Maybe you get some hints then. >>> >>> >>>> In case it helps, here is the full output: >>>> >>>> #global: >>>> execd_spool_dir /opt/sge/default/spool >>> >>> This can be set to have the spool directories local to safe some network >>> traffic. My favorite place is /var/spool/sge which is owned by the account >>> owning SGE (for me sgeadmin:gridware). >>> >>> - Create the local spool directories >>> - Adjust the setting in the configuration to read /var/spool/sge >>> - Shut down the execd's >>> - Start the execd's >>> >>> This will create the subdirectory "nodeXY" therein automatically on each >>> exechost then. >>> >>> https://arc.liv.ac.uk/SGE/howto/nfsreduce.html >>> >>> -- Reuti >>> >>> >>>> mailer /bin/mail >>>> xterm /usr/bin/xterm >>>> load_sensor none >>>> prolog none >>>> epilog none >>>> shell_start_mode posix_compliant >>>> login_shells sh,bash,ksh,csh,tcsh >>>> min_uid 100 >>>> min_gid 100 >>>> user_lists none >>>> xuser_lists none >>>> projects none >>>> xprojects none >>>> enforce_project false >>>> enforce_user auto >>>> load_report_time 00:00:40 >>>> max_unheard 00:05:00 >>>> reschedule_unknown 00:00:00 >>>> loglevel log_warning >>>> administrator_mail thomas.beau...@concordia.ca >>>> set_token_cmd none >>>> pag_cmd none >>>> token_extend_time none >>>> shepherd_cmd none >>>> qmaster_params none >>>> execd_params none >>>> reporting_params accounting=true reporting=false \ >>>> flush_time=00:00:15 joblog=false sharelog=00:00:00 >>>> finished_jobs 100 >>>> gid_range 20000-20100 >>>> qlogin_command builtin >>>> qlogin_daemon builtin >>>> rlogin_command builtin >>>> rlogin_daemon builtin >>>> rsh_command builtin >>>> rsh_daemon builtin >>>> max_aj_instances 2000 >>>> max_aj_tasks 75000 >>>> max_u_jobs 0 >>>> max_jobs 0 >>>> max_advance_reservations 0 >>>> auto_user_oticket 0 >>>> auto_user_fshare 0 >>>> auto_user_default_project none >>>> auto_user_delete_time 86400 >>>> delegated_file_staging false >>>> reprioritize 0 >>>> jsv_url none >>>> jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w >>>> >>>> Thanks! >>>> Thomas >>>> ________________________________________ >>>> From: Reuti <re...@staff.uni-marburg.de> >>>> Sent: Tuesday, December 20, 2016 5:35 PM >>>> To: Thomas Beaudry >>>> Cc: sge-discuss@liv.ac.uk >>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >>>> >>>> Am 20.12.2016 um 22:37 schrieb Thomas Beaudry: >>>> >>>>> Hi Reuti, >>>>> >>>>> The jobs stay in the queue forever - and don't get processed. There are >>>>> no messages in the spool directory for these jobs. >>>> >>>> The "r" state is already after the "t" state. With NFS problems they are >>>> often stuck in "t" state. What is your setting off: >>>> >>>> $ qconf -sconf >>>> ... >>>> loglevel log_info >>>> >>>> -- Reuti >>>> >>>>> >>>>> Thomas >>>>> ________________________________________ >>>>> From: Reuti <re...@staff.uni-marburg.de> >>>>> Sent: Tuesday, December 20, 2016 4:25 PM >>>>> To: Thomas Beaudry >>>>> Cc: sge-discuss@liv.ac.uk >>>>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >>>>> >>>>> Hi, >>>>> >>>>> Am 20.12.2016 um 22:20 schrieb Thomas Beaudry: >>>>> >>>>>> Hi, >>>>>> >>>>>> I've run into a problem recently where users jobs are stuck in the 'r' >>>>>> state. It doesn't always happen, but it's happening enough to be a >>>>>> persistent error. My guess is that it is IO realted (the jobs are >>>>>> accessing a NFS 4.1. share off of a windows 2012 file server). I really >>>>>> don't know how to debug this since I'm not getting any useful info from >>>>>> qstat -j <jobid> and the /var/log/* logs don't seem to give me any >>>>>> clues - or maybe i'm missin something. >>>>>> >>>>>> I would be very greatful if anyone has any suggestions as to where I can >>>>>> start to debug this issue. My cluster is unusable because of this error. >>>>> >>>>> You mean the job exited already and is not removed from `qstat`? Usually >>>>> there is a delay of some minutes for parallel jobs. >>>>> >>>>> What does the messages file in the spool directory of the nodes say? >>>>> Unless it's local it's in $SGE_ROOT/default/spool/nodeXY/messages >>>>> >>>>> -- Reuti >>>>> >>>>> >>>>>> Thanks, >>>>>> Thomas >>>>>> _______________________________________________ >>>>>> SGE-discuss mailing list >>>>>> SGE-discuss@liv.ac.uk >>>>>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss >>>>> >>>>> >>>> >>>> >>> >>> _______________________________________________ >>> SGE-discuss mailing list >>> SGE-discuss@liv.ac.uk >>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss >>> >>> >>> >> >> > > _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss