Hi Hanby, Yes I've checked before - no need to excuse yourself, any suggestion is helpful because I am really stumped on finding a solution. This is what I've tried on a machine that has a job in the 'r' state:
perform-admin@perf-hpc04:~$ ps aux | grep "perform-admin" perform+ 69850 0.0 0.0 73656 56664 ? DN 11:45 0:01 mnc2nii -short -nii /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks-short.nii /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks.nii perform-admin@perf-hpc04:~$ sudo cat /var/log/auth.log | grep "69850" perform-admin@perf-hpc04:~$ sudo cat /var/log/syslog | grep "69850" perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69850" I'm not finding anything helpful. Thanks so much guys! Thomas ________________________________________ From: Hanby, Mike <mha...@uab.edu> Sent: Wednesday, December 21, 2016 12:34 PM To: Thomas Beaudry; Reuti Cc: sge-discuss@liv.ac.uk Subject: Re: [SGE-discuss] jobs stuck in 'r' state Please excuse if you've already checked this, but are you sure that all job related processes have terminated on the compute nodes? Just a thought. -----Original Message----- From: SGE-discuss <sge-discuss-boun...@liverpool.ac.uk> on behalf of Thomas Beaudry <thomas.beau...@concordia.ca> Date: Wednesday, December 21, 2016 at 11:58 To: Reuti <re...@staff.uni-marburg.de> Cc: "sge-discuss@liv.ac.uk" <sge-disc...@liverpool.ac.uk> Subject: Re: [SGE-discuss] jobs stuck in 'r' state Hi Reuti, Setting the loglevel to log_info didn't add any additional warnings to my spool messages file. Any other ideas as to what I can do? Thanks! Thomas ________________________________________ From: Reuti <re...@staff.uni-marburg.de> Sent: Wednesday, December 21, 2016 5:48 AM To: Thomas Beaudry Cc: sge-discuss@liv.ac.uk Subject: Re: [SGE-discuss] jobs stuck in 'r' state > Am 21.12.2016 um 04:03 schrieb Thomas Beaudry <thomas.beau...@concordia.ca>: > > Hi Reuti, > > It is: loglevel log_warning Please set it to log_info, then you will get more output in the messages file (`man sge_conf`). Maybe you get some hints then. > In case it helps, here is the full output: > > #global: > execd_spool_dir /opt/sge/default/spool This can be set to have the spool directories local to safe some network traffic. My favorite place is /var/spool/sge which is owned by the account owning SGE (for me sgeadmin:gridware). - Create the local spool directories - Adjust the setting in the configuration to read /var/spool/sge - Shut down the execd's - Start the execd's This will create the subdirectory "nodeXY" therein automatically on each exechost then. https://arc.liv.ac.uk/SGE/howto/nfsreduce.html -- Reuti > mailer /bin/mail > xterm /usr/bin/xterm > load_sensor none > prolog none > epilog none > shell_start_mode posix_compliant > login_shells sh,bash,ksh,csh,tcsh > min_uid 100 > min_gid 100 > user_lists none > xuser_lists none > projects none > xprojects none > enforce_project false > enforce_user auto > load_report_time 00:00:40 > max_unheard 00:05:00 > reschedule_unknown 00:00:00 > loglevel log_warning > administrator_mail thomas.beau...@concordia.ca > set_token_cmd none > pag_cmd none > token_extend_time none > shepherd_cmd none > qmaster_params none > execd_params none > reporting_params accounting=true reporting=false \ > flush_time=00:00:15 joblog=false sharelog=00:00:00 > finished_jobs 100 > gid_range 20000-20100 > qlogin_command builtin > qlogin_daemon builtin > rlogin_command builtin > rlogin_daemon builtin > rsh_command builtin > rsh_daemon builtin > max_aj_instances 2000 > max_aj_tasks 75000 > max_u_jobs 0 > max_jobs 0 > max_advance_reservations 0 > auto_user_oticket 0 > auto_user_fshare 0 > auto_user_default_project none > auto_user_delete_time 86400 > delegated_file_staging false > reprioritize 0 > jsv_url none > jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w > > Thanks! > Thomas > ________________________________________ > From: Reuti <re...@staff.uni-marburg.de> > Sent: Tuesday, December 20, 2016 5:35 PM > To: Thomas Beaudry > Cc: sge-discuss@liv.ac.uk > Subject: Re: [SGE-discuss] jobs stuck in 'r' state > > Am 20.12.2016 um 22:37 schrieb Thomas Beaudry: > >> Hi Reuti, >> >> The jobs stay in the queue forever - and don't get processed. There are no messages in the spool directory for these jobs. > > The "r" state is already after the "t" state. With NFS problems they are often stuck in "t" state. What is your setting off: > > $ qconf -sconf > ... > loglevel log_info > > -- Reuti > >> >> Thomas >> ________________________________________ >> From: Reuti <re...@staff.uni-marburg.de> >> Sent: Tuesday, December 20, 2016 4:25 PM >> To: Thomas Beaudry >> Cc: sge-discuss@liv.ac.uk >> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >> >> Hi, >> >> Am 20.12.2016 um 22:20 schrieb Thomas Beaudry: >> >>> Hi, >>> >>> I've run into a problem recently where users jobs are stuck in the 'r' state. It doesn't always happen, but it's happening enough to be a persistent error. My guess is that it is IO realted (the jobs are accessing a NFS 4.1. share off of a windows 2012 file server). I really don't know how to debug this since I'm not getting any useful info from qstat -j <jobid> and the /var/log/* logs don't seem to give me any clues - or maybe i'm missin something. >>> >>> I would be very greatful if anyone has any suggestions as to where I can start to debug this issue. My cluster is unusable because of this error. >> >> You mean the job exited already and is not removed from `qstat`? Usually there is a delay of some minutes for parallel jobs. >> >> What does the messages file in the spool directory of the nodes say? Unless it's local it's in $SGE_ROOT/default/spool/nodeXY/messages >> >> -- Reuti >> >> >>> Thanks, >>> Thomas >>> _______________________________________________ >>> SGE-discuss mailing list >>> SGE-discuss@liv.ac.uk >>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss >> >> > > _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss