Am 21.12.2016 um 18:41 schrieb Thomas Beaudry: > Hi Hanby, > > Yes I've checked before - no need to excuse yourself, any suggestion is > helpful because I am really stumped on finding a solution. This is what I've > tried on a machine that has a job in the 'r' state: > > perform-admin@perf-hpc04:~$ ps aux | grep "perform-admin"
How is "perform-admin" related to the job? The job should be a child of the sge_shepherd. A tree of all process you can get with: $ ps -e f (f w/o -). The relevant processes of SGE is the sge_execd and the spawned sge_shepherd's for all the started processes thereon and their children. -- Reuti > perform+ 69850 0.0 0.0 73656 56664 ? DN 11:45 0:01 mnc2nii > -short -nii > /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks-short.nii > > /NAS/home/perform-admin/ICBM-new-atlas/ICBM/process/transformbylandmark/t1/tal_t1_00256_V1-10landmarks.nii > > perform-admin@perf-hpc04:~$ sudo cat /var/log/auth.log | grep "69850" > perform-admin@perf-hpc04:~$ sudo cat /var/log/syslog | grep "69850" > perform-admin@perf-hpc04:~$ sudo cat /var/log/kern.log | grep "69850" > > I'm not finding anything helpful. > > Thanks so much guys! > Thomas > ________________________________________ > From: Hanby, Mike <mha...@uab.edu> > Sent: Wednesday, December 21, 2016 12:34 PM > To: Thomas Beaudry; Reuti > Cc: sge-discuss@liv.ac.uk > Subject: Re: [SGE-discuss] jobs stuck in 'r' state > > Please excuse if you've already checked this, but are you sure that all job > related processes have terminated on the compute nodes? > > Just a thought. > > -----Original Message----- > From: SGE-discuss <sge-discuss-boun...@liverpool.ac.uk> on behalf of Thomas > Beaudry <thomas.beau...@concordia.ca> > Date: Wednesday, December 21, 2016 at 11:58 > To: Reuti <re...@staff.uni-marburg.de> > Cc: "sge-discuss@liv.ac.uk" <sge-disc...@liverpool.ac.uk> > Subject: Re: [SGE-discuss] jobs stuck in 'r' state > > Hi Reuti, > > Setting the loglevel to log_info didn't add any additional warnings to my > spool messages file. > > Any other ideas as to what I can do? > > Thanks! > Thomas > ________________________________________ > From: Reuti <re...@staff.uni-marburg.de> > Sent: Wednesday, December 21, 2016 5:48 AM > To: Thomas Beaudry > Cc: sge-discuss@liv.ac.uk > Subject: Re: [SGE-discuss] jobs stuck in 'r' state > >> Am 21.12.2016 um 04:03 schrieb Thomas Beaudry <thomas.beau...@concordia.ca>: >> >> Hi Reuti, >> >> It is: loglevel log_warning > > Please set it to log_info, then you will get more output in the messages > file (`man sge_conf`). Maybe you get some hints then. > > >> In case it helps, here is the full output: >> >> #global: >> execd_spool_dir /opt/sge/default/spool > > This can be set to have the spool directories local to safe some network > traffic. My favorite place is /var/spool/sge which is owned by the account > owning SGE (for me sgeadmin:gridware). > > - Create the local spool directories > - Adjust the setting in the configuration to read /var/spool/sge > - Shut down the execd's > - Start the execd's > > This will create the subdirectory "nodeXY" therein automatically on each > exechost then. > > https://arc.liv.ac.uk/SGE/howto/nfsreduce.html > > -- Reuti > > >> mailer /bin/mail >> xterm /usr/bin/xterm >> load_sensor none >> prolog none >> epilog none >> shell_start_mode posix_compliant >> login_shells sh,bash,ksh,csh,tcsh >> min_uid 100 >> min_gid 100 >> user_lists none >> xuser_lists none >> projects none >> xprojects none >> enforce_project false >> enforce_user auto >> load_report_time 00:00:40 >> max_unheard 00:05:00 >> reschedule_unknown 00:00:00 >> loglevel log_warning >> administrator_mail thomas.beau...@concordia.ca >> set_token_cmd none >> pag_cmd none >> token_extend_time none >> shepherd_cmd none >> qmaster_params none >> execd_params none >> reporting_params accounting=true reporting=false \ >> flush_time=00:00:15 joblog=false sharelog=00:00:00 >> finished_jobs 100 >> gid_range 20000-20100 >> qlogin_command builtin >> qlogin_daemon builtin >> rlogin_command builtin >> rlogin_daemon builtin >> rsh_command builtin >> rsh_daemon builtin >> max_aj_instances 2000 >> max_aj_tasks 75000 >> max_u_jobs 0 >> max_jobs 0 >> max_advance_reservations 0 >> auto_user_oticket 0 >> auto_user_fshare 0 >> auto_user_default_project none >> auto_user_delete_time 86400 >> delegated_file_staging false >> reprioritize 0 >> jsv_url none >> jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w >> >> Thanks! >> Thomas >> ________________________________________ >> From: Reuti <re...@staff.uni-marburg.de> >> Sent: Tuesday, December 20, 2016 5:35 PM >> To: Thomas Beaudry >> Cc: sge-discuss@liv.ac.uk >> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >> >> Am 20.12.2016 um 22:37 schrieb Thomas Beaudry: >> >>> Hi Reuti, >>> >>> The jobs stay in the queue forever - and don't get processed. There are no >>> messages in the spool directory for these jobs. >> >> The "r" state is already after the "t" state. With NFS problems they are >> often stuck in "t" state. What is your setting off: >> >> $ qconf -sconf >> ... >> loglevel log_info >> >> -- Reuti >> >>> >>> Thomas >>> ________________________________________ >>> From: Reuti <re...@staff.uni-marburg.de> >>> Sent: Tuesday, December 20, 2016 4:25 PM >>> To: Thomas Beaudry >>> Cc: sge-discuss@liv.ac.uk >>> Subject: Re: [SGE-discuss] jobs stuck in 'r' state >>> >>> Hi, >>> >>> Am 20.12.2016 um 22:20 schrieb Thomas Beaudry: >>> >>>> Hi, >>>> >>>> I've run into a problem recently where users jobs are stuck in the 'r' >>>> state. It doesn't always happen, but it's happening enough to be a >>>> persistent error. My guess is that it is IO realted (the jobs are >>>> accessing a NFS 4.1. share off of a windows 2012 file server). I really >>>> don't know how to debug this since I'm not getting any useful info from >>>> qstat -j <jobid> and the /var/log/* logs don't seem to give me any clues >>>> - or maybe i'm missin something. >>>> >>>> I would be very greatful if anyone has any suggestions as to where I can >>>> start to debug this issue. My cluster is unusable because of this error. >>> >>> You mean the job exited already and is not removed from `qstat`? Usually >>> there is a delay of some minutes for parallel jobs. >>> >>> What does the messages file in the spool directory of the nodes say? Unless >>> it's local it's in $SGE_ROOT/default/spool/nodeXY/messages >>> >>> -- Reuti >>> >>> >>>> Thanks, >>>> Thomas >>>> _______________________________________________ >>>> SGE-discuss mailing list >>>> SGE-discuss@liv.ac.uk >>>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss >>> >>> >> >> > > _______________________________________________ > SGE-discuss mailing list > SGE-discuss@liv.ac.uk > https://arc.liv.ac.uk/mailman/listinfo/sge-discuss > > > _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss