Another angle to consider is from the MATLAB side: using the '-singleCompThread' option in your matlab and worker scripts ($MATLAB_ROOT/bin/(matlab|worker)):
$ diff matlab.dist matlab 497c497 < arglist="" --- > arglist="-singleCompThread" $ diff worker.dist worker 19c19 < exec "${bindir}/matlab" -dmlworker -nodisplay -r distcomp_evaluate_filetask $* --- > exec "${bindir}/matlab" -dmlworker -logfile /dev/null -singleCompThread > -nodisplay -r distcomp_evaluate_filetask $* MATLAB will still launch an ungodly number of threads, but only use one (or so :)) for computation. -Hugh ________________________________ From: users-boun...@gridengine.org <users-boun...@gridengine.org> on behalf of William Hay <w....@ucl.ac.uk> Sent: Friday, June 10, 2016 5:13 AM To: berg...@merctech.com Cc: Gridengine Users Group Subject: Re: [gridengine users] core binding stopped working On Tue, May 24, 2016 at 07:20:48PM -0400, berg...@merctech.com wrote: > We're running SoGE 8.1.6 under CentOS6 and had successfully been using > qstat -j 2005747 > ============================================================== > job_number: 2005747 > exec_file: job_scripts/2005747 > submission_time: Tue May 24 12:19:22 2016 > sge_o_log_name: foobarmultimodal > account: sge > hard resource_list: h_stack=256m,centos6=TRUE,h_vmem=10G > notify: FALSE > job_name: run_func.sh > priority: -100 > jobshare: 0 > shell_list: NONE:/bin/bash > env_list: > TERM=NONE,SGE_CELL=default,SGE_ARCH=lx-amd64,SGE_EXECD_PORT=16445,SGE_QMASTER_PORT=16444,SGE_ROOT=/cbica/home/sge/centos6/8.1.6,SGE_VER=8.1.6,OMP_NUM_THREADS=1,ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=1,numMaxCompThreads=1,MKL_NUM_THREADS=1,MKL_DYNAMIC=FALSE > job_args: - > script_file: STDIN > binding: set linear:1 > --------------- > > The job is aggressively multi-threaded (it is based on Matlab). In the > past, this kind of job would be bound to the requested number of CPUs > (defaulting to 1). If there were too few CPUs requested, the job would > run very slowly as threads waited for each other, but other processes > on the same node would be fine. > > Now the job is using more than 1 CPU (I've seen it spike up to 9 cores) > and overloading the compute node. > > [root@c1-17 log]# ps -fp 4588 > UID PID PPID C STIME TTY TIME CMD > 32226 4588 4586 0 12:25 ? 00:00:00 /bin/bash > /var/tmp/gridengine/8.1.6/default/spool/c1-17/job_scripts/2005747 - > [root@c1-17 log]# pstree -p 4588 > 2005747(4588)?????????run_runGdCMFreg(4851)?????????runGdCMFreg_fun(4853)?????????{runGdCMFreg_fu}(4859) > > ??????{runGdCMFreg_fu}(4860) > : > : > > ??????{runGdCMFreg_fu}(5028) > [root@c1-17 log]# pstree -p 4588 | wc -l > 77 > [root@c1-17 log]# taskset -c -p 4588 > pid 4588's current affinity list: 0-39 > [root@c1-17 log]# taskset -c -p 4851 > pid 4851's current affinity list: 0-39 > [root@c1-17 log]# taskset -c -p 4853 > pid 4853's current affinity list: 0-39 > -------------------------------- > > Any suggestions about troubleshooting this in order to re-enable the core > binding? > Check that ENABLE_BINDING is set in the execd_params. The other possibility is that you've been bitten by this bug: https://arc.liv.ac.uk/trac/SGE/ticket/1479 which can cause an mpi style job to over allocate cores. If this results in there being no cores available on the node then any other job that ends up there won't be bound. I'm working on a fix but first I have to shave some Yaks. William
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users