Re: [gridengine users] core binding stopped working

MacMullan IV, Hugh Fri, 10 Jun 2016 05:49:39 -0700

Another angle to consider is from the MATLAB side: using the 
'-singleCompThread' option in your matlab and worker scripts 
($MATLAB_ROOT/bin/(matlab|worker)):



$ diff matlab.dist matlab
497c497
<     arglist=""
---
>     arglist="-singleCompThread"
$ diff worker.dist worker
19c19
< exec "${bindir}/matlab" -dmlworker -nodisplay -r distcomp_evaluate_filetask $*
---
> exec "${bindir}/matlab" -dmlworker -logfile /dev/null -singleCompThread 
> -nodisplay -r distcomp_evaluate_filetask $*


MATLAB will still launch an ungodly number of threads, but only use one (or so 
:)) for computation.

-Hugh

________________________________
From: users-boun...@gridengine.org <users-boun...@gridengine.org> on behalf of 
William Hay <w....@ucl.ac.uk>
Sent: Friday, June 10, 2016 5:13 AM
To: berg...@merctech.com
Cc: Gridengine Users Group
Subject: Re: [gridengine users] core binding stopped working

On Tue, May 24, 2016 at 07:20:48PM -0400, berg...@merctech.com wrote:
> We're running SoGE 8.1.6 under CentOS6 and had successfully been using
> qstat -j 2005747
> ==============================================================
> job_number:                 2005747
> exec_file:                  job_scripts/2005747
> submission_time:            Tue May 24 12:19:22 2016
> sge_o_log_name:             foobarmultimodal
> account:                    sge
> hard resource_list:         h_stack=256m,centos6=TRUE,h_vmem=10G
> notify:                     FALSE
> job_name:                   run_func.sh
> priority:                   -100
> jobshare:                   0
> shell_list:                 NONE:/bin/bash
> env_list:                   
> TERM=NONE,SGE_CELL=default,SGE_ARCH=lx-amd64,SGE_EXECD_PORT=16445,SGE_QMASTER_PORT=16444,SGE_ROOT=/cbica/home/sge/centos6/8.1.6,SGE_VER=8.1.6,OMP_NUM_THREADS=1,ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=1,numMaxCompThreads=1,MKL_NUM_THREADS=1,MKL_DYNAMIC=FALSE
> job_args:                   -
> script_file:                STDIN
> binding:                    set linear:1
> ---------------
>
> The job is aggressively multi-threaded (it is based on Matlab). In the
> past, this kind of job would be bound to the requested number of CPUs
> (defaulting to 1). If there were too few CPUs requested, the job would
> run very slowly as threads waited for each other, but other processes
> on the same node would be fine.
>
> Now the job is using more than 1 CPU (I've seen it spike up to 9 cores)
> and overloading the compute node.
>
> [root@c1-17 log]# ps -fp 4588
> UID        PID  PPID  C STIME TTY          TIME CMD
> 32226     4588  4586  0 12:25 ?        00:00:00 /bin/bash 
> /var/tmp/gridengine/8.1.6/default/spool/c1-17/job_scripts/2005747 -
> [root@c1-17 log]# pstree -p 4588
> 2005747(4588)?????????run_runGdCMFreg(4851)?????????runGdCMFreg_fun(4853)?????????{runGdCMFreg_fu}(4859)
>                                                               
> ??????{runGdCMFreg_fu}(4860)
>                                                                :
>                                                                :
>                                                               
> ??????{runGdCMFreg_fu}(5028)
> [root@c1-17 log]# pstree -p 4588 | wc -l
> 77
> [root@c1-17 log]# taskset -c -p 4588
> pid 4588's current affinity list: 0-39
> [root@c1-17 log]# taskset -c -p 4851
> pid 4851's current affinity list: 0-39
> [root@c1-17 log]# taskset -c -p 4853
> pid 4853's current affinity list: 0-39
> --------------------------------
>
> Any suggestions about troubleshooting this in order to re-enable the core 
> binding?
>

Check that ENABLE_BINDING is set in the execd_params.

The other possibility is that you've been bitten by this bug:
https://arc.liv.ac.uk/trac/SGE/ticket/1479 which can cause an mpi style
job to over allocate cores.  If this results in there being no cores
available on the node then any other job that ends up there won't be
bound.

I'm working on a fix but first I have to shave some Yaks.

William

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] core binding stopped working

Reply via email to