Re: [gridengine users] jobs allocate cores on node but do nothing

Reuti Mon, 15 Aug 2016 08:40:07 -0700

> Am 15.08.2016 um 17:03 schrieb Ulrich Hiller <hil...@mpia-hd.mpg.de>:
> 
> Hello,
> 
> thank you for the clarification. I must have misunderstood you.
> Now i did it.The master node was in the example i send now exec-node01
> (it varied from attempt  to attempt). The output is in the master-node
> file. The qstat file is the output of
> qstat -g t -u '*'
> That seems to look normal.
> 
> Now i created a simple C file with an endless loop.
> #include <stdio.h>
> int main()
> {
> int x;
> for(x=0;x=10;x=x+1)
> {
> puts("Hello");
> ;
> }
> return(0);
> }
> 
> and compiled it:
> mpicc mpihello.c -o mpihello
> and started qsub:
> qsub -pe orte 300 -j yes -cwd -S /bin/bash <<< "mpiexec -n 300 mpihello"
> The outputs look the same as for the sleep command above.
> But now i counted the jobs:
> 
> qstat -g t -u '*' | grep -ic slave
> This results in the number '300', which i expected.
> 
> On the execute nodes i did:
> ps -ef | grep mpihello | grep -v grep | grep -vc mpiexec


f w/o -

$ ps -e f

will list a nice tree of the processes.


> (i counted the 'mpihello' processes)
> This is the result:
> exec-node01: 43
> exec-node02: 82
> exec-node03: 83
> exec-node04: 82
> exec-node05: 82
> exec-node06: 80
> exec-node07: 64
> exec-node08: 64

To investigate this it would be good to post the complete slot allocation by 
`qstat -g t -u <your user>`, the master of the MPI application and one of the 
slave nodes' `ps -e f --cols=500`. Any "mpihello" in the path?

-- Reuti


> Which gives the sum of 580.
> When i count the number of free solts together (from 'qhost -q') i also
> get 300, which i expect.
> Where do the extra processes on the nodes come from?
> 
> This difference is reproducible.
> 
> libgomp.so.1.0.0 library is installed, but aqpart from that nothing with
> OpenMP.
> 
> With kind regards, ulrich
> 
> 
> 
> 
> 
> 
> 
> On 08/15/2016 02:30 PM, Ulrich Hiller wrote:
>> Hello,
>> 
>>> The other issue seems to be, that in fact your job is using only one
>> machine, which means that it is essentially ignoring any granted slot
>> allocation. While the job is running, can you please execute on the
>> master node of the parallel job:
>>> 
>>> $ ps -e f
>>> 
>>> (f w/o -) and post the relevant lines belonging to either sge_execd or
>> just running as kids of the init process, in case they jumped out of the
>> process tree. Maybe a good start would be to execute something like
>> `mpiexec sleep 300` in the jobscript.
>>> 
>> 
>> i invoked
>> qsub -pe orte 160 -j yes -cwd -S /bin/bash <<< "mpiexec -n 160 sleep 300"
>> 
>> the only line ('ps -e f') on the master node was:
>> 55722 ?        Sl     3:42 /opt/sge/bin/lx-amd64/sge_qmaster
>> 
>> No other sge lines, no child processes from it, and no other init
>> processes leading to sge While at the same time the sleep processes were
>> running on the nodes (Checked with ps command on the nodes).
>> 
>> The qstat command gave :
>>   264 0.60500 STDIN      ulrich       r     08/15/2016 11:33:02
>> all.q@exec-node01                  MASTER
>> 
>> all.q@exec-node01                  SLAVE
>> 
>> all.q@exec-node01                  SLAVE
>> 
>> all.q@exec-node01                  SLAVE
>> [ ...]
>> 
>> 264 0.60500 STDIN      ulrich       r     08/15/2016 11:33:02
>> all.q@exec-node03                  SLAVE
>> 
>> all.q@exec-node03                  SLAVE
>> 
>> all.q@exec-node03                  SLAVE
>> [ ... ]
>>   264 0.60500 STDIN      ulrich       r     08/15/2016 11:33:02
>> all.q@exec-node05                  SLAVE
>> 
>> all.q@exec-node05                  SLAVE
>> [ ...]
>> 
>> 
>> Because there was only the master deamon running on the master node, and
>> you were tlaking about child processes: Was this now normal behaviour my
>> cluster showed or is there something wrong?
>> 
>> Kind reagrds, ulrich
>> 
>> 
>> 
>> On 08/12/2016 07:11 PM, Reuti wrote:
>>> Hi,
>>> 
>>>> Am 12.08.2016 um 18:48 schrieb Ulrich Hiller <hil...@mpia-hd.mpg.de>:
>>>> 
>>>> Hello,
>>>> 
>>>> i have a strange effect, where i am not sure whether it is "only" a
>>>> misconfiguration or a bug.
>>>> 
>>>> First: I run son of gridengine 8.1.9-1.el6.x86_64 (i installed the rhel
>>>> rpm on an opensuse 13.1 machine. This should not matter in this case,
>>>> and it is reported to be able to run on opensuse).
>>>> 
>>>> mpirun and mpiexec are from openmpi-1.10.3 (no other mpi was installed,
>>>> neither on master, nor on slaves). The installation was made with:
>>>> ./configure --prefix=`pwd`/build --disable-dlopen --disable-mca-dso
>>>> --with-orte --with-sge --with-x --enable-mpi-thread-multiple
>>>> --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default
>>>> --enable-orte-static-ports --enable-mpi-cxx --enable-mpi-cxx-seek
>>>> --enable-oshmem --enable-java --enable-mpi-java
>>>> make
>>>> make install
>>>> 
>>>> I attached the outputs of 'qconf -ap all.q' , 'qconf -sconf' and 'qconf
>>>> -sp orte' as textfiles.
>>>> 
>>>> Now my problem:
>>>> I asked for 20 cores and if i run qstat -u '*' it shows that this job
>>>> is being run in slave07 using 20 cores but is not true! if i run qstat
>>>> -f -u '*' i see that this job is only using 3 cores in salve07 and
>>>> there are 17 cores in other nodes allocated to this job which are in fact
>>>> unused!
>>> 
>>> qstat will list only the master node of the parallel job and the number of 
>>> overall slots. The granted allocation you can check with:
>>> 
>>> $ qstat -g t -u '*'
>>> 
>>> The other issue seems to be, that in fact your job is using only one 
>>> machine, which means that it is essentially ignoring any granted slot 
>>> allocation. While the job is running, can you please execute on the master 
>>> node of the parallel job:
>>> 
>>> $ ps -e f
>>> 
>>> (f w/o -) and post the relevant lines belonging to either sge_execd or just 
>>> running as kids of the init process, in case they jumped out of the process 
>>> tree. Maybe a good start would be to execute something like `mpiexec sleep 
>>> 300` in the jobscript.
>>> 
>>> Next step could be a `mpihello.c` where you put an almost endless loop 
>>> inside and switch off all optimizations during compilations to check 
>>> whether these slave processes are distributed in the correct way.
>>> 
>>> Note that some applications will check the number of cores they are running 
>>> on and start by OpenMP (not Open MPI) as many threads as cores are found. 
>>> Could this be the case for your application too?
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> Or other example:
>>>> My job took say 6 cpus on slave07 and 14 on slave06 but nothing was
>>>> running on 06 and therefore a waste of ressource on 06 and overload on
>>>> 07 becomes highly possible (the numbers are made up).
>>>> If i ran 1 Cpus in many independent jobs that would not be an issue, but
>>>> imagine i now request 60 cpus on slave07, that would seriously overload
>>>> the node in many cases.
>>>> 
>>>> Or other example:
>>>> if i ask for say 50 CPUs, the job will start on one node, e.g,
>>>> slave01,  but only reserving say 15 CPUs out of 64 and reserve the rest
>>>> on many other nodes (obviously wasting space doing nothing).
>>>> This has the bad consequence of allocating many more CPUs than available
>>>> when many jobs are running, imagine you have 10 jobs like this one...
>>>> some nodes will run maybe 3 even if they only have 24 CPUs...
>>>> 
>>>> I hope that i have made clear what the issue is.
>>>> 
>>>> I also see that the `qstat` and `qstat -f` are in disagreement. The
>>>> latter is correct, i checked the processes running on the nodes.
>>>> 
>>>> 
>>>> Did somebody already encounter such a problem? Does somebody have an
>>>> idea where to look into or what to test?
>>>> 
>>>> With kind regards, ulrich
>>>> 
>>>> 
>>>> 
>>>> <qhost.txt><qconf-sconf.txt><qconf-mp-orte.txt><qconf-all.q>_______________________________________________
>>>> users mailing list
>>>> users@gridengine.org
>>>> https://gridengine.org/mailman/listinfo/users
>>> 
> <qstat.txt><master-node.txt>_______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] jobs allocate cores on node but do nothing

Reply via email to