Re: [gridengine users] jobs allocate cores on node but do nothing

Reuti Tue, 16 Aug 2016 03:18:52 -0700

Hi,

> Am 15.08.2016 um 18:48 schrieb Ulrich Hiller <hil...@mpia-hd.mpg.de>:
> 
> Excuse me, i have committed  a stupid mistake. The extra mpihello
> processes were leftovers from previous runs (sge processes aborted by
> qdel command). Now in this detail the world is as it should be. The
> number of processes on the nodes now sums to the number of the allocated
> slots.
> I have attached the output of the 'ps -e f' command of the master node
> and the output of the 'qstat -g t -u ulrich' command.
> 
> This seems to me to be correct.
> 
> Remains the original problem, why jobs allocate cores on node but do
> nothing.
> As i wrote before, there is propably no OpenMP incidence.
> The qmaster/messages file does not say anything about hanging/pending jobs.
> 
> The problem is that i could not reproduce today nodes which do nothing
> despite their cores are allocated. let me test a bit until i reproduce
> the problem. Then i will send you the output of 'ps -e f' and qstat.


Fine.


> Is there anything else which i could test?

Not for now.

-- Reuti


> With kind regards, and thanks a lot for your help so far, ulrich
> 
> 
> On 08/15/2016 05:37 PM, Reuti wrote:
>> 
>>> Am 15.08.2016 um 17:03 schrieb Ulrich Hiller <hil...@mpia-hd.mpg.de>:
>>> 
>>> Hello,
>>> 
>>> thank you for the clarification. I must have misunderstood you.
>>> Now i did it.The master node was in the example i send now exec-node01
>>> (it varied from attempt  to attempt). The output is in the master-node
>>> file. The qstat file is the output of
>>> qstat -g t -u '*'
>>> That seems to look normal.
>>> 
>>> Now i created a simple C file with an endless loop.
>>> #include <stdio.h>
>>> int main()
>>> {
>>> int x;
>>> for(x=0;x=10;x=x+1)
>>> {
>>> puts("Hello");
>>> ;
>>> }
>>> return(0);
>>> }
>>> 
>>> and compiled it:
>>> mpicc mpihello.c -o mpihello
>>> and started qsub:
>>> qsub -pe orte 300 -j yes -cwd -S /bin/bash <<< "mpiexec -n 300 mpihello"
>>> The outputs look the same as for the sleep command above.
>>> But now i counted the jobs:
>>> 
>>> qstat -g t -u '*' | grep -ic slave
>>> This results in the number '300', which i expected.
>>> 
>>> On the execute nodes i did:
>>> ps -ef | grep mpihello | grep -v grep | grep -vc mpiexec
>> 
>> f w/o -
>> 
>> $ ps -e f
>> 
>> will list a nice tree of the processes.
>> 
>> 
>>> (i counted the 'mpihello' processes)
>>> This is the result:
>>> exec-node01: 43
>>> exec-node02: 82
>>> exec-node03: 83
>>> exec-node04: 82
>>> exec-node05: 82
>>> exec-node06: 80
>>> exec-node07: 64
>>> exec-node08: 64
>> 
>> To investigate this it would be good to post the complete slot allocation by 
>> `qstat -g t -u <your user>`, the master of the MPI application and one of 
>> the slave nodes' `ps -e f --cols=500`. Any "mpihello" in the path?
>> 
>> -- Reuti
>> 
>> 
>>> Which gives the sum of 580.
>>> When i count the number of free solts together (from 'qhost -q') i also
>>> get 300, which i expect.
>>> Where do the extra processes on the nodes come from?
>>> 
>>> This difference is reproducible.
>>> 
>>> libgomp.so.1.0.0 library is installed, but aqpart from that nothing with
>>> OpenMP.
>>> 
>>> With kind regards, ulrich
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 08/15/2016 02:30 PM, Ulrich Hiller wrote:
>>>> Hello,
>>>> 
>>>>> The other issue seems to be, that in fact your job is using only one
>>>> machine, which means that it is essentially ignoring any granted slot
>>>> allocation. While the job is running, can you please execute on the
>>>> master node of the parallel job:
>>>>> 
>>>>> $ ps -e f
>>>>> 
>>>>> (f w/o -) and post the relevant lines belonging to either sge_execd or
>>>> just running as kids of the init process, in case they jumped out of the
>>>> process tree. Maybe a good start would be to execute something like
>>>> `mpiexec sleep 300` in the jobscript.
>>>>> 
>>>> 
>>>> i invoked
>>>> qsub -pe orte 160 -j yes -cwd -S /bin/bash <<< "mpiexec -n 160 sleep 300"
>>>> 
>>>> the only line ('ps -e f') on the master node was:
>>>> 55722 ?        Sl     3:42 /opt/sge/bin/lx-amd64/sge_qmaster
>>>> 
>>>> No other sge lines, no child processes from it, and no other init
>>>> processes leading to sge While at the same time the sleep processes were
>>>> running on the nodes (Checked with ps command on the nodes).
>>>> 
>>>> The qstat command gave :
>>>>  264 0.60500 STDIN      ulrich       r     08/15/2016 11:33:02
>>>> all.q@exec-node01                  MASTER
>>>> 
>>>> all.q@exec-node01                  SLAVE
>>>> 
>>>> all.q@exec-node01                  SLAVE
>>>> 
>>>> all.q@exec-node01                  SLAVE
>>>> [ ...]
>>>> 
>>>> 264 0.60500 STDIN      ulrich       r     08/15/2016 11:33:02
>>>> all.q@exec-node03                  SLAVE
>>>> 
>>>> all.q@exec-node03                  SLAVE
>>>> 
>>>> all.q@exec-node03                  SLAVE
>>>> [ ... ]
>>>>  264 0.60500 STDIN      ulrich       r     08/15/2016 11:33:02
>>>> all.q@exec-node05                  SLAVE
>>>> 
>>>> all.q@exec-node05                  SLAVE
>>>> [ ...]
>>>> 
>>>> 
>>>> Because there was only the master deamon running on the master node, and
>>>> you were tlaking about child processes: Was this now normal behaviour my
>>>> cluster showed or is there something wrong?
>>>> 
>>>> Kind reagrds, ulrich
>>>> 
>>>> 
>>>> 
>>>> On 08/12/2016 07:11 PM, Reuti wrote:
>>>>> Hi,
>>>>> 
>>>>>> Am 12.08.2016 um 18:48 schrieb Ulrich Hiller <hil...@mpia-hd.mpg.de>:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> i have a strange effect, where i am not sure whether it is "only" a
>>>>>> misconfiguration or a bug.
>>>>>> 
>>>>>> First: I run son of gridengine 8.1.9-1.el6.x86_64 (i installed the rhel
>>>>>> rpm on an opensuse 13.1 machine. This should not matter in this case,
>>>>>> and it is reported to be able to run on opensuse).
>>>>>> 
>>>>>> mpirun and mpiexec are from openmpi-1.10.3 (no other mpi was installed,
>>>>>> neither on master, nor on slaves). The installation was made with:
>>>>>> ./configure --prefix=`pwd`/build --disable-dlopen --disable-mca-dso
>>>>>> --with-orte --with-sge --with-x --enable-mpi-thread-multiple
>>>>>> --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default
>>>>>> --enable-orte-static-ports --enable-mpi-cxx --enable-mpi-cxx-seek
>>>>>> --enable-oshmem --enable-java --enable-mpi-java
>>>>>> make
>>>>>> make install
>>>>>> 
>>>>>> I attached the outputs of 'qconf -ap all.q' , 'qconf -sconf' and 'qconf
>>>>>> -sp orte' as textfiles.
>>>>>> 
>>>>>> Now my problem:
>>>>>> I asked for 20 cores and if i run qstat -u '*' it shows that this job
>>>>>> is being run in slave07 using 20 cores but is not true! if i run qstat
>>>>>> -f -u '*' i see that this job is only using 3 cores in salve07 and
>>>>>> there are 17 cores in other nodes allocated to this job which are in fact
>>>>>> unused!
>>>>> 
>>>>> qstat will list only the master node of the parallel job and the number 
>>>>> of overall slots. The granted allocation you can check with:
>>>>> 
>>>>> $ qstat -g t -u '*'
>>>>> 
>>>>> The other issue seems to be, that in fact your job is using only one 
>>>>> machine, which means that it is essentially ignoring any granted slot 
>>>>> allocation. While the job is running, can you please execute on the 
>>>>> master node of the parallel job:
>>>>> 
>>>>> $ ps -e f
>>>>> 
>>>>> (f w/o -) and post the relevant lines belonging to either sge_execd or 
>>>>> just running as kids of the init process, in case they jumped out of the 
>>>>> process tree. Maybe a good start would be to execute something like 
>>>>> `mpiexec sleep 300` in the jobscript.
>>>>> 
>>>>> Next step could be a `mpihello.c` where you put an almost endless loop 
>>>>> inside and switch off all optimizations during compilations to check 
>>>>> whether these slave processes are distributed in the correct way.
>>>>> 
>>>>> Note that some applications will check the number of cores they are 
>>>>> running on and start by OpenMP (not Open MPI) as many threads as cores 
>>>>> are found. Could this be the case for your application too?
>>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> 
>>>>>> Or other example:
>>>>>> My job took say 6 cpus on slave07 and 14 on slave06 but nothing was
>>>>>> running on 06 and therefore a waste of ressource on 06 and overload on
>>>>>> 07 becomes highly possible (the numbers are made up).
>>>>>> If i ran 1 Cpus in many independent jobs that would not be an issue, but
>>>>>> imagine i now request 60 cpus on slave07, that would seriously overload
>>>>>> the node in many cases.
>>>>>> 
>>>>>> Or other example:
>>>>>> if i ask for say 50 CPUs, the job will start on one node, e.g,
>>>>>> slave01,  but only reserving say 15 CPUs out of 64 and reserve the rest
>>>>>> on many other nodes (obviously wasting space doing nothing).
>>>>>> This has the bad consequence of allocating many more CPUs than available
>>>>>> when many jobs are running, imagine you have 10 jobs like this one...
>>>>>> some nodes will run maybe 3 even if they only have 24 CPUs...
>>>>>> 
>>>>>> I hope that i have made clear what the issue is.
>>>>>> 
>>>>>> I also see that the `qstat` and `qstat -f` are in disagreement. The
>>>>>> latter is correct, i checked the processes running on the nodes.
>>>>>> 
>>>>>> 
>>>>>> Did somebody already encounter such a problem? Does somebody have an
>>>>>> idea where to look into or what to test?
>>>>>> 
>>>>>> With kind regards, ulrich
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> <qhost.txt><qconf-sconf.txt><qconf-mp-orte.txt><qconf-all.q>_______________________________________________
>>>>>> users mailing list
>>>>>> users@gridengine.org
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>> 
>>> <qstat.txt><master-node.txt>_______________________________________________
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>> 
> <ps.txt><qstat.txt>_______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] jobs allocate cores on node but do nothing

Reply via email to