Hi, it turned out that the problem was caused by the program code itself. There is an interaction between the subprocesses. The solution was to configure the PE with allocation_rule "$pe_slots" so sge distributes it to only one node.
I am sorry to have bothered you with this "false alarm". At least i have learned some things from you. Thanks a lot for all your help, Reuti. With kind regards, ulrich On 08/16/2016 12:16 PM, Reuti wrote: > Hi, > >> Am 15.08.2016 um 18:48 schrieb Ulrich Hiller <hil...@mpia-hd.mpg.de>: >> >> Excuse me, i have committed a stupid mistake. The extra mpihello >> processes were leftovers from previous runs (sge processes aborted by >> qdel command). Now in this detail the world is as it should be. The >> number of processes on the nodes now sums to the number of the allocated >> slots. >> I have attached the output of the 'ps -e f' command of the master node >> and the output of the 'qstat -g t -u ulrich' command. >> >> This seems to me to be correct. >> >> Remains the original problem, why jobs allocate cores on node but do >> nothing. >> As i wrote before, there is propably no OpenMP incidence. >> The qmaster/messages file does not say anything about hanging/pending jobs. >> >> The problem is that i could not reproduce today nodes which do nothing >> despite their cores are allocated. let me test a bit until i reproduce >> the problem. Then i will send you the output of 'ps -e f' and qstat. > > Fine. > > >> Is there anything else which i could test? > > Not for now. > > -- Reuti > > >> With kind regards, and thanks a lot for your help so far, ulrich >> >> >> On 08/15/2016 05:37 PM, Reuti wrote: >>> >>>> Am 15.08.2016 um 17:03 schrieb Ulrich Hiller <hil...@mpia-hd.mpg.de>: >>>> >>>> Hello, >>>> >>>> thank you for the clarification. I must have misunderstood you. >>>> Now i did it.The master node was in the example i send now exec-node01 >>>> (it varied from attempt to attempt). The output is in the master-node >>>> file. The qstat file is the output of >>>> qstat -g t -u '*' >>>> That seems to look normal. >>>> >>>> Now i created a simple C file with an endless loop. >>>> #include <stdio.h> >>>> int main() >>>> { >>>> int x; >>>> for(x=0;x=10;x=x+1) >>>> { >>>> puts("Hello"); >>>> ; >>>> } >>>> return(0); >>>> } >>>> >>>> and compiled it: >>>> mpicc mpihello.c -o mpihello >>>> and started qsub: >>>> qsub -pe orte 300 -j yes -cwd -S /bin/bash <<< "mpiexec -n 300 mpihello" >>>> The outputs look the same as for the sleep command above. >>>> But now i counted the jobs: >>>> >>>> qstat -g t -u '*' | grep -ic slave >>>> This results in the number '300', which i expected. >>>> >>>> On the execute nodes i did: >>>> ps -ef | grep mpihello | grep -v grep | grep -vc mpiexec >>> >>> f w/o - >>> >>> $ ps -e f >>> >>> will list a nice tree of the processes. >>> >>> >>>> (i counted the 'mpihello' processes) >>>> This is the result: >>>> exec-node01: 43 >>>> exec-node02: 82 >>>> exec-node03: 83 >>>> exec-node04: 82 >>>> exec-node05: 82 >>>> exec-node06: 80 >>>> exec-node07: 64 >>>> exec-node08: 64 >>> >>> To investigate this it would be good to post the complete slot allocation >>> by `qstat -g t -u <your user>`, the master of the MPI application and one >>> of the slave nodes' `ps -e f --cols=500`. Any "mpihello" in the path? >>> >>> -- Reuti >>> >>> >>>> Which gives the sum of 580. >>>> When i count the number of free solts together (from 'qhost -q') i also >>>> get 300, which i expect. >>>> Where do the extra processes on the nodes come from? >>>> >>>> This difference is reproducible. >>>> >>>> libgomp.so.1.0.0 library is installed, but aqpart from that nothing with >>>> OpenMP. >>>> >>>> With kind regards, ulrich >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 08/15/2016 02:30 PM, Ulrich Hiller wrote: >>>>> Hello, >>>>> >>>>>> The other issue seems to be, that in fact your job is using only one >>>>> machine, which means that it is essentially ignoring any granted slot >>>>> allocation. While the job is running, can you please execute on the >>>>> master node of the parallel job: >>>>>> >>>>>> $ ps -e f >>>>>> >>>>>> (f w/o -) and post the relevant lines belonging to either sge_execd or >>>>> just running as kids of the init process, in case they jumped out of the >>>>> process tree. Maybe a good start would be to execute something like >>>>> `mpiexec sleep 300` in the jobscript. >>>>>> >>>>> >>>>> i invoked >>>>> qsub -pe orte 160 -j yes -cwd -S /bin/bash <<< "mpiexec -n 160 sleep 300" >>>>> >>>>> the only line ('ps -e f') on the master node was: >>>>> 55722 ? Sl 3:42 /opt/sge/bin/lx-amd64/sge_qmaster >>>>> >>>>> No other sge lines, no child processes from it, and no other init >>>>> processes leading to sge While at the same time the sleep processes were >>>>> running on the nodes (Checked with ps command on the nodes). >>>>> >>>>> The qstat command gave : >>>>> 264 0.60500 STDIN ulrich r 08/15/2016 11:33:02 >>>>> all.q@exec-node01 MASTER >>>>> >>>>> all.q@exec-node01 SLAVE >>>>> >>>>> all.q@exec-node01 SLAVE >>>>> >>>>> all.q@exec-node01 SLAVE >>>>> [ ...] >>>>> >>>>> 264 0.60500 STDIN ulrich r 08/15/2016 11:33:02 >>>>> all.q@exec-node03 SLAVE >>>>> >>>>> all.q@exec-node03 SLAVE >>>>> >>>>> all.q@exec-node03 SLAVE >>>>> [ ... ] >>>>> 264 0.60500 STDIN ulrich r 08/15/2016 11:33:02 >>>>> all.q@exec-node05 SLAVE >>>>> >>>>> all.q@exec-node05 SLAVE >>>>> [ ...] >>>>> >>>>> >>>>> Because there was only the master deamon running on the master node, and >>>>> you were tlaking about child processes: Was this now normal behaviour my >>>>> cluster showed or is there something wrong? >>>>> >>>>> Kind reagrds, ulrich >>>>> >>>>> >>>>> >>>>> On 08/12/2016 07:11 PM, Reuti wrote: >>>>>> Hi, >>>>>> >>>>>>> Am 12.08.2016 um 18:48 schrieb Ulrich Hiller <hil...@mpia-hd.mpg.de>: >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> i have a strange effect, where i am not sure whether it is "only" a >>>>>>> misconfiguration or a bug. >>>>>>> >>>>>>> First: I run son of gridengine 8.1.9-1.el6.x86_64 (i installed the rhel >>>>>>> rpm on an opensuse 13.1 machine. This should not matter in this case, >>>>>>> and it is reported to be able to run on opensuse). >>>>>>> >>>>>>> mpirun and mpiexec are from openmpi-1.10.3 (no other mpi was installed, >>>>>>> neither on master, nor on slaves). The installation was made with: >>>>>>> ./configure --prefix=`pwd`/build --disable-dlopen --disable-mca-dso >>>>>>> --with-orte --with-sge --with-x --enable-mpi-thread-multiple >>>>>>> --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default >>>>>>> --enable-orte-static-ports --enable-mpi-cxx --enable-mpi-cxx-seek >>>>>>> --enable-oshmem --enable-java --enable-mpi-java >>>>>>> make >>>>>>> make install >>>>>>> >>>>>>> I attached the outputs of 'qconf -ap all.q' , 'qconf -sconf' and 'qconf >>>>>>> -sp orte' as textfiles. >>>>>>> >>>>>>> Now my problem: >>>>>>> I asked for 20 cores and if i run qstat -u '*' it shows that this job >>>>>>> is being run in slave07 using 20 cores but is not true! if i run qstat >>>>>>> -f -u '*' i see that this job is only using 3 cores in salve07 and >>>>>>> there are 17 cores in other nodes allocated to this job which are in >>>>>>> fact >>>>>>> unused! >>>>>> >>>>>> qstat will list only the master node of the parallel job and the number >>>>>> of overall slots. The granted allocation you can check with: >>>>>> >>>>>> $ qstat -g t -u '*' >>>>>> >>>>>> The other issue seems to be, that in fact your job is using only one >>>>>> machine, which means that it is essentially ignoring any granted slot >>>>>> allocation. While the job is running, can you please execute on the >>>>>> master node of the parallel job: >>>>>> >>>>>> $ ps -e f >>>>>> >>>>>> (f w/o -) and post the relevant lines belonging to either sge_execd or >>>>>> just running as kids of the init process, in case they jumped out of the >>>>>> process tree. Maybe a good start would be to execute something like >>>>>> `mpiexec sleep 300` in the jobscript. >>>>>> >>>>>> Next step could be a `mpihello.c` where you put an almost endless loop >>>>>> inside and switch off all optimizations during compilations to check >>>>>> whether these slave processes are distributed in the correct way. >>>>>> >>>>>> Note that some applications will check the number of cores they are >>>>>> running on and start by OpenMP (not Open MPI) as many threads as cores >>>>>> are found. Could this be the case for your application too? >>>>>> >>>>>> -- Reuti >>>>>> >>>>>> >>>>>>> Or other example: >>>>>>> My job took say 6 cpus on slave07 and 14 on slave06 but nothing was >>>>>>> running on 06 and therefore a waste of ressource on 06 and overload on >>>>>>> 07 becomes highly possible (the numbers are made up). >>>>>>> If i ran 1 Cpus in many independent jobs that would not be an issue, but >>>>>>> imagine i now request 60 cpus on slave07, that would seriously overload >>>>>>> the node in many cases. >>>>>>> >>>>>>> Or other example: >>>>>>> if i ask for say 50 CPUs, the job will start on one node, e.g, >>>>>>> slave01, but only reserving say 15 CPUs out of 64 and reserve the rest >>>>>>> on many other nodes (obviously wasting space doing nothing). >>>>>>> This has the bad consequence of allocating many more CPUs than available >>>>>>> when many jobs are running, imagine you have 10 jobs like this one... >>>>>>> some nodes will run maybe 3 even if they only have 24 CPUs... >>>>>>> >>>>>>> I hope that i have made clear what the issue is. >>>>>>> >>>>>>> I also see that the `qstat` and `qstat -f` are in disagreement. The >>>>>>> latter is correct, i checked the processes running on the nodes. >>>>>>> >>>>>>> >>>>>>> Did somebody already encounter such a problem? Does somebody have an >>>>>>> idea where to look into or what to test? >>>>>>> >>>>>>> With kind regards, ulrich >>>>>>> >>>>>>> >>>>>>> >>>>>>> <qhost.txt><qconf-sconf.txt><qconf-mp-orte.txt><qconf-all.q>_______________________________________________ >>>>>>> users mailing list >>>>>>> users@gridengine.org >>>>>>> https://gridengine.org/mailman/listinfo/users >>>>>> >>>> <qstat.txt><master-node.txt>_______________________________________________ >>>> users mailing list >>>> users@gridengine.org >>>> https://gridengine.org/mailman/listinfo/users >>> >> <ps.txt><qstat.txt>_______________________________________________ >> users mailing list >> users@gridengine.org >> https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users