Am 21.08.2014 um 15:45 schrieb Ralph Castain: > On Aug 21, 2014, at 2:51 AM, Reuti <re...@staff.uni-marburg.de> wrote: > >> Am 20.08.2014 um 23:16 schrieb Ralph Castain: >> >>> >>> On Aug 20, 2014, at 11:16 AM, Reuti <re...@staff.uni-marburg.de> wrote: >>> >>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain: >>>> >>>>>> <snip> >>>>>> Aha, this is quite interesting - how do you do this: scanning the >>>>>> /proc/<pid>/status or alike? What happens if you don't find enough free >>>>>> cores as they are used up by other applications already? >>>>>> >>>>> >>>>> Remember, when you use mpirun to launch, we launch our own daemons using >>>>> the native launcher (e.g., qsub). So the external RM will bind our >>>>> daemons to the specified cores on each node. We use hwloc to determine >>>>> what cores our daemons are bound to, and then bind our own child >>>>> processes to cores within that range. >>>> >>>> Thx for reminding me of this. Indeed, I mixed up two different aspects in >>>> this discussion. >>>> >>>> a) What will happen in case no binding was done by the RM (hence Open MPI >>>> could use all cores) and two Open MPI jobs (or something completely >>>> different besides one Open MPI job) are running on the same node (due to >>>> the Tight Integration with two different Open MPI directories in /tmp and >>>> two `orted`, unique for each job)? Will the second Open MPI job know what >>>> the first Open MPI job used up already? Or will both use the same set of >>>> cores as "-bind-to none" can't be set in the given `mpiexec` command >>>> because of "-map-by slot:pe=$OMP_NUM_THREADS" was used - which triggers >>>> "-bind-to core" indispensable and can't be switched off? I see the same >>>> cores being used for both jobs. >>> >>> Yeah, each mpirun executes completely independently of the other, so they >>> have no idea what the other is doing. So the cores will be overloaded. >>> Multi-pe's requires bind-to-core otherwise there is no way to implement the >>> request >> >> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow >> "-bind-to none" here? > > Guess I'm confused here - what does pe=N mean if we bind-to none?? If you are > running on a mixed cluster and don't want binding, then just say bind-to none > and leave the pe argument out entirely as it wouldn't mean anything unless > you are bound
I would mean: divide the overall number of slots/cores in the machinefile by N (i.e. $OMP_NUM_THREADS). - Request made to the queuing system: I need 80 cores in total. - The machinefile will contain 80 cores - Open MPI will divide it by N, i.e. 8 here - Open MPI will start only 10 processes, one on each node - The application will use 8 threads per started MPI process -- Reuti >> >> >>>> Altering the machinefile instead: the processes are not bound to any core, >>>> and the OS takes care of a proper assignment. >> >> Here the ordinary user has to mangle the hostfile, this is not good (but >> allows several jobs per node as the OS shift the processes around). >> Could/should it be put into the "gridengine" module in OpenMPI, to divide >> the slot count per node automatically when $OMP_NUM_THREADS is found, or >> generate an error if it's not divisible? > > Sure, that could be done - but it will only have if OMP_NUM_THREADS is set > when someone spins off threads. So far as I know, that's only used for OpenMP > - so we'd get a little help, but it wouldn't be full coverage. > > >> >> === >> >>>>> If the cores we are bound to are the same on each node, then we will do >>>>> this with no further instruction. However, if the cores are different on >>>>> the individual nodes, then you need to add --hetero-nodes to your command >>>>> line (as the nodes appear to be heterogeneous to us). >>>> >>>> b) Aha, it's not about different type CPU types, but also same CPU type >>>> but different allocations between the nodes? It's not in the `mpiexec` >>>> man-page of 1.8.1 though. I'll have a look at it. >> >> I tried: >> >> $ qsub -binding linear:2:0 -pe smp2 8 -masterq parallel@node01 -q >> parallel@node0[1-4] test_openmpi.sh >> Your job 247109 ("test_openmpi.sh") has been submitted >> $ qsub -binding linear:2:1 -pe smp2 8 -masterq parallel@node01 -q >> parallel@node0[1-4] test_openmpi.sh >> Your job 247110 ("test_openmpi.sh") has been submitted >> >> >> Getting on node03: >> >> >> 6733 ? Sl 0:00 \_ sge_shepherd-247109 -bg >> 6734 ? SNs 0:00 | \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter >> /var/spool/sge/node03/active_jobs/247109.1/1.node03 >> 6741 ? SN 0:00 | \_ orted -mca orte_hetero_nodes 1 -mca >> ess env -mca orte_ess_jobid 1493303296 -mca orte_ess_vpid >> 6742 ? RNl 0:31 | \_ ./mpihello >> 6745 ? Sl 0:00 \_ sge_shepherd-247110 -bg >> 6746 ? SNs 0:00 \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter >> /var/spool/sge/node03/active_jobs/247110.1/1.node03 >> 6753 ? SN 0:00 \_ orted -mca orte_hetero_nodes 1 -mca >> ess env -mca orte_ess_jobid 1506607104 -mca orte_ess_vpid >> 6754 ? RNl 0:25 \_ ./mpihello >> >> >> reuti@node03:~> cat /proc/6741/status | grep Cpus_ >> Cpus_allowed: >> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 >> Cpus_allowed_list: 0-1 >> reuti@node03:~> cat /proc/6753/status | grep Cpus_ >> Cpus_allowed: >> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000030 >> Cpus_allowed_list: 4-5 >> >> Hence, "orted" got two cores assigned for each of them. But: >> >> >> reuti@node03:~> cat /proc/6742/status | grep Cpus_ >> Cpus_allowed: >> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 >> Cpus_allowed_list: 0-1 >> reuti@node03:~> cat /proc/6754/status | grep Cpus_ >> Cpus_allowed: >> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 >> Cpus_allowed_list: 0-1 >> >> What I see here (and in `top` + pressing "1") that only two cores are used, >> and Open MPI assigns 0-1 to both jobs. The information in "status" is not >> the one OpenMPI gets from hwloc? >> >> -- Reuti >> >> >>> The man page is probably a little out-of-date in this area - but yes, >>> --hetero-nodes is required for *any* difference in the way the nodes appear >>> to us (cpus, slot assignments, etc.). The 1.9 series may remove that >>> requirement - still looking at it. >>> >>>> >>>> >>>>> So it is up to the RM to set the constraint - we just live within it. >>>> >>>> Fine. >>>> >>>> -- Reuti >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/25097.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25098.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25106.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25111.php