On Aug 21, 2014, at 2:51 AM, Reuti <re...@staff.uni-marburg.de> wrote:

> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
> 
>> 
>> On Aug 20, 2014, at 11:16 AM, Reuti <re...@staff.uni-marburg.de> wrote:
>> 
>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>>> 
>>>>> <snip>
>>>>> Aha, this is quite interesting - how do you do this: scanning the 
>>>>> /proc/<pid>/status or alike? What happens if you don't find enough free 
>>>>> cores as they are used up by other applications already?
>>>>> 
>>>> 
>>>> Remember, when you use mpirun to launch, we launch our own daemons using 
>>>> the native launcher (e.g., qsub). So the external RM will bind our daemons 
>>>> to the specified cores on each node. We use hwloc to determine what cores 
>>>> our daemons are bound to, and then bind our own child processes to cores 
>>>> within that range.
>>> 
>>> Thx for reminding me of this. Indeed, I mixed up two different aspects in 
>>> this discussion.
>>> 
>>> a) What will happen in case no binding was done by the RM (hence Open MPI 
>>> could use all cores) and two Open MPI jobs (or something completely 
>>> different besides one Open MPI job) are running on the same node (due to 
>>> the Tight Integration with two different Open MPI directories in /tmp and 
>>> two `orted`, unique for each job)? Will the second Open MPI job know what 
>>> the first Open MPI job used up already? Or will both use the same set of 
>>> cores as "-bind-to none" can't be set in the given `mpiexec` command 
>>> because of "-map-by slot:pe=$OMP_NUM_THREADS" was used - which triggers 
>>> "-bind-to core" indispensable and can't be switched off? I see the same 
>>> cores being used for both jobs.
>> 
>> Yeah, each mpirun executes completely independently of the other, so they 
>> have no idea what the other is doing. So the cores will be overloaded. 
>> Multi-pe's requires bind-to-core otherwise there is no way to implement the 
>> request
> 
> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
> "-bind-to none" here?

Guess I'm confused here - what does pe=N mean if we bind-to none?? If you are 
running on a mixed cluster and don't want binding, then just say bind-to none 
and leave the pe argument out entirely as it wouldn't mean anything unless you 
are bound

> 
> 
>>> Altering the machinefile instead: the processes are not bound to any core, 
>>> and the OS takes care of a proper assignment.
> 
> Here the ordinary user has to mangle the hostfile, this is not good (but 
> allows several jobs per node as the OS shift the processes around). 
> Could/should it be put into the "gridengine" module in OpenMPI, to divide the 
> slot count per node automatically when $OMP_NUM_THREADS is found, or generate 
> an error if it's not divisible?

Sure, that could be done - but it will only have if OMP_NUM_THREADS is set when 
someone spins off threads. So far as I know, that's only used for OpenMP - so 
we'd get a little help, but it wouldn't be full coverage.


> 
> ===
> 
>>>> If the cores we are bound to are the same on each node, then we will do 
>>>> this with no further instruction. However, if the cores are different on 
>>>> the individual nodes, then you need to add --hetero-nodes to your command 
>>>> line (as the nodes appear to be heterogeneous to us).
>>> 
>>> b) Aha, it's not about different type CPU types, but also same CPU type but 
>>> different allocations between the nodes? It's not in the `mpiexec` man-page 
>>> of 1.8.1 though. I'll have a look at it.
> 
> I tried:
> 
> $ qsub -binding linear:2:0 -pe smp2 8 -masterq parallel@node01 -q 
> parallel@node0[1-4] test_openmpi.sh 
> Your job 247109 ("test_openmpi.sh") has been submitted
> $ qsub -binding linear:2:1 -pe smp2 8 -masterq parallel@node01 -q 
> parallel@node0[1-4] test_openmpi.sh 
> Your job 247110 ("test_openmpi.sh") has been submitted
> 
> 
> Getting on node03:
> 
> 
> 6733 ?        Sl     0:00  \_ sge_shepherd-247109 -bg
> 6734 ?        SNs    0:00  |   \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter 
> /var/spool/sge/node03/active_jobs/247109.1/1.node03
> 6741 ?        SN     0:00  |       \_ orted -mca orte_hetero_nodes 1 -mca ess 
> env -mca orte_ess_jobid 1493303296 -mca orte_ess_vpid
> 6742 ?        RNl    0:31  |           \_ ./mpihello
> 6745 ?        Sl     0:00  \_ sge_shepherd-247110 -bg
> 6746 ?        SNs    0:00      \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter 
> /var/spool/sge/node03/active_jobs/247110.1/1.node03
> 6753 ?        SN     0:00          \_ orted -mca orte_hetero_nodes 1 -mca ess 
> env -mca orte_ess_jobid 1506607104 -mca orte_ess_vpid
> 6754 ?        RNl    0:25              \_ ./mpihello
> 
> 
> reuti@node03:~> cat /proc/6741/status | grep Cpus_
> Cpus_allowed: 
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
> Cpus_allowed_list:    0-1
> reuti@node03:~> cat /proc/6753/status | grep Cpus_
> Cpus_allowed: 
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000030
> Cpus_allowed_list:    4-5
> 
> Hence, "orted" got two cores assigned for each of them. But:
> 
> 
> reuti@node03:~> cat /proc/6742/status | grep Cpus_
> Cpus_allowed: 
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
> Cpus_allowed_list:    0-1
> reuti@node03:~> cat /proc/6754/status | grep Cpus_
> Cpus_allowed: 
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
> Cpus_allowed_list:    0-1
> 
> What I see here (and in `top` + pressing "1") that only two cores are used, 
> and Open MPI assigns 0-1 to both jobs. The information in "status" is not the 
> one OpenMPI gets from hwloc?
> 
> -- Reuti
> 
> 
>> The man page is probably a little out-of-date in this area - but yes, 
>> --hetero-nodes is required for *any* difference in the way the nodes appear 
>> to us (cpus, slot assignments, etc.). The 1.9 series may remove that 
>> requirement - still looking at it.
>> 
>>> 
>>> 
>>>> So it is up to the RM to set the constraint - we just live within it.
>>> 
>>> Fine.
>>> 
>>> -- Reuti
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/25097.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25098.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25106.php

Reply via email to