Am 28.02.2013 um 19:50 schrieb Reuti:

> Am 28.02.2013 um 19:21 schrieb Ralph Castain:
> 
>> 
>> On Feb 28, 2013, at 9:53 AM, Reuti <re...@staff.uni-marburg.de> wrote:
>> 
>>> Am 28.02.2013 um 17:54 schrieb Ralph Castain:
>>> 
>>>> Hmmm....the problem is that we are mapping procs using the provided slots 
>>>> instead of dividing the slots by cpus-per-proc. So we put too many on the 
>>>> first node, and the backend daemon aborts the job because it lacks 
>>>> sufficient processors for cpus-per-proc=2.
>>> 
>>> Ok, this I would understand. But why is it then working if no maximum 
>>> number of slots is given? Will it then just fill the node up to the found 
>>> number of cores inside and subtract this correctly each time a new process 
>>> ist started and jump to the next machine if necessary?
>> 
>> Not exactly. If no max slots is given, then we assume a value of one. This 
>> effectively converts byslot mapping to bynode - i.e., we place one proc on a 
>> node, and that meets its #slots, so we place the next proc on the next node. 
>> So you wind up balancing across the two nodes.
> 
> Ok, now I understand the behavior - Thx.
> 
> 
>> If you specify slots=64, then we'll try to place all 64 procs on the first 
>> node because we are using byslot mapping by default. You could make it work 
>> by just adding -bynode to your command line.
>> 
>> 
>>> 
>>> It is of course for now a feasible workaround to get the intended behavior 
>>> by supplying just an additional hostfile.
>> 
>> Or use bynode mapping
> 
> You mean a line like:
> 
> mpiexec -cpus-per-proc 2 -bynode -report-bindings ./mpihello
> 
> For me this results in the same error like without "-bynode" at all.
> 
> 
>>> But regarding my recent eMail I also wonder about the difference between 
>>> running on the command line and inside SGE. In the latter case the overall 
>>> universe is correct.
>> 
>> If you don't provide a slots value in the hostfile, we assume 1 - and so the 
>> universe size is 2, and you are heavily oversubscribed. Inside SGE, we see 
>> 128 slots assigned to you, and you are not oversubscribed.
> 
> Yes, but the "fake" hostfile I provide on the command line to `mpiexec` has 
> only the plain names inside. Somehow this changes the way the processes are 
> distributed to "-bynode", but not the overall slotcount - interesting.

Ups: I meant "overall universe size", as the slot count (i.e. processes) I 
supply by the -np option to reduce it.

-- Reuti


> 
> -- Reuti
> 
> 
>> HTH
>> Ralph
>> 
>> 
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> Given that there are no current plans for a 1.6.5, this may not get fixed.
>>>> 
>>>> On Feb 27, 2013, at 3:15 PM, Reuti <re...@staff.uni-marburg.de> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I have an issue using the option -cpus-per-proc 2. As I have Bulldozer 
>>>>> machines and I want only one process per FP core, I thought using 
>>>>> -cpus-per-proc 2 would be the way to go. Initially I had this issue 
>>>>> inside GridEngine but then tried it outside any queuingsystem and face 
>>>>> exactly the same behavior.
>>>>> 
>>>>> @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 
>>>>> integer cores per machine in total. Used Open MPI is 1.6.4.
>>>>> 
>>>>> 
>>>>> a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 
>>>>> ./mpihello
>>>>> 
>>>>> and a hostfile containing only the two lines listing the machines:
>>>>> 
>>>>> node006
>>>>> node007
>>>>> 
>>>>> This works as I would like it (see working.txt) when initiated on node006.
>>>>> 
>>>>> 
>>>>> b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 
>>>>> ./mpihello
>>>>> 
>>>>> But changing the hostefile so that it is having a slot count which might 
>>>>> mimic the behavior in case of a parsed machinefile out of any queuing 
>>>>> system:
>>>>> 
>>>>> node006 slots=64
>>>>> node007 slots=64
>>>>> 
>>>>> This fails with:
>>>>> 
>>>>> --------------------------------------------------------------------------
>>>>> An invalid physical processor ID was returned when attempting to bind
>>>>> an MPI process to a unique processor on node:
>>>>> 
>>>>> Node: node006
>>>>> 
>>>>> This usually means that you requested binding to more processors than
>>>>> exist (e.g., trying to bind N MPI processes to M processors, where N >
>>>>> M), or that the node has an unexpectedly different topology.
>>>>> 
>>>>> Double check that you have enough unique processors for all the
>>>>> MPI processes that you are launching on this host, and that all nodes
>>>>> have identical topologies.
>>>>> 
>>>>> You job will now abort.
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> (see failed.txt)
>>>>> 
>>>>> 
>>>>> b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 
>>>>> ./mpihello
>>>>> 
>>>>> This works and the found universe is 128 as expected (see only32.txt).
>>>>> 
>>>>> 
>>>>> c) Maybe the used machinefile is not parsed in the correct way, so I 
>>>>> checked:
>>>>> 
>>>>> c1) mpiexec -hostfile machines -np 64 ./mpihello => works
>>>>> 
>>>>> c2) mpiexec -hostfile machines -np 128 ./mpihello => works
>>>>> 
>>>>> c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected
>>>>> 
>>>>> So, it got the slot counts in the correct way.
>>>>> 
>>>>> What do I miss?
>>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> <failed.txt><only32.txt><working.txt>_______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to