Am 28.02.2013 um 19:50 schrieb Reuti: > Am 28.02.2013 um 19:21 schrieb Ralph Castain: > >> >> On Feb 28, 2013, at 9:53 AM, Reuti <re...@staff.uni-marburg.de> wrote: >> >>> Am 28.02.2013 um 17:54 schrieb Ralph Castain: >>> >>>> Hmmm....the problem is that we are mapping procs using the provided slots >>>> instead of dividing the slots by cpus-per-proc. So we put too many on the >>>> first node, and the backend daemon aborts the job because it lacks >>>> sufficient processors for cpus-per-proc=2. >>> >>> Ok, this I would understand. But why is it then working if no maximum >>> number of slots is given? Will it then just fill the node up to the found >>> number of cores inside and subtract this correctly each time a new process >>> ist started and jump to the next machine if necessary? >> >> Not exactly. If no max slots is given, then we assume a value of one. This >> effectively converts byslot mapping to bynode - i.e., we place one proc on a >> node, and that meets its #slots, so we place the next proc on the next node. >> So you wind up balancing across the two nodes. > > Ok, now I understand the behavior - Thx. > > >> If you specify slots=64, then we'll try to place all 64 procs on the first >> node because we are using byslot mapping by default. You could make it work >> by just adding -bynode to your command line. >> >> >>> >>> It is of course for now a feasible workaround to get the intended behavior >>> by supplying just an additional hostfile. >> >> Or use bynode mapping > > You mean a line like: > > mpiexec -cpus-per-proc 2 -bynode -report-bindings ./mpihello > > For me this results in the same error like without "-bynode" at all. > > >>> But regarding my recent eMail I also wonder about the difference between >>> running on the command line and inside SGE. In the latter case the overall >>> universe is correct. >> >> If you don't provide a slots value in the hostfile, we assume 1 - and so the >> universe size is 2, and you are heavily oversubscribed. Inside SGE, we see >> 128 slots assigned to you, and you are not oversubscribed. > > Yes, but the "fake" hostfile I provide on the command line to `mpiexec` has > only the plain names inside. Somehow this changes the way the processes are > distributed to "-bynode", but not the overall slotcount - interesting.
Ups: I meant "overall universe size", as the slot count (i.e. processes) I supply by the -np option to reduce it. -- Reuti > > -- Reuti > > >> HTH >> Ralph >> >> >>> >>> -- Reuti >>> >>> >>>> Given that there are no current plans for a 1.6.5, this may not get fixed. >>>> >>>> On Feb 27, 2013, at 3:15 PM, Reuti <re...@staff.uni-marburg.de> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have an issue using the option -cpus-per-proc 2. As I have Bulldozer >>>>> machines and I want only one process per FP core, I thought using >>>>> -cpus-per-proc 2 would be the way to go. Initially I had this issue >>>>> inside GridEngine but then tried it outside any queuingsystem and face >>>>> exactly the same behavior. >>>>> >>>>> @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 >>>>> integer cores per machine in total. Used Open MPI is 1.6.4. >>>>> >>>>> >>>>> a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >>>>> ./mpihello >>>>> >>>>> and a hostfile containing only the two lines listing the machines: >>>>> >>>>> node006 >>>>> node007 >>>>> >>>>> This works as I would like it (see working.txt) when initiated on node006. >>>>> >>>>> >>>>> b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >>>>> ./mpihello >>>>> >>>>> But changing the hostefile so that it is having a slot count which might >>>>> mimic the behavior in case of a parsed machinefile out of any queuing >>>>> system: >>>>> >>>>> node006 slots=64 >>>>> node007 slots=64 >>>>> >>>>> This fails with: >>>>> >>>>> -------------------------------------------------------------------------- >>>>> An invalid physical processor ID was returned when attempting to bind >>>>> an MPI process to a unique processor on node: >>>>> >>>>> Node: node006 >>>>> >>>>> This usually means that you requested binding to more processors than >>>>> exist (e.g., trying to bind N MPI processes to M processors, where N > >>>>> M), or that the node has an unexpectedly different topology. >>>>> >>>>> Double check that you have enough unique processors for all the >>>>> MPI processes that you are launching on this host, and that all nodes >>>>> have identical topologies. >>>>> >>>>> You job will now abort. >>>>> -------------------------------------------------------------------------- >>>>> >>>>> (see failed.txt) >>>>> >>>>> >>>>> b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 >>>>> ./mpihello >>>>> >>>>> This works and the found universe is 128 as expected (see only32.txt). >>>>> >>>>> >>>>> c) Maybe the used machinefile is not parsed in the correct way, so I >>>>> checked: >>>>> >>>>> c1) mpiexec -hostfile machines -np 64 ./mpihello => works >>>>> >>>>> c2) mpiexec -hostfile machines -np 128 ./mpihello => works >>>>> >>>>> c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected >>>>> >>>>> So, it got the slot counts in the correct way. >>>>> >>>>> What do I miss? >>>>> >>>>> -- Reuti >>>>> >>>>> <failed.txt><only32.txt><working.txt>_______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users