Am 28.02.2013 um 08:58 schrieb Reuti: > Am 28.02.2013 um 06:55 schrieb Ralph Castain: > >> I don't off-hand see a problem, though I do note that your "working" version >> incorrectly reports the universe size as 2! > > Yes, it was 2 in the case when it was working by giving only two hostnames > without any dedicated slot count. What should it be in this case - "unknown", > "infinity"?
As an add on: a) I tried it again on the command line and still get: Total: 64 Universe: 2 with a hostfile node006 node007 b) In a job script under SGE and Open MPI compiled --with-sge I get after mangling the hostfile: #!/bin /sh #$ -pe openmpi* 128 #$ -l exclusive cut -f 1 -d" " $PE_HOSTFILE > $TMPDIR/machines mpiexec -cpus-per-proc 2 -report-bindings -hostfile $TMPDIR/machines -np 64 ./mpihello Here: Total: 64 Universe: 128 Maybe the found allocation by SGE and the one from the command line argument are getting mixed here. -- Reuti > -- Reuti > > >> >> I'll have to take a look at this and get back to you on it. >> >> On Feb 27, 2013, at 3:15 PM, Reuti <re...@staff.uni-marburg.de> wrote: >> >>> Hi, >>> >>> I have an issue using the option -cpus-per-proc 2. As I have Bulldozer >>> machines and I want only one process per FP core, I thought using >>> -cpus-per-proc 2 would be the way to go. Initially I had this issue inside >>> GridEngine but then tried it outside any queuingsystem and face exactly the >>> same behavior. >>> >>> @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 >>> integer cores per machine in total. Used Open MPI is 1.6.4. >>> >>> >>> a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >>> ./mpihello >>> >>> and a hostfile containing only the two lines listing the machines: >>> >>> node006 >>> node007 >>> >>> This works as I would like it (see working.txt) when initiated on node006. >>> >>> >>> b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >>> ./mpihello >>> >>> But changing the hostefile so that it is having a slot count which might >>> mimic the behavior in case of a parsed machinefile out of any queuing >>> system: >>> >>> node006 slots=64 >>> node007 slots=64 >>> >>> This fails with: >>> >>> -------------------------------------------------------------------------- >>> An invalid physical processor ID was returned when attempting to bind >>> an MPI process to a unique processor on node: >>> >>> Node: node006 >>> >>> This usually means that you requested binding to more processors than >>> exist (e.g., trying to bind N MPI processes to M processors, where N > >>> M), or that the node has an unexpectedly different topology. >>> >>> Double check that you have enough unique processors for all the >>> MPI processes that you are launching on this host, and that all nodes >>> have identical topologies. >>> >>> You job will now abort. >>> -------------------------------------------------------------------------- >>> >>> (see failed.txt) >>> >>> >>> b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 >>> ./mpihello >>> >>> This works and the found universe is 128 as expected (see only32.txt). >>> >>> >>> c) Maybe the used machinefile is not parsed in the correct way, so I >>> checked: >>> >>> c1) mpiexec -hostfile machines -np 64 ./mpihello => works >>> >>> c2) mpiexec -hostfile machines -np 128 ./mpihello => works >>> >>> c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected >>> >>> So, it got the slot counts in the correct way. >>> >>> What do I miss? >>> >>> -- Reuti >>> >>> <failed.txt><only32.txt><working.txt>_______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users