Am 28.02.2013 um 17:54 schrieb Ralph Castain: > Hmmm....the problem is that we are mapping procs using the provided slots > instead of dividing the slots by cpus-per-proc. So we put too many on the > first node, and the backend daemon aborts the job because it lacks sufficient > processors for cpus-per-proc=2.
Ok, this I would understand. But why is it then working if no maximum number of slots is given? Will it then just fill the node up to the found number of cores inside and subtract this correctly each time a new process ist started and jump to the next machine if necessary? It is of course for now a feasible workaround to get the intended behavior by supplying just an additional hostfile. But regarding my recent eMail I also wonder about the difference between running on the command line and inside SGE. In the latter case the overall universe is correct. -- Reuti > Given that there are no current plans for a 1.6.5, this may not get fixed. > > On Feb 27, 2013, at 3:15 PM, Reuti <re...@staff.uni-marburg.de> wrote: > >> Hi, >> >> I have an issue using the option -cpus-per-proc 2. As I have Bulldozer >> machines and I want only one process per FP core, I thought using >> -cpus-per-proc 2 would be the way to go. Initially I had this issue inside >> GridEngine but then tried it outside any queuingsystem and face exactly the >> same behavior. >> >> @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 >> integer cores per machine in total. Used Open MPI is 1.6.4. >> >> >> a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >> ./mpihello >> >> and a hostfile containing only the two lines listing the machines: >> >> node006 >> node007 >> >> This works as I would like it (see working.txt) when initiated on node006. >> >> >> b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >> ./mpihello >> >> But changing the hostefile so that it is having a slot count which might >> mimic the behavior in case of a parsed machinefile out of any queuing system: >> >> node006 slots=64 >> node007 slots=64 >> >> This fails with: >> >> -------------------------------------------------------------------------- >> An invalid physical processor ID was returned when attempting to bind >> an MPI process to a unique processor on node: >> >> Node: node006 >> >> This usually means that you requested binding to more processors than >> exist (e.g., trying to bind N MPI processes to M processors, where N > >> M), or that the node has an unexpectedly different topology. >> >> Double check that you have enough unique processors for all the >> MPI processes that you are launching on this host, and that all nodes >> have identical topologies. >> >> You job will now abort. >> -------------------------------------------------------------------------- >> >> (see failed.txt) >> >> >> b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 >> ./mpihello >> >> This works and the found universe is 128 as expected (see only32.txt). >> >> >> c) Maybe the used machinefile is not parsed in the correct way, so I checked: >> >> c1) mpiexec -hostfile machines -np 64 ./mpihello => works >> >> c2) mpiexec -hostfile machines -np 128 ./mpihello => works >> >> c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected >> >> So, it got the slot counts in the correct way. >> >> What do I miss? >> >> -- Reuti >> >> <failed.txt><only32.txt><working.txt>_______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users