On Oct 3, 2012, at 6:19 AM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote:
> Hi, > > I recognized another problem with procecss bindings. The command > works, if I use "-host" and it breaks, if I use "-hostfile" with > the same machines. > > tyr fd1026 178 mpiexec -report-bindings -host sunpc0,sunpc1 -np 4 \ > -cpus-per-proc 2 -bind-to-core hostname > sunpc1 > [sunpc1:00086] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] > [sunpc1:00086] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B] > sunpc0 > [sunpc0:10929] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .] > sunpc0 > [sunpc0:10929] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B] > sunpc1 > > Yes, this works because you told us there is only ONE slot on each host. As a result, we split the 4 processes across the two hosts (both of which are now oversubscribed), resulting in TWO processes running on each host. Since there are 4 cores on each host, and you asked for 2 cores/process, we can make this work. > tyr fd1026 179 cat host_sunpc0_1 > sunpc0 slots=4 > sunpc1 slots=4 > > > tyr fd1026 180 mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4 \ > -cpus-per-proc 2 -bind-to-core hostname And this will of course not work. In your hostfile, you told us there are FOUR slots on each host. Since the default is to map by slot, we correctly mapped all four processes to the first node. We then tried to bind 2 cores for each process, resulting in 8 cores - which is more than you have. > -------------------------------------------------------------------------- > An invalid physical processor ID was returned when attempting to bind > an MPI process to a unique processor. > > This usually means that you requested binding to more processors than > exist (e.g., trying to bind N MPI processes to M processors, where N > > M). Double check that you have enough unique processors for all the > MPI processes that you are launching on this host. > > You job will now abort. > -------------------------------------------------------------------------- > sunpc0 > [sunpc0:10964] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .] > sunpc0 > [sunpc0:10964] MCW rank 1 bound to socket 1[core 0-1]: [. .][B B] > -------------------------------------------------------------------------- > mpiexec was unable to start the specified application as it encountered > an error > on node sunpc0. More information may be available above. > -------------------------------------------------------------------------- > 4 total processes failed to start > > > Perhaps this error is related to the other errors. Thank you very > much for any help in advance. > > > Kind regards > > Siegmar > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users