Hi,

the man page for "mpiexec" shows the following:

         cat myrankfile
         rank 0=aa slot=1:0-2
         rank 1=bb slot=0:0,1
         rank 2=cc slot=1-2
         mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that

       Rank 0 runs on node aa, bound to socket 1, cores 0-2.
       Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
       Rank 2 runs on node cc, bound to cores 1 and 2.

Does it mean that the process with rank 0 should be bound to
core 0, 1, or 2 of socket 1?

I tried to use a rankfile and have a problem. My rankfile contains
the following lines.

rank 0=tyr.informatik.hs-fulda.de slot=0:0
rank 1=tyr.informatik.hs-fulda.de slot=1:0
#rank 2=rs0.informatik.hs-fulda.de slot=0:0


Everything is fine if I use the file with just my local machine
(the first two lines).

tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size
[tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
  odls:default:fork binding child [[9849,1],0] to slot_list 0:0
[tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
  odls:default:fork binding child [[9849,1],1] to slot_list 1:0
I'm process 0 of 2 available processes running on tyr.informatik.hs-fulda.de.
MPI standard 2.1 is supported.
I'm process 1 of 2 available processes running on tyr.informatik.hs-fulda.de.
MPI standard 2.1 is supported.
tyr small_prog 116 


I can also change the socket number and the processes will be attached
to the correct cores. Unfortunately it doesn't work if I add one
other machine (third line).


tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size
--------------------------------------------------------------------------
We were unable to successfully process/set the requested processor
affinity settings:

Specified slot list: 0:0
Error: Cross-device link

This could mean that a non-existent processor was specified, or
that the specification had improper syntax.
--------------------------------------------------------------------------
[tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
  odls:default:fork binding child [[10212,1],0] to slot_list 0:0
[tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
  odls:default:fork binding child [[10212,1],1] to slot_list 1:0
[rs0.informatik.hs-fulda.de:12047] [[10212,0],1]
  odls:default:fork binding child [[10212,1],2] to slot_list 0:0
[tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
  ORTE_ERROR_LOG: A message is attempting to be sent to a process
  whose contact information is unknown in file
  ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145
[tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send
  to [[10212,1],0]: tag 20
[tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG:
  A message is attempting to be sent to a process whose contact
  information is unknown in file
  ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
  at line 2501
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it
  encountered an error:

Error name: Error 0
Node: rs0.informatik.hs-fulda.de

when attempting to start process rank 2.
--------------------------------------------------------------------------
tyr small_prog 113 



The other machine has two 8 core processors.

tyr small_prog 121 ssh rs0 psrinfo -v
Status of virtual processor 0 as of: 09/03/2012 19:51:15
  on-line since 07/26/2012 15:03:14.
  The sparcv9 processor operates at 2400 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 1 as of: 09/03/2012 19:51:15
...
Status of virtual processor 15 as of: 09/03/2012 19:51:15
  on-line since 07/26/2012 15:03:16.
  The sparcv9 processor operates at 2400 MHz,
        and has a sparcv9 floating point processor.
tyr small_prog 122 



Is it necessary to specify another option on the command line or
is my rankfile faulty? Thank you very much for any suggestions in
advance.


Kind regards

Siegmar


Reply via email to