Hi, the man page for "mpiexec" shows the following:
cat myrankfile rank 0=aa slot=1:0-2 rank 1=bb slot=0:0,1 rank 2=cc slot=1-2 mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that Rank 0 runs on node aa, bound to socket 1, cores 0-2. Rank 1 runs on node bb, bound to socket 0, cores 0 and 1. Rank 2 runs on node cc, bound to cores 1 and 2. Does it mean that the process with rank 0 should be bound to core 0, 1, or 2 of socket 1? I tried to use a rankfile and have a problem. My rankfile contains the following lines. rank 0=tyr.informatik.hs-fulda.de slot=0:0 rank 1=tyr.informatik.hs-fulda.de slot=1:0 #rank 2=rs0.informatik.hs-fulda.de slot=0:0 Everything is fine if I use the file with just my local machine (the first two lines). tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size [tyr.informatik.hs-fulda.de:01133] [[9849,0],0] odls:default:fork binding child [[9849,1],0] to slot_list 0:0 [tyr.informatik.hs-fulda.de:01133] [[9849,0],0] odls:default:fork binding child [[9849,1],1] to slot_list 1:0 I'm process 0 of 2 available processes running on tyr.informatik.hs-fulda.de. MPI standard 2.1 is supported. I'm process 1 of 2 available processes running on tyr.informatik.hs-fulda.de. MPI standard 2.1 is supported. tyr small_prog 116 I can also change the socket number and the processes will be attached to the correct cores. Unfortunately it doesn't work if I add one other machine (third line). tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size -------------------------------------------------------------------------- We were unable to successfully process/set the requested processor affinity settings: Specified slot list: 0:0 Error: Cross-device link This could mean that a non-existent processor was specified, or that the specification had improper syntax. -------------------------------------------------------------------------- [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] odls:default:fork binding child [[10212,1],0] to slot_list 0:0 [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] odls:default:fork binding child [[10212,1],1] to slot_list 1:0 [rs0.informatik.hs-fulda.de:12047] [[10212,0],1] odls:default:fork binding child [[10212,1],2] to slot_list 0:0 [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145 [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send to [[10212,1],0]: tag 20 [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c at line 2501 -------------------------------------------------------------------------- mpiexec was unable to start the specified application as it encountered an error: Error name: Error 0 Node: rs0.informatik.hs-fulda.de when attempting to start process rank 2. -------------------------------------------------------------------------- tyr small_prog 113 The other machine has two 8 core processors. tyr small_prog 121 ssh rs0 psrinfo -v Status of virtual processor 0 as of: 09/03/2012 19:51:15 on-line since 07/26/2012 15:03:14. The sparcv9 processor operates at 2400 MHz, and has a sparcv9 floating point processor. Status of virtual processor 1 as of: 09/03/2012 19:51:15 ... Status of virtual processor 15 as of: 09/03/2012 19:51:15 on-line since 07/26/2012 15:03:16. The sparcv9 processor operates at 2400 MHz, and has a sparcv9 floating point processor. tyr small_prog 122 Is it necessary to specify another option on the command line or is my rankfile faulty? Thank you very much for any suggestions in advance. Kind regards Siegmar