Hi, I'm new to rankfiles so that I played a little bit with different options. I thought that the following entry would be similar to an entry in an appfile and that MPI could place the process with rank 0 on any core of any processor.
rank 0=tyr.informatik.hs-fulda.de Unfortunately it's not allowed and I got an error. Can somebody add the missing help to the file? tyr small_prog 126 mpiexec -rf my_rankfile -report-bindings rank_size -------------------------------------------------------------------------- Sorry! You were supposed to get help about: no-slot-list from the file: help-rmaps_rank_file.txt But I couldn't find that topic in the file. Sorry! -------------------------------------------------------------------------- As you can see below I could use a rankfile on my old local machine (Sun Ultra 45) but not on our "new" one (Sun Server M4000). Today I logged into the machine via ssh and tried the same command once more as a local user without success. It's more or less the same error as before when I tried to bind the process to a remote machine. rs0 small_prog 118 mpiexec -rf my_rankfile -report-bindings rank_size [rs0.informatik.hs-fulda.de:13745] [[19734,0],0] odls:default:fork binding child [[19734,1],0] to slot_list 0:0 -------------------------------------------------------------------------- We were unable to successfully process/set the requested processor affinity settings: Specified slot list: 0:0 Error: Cross-device link This could mean that a non-existent processor was specified, or that the specification had improper syntax. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpiexec was unable to start the specified application as it encountered an error: Error name: No such file or directory Node: rs0.informatik.hs-fulda.de when attempting to start process rank 0. -------------------------------------------------------------------------- rs0 small_prog 119 The application is available. rs0 small_prog 119 which rank_size /home/fd1026/SunOS/sparc/bin/rank_size Is it a problem in the Open MPI implementation or in my rankfile? How can I request which sockets and cores per socket are available so that I can use correct values in my rankfile? In lam-mpi I had a command "lamnodes" which I could use to get such information. Thank you very much for any help in advance. Kind regards Siegmar > > Are *all* the machines Sparc? Or just the 3rd one (rs0)? > > Yes, both machines are Sparc. I tried first in a homogeneous > environment. > > tyr fd1026 106 psrinfo -v > Status of virtual processor 0 as of: 09/04/2012 07:32:14 > on-line since 08/31/2012 15:44:42. > The sparcv9 processor operates at 1600 MHz, > and has a sparcv9 floating point processor. > Status of virtual processor 1 as of: 09/04/2012 07:32:14 > on-line since 08/31/2012 15:44:39. > The sparcv9 processor operates at 1600 MHz, > and has a sparcv9 floating point processor. > tyr fd1026 107 > > My local machine (tyr) is a dual processor machine and the > other one is equipped with two quad-core processors each > capable of running two hardware threads. > > > Kind regards > > Siegmar > > > > On Sep 3, 2012, at 12:43 PM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: > > > > > Hi, > > > > > > the man page for "mpiexec" shows the following: > > > > > > cat myrankfile > > > rank 0=aa slot=1:0-2 > > > rank 1=bb slot=0:0,1 > > > rank 2=cc slot=1-2 > > > mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that > > > > > > Rank 0 runs on node aa, bound to socket 1, cores 0-2. > > > Rank 1 runs on node bb, bound to socket 0, cores 0 and 1. > > > Rank 2 runs on node cc, bound to cores 1 and 2. > > > > > > Does it mean that the process with rank 0 should be bound to > > > core 0, 1, or 2 of socket 1? > > > > > > I tried to use a rankfile and have a problem. My rankfile contains > > > the following lines. > > > > > > rank 0=tyr.informatik.hs-fulda.de slot=0:0 > > > rank 1=tyr.informatik.hs-fulda.de slot=1:0 > > > #rank 2=rs0.informatik.hs-fulda.de slot=0:0 > > > > > > > > > Everything is fine if I use the file with just my local machine > > > (the first two lines). > > > > > > tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size > > > [tyr.informatik.hs-fulda.de:01133] [[9849,0],0] > > > odls:default:fork binding child [[9849,1],0] to slot_list 0:0 > > > [tyr.informatik.hs-fulda.de:01133] [[9849,0],0] > > > odls:default:fork binding child [[9849,1],1] to slot_list 1:0 > > > I'm process 0 of 2 available processes running on > tyr.informatik.hs-fulda.de. > > > MPI standard 2.1 is supported. > > > I'm process 1 of 2 available processes running on > tyr.informatik.hs-fulda.de. > > > MPI standard 2.1 is supported. > > > tyr small_prog 116 > > > > > > > > > I can also change the socket number and the processes will be attached > > > to the correct cores. Unfortunately it doesn't work if I add one > > > other machine (third line). > > > > > > > > > tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size > > > -------------------------------------------------------------------------- > > > We were unable to successfully process/set the requested processor > > > affinity settings: > > > > > > Specified slot list: 0:0 > > > Error: Cross-device link > > > > > > This could mean that a non-existent processor was specified, or > > > that the specification had improper syntax. > > > -------------------------------------------------------------------------- > > > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] > > > odls:default:fork binding child [[10212,1],0] to slot_list 0:0 > > > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] > > > odls:default:fork binding child [[10212,1],1] to slot_list 1:0 > > > [rs0.informatik.hs-fulda.de:12047] [[10212,0],1] > > > odls:default:fork binding child [[10212,1],2] to slot_list 0:0 > > > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] > > > ORTE_ERROR_LOG: A message is attempting to be sent to a process > > > whose contact information is unknown in file > > > ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145 > > > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send > > > to [[10212,1],0]: tag 20 > > > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG: > > > A message is attempting to be sent to a process whose contact > > > information is unknown in file > > > ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c > > > at line 2501 > > > -------------------------------------------------------------------------- > > > mpiexec was unable to start the specified application as it > > > encountered an error: > > > > > > Error name: Error 0 > > > Node: rs0.informatik.hs-fulda.de > > > > > > when attempting to start process rank 2. > > > -------------------------------------------------------------------------- > > > tyr small_prog 113 > > > > > > > > > > > > The other machine has two 8 core processors. > > > > > > tyr small_prog 121 ssh rs0 psrinfo -v > > > Status of virtual processor 0 as of: 09/03/2012 19:51:15 > > > on-line since 07/26/2012 15:03:14. > > > The sparcv9 processor operates at 2400 MHz, > > > and has a sparcv9 floating point processor. > > > Status of virtual processor 1 as of: 09/03/2012 19:51:15 > > > ... > > > Status of virtual processor 15 as of: 09/03/2012 19:51:15 > > > on-line since 07/26/2012 15:03:16. > > > The sparcv9 processor operates at 2400 MHz, > > > and has a sparcv9 floating point processor. > > > tyr small_prog 122 > > > > > > > > > > > > Is it necessary to specify another option on the command line or > > > is my rankfile faulty? Thank you very much for any suggestions in > > > advance. > > > > > > > > > Kind regards > > > > > > Siegmar > > > > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users