Hi

I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
it works for my previous rankfile.


> #3493: Handle the case where rankfile provides the allocation
> -----------------------------------+----------------------------
> Reporter:  rhc                     |      Owner:  jsquyres
>     Type:  changeset move request  |     Status:  new
> Priority:  critical                |  Milestone:  Open MPI 1.6.4
>  Version:  trunk                   |   Keywords:
> -----------------------------------+----------------------------
>  Please apply the attached patch that corrects the rmaps function for
>  obtaining the available nodes when rankfile is providing the allocation.


tyr rankfiles 129 more rf_linpc1
# mpiexec -report-bindings -rf rf_linpc1 hostname
rank 0=linpc1 slot=0:0-1,1:0-1

tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
[linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)



Unfortunately I don't get the expected result for the following
rankfile.

tyr rankfiles 114 more rf_bsp 
# mpiexec -report-bindings -rf rf_bsp hostname
rank 0=linpc1 slot=0:0-1,1:0-1
rank 1=sunpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=sunpc1 slot=1:1

I would expect that rank 0 gets all four cores from linpc1, rank 1
both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
because they both get all four cores of sunpc1. Is something wrong
with my rankfile or with your mapping of processes to cores? I have
removed the output from "hostname" and wrapped long lines.

tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
[linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 0:0-1,1:0-1)
[sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
[sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 1:0)
[sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 1:1)


I get the following output, if I add the options which you mentioned
in a previous email.

tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
  -display-allocation -mca ras_base_verbose 5 hostname
[tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
  Querying component [cm]
[tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
  Skipping component [cm]. Query failed to return a module
[tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
  No component selected!
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
  nothing found in module - proceeding to hostfile
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
  parsing default hostfile
   /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
  nothing found in hostfiles or dash-host - checking for rankfile
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
  ras:base:node_insert inserting 2 nodes
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
  ras:base:node_insert node linpc1
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
  ras:base:node_insert node sunpc1

======================   ALLOCATED NODES   ======================

 Data for node: tyr.informatik.hs-fulda.de  Num slots: 0  Max slots: 0
 Data for node: linpc1  Num slots: 1    Max slots: 0
 Data for node: sunpc1  Num slots: 3    Max slots: 0

=================================================================
[linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 0:0-1,1:0-1)
[sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
[sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 1:0)
[sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 1:1)


Thank you very much for any suggestions and any help in advance.


Kind regards

Siegmar

Reply via email to