Hi Ralph,
I tried to patch trunk/orte/mca/plm/base/plm_base_launch_support.c. I didn't touch debugging part of plm_base_launch_support.c and whole of trunk/orte/mca/rmaps/base/rmaps_base_support_fns.c, because rmaps_base_support_fns.c seems to include only updates for debugging. Then, it works! Here is the result. Regards, Tetsuya Mishima mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS --display-allocation -mca ras_base_verbose 5 -mca rmaps_base_verb ose 5 /home/mishima/Ducom/testbed/mPre m02-ld [node05.cluster:22522] mca:base:select:( ras) Querying component [loadleveler] [node05.cluster:22522] [[58229,0],0] ras:loadleveler: NOT available for selection [node05.cluster:22522] mca:base:select:( ras) Skipping component [loadleveler]. Query failed to return a module [node05.cluster:22522] mca:base:select:( ras) Querying component [simulator] [node05.cluster:22522] mca:base:select:( ras) Skipping component [simulator]. Query failed to return a module [node05.cluster:22522] mca:base:select:( ras) Querying component [slurm] [node05.cluster:22522] [[58229,0],0] ras:slurm: NOT available for selection [node05.cluster:22522] mca:base:select:( ras) Skipping component [slurm]. Query failed to return a module [node05.cluster:22522] mca:base:select:( ras) Querying component [tm] [node05.cluster:22522] mca:base:select:( ras) Query of component [tm] set priority to 100 [node05.cluster:22522] mca:base:select:( ras) Selected component [tm] [node05.cluster:22522] mca:rmaps:select: checking available component ppr [node05.cluster:22522] mca:rmaps:select: Querying component [ppr] [node05.cluster:22522] mca:rmaps:select: checking available component rank_file [node05.cluster:22522] mca:rmaps:select: Querying component [rank_file] [node05.cluster:22522] mca:rmaps:select: checking available component resilient [node05.cluster:22522] mca:rmaps:select: Querying component [resilient] [node05.cluster:22522] mca:rmaps:select: checking available component round_robin [node05.cluster:22522] mca:rmaps:select: Querying component [round_robin] [node05.cluster:22522] mca:rmaps:select: checking available component seq [node05.cluster:22522] mca:rmaps:select: Querying component [seq] [node05.cluster:22522] [[58229,0],0]: Final mapper priorities [node05.cluster:22522] Mapper: ppr Priority: 90 [node05.cluster:22522] Mapper: seq Priority: 60 [node05.cluster:22522] Mapper: resilient Priority: 40 [node05.cluster:22522] Mapper: round_robin Priority: 10 [node05.cluster:22522] Mapper: rank_file Priority: 0 [node05.cluster:22522] [[58229,0],0] ras:base:allocate [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname node05 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: not found -- added to list [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname node05 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found -- bumped slots to 2 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname node05 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found -- bumped slots to 3 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname node05 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found -- bumped slots to 4 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname node04 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: not found -- added to list [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname node04 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found -- bumped slots to 2 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname node04 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found -- bumped slots to 3 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname node04 [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found -- bumped slots to 4 [node05.cluster:22522] [[58229,0],0] ras:base:node_insert inserting 2 nodes [node05.cluster:22522] [[58229,0],0] ras:base:node_insert updating HNP info to 4 slots [node05.cluster:22522] [[58229,0],0] ras:base:node_insert node node04 ====================== ALLOCATED NODES ====================== Data for node: node05 Num slots: 4 Max slots: 0 Data for node: node04 Num slots: 4 Max slots: 0 ================================================================= [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE node04 [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE node05 [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE node04 [node05.cluster:22522] mca:rmaps: mapping job [58229,1] [node05.cluster:22522] mca:rmaps: creating new map for job [58229,1] [node05.cluster:22522] mca:rmaps:ppr: job [58229,1] not using ppr mapper [node05.cluster:22522] [[58229,0],0] rmaps:seq mapping job [58229,1] [node05.cluster:22522] mca:rmaps:seq: job [58229,1] not using seq mapper [node05.cluster:22522] mca:rmaps:resilient: cannot perform initial map of job [58229,1] - no fault groups [node05.cluster:22522] mca:rmaps:rr: mapping job [58229,1] [node05.cluster:22522] [[58229,0],0] Starting with 2 nodes in list [node05.cluster:22522] [[58229,0],0] Filtering thru apps [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE node05 [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE node04 [node05.cluster:22522] [[58229,0],0] Retained 2 nodes in list [node05.cluster:22522] AVAILABLE NODES FOR MAPPING: [node05.cluster:22522] node: node05 daemon: 0 [node05.cluster:22522] node: node04 daemon: 1 [node05.cluster:22522] [[58229,0],0] Starting bookmark at node node05 [node05.cluster:22522] [[58229,0],0] Starting at node node05 [node05.cluster:22522] mca:rmaps:rr: mapping by slot for job [58229,1] slots 8 num_procs 8 [node05.cluster:22522] mca:rmaps:rr:slot working node node05 [node05.cluster:22522] mca:rmaps:rr:slot working node node04 [node05.cluster:22522] mca:rmaps:base: computing vpids by slot for job [58229,1] [node05.cluster:22522] mca:rmaps:base: assigning rank 0 to node node05 [node05.cluster:22522] mca:rmaps:base: assigning rank 1 to node node05 [node05.cluster:22522] mca:rmaps:base: assigning rank 2 to node node05 [node05.cluster:22522] mca:rmaps:base: assigning rank 3 to node node05 [node05.cluster:22522] mca:rmaps:base: assigning rank 4 to node node04 [node05.cluster:22522] mca:rmaps:base: assigning rank 5 to node node04 [node05.cluster:22522] mca:rmaps:base: assigning rank 6 to node node04 [node05.cluster:22522] mca:rmaps:base: assigning rank 7 to node node04 [node05.cluster:22522] [[58229,0],0] rmaps:base:compute_usage > Okay, I found it - fix coming in a bit. > > Thanks! > Ralph > > On Mar 21, 2013, at 4:02 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Hi Ralph, > > > > Sorry for late reply. Here is my result. > > > > mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS --display-allocation > > -mca ras_base_verbose 5 -mca rmaps_base_verb > > ose 5 /home/mishima/Ducom/testbed/mPre m02-ld > > [node04.cluster:28175] mca:base:select:( ras) Querying component > > [loadleveler] > > [node04.cluster:28175] [[29518,0],0] ras:loadleveler: NOT available for > > selection > > [node04.cluster:28175] mca:base:select:( ras) Skipping component > > [loadleveler]. Query failed to return a module > > [node04.cluster:28175] mca:base:select:( ras) Querying component > > [simulator] > > [node04.cluster:28175] mca:base:select:( ras) Skipping component > > [simulator]. Query failed to return a module > > [node04.cluster:28175] mca:base:select:( ras) Querying component [slurm] > > [node04.cluster:28175] [[29518,0],0] ras:slurm: NOT available for selection > > [node04.cluster:28175] mca:base:select:( ras) Skipping component [slurm]. > > Query failed to return a module > > [node04.cluster:28175] mca:base:select:( ras) Querying component [tm] > > [node04.cluster:28175] mca:base:select:( ras) Query of component [tm] set > > priority to 100 > > [node04.cluster:28175] mca:base:select:( ras) Selected component [tm] > > [node04.cluster:28175] mca:rmaps:select: checking available component ppr > > [node04.cluster:28175] mca:rmaps:select: Querying component [ppr] > > [node04.cluster:28175] mca:rmaps:select: checking available component > > rank_file > > [node04.cluster:28175] mca:rmaps:select: Querying component [rank_file] > > [node04.cluster:28175] mca:rmaps:select: checking available component > > resilient > > [node04.cluster:28175] mca:rmaps:select: Querying component [resilient] > > [node04.cluster:28175] mca:rmaps:select: checking available component > > round_robin > > [node04.cluster:28175] mca:rmaps:select: Querying component [round_robin] > > [node04.cluster:28175] mca:rmaps:select: checking available component seq > > [node04.cluster:28175] mca:rmaps:select: Querying component [seq] > > [node04.cluster:28175] [[29518,0],0]: Final mapper priorities > > [node04.cluster:28175] Mapper: ppr Priority: 90 > > [node04.cluster:28175] Mapper: seq Priority: 60 > > [node04.cluster:28175] Mapper: resilient Priority: 40 > > [node04.cluster:28175] Mapper: round_robin Priority: 10 > > [node04.cluster:28175] Mapper: rank_file Priority: 0 > > [node04.cluster:28175] [[29518,0],0] ras:base:allocate > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got hostname > > node04 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: not found -- > > added to list > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got hostname > > node04 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found -- > > bumped slots to 2 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got hostname > > node04 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found -- > > bumped slots to 3 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got hostname > > node04 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found -- > > bumped slots to 4 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got hostname > > node03 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: not found -- > > added to list > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got hostname > > node03 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found -- > > bumped slots to 2 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got hostname > > node03 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found -- > > bumped slots to 3 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got hostname > > node03 > > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found -- > > bumped slots to 4 > > [node04.cluster:28175] [[29518,0],0] ras:base:node_insert inserting 2 nodes > > [node04.cluster:28175] [[29518,0],0] ras:base:node_insert updating HNP info > > to 4 slots > > [node04.cluster:28175] [[29518,0],0] ras:base:node_insert node node03 > > > > ====================== ALLOCATED NODES ====================== > > > > Data for node: node04 Num slots: 4 Max slots: 0 > > Data for node: node03 Num slots: 4 Max slots: 0 > > > > ================================================================= > > [node04.cluster:28175] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE > > node03 > > -------------------------------------------------------------------------- > > A hostfile was provided that contains at least one node not > > present in the allocation: > > > > hostfile: pbs_hosts > > node: node04 > > > > If you are operating in a resource-managed environment, then only > > nodes that are in the allocation can be used in the hostfile. You > > may find relative node syntax to be a useful alternative to > > specifying absolute node names see the orte_hosts man page for > > further information. > > -------------------------------------------------------------------------- > > > > Regards, > > Tetsuya Mishima > > > >> Hmmm...okay, let's try one more thing. Can you please add the following > > to your command line: > >> > >> -mca ras_base_verbose 5 -mca rmaps_base_verbose 5 > >> > >> Appreciate your patience. For some reason, we are losing your head node > > from the allocation when we start trying to map processes. I'm trying to > > track down where this is happening so we can figure > >> out why. > >> > >> > >> On Mar 20, 2013, at 10:32 PM, tmish...@jcity.maeda.co.jp wrote: > >> > >>> > >>> > >>> Hi Ralph, > >>> > >>> Here is the result on patched openmpi-1.7rc8. > >>> > >>> mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS > >>> --display-allocation /home/mishima/Ducom/testbed/mPre m02-ld > >>> > >>> ====================== ALLOCATED NODES ====================== > >>> > >>> Data for node: node06 Num slots: 4 Max slots: 0 > >>> Data for node: node05 Num slots: 4 Max slots: 0 > >>> > >>> ================================================================= > >>> [node06.cluster:21149] HOSTFILE: CHECKING FILE NODE node06 VS LIST NODE > >>> node05 > >>> > > -------------------------------------------------------------------------- > >>> A hostfile was provided that contains at least one node not > >>> present in the allocation: > >>> > >>> hostfile: pbs_hosts > >>> node: node06 > >>> > >>> If you are operating in a resource-managed environment, then only > >>> nodes that are in the allocation can be used in the hostfile. You > >>> may find relative node syntax to be a useful alternative to > >>> specifying absolute node names see the orte_hosts man page for > >>> further information. > >>> > > -------------------------------------------------------------------------- > >>> > >>> Regards, > >>> Tetsuya > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >