Thanks - yes, the problem was in the launch_support.c code. I'll mark this as checked and apply it to the v1.7.0 release.
Thanks for the help! Ralph On Mar 21, 2013, at 9:06 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > I tried to patch trunk/orte/mca/plm/base/plm_base_launch_support.c. > > I didn't touch debugging part of plm_base_launch_support.c and whole of > trunk/orte/mca/rmaps/base/rmaps_base_support_fns.c, because > rmaps_base_support_fns.c seems to include only updates for debugging. > > Then, it works! Here is the result. > > Regards, > Tetsuya Mishima > > mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS --display-allocation > -mca ras_base_verbose 5 -mca rmaps_base_verb > ose 5 /home/mishima/Ducom/testbed/mPre m02-ld > [node05.cluster:22522] mca:base:select:( ras) Querying component > [loadleveler] > [node05.cluster:22522] [[58229,0],0] ras:loadleveler: NOT available for > selection > [node05.cluster:22522] mca:base:select:( ras) Skipping component > [loadleveler]. Query failed to return a module > [node05.cluster:22522] mca:base:select:( ras) Querying component > [simulator] > [node05.cluster:22522] mca:base:select:( ras) Skipping component > [simulator]. Query failed to return a module > [node05.cluster:22522] mca:base:select:( ras) Querying component [slurm] > [node05.cluster:22522] [[58229,0],0] ras:slurm: NOT available for selection > [node05.cluster:22522] mca:base:select:( ras) Skipping component [slurm]. > Query failed to return a module > [node05.cluster:22522] mca:base:select:( ras) Querying component [tm] > [node05.cluster:22522] mca:base:select:( ras) Query of component [tm] set > priority to 100 > [node05.cluster:22522] mca:base:select:( ras) Selected component [tm] > [node05.cluster:22522] mca:rmaps:select: checking available component ppr > [node05.cluster:22522] mca:rmaps:select: Querying component [ppr] > [node05.cluster:22522] mca:rmaps:select: checking available component > rank_file > [node05.cluster:22522] mca:rmaps:select: Querying component [rank_file] > [node05.cluster:22522] mca:rmaps:select: checking available component > resilient > [node05.cluster:22522] mca:rmaps:select: Querying component [resilient] > [node05.cluster:22522] mca:rmaps:select: checking available component > round_robin > [node05.cluster:22522] mca:rmaps:select: Querying component [round_robin] > [node05.cluster:22522] mca:rmaps:select: checking available component seq > [node05.cluster:22522] mca:rmaps:select: Querying component [seq] > [node05.cluster:22522] [[58229,0],0]: Final mapper priorities > [node05.cluster:22522] Mapper: ppr Priority: 90 > [node05.cluster:22522] Mapper: seq Priority: 60 > [node05.cluster:22522] Mapper: resilient Priority: 40 > [node05.cluster:22522] Mapper: round_robin Priority: 10 > [node05.cluster:22522] Mapper: rank_file Priority: 0 > [node05.cluster:22522] [[58229,0],0] ras:base:allocate > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname > node05 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: not found -- > added to list > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname > node05 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found -- > bumped slots to 2 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname > node05 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found -- > bumped slots to 3 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname > node05 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found -- > bumped slots to 4 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname > node04 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: not found -- > added to list > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname > node04 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found -- > bumped slots to 2 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname > node04 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found -- > bumped slots to 3 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname > node04 > [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found -- > bumped slots to 4 > [node05.cluster:22522] [[58229,0],0] ras:base:node_insert inserting 2 nodes > [node05.cluster:22522] [[58229,0],0] ras:base:node_insert updating HNP info > to 4 slots > [node05.cluster:22522] [[58229,0],0] ras:base:node_insert node node04 > > ====================== ALLOCATED NODES ====================== > > Data for node: node05 Num slots: 4 Max slots: 0 > Data for node: node04 Num slots: 4 Max slots: 0 > > ================================================================= > [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE > node04 > [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE > node05 > [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE > node04 > [node05.cluster:22522] mca:rmaps: mapping job [58229,1] > [node05.cluster:22522] mca:rmaps: creating new map for job [58229,1] > [node05.cluster:22522] mca:rmaps:ppr: job [58229,1] not using ppr mapper > [node05.cluster:22522] [[58229,0],0] rmaps:seq mapping job [58229,1] > [node05.cluster:22522] mca:rmaps:seq: job [58229,1] not using seq mapper > [node05.cluster:22522] mca:rmaps:resilient: cannot perform initial map of > job [58229,1] - no fault groups > [node05.cluster:22522] mca:rmaps:rr: mapping job [58229,1] > [node05.cluster:22522] [[58229,0],0] Starting with 2 nodes in list > [node05.cluster:22522] [[58229,0],0] Filtering thru apps > [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE > node05 > [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE > node04 > [node05.cluster:22522] [[58229,0],0] Retained 2 nodes in list > [node05.cluster:22522] AVAILABLE NODES FOR MAPPING: > [node05.cluster:22522] node: node05 daemon: 0 > [node05.cluster:22522] node: node04 daemon: 1 > [node05.cluster:22522] [[58229,0],0] Starting bookmark at node node05 > [node05.cluster:22522] [[58229,0],0] Starting at node node05 > [node05.cluster:22522] mca:rmaps:rr: mapping by slot for job [58229,1] > slots 8 num_procs 8 > [node05.cluster:22522] mca:rmaps:rr:slot working node node05 > [node05.cluster:22522] mca:rmaps:rr:slot working node node04 > [node05.cluster:22522] mca:rmaps:base: computing vpids by slot for job > [58229,1] > [node05.cluster:22522] mca:rmaps:base: assigning rank 0 to node node05 > [node05.cluster:22522] mca:rmaps:base: assigning rank 1 to node node05 > [node05.cluster:22522] mca:rmaps:base: assigning rank 2 to node node05 > [node05.cluster:22522] mca:rmaps:base: assigning rank 3 to node node05 > [node05.cluster:22522] mca:rmaps:base: assigning rank 4 to node node04 > [node05.cluster:22522] mca:rmaps:base: assigning rank 5 to node node04 > [node05.cluster:22522] mca:rmaps:base: assigning rank 6 to node node04 > [node05.cluster:22522] mca:rmaps:base: assigning rank 7 to node node04 > [node05.cluster:22522] [[58229,0],0] rmaps:base:compute_usage > > >> Okay, I found it - fix coming in a bit. >> >> Thanks! >> Ralph >> >> On Mar 21, 2013, at 4:02 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Hi Ralph, >>> >>> Sorry for late reply. Here is my result. >>> >>> mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS > --display-allocation >>> -mca ras_base_verbose 5 -mca rmaps_base_verb >>> ose 5 /home/mishima/Ducom/testbed/mPre m02-ld >>> [node04.cluster:28175] mca:base:select:( ras) Querying component >>> [loadleveler] >>> [node04.cluster:28175] [[29518,0],0] ras:loadleveler: NOT available for >>> selection >>> [node04.cluster:28175] mca:base:select:( ras) Skipping component >>> [loadleveler]. Query failed to return a module >>> [node04.cluster:28175] mca:base:select:( ras) Querying component >>> [simulator] >>> [node04.cluster:28175] mca:base:select:( ras) Skipping component >>> [simulator]. Query failed to return a module >>> [node04.cluster:28175] mca:base:select:( ras) Querying component > [slurm] >>> [node04.cluster:28175] [[29518,0],0] ras:slurm: NOT available for > selection >>> [node04.cluster:28175] mca:base:select:( ras) Skipping component > [slurm]. >>> Query failed to return a module >>> [node04.cluster:28175] mca:base:select:( ras) Querying component [tm] >>> [node04.cluster:28175] mca:base:select:( ras) Query of component [tm] > set >>> priority to 100 >>> [node04.cluster:28175] mca:base:select:( ras) Selected component [tm] >>> [node04.cluster:28175] mca:rmaps:select: checking available component > ppr >>> [node04.cluster:28175] mca:rmaps:select: Querying component [ppr] >>> [node04.cluster:28175] mca:rmaps:select: checking available component >>> rank_file >>> [node04.cluster:28175] mca:rmaps:select: Querying component [rank_file] >>> [node04.cluster:28175] mca:rmaps:select: checking available component >>> resilient >>> [node04.cluster:28175] mca:rmaps:select: Querying component [resilient] >>> [node04.cluster:28175] mca:rmaps:select: checking available component >>> round_robin >>> [node04.cluster:28175] mca:rmaps:select: Querying component > [round_robin] >>> [node04.cluster:28175] mca:rmaps:select: checking available component > seq >>> [node04.cluster:28175] mca:rmaps:select: Querying component [seq] >>> [node04.cluster:28175] [[29518,0],0]: Final mapper priorities >>> [node04.cluster:28175] Mapper: ppr Priority: 90 >>> [node04.cluster:28175] Mapper: seq Priority: 60 >>> [node04.cluster:28175] Mapper: resilient Priority: 40 >>> [node04.cluster:28175] Mapper: round_robin Priority: 10 >>> [node04.cluster:28175] Mapper: rank_file Priority: 0 >>> [node04.cluster:28175] [[29518,0],0] ras:base:allocate >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got > hostname >>> node04 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: not > found -- >>> added to list >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got > hostname >>> node04 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found -- >>> bumped slots to 2 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got > hostname >>> node04 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found -- >>> bumped slots to 3 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got > hostname >>> node04 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found -- >>> bumped slots to 4 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got > hostname >>> node03 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: not > found -- >>> added to list >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got > hostname >>> node03 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found -- >>> bumped slots to 2 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got > hostname >>> node03 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found -- >>> bumped slots to 3 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got > hostname >>> node03 >>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found -- >>> bumped slots to 4 >>> [node04.cluster:28175] [[29518,0],0] ras:base:node_insert inserting 2 > nodes >>> [node04.cluster:28175] [[29518,0],0] ras:base:node_insert updating HNP > info >>> to 4 slots >>> [node04.cluster:28175] [[29518,0],0] ras:base:node_insert node node03 >>> >>> ====================== ALLOCATED NODES ====================== >>> >>> Data for node: node04 Num slots: 4 Max slots: 0 >>> Data for node: node03 Num slots: 4 Max slots: 0 >>> >>> ================================================================= >>> [node04.cluster:28175] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE >>> node03 >>> > -------------------------------------------------------------------------- >>> A hostfile was provided that contains at least one node not >>> present in the allocation: >>> >>> hostfile: pbs_hosts >>> node: node04 >>> >>> If you are operating in a resource-managed environment, then only >>> nodes that are in the allocation can be used in the hostfile. You >>> may find relative node syntax to be a useful alternative to >>> specifying absolute node names see the orte_hosts man page for >>> further information. >>> > -------------------------------------------------------------------------- >>> >>> Regards, >>> Tetsuya Mishima >>> >>>> Hmmm...okay, let's try one more thing. Can you please add the > following >>> to your command line: >>>> >>>> -mca ras_base_verbose 5 -mca rmaps_base_verbose 5 >>>> >>>> Appreciate your patience. For some reason, we are losing your head > node >>> from the allocation when we start trying to map processes. I'm trying > to >>> track down where this is happening so we can figure >>>> out why. >>>> >>>> >>>> On Mar 20, 2013, at 10:32 PM, tmish...@jcity.maeda.co.jp wrote: >>>> >>>>> >>>>> >>>>> Hi Ralph, >>>>> >>>>> Here is the result on patched openmpi-1.7rc8. >>>>> >>>>> mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS >>>>> --display-allocation /home/mishima/Ducom/testbed/mPre m02-ld >>>>> >>>>> ====================== ALLOCATED NODES ====================== >>>>> >>>>> Data for node: node06 Num slots: 4 Max slots: 0 >>>>> Data for node: node05 Num slots: 4 Max slots: 0 >>>>> >>>>> ================================================================= >>>>> [node06.cluster:21149] HOSTFILE: CHECKING FILE NODE node06 VS LIST > NODE >>>>> node05 >>>>> >>> > -------------------------------------------------------------------------- >>>>> A hostfile was provided that contains at least one node not >>>>> present in the allocation: >>>>> >>>>> hostfile: pbs_hosts >>>>> node: node06 >>>>> >>>>> If you are operating in a resource-managed environment, then only >>>>> nodes that are in the allocation can be used in the hostfile. You >>>>> may find relative node syntax to be a useful alternative to >>>>> specifying absolute node names see the orte_hosts man page for >>>>> further information. >>>>> >>> > -------------------------------------------------------------------------- >>>>> >>>>> Regards, >>>>> Tetsuya >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users