Thanks - yes, the problem was in the launch_support.c code. I'll mark this as 
checked and apply it to the v1.7.0 release.

Thanks for the help!
Ralph

On Mar 21, 2013, at 9:06 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> I tried to patch trunk/orte/mca/plm/base/plm_base_launch_support.c.
> 
> I didn't touch debugging part of plm_base_launch_support.c and whole of
> trunk/orte/mca/rmaps/base/rmaps_base_support_fns.c, because
> rmaps_base_support_fns.c seems to include only updates for debugging.
> 
> Then, it works! Here is the result.
> 
> Regards,
> Tetsuya Mishima
> 
> mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS --display-allocation
> -mca ras_base_verbose 5 -mca rmaps_base_verb
> ose 5 /home/mishima/Ducom/testbed/mPre m02-ld
> [node05.cluster:22522] mca:base:select:(  ras) Querying component
> [loadleveler]
> [node05.cluster:22522] [[58229,0],0] ras:loadleveler: NOT available for
> selection
> [node05.cluster:22522] mca:base:select:(  ras) Skipping component
> [loadleveler]. Query failed to return a module
> [node05.cluster:22522] mca:base:select:(  ras) Querying component
> [simulator]
> [node05.cluster:22522] mca:base:select:(  ras) Skipping component
> [simulator]. Query failed to return a module
> [node05.cluster:22522] mca:base:select:(  ras) Querying component [slurm]
> [node05.cluster:22522] [[58229,0],0] ras:slurm: NOT available for selection
> [node05.cluster:22522] mca:base:select:(  ras) Skipping component [slurm].
> Query failed to return a module
> [node05.cluster:22522] mca:base:select:(  ras) Querying component [tm]
> [node05.cluster:22522] mca:base:select:(  ras) Query of component [tm] set
> priority to 100
> [node05.cluster:22522] mca:base:select:(  ras) Selected component [tm]
> [node05.cluster:22522] mca:rmaps:select: checking available component ppr
> [node05.cluster:22522] mca:rmaps:select: Querying component [ppr]
> [node05.cluster:22522] mca:rmaps:select: checking available component
> rank_file
> [node05.cluster:22522] mca:rmaps:select: Querying component [rank_file]
> [node05.cluster:22522] mca:rmaps:select: checking available component
> resilient
> [node05.cluster:22522] mca:rmaps:select: Querying component [resilient]
> [node05.cluster:22522] mca:rmaps:select: checking available component
> round_robin
> [node05.cluster:22522] mca:rmaps:select: Querying component [round_robin]
> [node05.cluster:22522] mca:rmaps:select: checking available component seq
> [node05.cluster:22522] mca:rmaps:select: Querying component [seq]
> [node05.cluster:22522] [[58229,0],0]: Final mapper priorities
> [node05.cluster:22522]  Mapper: ppr Priority: 90
> [node05.cluster:22522]  Mapper: seq Priority: 60
> [node05.cluster:22522]  Mapper: resilient Priority: 40
> [node05.cluster:22522]  Mapper: round_robin Priority: 10
> [node05.cluster:22522]  Mapper: rank_file Priority: 0
> [node05.cluster:22522] [[58229,0],0] ras:base:allocate
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node05
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: not found --
> added to list
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node05
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
> bumped slots to 2
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node05
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
> bumped slots to 3
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node05
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
> bumped slots to 4
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node04
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: not found --
> added to list
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node04
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
> bumped slots to 2
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node04
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
> bumped slots to 3
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node04
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
> bumped slots to 4
> [node05.cluster:22522] [[58229,0],0] ras:base:node_insert inserting 2 nodes
> [node05.cluster:22522] [[58229,0],0] ras:base:node_insert updating HNP info
> to 4 slots
> [node05.cluster:22522] [[58229,0],0] ras:base:node_insert node node04
> 
> ======================   ALLOCATED NODES   ======================
> 
> Data for node: node05  Num slots: 4    Max slots: 0
> Data for node: node04  Num slots: 4    Max slots: 0
> 
> =================================================================
> [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE
> node04
> [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE
> node05
> [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE
> node04
> [node05.cluster:22522] mca:rmaps: mapping job [58229,1]
> [node05.cluster:22522] mca:rmaps: creating new map for job [58229,1]
> [node05.cluster:22522] mca:rmaps:ppr: job [58229,1] not using ppr mapper
> [node05.cluster:22522] [[58229,0],0] rmaps:seq mapping job [58229,1]
> [node05.cluster:22522] mca:rmaps:seq: job [58229,1] not using seq mapper
> [node05.cluster:22522] mca:rmaps:resilient: cannot perform initial map of
> job [58229,1] - no fault groups
> [node05.cluster:22522] mca:rmaps:rr: mapping job [58229,1]
> [node05.cluster:22522] [[58229,0],0] Starting with 2 nodes in list
> [node05.cluster:22522] [[58229,0],0] Filtering thru apps
> [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE
> node05
> [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE
> node04
> [node05.cluster:22522] [[58229,0],0] Retained 2 nodes in list
> [node05.cluster:22522] AVAILABLE NODES FOR MAPPING:
> [node05.cluster:22522]     node: node05 daemon: 0
> [node05.cluster:22522]     node: node04 daemon: 1
> [node05.cluster:22522] [[58229,0],0] Starting bookmark at node node05
> [node05.cluster:22522] [[58229,0],0] Starting at node node05
> [node05.cluster:22522] mca:rmaps:rr: mapping by slot for job [58229,1]
> slots 8 num_procs 8
> [node05.cluster:22522] mca:rmaps:rr:slot working node node05
> [node05.cluster:22522] mca:rmaps:rr:slot working node node04
> [node05.cluster:22522] mca:rmaps:base: computing vpids by slot for job
> [58229,1]
> [node05.cluster:22522] mca:rmaps:base: assigning rank 0 to node node05
> [node05.cluster:22522] mca:rmaps:base: assigning rank 1 to node node05
> [node05.cluster:22522] mca:rmaps:base: assigning rank 2 to node node05
> [node05.cluster:22522] mca:rmaps:base: assigning rank 3 to node node05
> [node05.cluster:22522] mca:rmaps:base: assigning rank 4 to node node04
> [node05.cluster:22522] mca:rmaps:base: assigning rank 5 to node node04
> [node05.cluster:22522] mca:rmaps:base: assigning rank 6 to node node04
> [node05.cluster:22522] mca:rmaps:base: assigning rank 7 to node node04
> [node05.cluster:22522] [[58229,0],0] rmaps:base:compute_usage
> 
> 
>> Okay, I found it - fix coming in a bit.
>> 
>> Thanks!
>> Ralph
>> 
>> On Mar 21, 2013, at 4:02 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Hi Ralph,
>>> 
>>> Sorry for late reply. Here is my result.
>>> 
>>> mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS
> --display-allocation
>>> -mca ras_base_verbose 5 -mca rmaps_base_verb
>>> ose 5 /home/mishima/Ducom/testbed/mPre m02-ld
>>> [node04.cluster:28175] mca:base:select:(  ras) Querying component
>>> [loadleveler]
>>> [node04.cluster:28175] [[29518,0],0] ras:loadleveler: NOT available for
>>> selection
>>> [node04.cluster:28175] mca:base:select:(  ras) Skipping component
>>> [loadleveler]. Query failed to return a module
>>> [node04.cluster:28175] mca:base:select:(  ras) Querying component
>>> [simulator]
>>> [node04.cluster:28175] mca:base:select:(  ras) Skipping component
>>> [simulator]. Query failed to return a module
>>> [node04.cluster:28175] mca:base:select:(  ras) Querying component
> [slurm]
>>> [node04.cluster:28175] [[29518,0],0] ras:slurm: NOT available for
> selection
>>> [node04.cluster:28175] mca:base:select:(  ras) Skipping component
> [slurm].
>>> Query failed to return a module
>>> [node04.cluster:28175] mca:base:select:(  ras) Querying component [tm]
>>> [node04.cluster:28175] mca:base:select:(  ras) Query of component [tm]
> set
>>> priority to 100
>>> [node04.cluster:28175] mca:base:select:(  ras) Selected component [tm]
>>> [node04.cluster:28175] mca:rmaps:select: checking available component
> ppr
>>> [node04.cluster:28175] mca:rmaps:select: Querying component [ppr]
>>> [node04.cluster:28175] mca:rmaps:select: checking available component
>>> rank_file
>>> [node04.cluster:28175] mca:rmaps:select: Querying component [rank_file]
>>> [node04.cluster:28175] mca:rmaps:select: checking available component
>>> resilient
>>> [node04.cluster:28175] mca:rmaps:select: Querying component [resilient]
>>> [node04.cluster:28175] mca:rmaps:select: checking available component
>>> round_robin
>>> [node04.cluster:28175] mca:rmaps:select: Querying component
> [round_robin]
>>> [node04.cluster:28175] mca:rmaps:select: checking available component
> seq
>>> [node04.cluster:28175] mca:rmaps:select: Querying component [seq]
>>> [node04.cluster:28175] [[29518,0],0]: Final mapper priorities
>>> [node04.cluster:28175]  Mapper: ppr Priority: 90
>>> [node04.cluster:28175]  Mapper: seq Priority: 60
>>> [node04.cluster:28175]  Mapper: resilient Priority: 40
>>> [node04.cluster:28175]  Mapper: round_robin Priority: 10
>>> [node04.cluster:28175]  Mapper: rank_file Priority: 0
>>> [node04.cluster:28175] [[29518,0],0] ras:base:allocate
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node04
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: not
> found --
>>> added to list
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node04
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
>>> bumped slots to 2
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node04
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
>>> bumped slots to 3
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node04
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
>>> bumped slots to 4
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node03
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: not
> found --
>>> added to list
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node03
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
>>> bumped slots to 2
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node03
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
>>> bumped slots to 3
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node03
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
>>> bumped slots to 4
>>> [node04.cluster:28175] [[29518,0],0] ras:base:node_insert inserting 2
> nodes
>>> [node04.cluster:28175] [[29518,0],0] ras:base:node_insert updating HNP
> info
>>> to 4 slots
>>> [node04.cluster:28175] [[29518,0],0] ras:base:node_insert node node03
>>> 
>>> ======================   ALLOCATED NODES   ======================
>>> 
>>> Data for node: node04  Num slots: 4    Max slots: 0
>>> Data for node: node03  Num slots: 4    Max slots: 0
>>> 
>>> =================================================================
>>> [node04.cluster:28175] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE
>>> node03
>>> 
> --------------------------------------------------------------------------
>>> A hostfile was provided that contains at least one node not
>>> present in the allocation:
>>> 
>>> hostfile:  pbs_hosts
>>> node:      node04
>>> 
>>> If you are operating in a resource-managed environment, then only
>>> nodes that are in the allocation can be used in the hostfile. You
>>> may find relative node syntax to be a useful alternative to
>>> specifying absolute node names see the orte_hosts man page for
>>> further information.
>>> 
> --------------------------------------------------------------------------
>>> 
>>> Regards,
>>> Tetsuya Mishima
>>> 
>>>> Hmmm...okay, let's try one more thing. Can you please add the
> following
>>> to your command line:
>>>> 
>>>> -mca ras_base_verbose 5 -mca rmaps_base_verbose 5
>>>> 
>>>> Appreciate your patience. For some reason, we are losing your head
> node
>>> from the allocation when we start trying to map processes. I'm trying
> to
>>> track down where this is happening so we can figure
>>>> out why.
>>>> 
>>>> 
>>>> On Mar 20, 2013, at 10:32 PM, tmish...@jcity.maeda.co.jp wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> Hi Ralph,
>>>>> 
>>>>> Here is the result on patched openmpi-1.7rc8.
>>>>> 
>>>>> mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS
>>>>> --display-allocation /home/mishima/Ducom/testbed/mPre m02-ld
>>>>> 
>>>>> ======================   ALLOCATED NODES   ======================
>>>>> 
>>>>> Data for node: node06  Num slots: 4    Max slots: 0
>>>>> Data for node: node05  Num slots: 4    Max slots: 0
>>>>> 
>>>>> =================================================================
>>>>> [node06.cluster:21149] HOSTFILE: CHECKING FILE NODE node06 VS LIST
> NODE
>>>>> node05
>>>>> 
>>> 
> --------------------------------------------------------------------------
>>>>> A hostfile was provided that contains at least one node not
>>>>> present in the allocation:
>>>>> 
>>>>> hostfile:  pbs_hosts
>>>>> node:      node06
>>>>> 
>>>>> If you are operating in a resource-managed environment, then only
>>>>> nodes that are in the allocation can be used in the hostfile. You
>>>>> may find relative node syntax to be a useful alternative to
>>>>> specifying absolute node names see the orte_hosts man page for
>>>>> further information.
>>>>> 
>>> 
> --------------------------------------------------------------------------
>>>>> 
>>>>> Regards,
>>>>> Tetsuya
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to