Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

Ralph Castain Thu, 14 Nov 2013 09:50:31 -0500 (EST)

FWIW: I verified that this works fine under a slurm allocation of 2 nodes, each 
with 12 slots. I filled the node without getting an "oversbuscribed" error 
message


[rhc@bend001 svn-trunk]$ mpirun -n 3 --bind-to core --cpus-per-proc 4 
--report-bindings -hostfile hosts hostname
[bend001:24318] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: 
[BB/BB/BB/BB/../..][../../../../../..]
[bend001:24318] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket 0[core 
5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
[../../../../BB/BB][BB/BB/../../../..]
[bend001:24318] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket 1[core 
9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
[../../../../../..][../../BB/BB/BB/BB]
bend001
bend001
bend001

where

[rhc@bend001 svn-trunk]$ cat hosts
bend001 slots=12

The only way I get the "out of resources" error is if I ask for more processes 
than I have slots - i.e., I give it the hosts file as shown, but ask for 13 or 
more processes.


BTW: note one important issue with cpus-per-proc, as shown above. Because I 
specified 4 cpus/proc, and my sockets each have 6 cpus, one of my procs wound 
up being split across the two sockets (2 cores on each). That's about the worst 
situation you can have.

So a word of caution: it is up to the user to ensure that the mapping is 
"good". We just do what you asked us to do.


On Nov 13, 2013, at 8:30 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Guess I don't see why modifying the allocation is required - we have mapping 
> options that should support such things. If you specify the total number of 
> procs you want, and cpus-per-proc=4, it should do the same thing I would 
> think. You'd get 2 procs on the 8 slot nodes, 8 on the 32 proc nodes, and up 
> to 6 on the 64 slot nodes (since you specified np=16). So I guess I don't 
> understand the issue.
> 
> Regardless, if NPROCS=8 (and you verified that by printing it out, not just 
> assuming wc -l got that value), then it shouldn't think it is oversubscribed. 
> I'll take a look under a slurm allocation as that is all I can access.
> 
> 
> On Nov 13, 2013, at 7:23 PM, tmish...@jcity.maeda.co.jp wrote:
> 
>> 
>> 
>> Our cluster consists of three types of nodes. They have 8, 32
>> and 64 slots respectively. Since the performance of each core is
>> almost same, mixed use of these nodes is possible.
>> 
>> Furthremore, in this case, for hybrid application with openmpi+openmp,
>> the modification of hostfile is necesarry as follows:
>> 
>> #PBS -l nodes=1:ppn=32+4:ppn=8
>> export OMP_NUM_THREADS=4
>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines
>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS
>> Myprog
>> 
>> That's why I want to do that.
>> 
>> Of course I know, If I quit mixed use, -npernode is better for this
>> purpose.
>> 
>> (The script I showed you first is just a simplified one to clarify the
>> problem.)
>> 
>> tmishima
>> 
>> 
>>> Why do it the hard way? I'll look at the FAQ because that definitely
>> isn't a recommended thing to do - better to use -host to specify the
>> subset, or just specify the desired mapping using all the
>>> various mappers we provide.
>>> 
>>> On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote:
>>> 
>>>> 
>>>> 
>>>> Sorry for cross-post.
>>>> 
>>>> Nodefile is very simple which consists of 8 lines:
>>>> 
>>>> node08
>>>> node08
>>>> node08
>>>> node08
>>>> node08
>>>> node08
>>>> node08
>>>> node08
>>>> 
>>>> Therefore, NPROCS=8
>>>> 
>>>> My aim is to modify the allocation as you pointed out. According to
>> Openmpi
>>>> FAQ,
>>>> proper subset of the hosts allocated to the Torque / PBS Pro job should
>> be
>>>> allowed.
>>>> 
>>>> tmishima
>>>> 
>>>>> Please - can you answer my question on script2? What is the value of
>>>> NPROCS?
>>>>> 
>>>>> Why would you want to do it this way? Are you planning to modify the
>>>> allocation?? That generally is a bad idea as it can confuse the system
>>>>> 
>>>>> 
>>>>> On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Since what I really want is to run script2 correctly, please let us
>>>>>> concentrate script2.
>>>>>> 
>>>>>> I'm not an expert of the inside of openmpi. What I can do is just
>>>>>> obsabation
>>>>>> from the outside. I doubt these lines are strange, especially the
>> last
>>>> one.
>>>>>> 
>>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
>>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list
>>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
>>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
>>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0
>> inuse
>>>> 0
>>>>>> 
>>>>>> These lines come from this part of orte_rmaps_base_get_target_nodes
>>>>>> in rmaps_base_support_fns.c:
>>>>>> 
>>>>>>     } else if (node->slots <= node->slots_inuse &&
>>>>>>                (ORTE_MAPPING_NO_OVERSUBSCRIBE &
>>>>>> ORTE_GET_MAPPING_DIRECTIVE(policy))) {
>>>>>>         /* remove the node as fully used */
>>>>>>         OPAL_OUTPUT_VERBOSE((5,
>>>>>> orte_rmaps_base_framework.framework_output,
>>>>>>                              "%s Removing node %s slots %d inuse
>>>> %d",
>>>>>>                              ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>>>>>>                              node->name, node->slots, node->
>>>>>> slots_inuse));
>>>>>>         opal_list_remove_item(allocated_nodes, item);
>>>>>>         OBJ_RELEASE(item);  /* "un-retain" it */
>>>>>> 
>>>>>> I wonder why node->slots and node->slots_inuse is 0, which I can read
>>>>>> from the above line "Removing node node08 slots 0 inuse 0".
>>>>>> 
>>>>>> Or I'm not sure but
>>>>>> "else if (node->slots <= node->slots_inuse &&" should be
>>>>>> "else if (node->slots < node->slots_inuse &&" ?
>>>>>> 
>>>>>> tmishima
>>>>>> 
>>>>>>> On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Yes, the node08 has 8 slots but the process I run is also 8.
>>>>>>>> 
>>>>>>>> #PBS -l nodes=node08:ppn=8
>>>>>>>> 
>>>>>>>> Therefore, I think it should allow this allocation. Is that right?
>>>>>>> 
>>>>>>> Correct
>>>>>>> 
>>>>>>>> 
>>>>>>>> My question is why scritp1 works and script2 does not. They are
>>>>>>>> almost same.
>>>>>>>> 
>>>>>>>> #PBS -l nodes=node08:ppn=8
>>>>>>>> export OMP_NUM_THREADS=1
>>>>>>>> cd $PBS_O_WORKDIR
>>>>>>>> cp $PBS_NODEFILE pbs_hosts
>>>>>>>> NPROCS=`wc -l < pbs_hosts`
>>>>>>>> 
>>>>>>>> #SCRITP1
>>>>>>>> mpirun -report-bindings -bind-to core Myprog
>>>>>>>> 
>>>>>>>> #SCRIPT2
>>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings
>> -bind-to
>>>>>> core
>>>>>>> 
>>>>>>> This version is not only reading the PBS allocation, but also
>> invoking
>>>>>> the hostfile filter on top of it. Different code path. I'll take a
>> look
>>>> -
>>>>>> it should still match up assuming NPROCS=8. Any
>>>>>>> possibility that it is a different number? I don't recall, but isn't
>>>>>> there some extra lines in the nodefile - e.g., comments?
>>>>>>> 
>>>>>>> 
>>>>>>>> Myprog
>>>>>>>> 
>>>>>>>> tmishima
>>>>>>>> 
>>>>>>>>> I guess here's my confusion. If you are using only one node, and
>>>> that
>>>>>>>> node has 8 allocated slots, then we will not allow you to run more
>>>> than
>>>>>> 8
>>>>>>>> processes on that node unless you specifically provide
>>>>>>>>> the --oversubscribe flag. This is because you are operating in a
>>>>>> managed
>>>>>>>> environment (in this case, under Torque), and so we treat the
>>>>>> allocation as
>>>>>>>> "mandatory" by default.
>>>>>>>>> 
>>>>>>>>> I suspect that is the issue here, in which case the system is
>>>> behaving
>>>>>> as
>>>>>>>> it should.
>>>>>>>>> 
>>>>>>>>> Is the above accurate?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> It has nothing to do with LAMA as you aren't using that mapper.
>>>>>>>>>> 
>>>>>>>>>> How many nodes are in this allocation?
>>>>>>>>>> 
>>>>>>>>>> On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi Ralph, this is an additional information.
>>>>>>>>>>> 
>>>>>>>>>>> Here is the main part of output by adding "-mca
>> rmaps_base_verbose
>>>>>>>> 50".
>>>>>>>>>>> 
>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm
>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating
>>>> map
>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP
>> in
>>>>>>>>>>> allocation
>>>>>>>>>>> [node08.cluster:26952] mca:rmaps: mapping job [56581,1]
>>>>>>>>>>> [node08.cluster:26952] mca:rmaps: creating new map for job
>>>> [56581,1]
>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using
>> ppr
>>>>>>>> mapper
>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job
>>>> [56581,1]
>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using
>> seq
>>>>>>>> mapper
>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:resilient: cannot perform
>> initial
>>>>>> map
>>>>>>>> of
>>>>>>>>>>> job [56581,1] - no fault groups
>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not
>> using
>>>>>>>> mindist
>>>>>>>>>>> mapper
>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in
>> list
>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots
>> 0
>>>>>>>> inuse 0
>>>>>>>>>>> 
>>>>>>>>>>> From this result, I guess it's related to oversubscribe.
>>>>>>>>>>> So I added "-oversubscribe" and rerun, then it worked well as
>> show
>>>>>>>> below:
>>>>>>>>>>> 
>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in
>> list
>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Filtering thru apps
>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list
>>>>>>>>>>> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING:
>>>>>>>>>>> [node08.cluster:27019]     node: node08 daemon: 0
>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node
>>>>>> node08
>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting at node node08
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job
>>>>>> [56774,1]
>>>>>>>>>>> slots 1 num_procs 8
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full -
>>>>>>>> skipping
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is
>>>>>>>> oversubscribed -
>>>>>>>>>>> performing second pass
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to
>>>>>> node
>>>>>>>>>>> node08
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot
>> for
>>>>>> job
>>>>>>>>>>> [56774,1]
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node
>>>>>> node08
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node
>>>>>> node08
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node
>>>>>> node08
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node
>>>>>> node08
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node
>>>>>> node08
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node
>>>>>> node08
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node
>>>>>> node08
>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node
>>>>>> node08
>>>>>>>>>>> 
>>>>>>>>>>> I think something is wrong with treatment of oversubscription,
>>>> which
>>>>>>>> might
>>>>>>>>>>> be
>>>>>>>>>>> related to "#3893: LAMA mapper has problems"
>>>>>>>>>>> 
>>>>>>>>>>> tmishima
>>>>>>>>>>> 
>>>>>>>>>>>> Hmmm...looks like we aren't getting your allocation. Can you
>>>> rerun
>>>>>>>> and
>>>>>>>>>>> add -mca ras_base_verbose 50?
>>>>>>>>>>>> 
>>>>>>>>>>>> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here is the output of "-mca plm_base_verbose 5".
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
>>>> component
>>>>>>>> [rsh]
>>>>>>>>>>>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on
>>>>>>>>>>>>> agent /usr/bin/rsh path NULL
>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Query of
>>>> component
>>>>>>>> [rsh]
>>>>>>>>>>> set
>>>>>>>>>>>>> priority to 10
>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
>>>> component
>>>>>>>>>>> [slurm]
>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Skipping
>>>> component
>>>>>>>>>>> [slurm].
>>>>>>>>>>>>> Query failed to return a module
>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
>>>> component
>>>>>>>> [tm]
>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Query of
>>>> component
>>>>>>>> [tm]
>>>>>>>>>>> set
>>>>>>>>>>>>> priority to 75
>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Selected
>>>> component
>>>>>>>> [tm]
>>>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias
>> 23573
>>>>>>>>>>> nodename
>>>>>>>>>>>>> hash 85176670
>>>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam
>> 59480
>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start
>> comm
>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job
>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
>> creating
>>>>>> map
>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only
>> HNP
>>>> in
>>>>>>>>>>>>> allocation
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> --------------------------------------------------------------------------
>>>>>>>>>>>>> All nodes which are allocated for this job are already filled.
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> --------------------------------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here, openmpi's configuration is as follows:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ./configure \
>>>>>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \
>>>>>>>>>>>>> --with-tm \
>>>>>>>>>>>>> --with-verbs \
>>>>>>>>>>>>> --disable-ipv6 \
>>>>>>>>>>>>> --disable-vt \
>>>>>>>>>>>>> --enable-debug \
>>>>>>>>>>>>> CC=pgcc CFLAGS="-tp k8-64e" \
>>>>>>>>>>>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \
>>>>>>>>>>>>> F77=pgfortran FFLAGS="-tp k8-64e" \
>>>>>>>>>>>>> FC=pgfortran FCFLAGS="-tp k8-64e"
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Okey, I can help you. Please give me some time to report the
>>>>>>>> output.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I can try, but I have no way of testing Torque any more - so
>>>> all
>>>>>> I
>>>>>>>>>>> can
>>>>>>>>>>>>> do
>>>>>>>>>>>>>> is a code review. If you can build --enable-debug and add
>> -mca
>>>>>>>>>>>>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing
>> the
>>>>>>>>>>>>>>> output.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp
>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thank you for your quick response.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'd like to report one more regressive issue about Torque
>>>>>> support
>>>>>>>> of
>>>>>>>>>>>>>>>> openmpi-1.7.4a1r29646, which might be related to "#3893:
>> LAMA
>>>>>>>> mapper
>>>>>>>>>>>>>>>> has problems" I reported a few days ago.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The script below does not work with openmpi-1.7.4a1r29646,
>>>>>>>>>>>>>>>> although it worked with openmpi-1.7.3 as I told you before.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> #!/bin/sh
>>>>>>>>>>>>>>>> #PBS -l nodes=node08:ppn=8
>>>>>>>>>>>>>>>> export OMP_NUM_THREADS=1
>>>>>>>>>>>>>>>> cd $PBS_O_WORKDIR
>>>>>>>>>>>>>>>> cp $PBS_NODEFILE pbs_hosts
>>>>>>>>>>>>>>>> NPROCS=`wc -l < pbs_hosts`
>>>>>>>>>>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS}
>> -report-bindings
>>>>>>>>>>> -bind-to
>>>>>>>>>>>>>> core
>>>>>>>>>>>>>>>> Myprog
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it
>>>>>> works
>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>> Since this happens without lama request, I guess it's not
>> the
>>>>>>>>>>> problem
>>>>>>>>>>>>>>>> in lama itself. Anyway, please look into this issue as
>> well.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Done - thanks!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp
>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Dear openmpi developers,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I got a segmentation fault in traial use of
>>>>>>>> openmpi-1.7.4a1r29646
>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> PGI13.10 as shown below:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4
>>>>>>>>>>> -cpus-per-proc
>>>>>>>>>>>>> 2
>>>>>>>>>>>>>>>>>> -report-bindings mPre
>>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core
>> 4
>>>>>> [hwt
>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.]
>>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core
>> 6
>>>>>> [hwt
>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
>>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core
>> 0
>>>>>> [hwt
>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
>>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core
>> 2
>>>>>> [hwt
>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
>>>>>>>>>>>>>>>>>> [manage:23082] *** Process received signal ***
>>>>>>>>>>>>>>>>>> [manage:23082] Signal: Segmentation fault (11)
>>>>>>>>>>>>>>>>>> [manage:23082] Signal code: Address not mapped (1)
>>>>>>>>>>>>>>>>>> [manage:23082] Failing at address: 0x34
>>>>>>>>>>>>>>>>>> [manage:23082] *** End of error message ***
>>>>>>>>>>>>>>>>>> Segmentation fault (core dumped)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun
>>>> core.23082
>>>>>>>>>>>>>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
>>>>>>>>>>>>>>>>>> Copyright (C) 2009 Free Software Foundation, Inc.
>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2
>>>>>>>>>>>>> -report-bindings
>>>>>>>>>>>>>>>>>> mPre'.
>>>>>>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>>>>>>>>>>>> #0  0x00002b5f861c9c4f in recv_connect>>>
>> (mod=0x5f861ca20b00007f,
>>>>>>>>>>>>>>>> sd=32767,
>>>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>>>>>>>>>>>>>>>>> 631             peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>>>>>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>>>>>>> #0  0x00002b5f861c9c4f in recv_connect
>>>>>> (mod=0x5f861ca20b00007f,
>>>>>>>>>>>>>>>> sd=32767,
>>>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>>>>>>>>>>>>>>>>> #1  0x00002b5f861ca20b in recv_handler (sd=1778385023,
>>>>>>>>>>> flags=32767,
>>>>>>>>>>>>>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
>>>>>>>>>>>>>>>>>> #2  0x00002b5f848eb06a in
>> event_process_active_single_queue
>>>>>>>>>>>>>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
>>>>>>>>>>>>>>>>>> at ./event.c:1366
>>>>>>>>>>>>>>>>>> #3  0x00002b5f848eb270 in event_process_active
>>>>>>>>>>>>>>>> (base=0x5f848eb84900007f)
>>>>>>>>>>>>>>>>>> at ./event.c:1435
>>>>>>>>>>>>>>>>>> #4  0x00002b5f848eb849 in
>> opal_libevent2021_event_base_loop
>>>>>>>>>>>>>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645
>>>>>>>>>>>>>>>>>> #5  0x00000000004077a0 in orterun (argc=7,
>>>>>> argv=0x7fff25bbd4a8)
>>>>>>>>>>>>>>>>>> at ./orterun.c:1030
>>>>>>>>>>>>>>>>>> #6  0x00000000004067fb in main (argc=7,
>>>> argv=0x7fff25bbd4a8)
>>>>>>>>>>>>>>>> at ./main.c:13
>>>>>>>>>>>>>>>>>> (gdb) quit
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently
>>>>>>>>>>>>> unnecessary,
>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>> causes the segfault.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 624      /* lookup the corresponding process
>>>> */>>>>>>>>>>>>> 625      peer = mca_oob_tcp_peer_lookup(mod, &hdr->
>> origin);
>>>>>>>>>>>>>>>>>> 626      if (NULL == peer) {
>>>>>>>>>>>>>>>>>> 627          ui64 = (uint64_t*)(&peer->name);
>>>>>>>>>>>>>>>>>> 628          opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
>>>>>>>>>>>>>>>>>> orte_oob_base_framework.framework_output,
>>>>>>>>>>>>>>>>>> 629                              "%s
>>>>>> mca_oob_tcp_recv_connect:
>>>>>>>>>>>>>>>>>> connection from new peer",
>>>>>>>>>>>>>>>>>> 630                              ORTE_NAME_PRINT
>>>>>>>>>>>>>> (ORTE_PROC_MY_NAME));
>>>>>>>>>>>>>>>>>> 631          peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>>>>>>>>>>>>>>>>> 632          peer->mod = mod;
>>>>>>>>>>>>>>>>>> 633          peer->name = hdr->origin;
>>>>>>>>>>>>>>>>>> 634          peer->state = MCA_OOB_TCP_ACCEPTING;
>>>>>>>>>>>>>>>>>> 635          ui64 = (uint64_t*)(&peer->name);
>>>>>>>>>>>>>>>>>> 636          if (OPAL_SUCCESS !=
>>>>>>>> opal_hash_table_set_value_uint64
>>>>>>>>>>>>>>>> (&mod->
>>>>>>>>>>>>>>>>>> peers, (*ui64), peer)) {
>>>>>>>>>>>>>>>>>> 637              OBJ_RELEASE(peer);
>>>>>>>>>>>>>>>>>> 638              return;
>>>>>>>>>>>>>>>>>> 639          }
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Please fix this mistake in the next release.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

Reply via email to