Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

Ralph Castain Wed, 13 Nov 2013 23:31:19 -0500 (EST)

Guess I don't see why modifying the allocation is required - we have mapping 
options that should support such things. If you specify the total number of 
procs you want, and cpus-per-proc=4, it should do the same thing I would think. 
You'd get 2 procs on the 8 slot nodes, 8 on the 32 proc nodes, and up to 6 on 
the 64 slot nodes (since you specified np=16). So I guess I don't understand 
the issue.


Regardless, if NPROCS=8 (and you verified that by printing it out, not just 
assuming wc -l got that value), then it shouldn't think it is oversubscribed. 
I'll take a look under a slurm allocation as that is all I can access.


On Nov 13, 2013, at 7:23 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Our cluster consists of three types of nodes. They have 8, 32
> and 64 slots respectively. Since the performance of each core is
> almost same, mixed use of these nodes is possible.
> 
> Furthremore, in this case, for hybrid application with openmpi+openmp,
> the modification of hostfile is necesarry as follows:
> 
> #PBS -l nodes=1:ppn=32+4:ppn=8
> export OMP_NUM_THREADS=4
> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines
> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS
> Myprog
> 
> That's why I want to do that.
> 
> Of course I know, If I quit mixed use, -npernode is better for this
> purpose.
> 
> (The script I showed you first is just a simplified one to clarify the
> problem.)
> 
> tmishima
> 
> 
>> Why do it the hard way? I'll look at the FAQ because that definitely
> isn't a recommended thing to do - better to use -host to specify the
> subset, or just specify the desired mapping using all the
>> various mappers we provide.
>> 
>> On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Sorry for cross-post.
>>> 
>>> Nodefile is very simple which consists of 8 lines:
>>> 
>>> node08
>>> node08
>>> node08
>>> node08
>>> node08
>>> node08
>>> node08
>>> node08
>>> 
>>> Therefore, NPROCS=8
>>> 
>>> My aim is to modify the allocation as you pointed out. According to
> Openmpi
>>> FAQ,
>>> proper subset of the hosts allocated to the Torque / PBS Pro job should
> be
>>> allowed.
>>> 
>>> tmishima
>>> 
>>>> Please - can you answer my question on script2? What is the value of
>>> NPROCS?
>>>> 
>>>> Why would you want to do it this way? Are you planning to modify the
>>> allocation?? That generally is a bad idea as it can confuse the system
>>>> 
>>>> 
>>>> On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> Since what I really want is to run script2 correctly, please let us
>>>>> concentrate script2.
>>>>> 
>>>>> I'm not an expert of the inside of openmpi. What I can do is just
>>>>> obsabation
>>>>> from the outside. I doubt these lines are strange, especially the
> last
>>> one.
>>>>> 
>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list
>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0
> inuse
>>> 0
>>>>> 
>>>>> These lines come from this part of orte_rmaps_base_get_target_nodes
>>>>> in rmaps_base_support_fns.c:
>>>>> 
>>>>>      } else if (node->slots <= node->slots_inuse &&
>>>>>                 (ORTE_MAPPING_NO_OVERSUBSCRIBE &
>>>>> ORTE_GET_MAPPING_DIRECTIVE(policy))) {
>>>>>          /* remove the node as fully used */
>>>>>          OPAL_OUTPUT_VERBOSE((5,
>>>>> orte_rmaps_base_framework.framework_output,
>>>>>                               "%s Removing node %s slots %d inuse
>>> %d",
>>>>>                               ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>>>>>                               node->name, node->slots, node->
>>>>> slots_inuse));
>>>>>          opal_list_remove_item(allocated_nodes, item);
>>>>>          OBJ_RELEASE(item);  /* "un-retain" it */
>>>>> 
>>>>> I wonder why node->slots and node->slots_inuse is 0, which I can read
>>>>> from the above line "Removing node node08 slots 0 inuse 0".
>>>>> 
>>>>> Or I'm not sure but
>>>>> "else if (node->slots <= node->slots_inuse &&" should be
>>>>> "else if (node->slots < node->slots_inuse &&" ?
>>>>> 
>>>>> tmishima
>>>>> 
>>>>>> On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Yes, the node08 has 8 slots but the process I run is also 8.
>>>>>>> 
>>>>>>> #PBS -l nodes=node08:ppn=8
>>>>>>> 
>>>>>>> Therefore, I think it should allow this allocation. Is that right?
>>>>>> 
>>>>>> Correct
>>>>>> 
>>>>>>> 
>>>>>>> My question is why scritp1 works and script2 does not. They are
>>>>>>> almost same.
>>>>>>> 
>>>>>>> #PBS -l nodes=node08:ppn=8
>>>>>>> export OMP_NUM_THREADS=1
>>>>>>> cd $PBS_O_WORKDIR
>>>>>>> cp $PBS_NODEFILE pbs_hosts
>>>>>>> NPROCS=`wc -l < pbs_hosts`
>>>>>>> 
>>>>>>> #SCRITP1
>>>>>>> mpirun -report-bindings -bind-to core Myprog
>>>>>>> 
>>>>>>> #SCRIPT2
>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings
> -bind-to
>>>>> core
>>>>>> 
>>>>>> This version is not only reading the PBS allocation, but also
> invoking
>>>>> the hostfile filter on top of it. Different code path. I'll take a
> look
>>> -
>>>>> it should still match up assuming NPROCS=8. Any
>>>>>> possibility that it is a different number? I don't recall, but isn't
>>>>> there some extra lines in the nodefile - e.g., comments?
>>>>>> 
>>>>>> 
>>>>>>> Myprog
>>>>>>> 
>>>>>>> tmishima
>>>>>>> 
>>>>>>>> I guess here's my confusion. If you are using only one node, and
>>> that
>>>>>>> node has 8 allocated slots, then we will not allow you to run more
>>> than
>>>>> 8
>>>>>>> processes on that node unless you specifically provide
>>>>>>>> the --oversubscribe flag. This is because you are operating in a
>>>>> managed
>>>>>>> environment (in this case, under Torque), and so we treat the
>>>>> allocation as
>>>>>>> "mandatory" by default.
>>>>>>>> 
>>>>>>>> I suspect that is the issue here, in which case the system is
>>> behaving
>>>>> as
>>>>>>> it should.
>>>>>>>> 
>>>>>>>> Is the above accurate?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org>
> wrote:
>>>>>>>> 
>>>>>>>>> It has nothing to do with LAMA as you aren't using that mapper.
>>>>>>>>> 
>>>>>>>>> How many nodes are in this allocation?
>>>>>>>>> 
>>>>>>>>> On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi Ralph, this is an additional information.
>>>>>>>>>> 
>>>>>>>>>> Here is the main part of output by adding "-mca
> rmaps_base_verbose
>>>>>>> 50".
>>>>>>>>>> 
>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm
>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating
>>> map
>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP
> in
>>>>>>>>>> allocation
>>>>>>>>>> [node08.cluster:26952] mca:rmaps: mapping job [56581,1]
>>>>>>>>>> [node08.cluster:26952] mca:rmaps: creating new map for job
>>> [56581,1]
>>>>>>>>>> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using
> ppr
>>>>>>> mapper
>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job
>>> [56581,1]
>>>>>>>>>> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using
> seq
>>>>>>> mapper
>>>>>>>>>> [node08.cluster:26952] mca:rmaps:resilient: cannot perform
> initial
>>>>> map
>>>>>>> of
>>>>>>>>>> job [56581,1] - no fault groups
>>>>>>>>>> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not
> using
>>>>>>> mindist
>>>>>>>>>> mapper
>>>>>>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in
> list
>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots
> 0
>>>>>>> inuse 0
>>>>>>>>>> 
>>>>>>>>>> From this result, I guess it's related to oversubscribe.
>>>>>>>>>> So I added "-oversubscribe" and rerun, then it worked well as
> show
>>>>>>> below:
>>>>>>>>>> 
>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in
> list
>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Filtering thru apps
>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list
>>>>>>>>>> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING:
>>>>>>>>>> [node08.cluster:27019]     node: node08 daemon: 0
>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node
>>>>> node08
>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting at node node08
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job
>>>>> [56774,1]
>>>>>>>>>> slots 1 num_procs 8
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full -
>>>>>>> skipping
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is
>>>>>>> oversubscribed -
>>>>>>>>>> performing second pass
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to
>>>>> node
>>>>>>>>>> node08
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot
> for
>>>>> job
>>>>>>>>>> [56774,1]
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node
>>>>> node08
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node
>>>>> node08
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node
>>>>> node08
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node
>>>>> node08
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node
>>>>> node08
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node
>>>>> node08
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node
>>>>> node08
>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node
>>>>> node08
>>>>>>>>>> 
>>>>>>>>>> I think something is wrong with treatment of oversubscription,
>>> which
>>>>>>> might
>>>>>>>>>> be
>>>>>>>>>> related to "#3893: LAMA mapper has problems"
>>>>>>>>>> 
>>>>>>>>>> tmishima
>>>>>>>>>> 
>>>>>>>>>>> Hmmm...looks like we aren't getting your allocation. Can you
>>> rerun
>>>>>>> and
>>>>>>>>>> add -mca ras_base_verbose 50?
>>>>>>>>>>> 
>>>>>>>>>>> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>> 
>>>>>>>>>>>> Here is the output of "-mca plm_base_verbose 5".
>>>>>>>>>>>> 
>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
>>> component
>>>>>>> [rsh]
>>>>>>>>>>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on
>>>>>>>>>>>> agent /usr/bin/rsh path NULL
>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Query of
>>> component
>>>>>>> [rsh]
>>>>>>>>>> set
>>>>>>>>>>>> priority to 10
>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
>>> component
>>>>>>>>>> [slurm]
>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Skipping
>>> component
>>>>>>>>>> [slurm].
>>>>>>>>>>>> Query failed to return a module
>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
>>> component
>>>>>>> [tm]
>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Query of
>>> component
>>>>>>> [tm]
>>>>>>>>>> set
>>>>>>>>>>>> priority to 75
>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Selected
>>> component
>>>>>>> [tm]
>>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias
> 23573
>>>>>>>>>> nodename
>>>>>>>>>>>> hash 85176670
>>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam
> 59480
>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start
> comm
>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job
>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
> creating
>>>>> map
>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only
> HNP
>>> in
>>>>>>>>>>>> allocation
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
> --------------------------------------------------------------------------
>>>>>>>>>>>> All nodes which are allocated for this job are already filled.
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
> --------------------------------------------------------------------------
>>>>>>>>>>>> 
>>>>>>>>>>>> Here, openmpi's configuration is as follows:
>>>>>>>>>>>> 
>>>>>>>>>>>> ./configure \
>>>>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \
>>>>>>>>>>>> --with-tm \
>>>>>>>>>>>> --with-verbs \
>>>>>>>>>>>> --disable-ipv6 \
>>>>>>>>>>>> --disable-vt \
>>>>>>>>>>>> --enable-debug \
>>>>>>>>>>>> CC=pgcc CFLAGS="-tp k8-64e" \
>>>>>>>>>>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \
>>>>>>>>>>>> F77=pgfortran FFLAGS="-tp k8-64e" \
>>>>>>>>>>>> FC=pgfortran FCFLAGS="-tp k8-64e"
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Okey, I can help you. Please give me some time to report the
>>>>>>> output.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I can try, but I have no way of testing Torque any more - so
>>> all
>>>>> I
>>>>>>>>>> can
>>>>>>>>>>>> do
>>>>>>>>>>>>> is a code review. If you can build --enable-debug and add
> -mca
>>>>>>>>>>>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing
> the
>>>>>>>>>>>>>> output.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp
> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thank you for your quick response.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'd like to report one more regressive issue about Torque
>>>>> support
>>>>>>> of
>>>>>>>>>>>>>>> openmpi-1.7.4a1r29646, which might be related to "#3893:
> LAMA
>>>>>>> mapper
>>>>>>>>>>>>>>> has problems" I reported a few days ago.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The script below does not work with openmpi-1.7.4a1r29646,
>>>>>>>>>>>>>>> although it worked with openmpi-1.7.3 as I told you before.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> #!/bin/sh
>>>>>>>>>>>>>>> #PBS -l nodes=node08:ppn=8
>>>>>>>>>>>>>>> export OMP_NUM_THREADS=1
>>>>>>>>>>>>>>> cd $PBS_O_WORKDIR
>>>>>>>>>>>>>>> cp $PBS_NODEFILE pbs_hosts
>>>>>>>>>>>>>>> NPROCS=`wc -l < pbs_hosts`
>>>>>>>>>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS}
> -report-bindings
>>>>>>>>>> -bind-to
>>>>>>>>>>>>> core
>>>>>>>>>>>>>>> Myprog
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it
>>>>> works
>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>> Since this happens without lama request, I guess it's not
> the
>>>>>>>>>> problem
>>>>>>>>>>>>>>> in lama itself. Anyway, please look into this issue as
> well.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Done - thanks!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp
>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Dear openmpi developers,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I got a segmentation fault in traial use of
>>>>>>> openmpi-1.7.4a1r29646
>>>>>>>>>>>>> built
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>> PGI13.10 as shown below:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4
>>>>>>>>>> -cpus-per-proc
>>>>>>>>>>>> 2
>>>>>>>>>>>>>>>>> -report-bindings mPre
>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core
> 4
>>>>> [hwt
>>>>>>>>>> 0]],
>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.]
>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core
> 6
>>>>> [hwt
>>>>>>>>>> 0]],
>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core
> 0
>>>>> [hwt
>>>>>>>>>> 0]],
>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core
> 2
>>>>> [hwt
>>>>>>>>>> 0]],
>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
>>>>>>>>>>>>>>>>> [manage:23082] *** Process received signal ***
>>>>>>>>>>>>>>>>> [manage:23082] Signal: Segmentation fault (11)
>>>>>>>>>>>>>>>>> [manage:23082] Signal code: Address not mapped (1)
>>>>>>>>>>>>>>>>> [manage:23082] Failing at address: 0x34
>>>>>>>>>>>>>>>>> [manage:23082] *** End of error message ***
>>>>>>>>>>>>>>>>> Segmentation fault (core dumped)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun
>>> core.23082
>>>>>>>>>>>>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
>>>>>>>>>>>>>>>>> Copyright (C) 2009 Free Software Foundation, Inc.
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2
>>>>>>>>>>>> -report-bindings
>>>>>>>>>>>>>>>>> mPre'.
>>>>>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>>>>>>>>>>> #0  0x00002b5f861c9c4f in recv_connect>>>
> (mod=0x5f861ca20b00007f,
>>>>>>>>>>>>>>> sd=32767,
>>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>>>>>>>>>>>>>>>> 631             peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>>>>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>>>>>> #0  0x00002b5f861c9c4f in recv_connect
>>>>> (mod=0x5f861ca20b00007f,
>>>>>>>>>>>>>>> sd=32767,
>>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>>>>>>>>>>>>>>>> #1  0x00002b5f861ca20b in recv_handler (sd=1778385023,
>>>>>>>>>> flags=32767,
>>>>>>>>>>>>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
>>>>>>>>>>>>>>>>> #2  0x00002b5f848eb06a in
> event_process_active_single_queue
>>>>>>>>>>>>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
>>>>>>>>>>>>>>>>> at ./event.c:1366
>>>>>>>>>>>>>>>>> #3  0x00002b5f848eb270 in event_process_active
>>>>>>>>>>>>>>> (base=0x5f848eb84900007f)
>>>>>>>>>>>>>>>>> at ./event.c:1435
>>>>>>>>>>>>>>>>> #4  0x00002b5f848eb849 in
> opal_libevent2021_event_base_loop
>>>>>>>>>>>>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645
>>>>>>>>>>>>>>>>> #5  0x00000000004077a0 in orterun (argc=7,
>>>>> argv=0x7fff25bbd4a8)
>>>>>>>>>>>>>>>>> at ./orterun.c:1030
>>>>>>>>>>>>>>>>> #6  0x00000000004067fb in main (argc=7,
>>> argv=0x7fff25bbd4a8)
>>>>>>>>>>>>>>> at ./main.c:13
>>>>>>>>>>>>>>>>> (gdb) quit
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently
>>>>>>>>>>>> unnecessary,
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>> causes the segfault.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 624      /* lookup the corresponding process
>>> */>>>>>>>>>>>>> 625      peer = mca_oob_tcp_peer_lookup(mod, &hdr->
> origin);
>>>>>>>>>>>>>>>>> 626      if (NULL == peer) {
>>>>>>>>>>>>>>>>> 627          ui64 = (uint64_t*)(&peer->name);
>>>>>>>>>>>>>>>>> 628          opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
>>>>>>>>>>>>>>>>> orte_oob_base_framework.framework_output,
>>>>>>>>>>>>>>>>> 629                              "%s
>>>>> mca_oob_tcp_recv_connect:
>>>>>>>>>>>>>>>>> connection from new peer",
>>>>>>>>>>>>>>>>> 630                              ORTE_NAME_PRINT
>>>>>>>>>>>>> (ORTE_PROC_MY_NAME));
>>>>>>>>>>>>>>>>> 631          peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>>>>>>>>>>>>>>>> 632          peer->mod = mod;
>>>>>>>>>>>>>>>>> 633          peer->name = hdr->origin;
>>>>>>>>>>>>>>>>> 634          peer->state = MCA_OOB_TCP_ACCEPTING;
>>>>>>>>>>>>>>>>> 635          ui64 = (uint64_t*)(&peer->name);
>>>>>>>>>>>>>>>>> 636          if (OPAL_SUCCESS !=
>>>>>>> opal_hash_table_set_value_uint64
>>>>>>>>>>>>>>> (&mod->
>>>>>>>>>>>>>>>>> peers, (*ui64), peer)) {
>>>>>>>>>>>>>>>>> 637              OBJ_RELEASE(peer);
>>>>>>>>>>>>>>>>> 638              return;
>>>>>>>>>>>>>>>>> 639          }
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Please fix this mistake in the next release.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

Reply via email to