Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

tmishima Wed, 13 Nov 2013 22:22:58 -0500 (EST)


Our cluster consists of three types of nodes. They have 8, 32
and 64 slots respectively. Since the performance of each core is
almost same, mixed use of these nodes is possible.


Furthremore, in this case, for hybrid application with openmpi+openmp,
the modification of hostfile is necesarry as follows:

#PBS -l nodes=1:ppn=32+4:ppn=8
export OMP_NUM_THREADS=4
modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines
mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS
Myprog

That's why I want to do that.

Of course I know, If I quit mixed use, -npernode is better for this
purpose.

(The script I showed you first is just a simplified one to clarify the
problem.)

tmishima


> Why do it the hard way? I'll look at the FAQ because that definitely
isn't a recommended thing to do - better to use -host to specify the
subset, or just specify the desired mapping using all the
> various mappers we provide.
>
> On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Sorry for cross-post.
> >
> > Nodefile is very simple which consists of 8 lines:
> >
> > node08
> > node08
> > node08
> > node08
> > node08
> > node08
> > node08
> > node08
> >
> > Therefore, NPROCS=8
> >
> > My aim is to modify the allocation as you pointed out. According to
Openmpi
> > FAQ,
> > proper subset of the hosts allocated to the Torque / PBS Pro job should
be
> > allowed.
> >
> > tmishima
> >
> >> Please - can you answer my question on script2? What is the value of
> > NPROCS?
> >>
> >> Why would you want to do it this way? Are you planning to modify the
> > allocation?? That generally is a bad idea as it can confuse the system
> >>
> >>
> >> On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote:
> >>
> >>>
> >>>
> >>> Since what I really want is to run script2 correctly, please let us
> >>> concentrate script2.
> >>>
> >>> I'm not an expert of the inside of openmpi. What I can do is just
> >>> obsabation
> >>> from the outside. I doubt these lines are strange, especially the
last
> > one.
> >>>
> >>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
> >>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list
> >>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
> >>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
> >>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0
inuse
> > 0
> >>>
> >>> These lines come from this part of orte_rmaps_base_get_target_nodes
> >>> in rmaps_base_support_fns.c:
> >>>
> >>>       } else if (node->slots <= node->slots_inuse &&
> >>>                  (ORTE_MAPPING_NO_OVERSUBSCRIBE &
> >>> ORTE_GET_MAPPING_DIRECTIVE(policy))) {
> >>>           /* remove the node as fully used */
> >>>           OPAL_OUTPUT_VERBOSE((5,
> >>> orte_rmaps_base_framework.framework_output,
> >>>                                "%s Removing node %s slots %d inuse
> > %d",
> >>>                                ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
> >>>                                node->name, node->slots, node->
> >>> slots_inuse));
> >>>           opal_list_remove_item(allocated_nodes, item);
> >>>           OBJ_RELEASE(item);  /* "un-retain" it */
> >>>
> >>> I wonder why node->slots and node->slots_inuse is 0, which I can read
> >>> from the above line "Removing node node08 slots 0 inuse 0".
> >>>
> >>> Or I'm not sure but
> >>>  "else if (node->slots <= node->slots_inuse &&" should be
> >>>  "else if (node->slots < node->slots_inuse &&" ?
> >>>
> >>> tmishima
> >>>
> >>>> On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> Yes, the node08 has 8 slots but the process I run is also 8.
> >>>>>
> >>>>> #PBS -l nodes=node08:ppn=8
> >>>>>
> >>>>> Therefore, I think it should allow this allocation. Is that right?
> >>>>
> >>>> Correct
> >>>>
> >>>>>
> >>>>> My question is why scritp1 works and script2 does not. They are
> >>>>> almost same.
> >>>>>
> >>>>> #PBS -l nodes=node08:ppn=8
> >>>>> export OMP_NUM_THREADS=1
> >>>>> cd $PBS_O_WORKDIR
> >>>>> cp $PBS_NODEFILE pbs_hosts
> >>>>> NPROCS=`wc -l < pbs_hosts`
> >>>>>
> >>>>> #SCRITP1
> >>>>> mpirun -report-bindings -bind-to core Myprog
> >>>>>
> >>>>> #SCRIPT2
> >>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings
-bind-to
> >>> core
> >>>>
> >>>> This version is not only reading the PBS allocation, but also
invoking
> >>> the hostfile filter on top of it. Different code path. I'll take a
look
> > -
> >>> it should still match up assuming NPROCS=8. Any
> >>>> possibility that it is a different number? I don't recall, but isn't
> >>> there some extra lines in the nodefile - e.g., comments?
> >>>>
> >>>>
> >>>>> Myprog
> >>>>>
> >>>>> tmishima
> >>>>>
> >>>>>> I guess here's my confusion. If you are using only one node, and
> > that
> >>>>> node has 8 allocated slots, then we will not allow you to run more
> > than
> >>> 8
> >>>>> processes on that node unless you specifically provide
> >>>>>> the --oversubscribe flag. This is because you are operating in a
> >>> managed
> >>>>> environment (in this case, under Torque), and so we treat the
> >>> allocation as
> >>>>> "mandatory" by default.
> >>>>>>
> >>>>>> I suspect that is the issue here, in which case the system is
> > behaving
> >>> as
> >>>>> it should.
> >>>>>>
> >>>>>> Is the above accurate?
> >>>>>>
> >>>>>>
> >>>>>> On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org>
wrote:
> >>>>>>
> >>>>>>> It has nothing to do with LAMA as you aren't using that mapper.
> >>>>>>>
> >>>>>>> How many nodes are in this allocation?
> >>>>>>>
> >>>>>>> On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Ralph, this is an additional information.
> >>>>>>>>
> >>>>>>>> Here is the main part of output by adding "-mca
rmaps_base_verbose
> >>>>> 50".
> >>>>>>>>
> >>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm
> >>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating
> > map
> >>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP
in
> >>>>>>>> allocation
> >>>>>>>> [node08.cluster:26952] mca:rmaps: mapping job [56581,1]
> >>>>>>>> [node08.cluster:26952] mca:rmaps: creating new map for job
> > [56581,1]
> >>>>>>>> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using
ppr
> >>>>> mapper
> >>>>>>>> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job
> > [56581,1]
> >>>>>>>> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using
seq
> >>>>> mapper
> >>>>>>>> [node08.cluster:26952] mca:rmaps:resilient: cannot perform
initial
> >>> map
> >>>>> of
> >>>>>>>> job [56581,1] - no fault groups
> >>>>>>>> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not
using
> >>>>> mindist
> >>>>>>>> mapper
> >>>>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
> >>>>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in
list
> >>>>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
> >>>>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
> >>>>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots
0
> >>>>> inuse 0
> >>>>>>>>
> >>>>>>>> From this result, I guess it's related to oversubscribe.
> >>>>>>>> So I added "-oversubscribe" and rerun, then it worked well as
show
> >>>>> below:
> >>>>>>>>
> >>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in
list
> >>>>>>>> [node08.cluster:27019] [[56774,0],0] Filtering thru apps
> >>>>>>>> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list
> >>>>>>>> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING:
> >>>>>>>> [node08.cluster:27019]     node: node08 daemon: 0
> >>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node
> >>> node08
> >>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting at node node08
> >>>>>>>> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job
> >>> [56774,1]
> >>>>>>>> slots 1 num_procs 8
> >>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
> >>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full -
> >>>>> skipping
> >>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is
> >>>>> oversubscribed -
> >>>>>>>> performing second pass
> >>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
> >>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to
> >>> node
> >>>>>>>> node08
> >>>>>>>> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot
for
> >>> job
> >>>>>>>> [56774,1]
> >>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node
> >>> node08
> >>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node
> >>> node08
> >>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node
> >>> node08
> >>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node
> >>> node08
> >>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node
> >>> node08
> >>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node
> >>> node08
> >>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node
> >>> node08
> >>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node
> >>> node08
> >>>>>>>>
> >>>>>>>> I think something is wrong with treatment of oversubscription,
> > which
> >>>>> might
> >>>>>>>> be
> >>>>>>>> related to "#3893: LAMA mapper has problems"
> >>>>>>>>
> >>>>>>>> tmishima
> >>>>>>>>
> >>>>>>>>> Hmmm...looks like we aren't getting your allocation. Can you
> > rerun
> >>>>> and
> >>>>>>>> add -mca ras_base_verbose 50?
> >>>>>>>>>
> >>>>>>>>> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Ralph,
> >>>>>>>>>>
> >>>>>>>>>> Here is the output of "-mca plm_base_verbose 5".
> >>>>>>>>>>
> >>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
> > component
> >>>>> [rsh]
> >>>>>>>>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on
> >>>>>>>>>> agent /usr/bin/rsh path NULL
> >>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Query of
> > component
> >>>>> [rsh]
> >>>>>>>> set
> >>>>>>>>>> priority to 10
> >>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
> > component
> >>>>>>>> [slurm]
> >>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Skipping
> > component
> >>>>>>>> [slurm].
> >>>>>>>>>> Query failed to return a module
> >>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
> > component
> >>>>> [tm]
> >>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Query of
> > component
> >>>>> [tm]
> >>>>>>>> set
> >>>>>>>>>> priority to 75
> >>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Selected
> > component
> >>>>> [tm]
> >>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias
23573
> >>>>>>>> nodename
> >>>>>>>>>> hash 85176670
> >>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam
59480
> >>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start
comm
> >>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job
> >>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
> >>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
creating
> >>> map
> >>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only
HNP
> > in
> >>>>>>>>>> allocation
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>>>>>>> All nodes which are allocated for this job are already filled.
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> Here, openmpi's configuration is as follows:
> >>>>>>>>>>
> >>>>>>>>>> ./configure \
> >>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \
> >>>>>>>>>> --with-tm \
> >>>>>>>>>> --with-verbs \
> >>>>>>>>>> --disable-ipv6 \
> >>>>>>>>>> --disable-vt \
> >>>>>>>>>> --enable-debug \
> >>>>>>>>>> CC=pgcc CFLAGS="-tp k8-64e" \
> >>>>>>>>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \
> >>>>>>>>>> F77=pgfortran FFLAGS="-tp k8-64e" \
> >>>>>>>>>> FC=pgfortran FCFLAGS="-tp k8-64e"
> >>>>>>>>>>
> >>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>
> >>>>>>>>>>> Okey, I can help you. Please give me some time to report the
> >>>>> output.
> >>>>>>>>>>>
> >>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>
> >>>>>>>>>>>> I can try, but I have no way of testing Torque any more - so
> > all
> >>> I
> >>>>>>>> can
> >>>>>>>>>> do
> >>>>>>>>>>> is a code review. If you can build --enable-debug and add
-mca
> >>>>>>>>>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing
the
> >>>>>>>>>>>> output.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp
wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thank you for your quick response.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'd like to report one more regressive issue about Torque
> >>> support
> >>>>> of
> >>>>>>>>>>>>> openmpi-1.7.4a1r29646, which might be related to "#3893:
LAMA
> >>>>> mapper
> >>>>>>>>>>>>> has problems" I reported a few days ago.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The script below does not work with openmpi-1.7.4a1r29646,
> >>>>>>>>>>>>> although it worked with openmpi-1.7.3 as I told you before.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> #!/bin/sh
> >>>>>>>>>>>>> #PBS -l nodes=node08:ppn=8
> >>>>>>>>>>>>> export OMP_NUM_THREADS=1
> >>>>>>>>>>>>> cd $PBS_O_WORKDIR
> >>>>>>>>>>>>> cp $PBS_NODEFILE pbs_hosts
> >>>>>>>>>>>>> NPROCS=`wc -l < pbs_hosts`
> >>>>>>>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS}
-report-bindings
> >>>>>>>> -bind-to
> >>>>>>>>>>> core
> >>>>>>>>>>>>> Myprog
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it
> >>> works
> >>>>>>>>>> fine.
> >>>>>>>>>>>>> Since this happens without lama request, I guess it's not
the
> >>>>>>>> problem
> >>>>>>>>>>>>> in lama itself. Anyway, please look into this issue as
well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Done - thanks!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp
> > wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Dear openmpi developers,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I got a segmentation fault in traial use of
> >>>>> openmpi-1.7.4a1r29646
> >>>>>>>>>>> built
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>>> PGI13.10 as shown below:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4
> >>>>>>>> -cpus-per-proc
> >>>>>>>>>> 2
> >>>>>>>>>>>>>>> -report-bindings mPre
> >>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core
4
> >>> [hwt
> >>>>>>>> 0]],
> >>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.]
> >>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core
6
> >>> [hwt
> >>>>>>>> 0]],
> >>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
> >>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core
0
> >>> [hwt
> >>>>>>>> 0]],
> >>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
> >>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core
2
> >>> [hwt
> >>>>>>>> 0]],
> >>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
> >>>>>>>>>>>>>>> [manage:23082] *** Process received signal ***
> >>>>>>>>>>>>>>> [manage:23082] Signal: Segmentation fault (11)
> >>>>>>>>>>>>>>> [manage:23082] Signal code: Address not mapped (1)
> >>>>>>>>>>>>>>> [manage:23082] Failing at address: 0x34
> >>>>>>>>>>>>>>> [manage:23082] *** End of error message ***
> >>>>>>>>>>>>>>> Segmentation fault (core dumped)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun
> > core.23082
> >>>>>>>>>>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
> >>>>>>>>>>>>>>> Copyright (C) 2009 Free Software Foundation, Inc.
> >>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2
> >>>>>>>>>> -report-bindings
> >>>>>>>>>>>>>>> mPre'.
> >>>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
> >>>>>>>>>>>>>>> #0  0x00002b5f861c9c4f in recv_connect>>>
(mod=0x5f861ca20b00007f,
> >>>>>>>>>>>>> sd=32767,
> >>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> >>>>>>>>>>>>>>> 631             peer = OBJ_NEW(mca_oob_tcp_peer_t);
> >>>>>>>>>>>>>>> (gdb) where
> >>>>>>>>>>>>>>> #0  0x00002b5f861c9c4f in recv_connect
> >>> (mod=0x5f861ca20b00007f,
> >>>>>>>>>>>>> sd=32767,
> >>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> >>>>>>>>>>>>>>> #1  0x00002b5f861ca20b in recv_handler (sd=1778385023,
> >>>>>>>> flags=32767,
> >>>>>>>>>>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
> >>>>>>>>>>>>>>> #2  0x00002b5f848eb06a in
event_process_active_single_queue
> >>>>>>>>>>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
> >>>>>>>>>>>>>>> at ./event.c:1366
> >>>>>>>>>>>>>>> #3  0x00002b5f848eb270 in event_process_active
> >>>>>>>>>>>>> (base=0x5f848eb84900007f)
> >>>>>>>>>>>>>>> at ./event.c:1435
> >>>>>>>>>>>>>>> #4  0x00002b5f848eb849 in
opal_libevent2021_event_base_loop
> >>>>>>>>>>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645
> >>>>>>>>>>>>>>> #5  0x00000000004077a0 in orterun (argc=7,
> >>> argv=0x7fff25bbd4a8)
> >>>>>>>>>>>>>>> at ./orterun.c:1030
> >>>>>>>>>>>>>>> #6  0x00000000004067fb in main (argc=7,
> > argv=0x7fff25bbd4a8)
> >>>>>>>>>>>>> at ./main.c:13
> >>>>>>>>>>>>>>> (gdb) quit
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently
> >>>>>>>>>> unnecessary,
> >>>>>>>>>>>>> which
> >>>>>>>>>>>>>>> causes the segfault.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 624      /* lookup the corresponding process
> > */>>>>>>>>>>>>> 625      peer = mca_oob_tcp_peer_lookup(mod, &hdr->
origin);
> >>>>>>>>>>>>>>> 626      if (NULL == peer) {
> >>>>>>>>>>>>>>> 627          ui64 = (uint64_t*)(&peer->name);
> >>>>>>>>>>>>>>> 628          opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
> >>>>>>>>>>>>>>> orte_oob_base_framework.framework_output,
> >>>>>>>>>>>>>>> 629                              "%s
> >>> mca_oob_tcp_recv_connect:
> >>>>>>>>>>>>>>> connection from new peer",
> >>>>>>>>>>>>>>> 630                              ORTE_NAME_PRINT
> >>>>>>>>>>> (ORTE_PROC_MY_NAME));
> >>>>>>>>>>>>>>> 631          peer = OBJ_NEW(mca_oob_tcp_peer_t);
> >>>>>>>>>>>>>>> 632          peer->mod = mod;
> >>>>>>>>>>>>>>> 633          peer->name = hdr->origin;
> >>>>>>>>>>>>>>> 634          peer->state = MCA_OOB_TCP_ACCEPTING;
> >>>>>>>>>>>>>>> 635          ui64 = (uint64_t*)(&peer->name);
> >>>>>>>>>>>>>>> 636          if (OPAL_SUCCESS !=
> >>>>> opal_hash_table_set_value_uint64
> >>>>>>>>>>>>> (&mod->
> >>>>>>>>>>>>>>> peers, (*ui64), peer)) {
> >>>>>>>>>>>>>>> 637              OBJ_RELEASE(peer);
> >>>>>>>>>>>>>>> 638              return;
> >>>>>>>>>>>>>>> 639          }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Please fix this mistake in the next release.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> users mailing list
> >>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> users mailing list
> >>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> us...@open-mpi.org
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> us...@open-mpi.org
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list>> us...@open-mpi.org
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

Reply via email to