Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

tmishima Thu, 14 Nov 2013 01:13:33 -0500 (EST)


Thank you, Ralph!


I didn't know that function of cups-per-proc.
As fas as I know, it didn't work in openmpi-1.6.x like that.
It was just 4 cores binding...

Today I don't have much time and I'll check it tomorrow.
And thank you again for checking oversubscription problem.

tmishima


> Guess I don't see why modifying the allocation is required - we have
mapping options that should support such things. If you specify the total
number of procs you want, and cpus-per-proc=4, it should
> do the same thing I would think. You'd get 2 procs on the 8 slot nodes, 8
on the 32 proc nodes, and up to 6 on the 64 slot nodes (since you specified
np=16). So I guess I don't understand the issue.
>
> Regardless, if NPROCS=8 (and you verified that by printing it out, not
just assuming wc -l got that value), then it shouldn't think it is
oversubscribed. I'll take a look under a slurm allocation as
> that is all I can access.
>
>
> On Nov 13, 2013, at 7:23 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Our cluster consists of three types of nodes. They have 8, 32
> > and 64 slots respectively. Since the performance of each core is
> > almost same, mixed use of these nodes is possible.
> >
> > Furthremore, in this case, for hybrid application with openmpi+openmp,
> > the modification of hostfile is necesarry as follows:
> >
> > #PBS -l nodes=1:ppn=32+4:ppn=8
> > export OMP_NUM_THREADS=4
> > modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines
> > mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS
> > Myprog
> >
> > That's why I want to do that.
> >
> > Of course I know, If I quit mixed use, -npernode is better for this
> > purpose.
> >
> > (The script I showed you first is just a simplified one to clarify the
> > problem.)
> >
> > tmishima
> >
> >
> >> Why do it the hard way? I'll look at the FAQ because that definitely
> > isn't a recommended thing to do - better to use -host to specify the
> > subset, or just specify the desired mapping using all the
> >> various mappers we provide.
> >>
> >> On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote:
> >>
> >>>
> >>>
> >>> Sorry for cross-post.
> >>>
> >>> Nodefile is very simple which consists of 8 lines:
> >>>
> >>> node08
> >>> node08
> >>> node08
> >>> node08
> >>> node08
> >>> node08
> >>> node08
> >>> node08
> >>>
> >>> Therefore, NPROCS=8
> >>>
> >>> My aim is to modify the allocation as you pointed out. According to
> > Openmpi
> >>> FAQ,
> >>> proper subset of the hosts allocated to the Torque / PBS Pro job
should
> > be
> >>> allowed.
> >>>
> >>> tmishima
> >>>
> >>>> Please - can you answer my question on script2? What is the value of
> >>> NPROCS?
> >>>>
> >>>> Why would you want to do it this way? Are you planning to modify the
> >>> allocation?? That generally is a bad idea as it can confuse the
system
> >>>>
> >>>>
> >>>> On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> Since what I really want is to run script2 correctly, please let us
> >>>>> concentrate script2.
> >>>>>
> >>>>> I'm not an expert of the inside of openmpi. What I can do is just
> >>>>> obsabation
> >>>>> from the outside. I doubt these lines are strange, especially the
> > last
> >>> one.
> >>>>>
> >>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
> >>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list
> >>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
> >>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
> >>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0
> > inuse
> >>> 0
> >>>>>
> >>>>> These lines come from this part of orte_rmaps_base_get_target_nodes
> >>>>> in rmaps_base_support_fns.c:
> >>>>>
> >>>>>      } else if (node->slots <= node->slots_inuse &&
> >>>>>                 (ORTE_MAPPING_NO_OVERSUBSCRIBE &
> >>>>> ORTE_GET_MAPPING_DIRECTIVE(policy))) {
> >>>>>          /* remove the node as fully used */
> >>>>>          OPAL_OUTPUT_VERBOSE((5,
> >>>>> orte_rmaps_base_framework.framework_output,
> >>>>>                               "%s Removing node %s slots %d inuse
> >>> %d",
> >>>>>                               ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
> >>>>>                               node->name, node->slots, node->
> >>>>> slots_inuse));
> >>>>>          opal_list_remove_item(allocated_nodes, item);
> >>>>>          OBJ_RELEASE(item);  /* "un-retain" it */
> >>>>>
> >>>>> I wonder why node->slots and node->slots_inuse is 0, which I can
read
> >>>>> from the above line "Removing node node08 slots 0 inuse 0".
> >>>>>
> >>>>> Or I'm not sure but
> >>>>> "else if (node->slots <= node->slots_inuse &&" should be
> >>>>> "else if (node->slots < node->slots_inuse &&" ?
> >>>>>
> >>>>> tmishima
> >>>>>
> >>>>>> On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Yes, the node08 has 8 slots but the process I run is also 8.
> >>>>>>>
> >>>>>>> #PBS -l nodes=node08:ppn=8
> >>>>>>>
> >>>>>>> Therefore, I think it should allow this allocation. Is that
right?
> >>>>>>
> >>>>>> Correct
> >>>>>>
> >>>>>>>
> >>>>>>> My question is why scritp1 works and script2 does not. They are
> >>>>>>> almost same.
> >>>>>>>
> >>>>>>> #PBS -l nodes=node08:ppn=8
> >>>>>>> export OMP_NUM_THREADS=1
> >>>>>>> cd $PBS_O_WORKDIR
> >>>>>>> cp $PBS_NODEFILE pbs_hosts
> >>>>>>> NPROCS=`wc -l < pbs_hosts`
> >>>>>>>
> >>>>>>> #SCRITP1
> >>>>>>> mpirun -report-bindings -bind-to core Myprog
> >>>>>>>
> >>>>>>> #SCRIPT2
> >>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings
> > -bind-to
> >>>>> core
> >>>>>>
> >>>>>> This version is not only reading the PBS allocation, but also
> > invoking
> >>>>> the hostfile filter on top of it. Different code path. I'll take a
> > look
> >>> -
> >>>>> it should still match up assuming NPROCS=8. Any
> >>>>>> possibility that it is a different number? I don't recall, but
isn't
> >>>>> there some extra lines in the nodefile - e.g., comments?
> >>>>>>
> >>>>>>
> >>>>>>> Myprog
> >>>>>>>
> >>>>>>> tmishima
> >>>>>>>
> >>>>>>>> I guess here's my confusion. If you are using only one node, and
> >>> that
> >>>>>>> node has 8 allocated slots, then we will not allow you to run
more
> >>> than
> >>>>> 8
> >>>>>>> processes on that node unless you specifically provide
> >>>>>>>> the --oversubscribe flag. This is because you are operating in a
> >>>>> managed
> >>>>>>> environment (in this case, under Torque), and so we treat the
> >>>>> allocation as
> >>>>>>> "mandatory" by default.
> >>>>>>>>
> >>>>>>>> I suspect that is the issue here, in which case the system is
> >>> behaving
> >>>>> as
> >>>>>>> it should.
> >>>>>>>>
> >>>>>>>> Is the above accurate?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org>
> > wrote:
> >>>>>>>>
> >>>>>>>>> It has nothing to do with LAMA as you aren't using that mapper.
> >>>>>>>>>
> >>>>>>>>> How many nodes are in this allocation?
> >>>>>>>>>
> >>>>>>>>> On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Ralph, this is an additional information.
> >>>>>>>>>>
> >>>>>>>>>> Here is the main part of output by adding "-mca
> > rmaps_base_verbose
> >>>>>>> 50".
> >>>>>>>>>>
> >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm
> >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm
creating
> >>> map
> >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only
HNP
> > in
> >>>>>>>>>> allocation
> >>>>>>>>>> [node08.cluster:26952] mca:rmaps: mapping job [56581,1]
> >>>>>>>>>> [node08.cluster:26952] mca:rmaps: creating new map for job
> >>> [56581,1]
> >>>>>>>>>> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using
> > ppr
> >>>>>>> mapper
> >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job
> >>> [56581,1]
> >>>>>>>>>> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using
> > seq
> >>>>>>> mapper
> >>>>>>>>>> [node08.cluster:26952] mca:rmaps:resilient: cannot perform
> > initial
> >>>>> map
> >>>>>>> of
> >>>>>>>>>> job [56581,1] - no fault groups
> >>>>>>>>>> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not
> > using
> >>>>>>> mindist
> >>>>>>>>>> mapper
> >>>>>>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
> >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in
> > list
> >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
> >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
> >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08
slots
> > 0
> >>>>>>> inuse 0
> >>>>>>>>>>
> >>>>>>>>>> From this result, I guess it's related to oversubscribe.
> >>>>>>>>>> So I added "-oversubscribe" and rerun, then it worked well as
> > show
> >>>>>>> below:
> >>>>>>>>>>
> >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in
> > list
> >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Filtering thru apps
> >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list
> >>>>>>>>>> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING:
> >>>>>>>>>> [node08.cluster:27019]     node: node08 daemon: 0
> >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node
> >>>>> node08
> >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting at node node08
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job
> >>>>> [56774,1]
> >>>>>>>>>> slots 1 num_procs 8
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full -
> >>>>>>> skipping
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is
> >>>>>>> oversubscribed -
> >>>>>>>>>> performing second pass
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs
to
> >>>>> node
> >>>>>>>>>> node08
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot
> > for
> >>>>> job
> >>>>>>>>>> [56774,1]
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to
node
> >>>>> node08
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to
node
> >>>>> node08
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to
node
> >>>>> node08
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to
node
> >>>>> node08
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to
node
> >>>>> node08
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to
node
> >>>>> node08
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to
node
> >>>>> node08
> >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to
node
> >>>>> node08
> >>>>>>>>>>
> >>>>>>>>>> I think something is wrong with treatment of oversubscription,
> >>> which
> >>>>>>> might
> >>>>>>>>>> be
> >>>>>>>>>> related to "#3893: LAMA mapper has problems"
> >>>>>>>>>>
> >>>>>>>>>> tmishima
> >>>>>>>>>>
> >>>>>>>>>>> Hmmm...looks like we aren't getting your allocation. Can you
> >>> rerun
> >>>>>>> and
> >>>>>>>>>> add -mca ras_base_verbose 50?
> >>>>>>>>>>>
> >>>>>>>>>>> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp
wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Here is the output of "-mca plm_base_verbose 5".
> >>>>>>>>>>>>
> >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
> >>> component
> >>>>>>> [rsh]
> >>>>>>>>>>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on
> >>>>>>>>>>>> agent /usr/bin/rsh path NULL
> >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Query of
> >>> component
> >>>>>>> [rsh]
> >>>>>>>>>> set
> >>>>>>>>>>>> priority to 10
> >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
> >>> component
> >>>>>>>>>> [slurm]
> >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Skipping
> >>> component
> >>>>>>>>>> [slurm].
> >>>>>>>>>>>> Query failed to return a module
> >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
> >>> component
> >>>>>>> [tm]
> >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Query of
> >>> component
> >>>>>>> [tm]
> >>>>>>>>>> set
> >>>>>>>>>>>> priority to 75
> >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Selected
> >>> component
> >>>>>>> [tm]
> >>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias
> > 23573
> >>>>>>>>>> nodename
> >>>>>>>>>>>> hash 85176670
> >>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam
> > 59480
> >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start
> > comm
> >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job
> >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
> >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
> > creating
> >>>>> map
> >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only
> > HNP
> >>> in
> >>>>>>>>>>>> allocation
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>>>>>>>>> All nodes which are allocated for this job are already
filled.
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>>>>>>>>>
> >>>>>>>>>>>> Here, openmpi's configuration is as follows:
> >>>>>>>>>>>>
> >>>>>>>>>>>> ./configure \
> >>>>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \
> >>>>>>>>>>>> --with-tm \
> >>>>>>>>>>>> --with-verbs \
> >>>>>>>>>>>> --disable-ipv6 \
> >>>>>>>>>>>> --disable-vt \
> >>>>>>>>>>>> --enable-debug \
> >>>>>>>>>>>> CC=pgcc CFLAGS="-tp k8-64e" \
> >>>>>>>>>>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \
> >>>>>>>>>>>> F77=pgfortran FFLAGS="-tp k8-64e" \
> >>>>>>>>>>>> FC=pgfortran FCFLAGS="-tp k8-64e"
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Okey, I can help you. Please give me some time to report
the
> >>>>>>> output.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> I can try, but I have no way of testing Torque any more -
so
> >>> all
> >>>>> I
> >>>>>>>>>> can
> >>>>>>>>>>>> do
> >>>>>>>>>>>>> is a code review. If you can build --enable-debug and add
> > -mca
> >>>>>>>>>>>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing
> > the
> >>>>>>>>>>>>>> output.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp
> > wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thank you for your quick response.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'd like to report one more regressive issue about Torque
> >>>>> support
> >>>>>>> of
> >>>>>>>>>>>>>>> openmpi-1.7.4a1r29646, which might be related to "#3893:
> > LAMA
> >>>>>>> mapper
> >>>>>>>>>>>>>>> has problems" I reported a few days ago.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The script below does not work with
openmpi-1.7.4a1r29646,
> >>>>>>>>>>>>>>> although it worked with openmpi-1.7.3 as I told you
before.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> #!/bin/sh
> >>>>>>>>>>>>>>> #PBS -l nodes=node08:ppn=8
> >>>>>>>>>>>>>>> export OMP_NUM_THREADS=1
> >>>>>>>>>>>>>>> cd $PBS_O_WORKDIR
> >>>>>>>>>>>>>>> cp $PBS_NODEFILE pbs_hosts
> >>>>>>>>>>>>>>> NPROCS=`wc -l < pbs_hosts`
> >>>>>>>>>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS}
> > -report-bindings
> >>>>>>>>>> -bind-to
> >>>>>>>>>>>>> core
> >>>>>>>>>>>>>>> Myprog
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then
it
> >>>>> works
> >>>>>>>>>>>> fine.
> >>>>>>>>>>>>>>> Since this happens without lama request, I guess it's not
> > the
> >>>>>>>>>> problem>>>>>>>>>>>>>>> in lama itself. Anyway, please look
into this issue as
> > well.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Done - thanks!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp
> >>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Dear openmpi developers,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I got a segmentation fault in traial use of
> >>>>>>> openmpi-1.7.4a1r29646
> >>>>>>>>>>>>> built
> >>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>> PGI13.10 as shown below:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4
> >>>>>>>>>> -cpus-per-proc
> >>>>>>>>>>>> 2
> >>>>>>>>>>>>>>>>> -report-bindings mPre
> >>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0
[core
> > 4
> >>>>> [hwt
> >>>>>>>>>> 0]],
> >>>>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.]
> >>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1
[core
> > 6
> >>>>> [hwt
> >>>>>>>>>> 0]],
> >>>>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
> >>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0
[core
> > 0
> >>>>> [hwt
> >>>>>>>>>> 0]],
> >>>>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
> >>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0
[core
> > 2
> >>>>> [hwt
> >>>>>>>>>> 0]],
> >>>>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
> >>>>>>>>>>>>>>>>> [manage:23082] *** Process received signal ***
> >>>>>>>>>>>>>>>>> [manage:23082] Signal: Segmentation fault (11)
> >>>>>>>>>>>>>>>>> [manage:23082] Signal code: Address not mapped (1)
> >>>>>>>>>>>>>>>>> [manage:23082] Failing at address: 0x34
> >>>>>>>>>>>>>>>>> [manage:23082] *** End of error message ***
> >>>>>>>>>>>>>>>>> Segmentation fault (core dumped)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun
> >>> core.23082
> >>>>>>>>>>>>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
> >>>>>>>>>>>>>>>>> Copyright (C) 2009 Free Software Foundation, Inc.
> >>>>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2
> >>>>>>>>>>>> -report-bindings
> >>>>>>>>>>>>>>>>> mPre'.
> >>>>>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
> >>>>>>>>>>>>>>>>> #0  0x00002b5f861c9c4f in recv_connect>>>
> > (mod=0x5f861ca20b00007f,
> >>>>>>>>>>>>>>> sd=32767,
> >>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> >>>>>>>>>>>>>>>>> 631             peer = OBJ_NEW(mca_oob_tcp_peer_t);
> >>>>>>>>>>>>>>>>> (gdb) where
> >>>>>>>>>>>>>>>>> #0  0x00002b5f861c9c4f in recv_connect
> >>>>> (mod=0x5f861ca20b00007f,
> >>>>>>>>>>>>>>> sd=32767,
> >>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> >>>>>>>>>>>>>>>>> #1  0x00002b5f861ca20b in recv_handler (sd=1778385023,
> >>>>>>>>>> flags=32767,
> >>>>>>>>>>>>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
> >>>>>>>>>>>>>>>>> #2  0x00002b5f848eb06a in
> > event_process_active_single_queue
> >>>>>>>>>>>>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
> >>>>>>>>>>>>>>>>> at ./event.c:1366
> >>>>>>>>>>>>>>>>> #3  0x00002b5f848eb270 in event_process_active
> >>>>>>>>>>>>>>> (base=0x5f848eb84900007f)
> >>>>>>>>>>>>>>>>> at ./event.c:1435
> >>>>>>>>>>>>>>>>> #4  0x00002b5f848eb849 in
> > opal_libevent2021_event_base_loop
> >>>>>>>>>>>>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645
> >>>>>>>>>>>>>>>>> #5  0x00000000004077a0 in orterun (argc=7,
> >>>>> argv=0x7fff25bbd4a8)
> >>>>>>>>>>>>>>>>> at ./orterun.c:1030
> >>>>>>>>>>>>>>>>> #6  0x00000000004067fb in main (argc=7,
> >>> argv=0x7fff25bbd4a8)
> >>>>>>>>>>>>>>> at ./main.c:13
> >>>>>>>>>>>>>>>>> (gdb) quit
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is
apparently
> >>>>>>>>>>>> unnecessary,
> >>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>> causes the segfault.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 624      /* lookup the corresponding process
> >>> */>>>>>>>>>>>>> 625      peer = mca_oob_tcp_peer_lookup(mod, &hdr->
> > origin);
> >>>>>>>>>>>>>>>>> 626      if (NULL == peer) {
> >>>>>>>>>>>>>>>>> 627          ui64 = (uint64_t*)(&peer->name);
> >>>>>>>>>>>>>>>>> 628          opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
> >>>>>>>>>>>>>>>>> orte_oob_base_framework.framework_output,
> >>>>>>>>>>>>>>>>> 629                              "%s
> >>>>> mca_oob_tcp_recv_connect:
> >>>>>>>>>>>>>>>>> connection from new peer",
> >>>>>>>>>>>>>>>>> 630                              ORTE_NAME_PRINT
> >>>>>>>>>>>>> (ORTE_PROC_MY_NAME));
> >>>>>>>>>>>>>>>>> 631          peer = OBJ_NEW(mca_oob_tcp_peer_t);
> >>>>>>>>>>>>>>>>> 632          peer->mod = mod;
> >>>>>>>>>>>>>>>>> 633          peer->name = hdr->origin;
> >>>>>>>>>>>>>>>>> 634          peer->state = MCA_OOB_TCP_ACCEPTING;
> >>>>>>>>>>>>>>>>> 635          ui64 = (uint64_t*)(&peer->name);
> >>>>>>>>>>>>>>>>> 636          if (OPAL_SUCCESS !=
> >>>>>>> opal_hash_table_set_value_uint64
> >>>>>>>>>>>>>>> (&mod->
> >>>>>>>>>>>>>>>>> peers, (*ui64), peer)) {
> >>>>>>>>>>>>>>>>> 637              OBJ_RELEASE(peer);
> >>>>>>>>>>>>>>>>> 638              return;
> >>>>>>>>>>>>>>>>> 639          }
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Please fix this mistake in the next release.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> users mailing list
> >>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> users mailing list
> >>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list>> us...@open-mpi.org
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> us...@open-mpi.org
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> us...@open-mpi.org
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

Reply via email to