Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

Ralph Castain Wed, 13 Nov 2013 10:33:03 -0500 (EST)

Hmmm...looks like we aren't getting your allocation. Can you rerun and add -mca 
ras_base_verbose 50?


On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> Here is the output of "-mca plm_base_verbose 5".
> 
> [node08.cluster:23573] mca:base:select:(  plm) Querying component [rsh]
> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on
> agent /usr/bin/rsh path NULL
> [node08.cluster:23573] mca:base:select:(  plm) Query of component [rsh] set
> priority to 10
> [node08.cluster:23573] mca:base:select:(  plm) Querying component [slurm]
> [node08.cluster:23573] mca:base:select:(  plm) Skipping component [slurm].
> Query failed to return a module
> [node08.cluster:23573] mca:base:select:(  plm) Querying component [tm]
> [node08.cluster:23573] mca:base:select:(  plm) Query of component [tm] set
> priority to 75
> [node08.cluster:23573] mca:base:select:(  plm) Selected component [tm]
> [node08.cluster:23573] plm:base:set_hnp_name: initial bias 23573 nodename
> hash 85176670
> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam 59480
> [node08.cluster:23573] [[59480,0],0] plm:base:receive start comm
> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job
> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm creating map
> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only HNP in
> allocation
> --------------------------------------------------------------------------
> All nodes which are allocated for this job are already filled.
> --------------------------------------------------------------------------
> 
> Here, openmpi's configuration is as follows:
> 
> ./configure \
> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \
> --with-tm \
> --with-verbs \
> --disable-ipv6 \
> --disable-vt \
> --enable-debug \
> CC=pgcc CFLAGS="-tp k8-64e" \
> CXX=pgCC CXXFLAGS="-tp k8-64e" \
> F77=pgfortran FFLAGS="-tp k8-64e" \
> FC=pgfortran FCFLAGS="-tp k8-64e"
> 
>> Hi Ralph,
>> 
>> Okey, I can help you. Please give me some time to report the output.
>> 
>> Tetsuya Mishima
>> 
>>> I can try, but I have no way of testing Torque any more - so all I can
> do
>> is a code review. If you can build --enable-debug and add -mca
>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing the
>>> output.
>>> 
>>> 
>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp wrote:
>>> 
>>>> 
>>>> 
>>>> Hi Ralph,
>>>> 
>>>> Thank you for your quick response.
>>>> 
>>>> I'd like to report one more regressive issue about Torque support of
>>>> openmpi-1.7.4a1r29646, which might be related to "#3893: LAMA mapper
>>>> has problems" I reported a few days ago.
>>>> 
>>>> The script below does not work with openmpi-1.7.4a1r29646,
>>>> although it worked with openmpi-1.7.3 as I told you before.
>>>> 
>>>> #!/bin/sh
>>>> #PBS -l nodes=node08:ppn=8
>>>> export OMP_NUM_THREADS=1
>>>> cd $PBS_O_WORKDIR
>>>> cp $PBS_NODEFILE pbs_hosts
>>>> NPROCS=`wc -l < pbs_hosts`
>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to
>> core
>>>> Myprog
>>>> 
>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it works
> fine.
>>>> Since this happens without lama request, I guess it's not the problem
>>>> in lama itself. Anyway, please look into this issue as well.
>>>> 
>>>> Regards,
>>>> Tetsuya Mishima
>>>> 
>>>>> Done - thanks!
>>>>> 
>>>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Dear openmpi developers,
>>>>>> 
>>>>>> I got a segmentation fault in traial use of openmpi-1.7.4a1r29646
>> built
>>>> by
>>>>>> PGI13.10 as shown below:
>>>>>> 
>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 -cpus-per-proc
> 2
>>>>>> -report-bindings mPre
>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt 0]],
>>>> socket
>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.]
>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt 0]],
>>>> socket
>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
>>>>>> [manage:23082] *** Process received signal ***
>>>>>> [manage:23082] Signal: Segmentation fault (11)
>>>>>> [manage:23082] Signal code: Address not mapped (1)
>>>>>> [manage:23082] Failing at address: 0x34
>>>>>> [manage:23082] *** End of error message ***
>>>>>> Segmentation fault (core dumped)
>>>>>> 
>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082
>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
>>>>>> Copyright (C) 2009 Free Software Foundation, Inc.
>>>>>> ...
>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2
> -report-bindings
>>>>>> mPre'.
>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>> #0  0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f,
>>>> sd=32767,
>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>>>>> 631             peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>>>>> (gdb) where
>>>>>> #0  0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f,
>>>> sd=32767,
>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>>>>> #1  0x00002b5f861ca20b in recv_handler (sd=1778385023, flags=32767,
>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
>>>>>> #2  0x00002b5f848eb06a in event_process_active_single_queue
>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
>>>>>>  at ./event.c:1366
>>>>>> #3  0x00002b5f848eb270 in event_process_active
>>>> (base=0x5f848eb84900007f)
>>>>>> at ./event.c:1435
>>>>>> #4  0x00002b5f848eb849 in opal_libevent2021_event_base_loop
>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645
>>>>>> #5  0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8)
>>>>>> at ./orterun.c:1030
>>>>>> #6  0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8)
>>>> at ./main.c:13
>>>>>> (gdb) quit
>>>>>> 
>>>>>> 
>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently
> unnecessary,
>>>> which
>>>>>> causes the segfault.
>>>>>> 
>>>>>> 624      /* lookup the corresponding process */
>>>>>> 625      peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin);
>>>>>> 626      if (NULL == peer) {
>>>>>> 627          ui64 = (uint64_t*)(&peer->name);
>>>>>> 628          opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
>>>>>> orte_oob_base_framework.framework_output,
>>>>>> 629                              "%s mca_oob_tcp_recv_connect:
>>>>>> connection from new peer",
>>>>>> 630                              ORTE_NAME_PRINT
>> (ORTE_PROC_MY_NAME));
>>>>>> 631          peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>>>>> 632          peer->mod = mod;
>>>>>> 633          peer->name = hdr->origin;
>>>>>> 634          peer->state = MCA_OOB_TCP_ACCEPTING;
>>>>>> 635          ui64 = (uint64_t*)(&peer->name);
>>>>>> 636          if (OPAL_SUCCESS != opal_hash_table_set_value_uint64
>>>> (&mod->
>>>>>> peers, (*ui64), peer)) {
>>>>>> 637              OBJ_RELEASE(peer);
>>>>>> 638              return;
>>>>>> 639          }
>>>>>> 
>>>>>> 
>>>>>> Please fix this mistake in the next release.
>>>>>> 
>>>>>> Regards,
>>>>>> Tetsuya Mishima
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

Reply via email to