Hi Ralph,

Okey, I can help you. Please give me some time to report the output.

Tetsuya Mishima

> I can try, but I have no way of testing Torque any more - so all I can do
is a code review. If you can build --enable-debug and add -mca
plm_base_verbose 5 to your cmd line, I'd appreciate seeing the
> output.
>
>
> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph,
> >
> > Thank you for your quick response.
> >
> > I'd like to report one more regressive issue about Torque support of
> > openmpi-1.7.4a1r29646, which might be related to "#3893: LAMA mapper
> > has problems" I reported a few days ago.
> >
> > The script below does not work with openmpi-1.7.4a1r29646,
> > although it worked with openmpi-1.7.3 as I told you before.
> >
> > #!/bin/sh
> > #PBS -l nodes=node08:ppn=8
> > export OMP_NUM_THREADS=1
> > cd $PBS_O_WORKDIR
> > cp $PBS_NODEFILE pbs_hosts
> > NPROCS=`wc -l < pbs_hosts`
> > mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to
core
> > Myprog
> >
> > If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it works fine.
> > Since this happens without lama request, I guess it's not the problem
> > in lama itself. Anyway, please look into this issue as well.
> >
> > Regards,
> > Tetsuya Mishima
> >
> >> Done - thanks!
> >>
> >> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp wrote:
> >>
> >>>
> >>>
> >>> Dear openmpi developers,
> >>>
> >>> I got a segmentation fault in traial use of openmpi-1.7.4a1r29646
built
> > by
> >>> PGI13.10 as shown below:
> >>>
> >>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 -cpus-per-proc 2
> >>> -report-bindings mPre
> >>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt 0]],
> > socket
> >>> 0[core 5[hwt 0]]: [././././B/B][./././././.]
> >>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt 0]],
> > socket
> >>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
> >>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > socket
> >>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
> >>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt 0]],
> > socket
> >>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
> >>> [manage:23082] *** Process received signal ***
> >>> [manage:23082] Signal: Segmentation fault (11)
> >>> [manage:23082] Signal code: Address not mapped (1)
> >>> [manage:23082] Failing at address: 0x34
> >>> [manage:23082] *** End of error message ***
> >>> Segmentation fault (core dumped)
> >>>
> >>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082
> >>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
> >>> Copyright (C) 2009 Free Software Foundation, Inc.
> >>> ...
> >>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 -report-bindings
> >>> mPre'.
> >>> Program terminated with signal 11, Segmentation fault.
> >>> #0  0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f,
> > sd=32767,
> >>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> >>> 631             peer = OBJ_NEW(mca_oob_tcp_peer_t);
> >>> (gdb) where
> >>> #0  0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f,
> > sd=32767,
> >>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> >>> #1  0x00002b5f861ca20b in recv_handler (sd=1778385023, flags=32767,
> >>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
> >>> #2  0x00002b5f848eb06a in event_process_active_single_queue
> >>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
> >>>   at ./event.c:1366
> >>> #3  0x00002b5f848eb270 in event_process_active
> > (base=0x5f848eb84900007f)
> >>> at ./event.c:1435
> >>> #4  0x00002b5f848eb849 in opal_libevent2021_event_base_loop
> >>> (base=0x4077a000007f, flags=32767) at ./event.c:1645
> >>> #5  0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8)
> >>> at ./orterun.c:1030
> >>> #6  0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8)
> > at ./main.c:13
> >>> (gdb) quit
> >>>
> >>>
> >>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently unnecessary,
> > which
> >>> causes the segfault.
> >>>
> >>>  624      /* lookup the corresponding process */
> >>>  625      peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin);
> >>>  626      if (NULL == peer) {
> >>>  627          ui64 = (uint64_t*)(&peer->name);
> >>>  628          opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
> >>> orte_oob_base_framework.framework_output,
> >>>  629                              "%s mca_oob_tcp_recv_connect:
> >>> connection from new peer",
> >>>  630                              ORTE_NAME_PRINT
(ORTE_PROC_MY_NAME));
> >>>  631          peer = OBJ_NEW(mca_oob_tcp_peer_t);
> >>>  632          peer->mod = mod;
> >>>  633          peer->name = hdr->origin;
> >>>  634          peer->state = MCA_OOB_TCP_ACCEPTING;
> >>>  635          ui64 = (uint64_t*)(&peer->name);
> >>>  636          if (OPAL_SUCCESS != opal_hash_table_set_value_uint64
> > (&mod->
> >>> peers, (*ui64), peer)) {
> >>>  637              OBJ_RELEASE(peer);
> >>>  638              return;
> >>>  639          }
> >>>
> >>>
> >>> Please fix this mistake in the next release.
> >>>
> >>> Regards,
> >>> Tetsuya Mishima
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to