Hi Ralph,
Okey, I can help you. Please give me some time to report the output. Tetsuya Mishima > I can try, but I have no way of testing Torque any more - so all I can do is a code review. If you can build --enable-debug and add -mca plm_base_verbose 5 to your cmd line, I'd appreciate seeing the > output. > > > On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Hi Ralph, > > > > Thank you for your quick response. > > > > I'd like to report one more regressive issue about Torque support of > > openmpi-1.7.4a1r29646, which might be related to "#3893: LAMA mapper > > has problems" I reported a few days ago. > > > > The script below does not work with openmpi-1.7.4a1r29646, > > although it worked with openmpi-1.7.3 as I told you before. > > > > #!/bin/sh > > #PBS -l nodes=node08:ppn=8 > > export OMP_NUM_THREADS=1 > > cd $PBS_O_WORKDIR > > cp $PBS_NODEFILE pbs_hosts > > NPROCS=`wc -l < pbs_hosts` > > mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to core > > Myprog > > > > If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it works fine. > > Since this happens without lama request, I guess it's not the problem > > in lama itself. Anyway, please look into this issue as well. > > > > Regards, > > Tetsuya Mishima > > > >> Done - thanks! > >> > >> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp wrote: > >> > >>> > >>> > >>> Dear openmpi developers, > >>> > >>> I got a segmentation fault in traial use of openmpi-1.7.4a1r29646 built > > by > >>> PGI13.10 as shown below: > >>> > >>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 -cpus-per-proc 2 > >>> -report-bindings mPre > >>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt 0]], > > socket > >>> 0[core 5[hwt 0]]: [././././B/B][./././././.] > >>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt 0]], > > socket > >>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.] > >>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt 0]], > > socket > >>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.] > >>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt 0]], > > socket > >>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.] > >>> [manage:23082] *** Process received signal *** > >>> [manage:23082] Signal: Segmentation fault (11) > >>> [manage:23082] Signal code: Address not mapped (1) > >>> [manage:23082] Failing at address: 0x34 > >>> [manage:23082] *** End of error message *** > >>> Segmentation fault (core dumped) > >>> > >>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082 > >>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) > >>> Copyright (C) 2009 Free Software Foundation, Inc. > >>> ... > >>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 -report-bindings > >>> mPre'. > >>> Program terminated with signal 11, Segmentation fault. > >>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, > > sd=32767, > >>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 > >>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); > >>> (gdb) where > >>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, > > sd=32767, > >>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 > >>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, flags=32767, > >>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760 > >>> #2 0x00002b5f848eb06a in event_process_active_single_queue > >>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff) > >>> at ./event.c:1366 > >>> #3 0x00002b5f848eb270 in event_process_active > > (base=0x5f848eb84900007f) > >>> at ./event.c:1435 > >>> #4 0x00002b5f848eb849 in opal_libevent2021_event_base_loop > >>> (base=0x4077a000007f, flags=32767) at ./event.c:1645 > >>> #5 0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8) > >>> at ./orterun.c:1030 > >>> #6 0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8) > > at ./main.c:13 > >>> (gdb) quit > >>> > >>> > >>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently unnecessary, > > which > >>> causes the segfault. > >>> > >>> 624 /* lookup the corresponding process */ > >>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin); > >>> 626 if (NULL == peer) { > >>> 627 ui64 = (uint64_t*)(&peer->name); > >>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, > >>> orte_oob_base_framework.framework_output, > >>> 629 "%s mca_oob_tcp_recv_connect: > >>> connection from new peer", > >>> 630 ORTE_NAME_PRINT (ORTE_PROC_MY_NAME)); > >>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); > >>> 632 peer->mod = mod; > >>> 633 peer->name = hdr->origin; > >>> 634 peer->state = MCA_OOB_TCP_ACCEPTING; > >>> 635 ui64 = (uint64_t*)(&peer->name); > >>> 636 if (OPAL_SUCCESS != opal_hash_table_set_value_uint64 > > (&mod-> > >>> peers, (*ui64), peer)) { > >>> 637 OBJ_RELEASE(peer); > >>> 638 return; > >>> 639 } > >>> > >>> > >>> Please fix this mistake in the next release. > >>> > >>> Regards, > >>> Tetsuya Mishima > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users