I can try, but I have no way of testing Torque any more - so all I can do is a code review. If you can build --enable-debug and add -mca plm_base_verbose 5 to your cmd line, I'd appreciate seeing the output.
On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > Thank you for your quick response. > > I'd like to report one more regressive issue about Torque support of > openmpi-1.7.4a1r29646, which might be related to "#3893: LAMA mapper > has problems" I reported a few days ago. > > The script below does not work with openmpi-1.7.4a1r29646, > although it worked with openmpi-1.7.3 as I told you before. > > #!/bin/sh > #PBS -l nodes=node08:ppn=8 > export OMP_NUM_THREADS=1 > cd $PBS_O_WORKDIR > cp $PBS_NODEFILE pbs_hosts > NPROCS=`wc -l < pbs_hosts` > mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to core > Myprog > > If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it works fine. > Since this happens without lama request, I guess it's not the problem > in lama itself. Anyway, please look into this issue as well. > > Regards, > Tetsuya Mishima > >> Done - thanks! >> >> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Dear openmpi developers, >>> >>> I got a segmentation fault in traial use of openmpi-1.7.4a1r29646 built > by >>> PGI13.10 as shown below: >>> >>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 -cpus-per-proc 2 >>> -report-bindings mPre >>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt 0]], > socket >>> 0[core 5[hwt 0]]: [././././B/B][./././././.] >>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt 0]], > socket >>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.] >>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket >>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.] >>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt 0]], > socket >>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.] >>> [manage:23082] *** Process received signal *** >>> [manage:23082] Signal: Segmentation fault (11) >>> [manage:23082] Signal code: Address not mapped (1) >>> [manage:23082] Failing at address: 0x34 >>> [manage:23082] *** End of error message *** >>> Segmentation fault (core dumped) >>> >>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082 >>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) >>> Copyright (C) 2009 Free Software Foundation, Inc. >>> ... >>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 -report-bindings >>> mPre'. >>> Program terminated with signal 11, Segmentation fault. >>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, > sd=32767, >>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 >>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); >>> (gdb) where >>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, > sd=32767, >>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 >>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, flags=32767, >>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760 >>> #2 0x00002b5f848eb06a in event_process_active_single_queue >>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff) >>> at ./event.c:1366 >>> #3 0x00002b5f848eb270 in event_process_active > (base=0x5f848eb84900007f) >>> at ./event.c:1435 >>> #4 0x00002b5f848eb849 in opal_libevent2021_event_base_loop >>> (base=0x4077a000007f, flags=32767) at ./event.c:1645 >>> #5 0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8) >>> at ./orterun.c:1030 >>> #6 0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8) > at ./main.c:13 >>> (gdb) quit >>> >>> >>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently unnecessary, > which >>> causes the segfault. >>> >>> 624 /* lookup the corresponding process */ >>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin); >>> 626 if (NULL == peer) { >>> 627 ui64 = (uint64_t*)(&peer->name); >>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, >>> orte_oob_base_framework.framework_output, >>> 629 "%s mca_oob_tcp_recv_connect: >>> connection from new peer", >>> 630 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)); >>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); >>> 632 peer->mod = mod; >>> 633 peer->name = hdr->origin; >>> 634 peer->state = MCA_OOB_TCP_ACCEPTING; >>> 635 ui64 = (uint64_t*)(&peer->name); >>> 636 if (OPAL_SUCCESS != opal_hash_table_set_value_uint64 > (&mod-> >>> peers, (*ui64), peer)) { >>> 637 OBJ_RELEASE(peer); >>> 638 return; >>> 639 } >>> >>> >>> Please fix this mistake in the next release. >>> >>> Regards, >>> Tetsuya Mishima >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users