FWIW: I verified that this works fine under a slurm allocation of 2 nodes, each with 12 slots. I filled the node without getting an "oversbuscribed" error message
[rhc@bend001 svn-trunk]$ mpirun -n 3 --bind-to core --cpus-per-proc 4 --report-bindings -hostfile hosts hostname [bend001:24318] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB/../..][../../../../../..] [bend001:24318] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../../BB/BB][BB/BB/../../../..] [bend001:24318] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../..][../../BB/BB/BB/BB] bend001 bend001 bend001 where [rhc@bend001 svn-trunk]$ cat hosts bend001 slots=12 The only way I get the "out of resources" error is if I ask for more processes than I have slots - i.e., I give it the hosts file as shown, but ask for 13 or more processes. BTW: note one important issue with cpus-per-proc, as shown above. Because I specified 4 cpus/proc, and my sockets each have 6 cpus, one of my procs wound up being split across the two sockets (2 cores on each). That's about the worst situation you can have. So a word of caution: it is up to the user to ensure that the mapping is "good". We just do what you asked us to do. On Nov 13, 2013, at 8:30 PM, Ralph Castain <r...@open-mpi.org> wrote: > Guess I don't see why modifying the allocation is required - we have mapping > options that should support such things. If you specify the total number of > procs you want, and cpus-per-proc=4, it should do the same thing I would > think. You'd get 2 procs on the 8 slot nodes, 8 on the 32 proc nodes, and up > to 6 on the 64 slot nodes (since you specified np=16). So I guess I don't > understand the issue. > > Regardless, if NPROCS=8 (and you verified that by printing it out, not just > assuming wc -l got that value), then it shouldn't think it is oversubscribed. > I'll take a look under a slurm allocation as that is all I can access. > > > On Nov 13, 2013, at 7:23 PM, tmish...@jcity.maeda.co.jp wrote: > >> >> >> Our cluster consists of three types of nodes. They have 8, 32 >> and 64 slots respectively. Since the performance of each core is >> almost same, mixed use of these nodes is possible. >> >> Furthremore, in this case, for hybrid application with openmpi+openmp, >> the modification of hostfile is necesarry as follows: >> >> #PBS -l nodes=1:ppn=32+4:ppn=8 >> export OMP_NUM_THREADS=4 >> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines >> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS >> Myprog >> >> That's why I want to do that. >> >> Of course I know, If I quit mixed use, -npernode is better for this >> purpose. >> >> (The script I showed you first is just a simplified one to clarify the >> problem.) >> >> tmishima >> >> >>> Why do it the hard way? I'll look at the FAQ because that definitely >> isn't a recommended thing to do - better to use -host to specify the >> subset, or just specify the desired mapping using all the >>> various mappers we provide. >>> >>> On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote: >>> >>>> >>>> >>>> Sorry for cross-post. >>>> >>>> Nodefile is very simple which consists of 8 lines: >>>> >>>> node08 >>>> node08 >>>> node08 >>>> node08 >>>> node08 >>>> node08 >>>> node08 >>>> node08 >>>> >>>> Therefore, NPROCS=8 >>>> >>>> My aim is to modify the allocation as you pointed out. According to >> Openmpi >>>> FAQ, >>>> proper subset of the hosts allocated to the Torque / PBS Pro job should >> be >>>> allowed. >>>> >>>> tmishima >>>> >>>>> Please - can you answer my question on script2? What is the value of >>>> NPROCS? >>>>> >>>>> Why would you want to do it this way? Are you planning to modify the >>>> allocation?? That generally is a bad idea as it can confuse the system >>>>> >>>>> >>>>> On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote: >>>>> >>>>>> >>>>>> >>>>>> Since what I really want is to run script2 correctly, please let us >>>>>> concentrate script2. >>>>>> >>>>>> I'm not an expert of the inside of openmpi. What I can do is just >>>>>> obsabation >>>>>> from the outside. I doubt these lines are strange, especially the >> last >>>> one. >>>>>> >>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] >>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list >>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps >>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list >>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0 >> inuse >>>> 0 >>>>>> >>>>>> These lines come from this part of orte_rmaps_base_get_target_nodes >>>>>> in rmaps_base_support_fns.c: >>>>>> >>>>>> } else if (node->slots <= node->slots_inuse && >>>>>> (ORTE_MAPPING_NO_OVERSUBSCRIBE & >>>>>> ORTE_GET_MAPPING_DIRECTIVE(policy))) { >>>>>> /* remove the node as fully used */ >>>>>> OPAL_OUTPUT_VERBOSE((5, >>>>>> orte_rmaps_base_framework.framework_output, >>>>>> "%s Removing node %s slots %d inuse >>>> %d", >>>>>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >>>>>> node->name, node->slots, node-> >>>>>> slots_inuse)); >>>>>> opal_list_remove_item(allocated_nodes, item); >>>>>> OBJ_RELEASE(item); /* "un-retain" it */ >>>>>> >>>>>> I wonder why node->slots and node->slots_inuse is 0, which I can read >>>>>> from the above line "Removing node node08 slots 0 inuse 0". >>>>>> >>>>>> Or I'm not sure but >>>>>> "else if (node->slots <= node->slots_inuse &&" should be >>>>>> "else if (node->slots < node->slots_inuse &&" ? >>>>>> >>>>>> tmishima >>>>>> >>>>>>> On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Yes, the node08 has 8 slots but the process I run is also 8. >>>>>>>> >>>>>>>> #PBS -l nodes=node08:ppn=8 >>>>>>>> >>>>>>>> Therefore, I think it should allow this allocation. Is that right? >>>>>>> >>>>>>> Correct >>>>>>> >>>>>>>> >>>>>>>> My question is why scritp1 works and script2 does not. They are >>>>>>>> almost same. >>>>>>>> >>>>>>>> #PBS -l nodes=node08:ppn=8 >>>>>>>> export OMP_NUM_THREADS=1 >>>>>>>> cd $PBS_O_WORKDIR >>>>>>>> cp $PBS_NODEFILE pbs_hosts >>>>>>>> NPROCS=`wc -l < pbs_hosts` >>>>>>>> >>>>>>>> #SCRITP1 >>>>>>>> mpirun -report-bindings -bind-to core Myprog >>>>>>>> >>>>>>>> #SCRIPT2 >>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings >> -bind-to >>>>>> core >>>>>>> >>>>>>> This version is not only reading the PBS allocation, but also >> invoking >>>>>> the hostfile filter on top of it. Different code path. I'll take a >> look >>>> - >>>>>> it should still match up assuming NPROCS=8. Any >>>>>>> possibility that it is a different number? I don't recall, but isn't >>>>>> there some extra lines in the nodefile - e.g., comments? >>>>>>> >>>>>>> >>>>>>>> Myprog >>>>>>>> >>>>>>>> tmishima >>>>>>>> >>>>>>>>> I guess here's my confusion. If you are using only one node, and >>>> that >>>>>>>> node has 8 allocated slots, then we will not allow you to run more >>>> than >>>>>> 8 >>>>>>>> processes on that node unless you specifically provide >>>>>>>>> the --oversubscribe flag. This is because you are operating in a >>>>>> managed >>>>>>>> environment (in this case, under Torque), and so we treat the >>>>>> allocation as >>>>>>>> "mandatory" by default. >>>>>>>>> >>>>>>>>> I suspect that is the issue here, in which case the system is >>>> behaving >>>>>> as >>>>>>>> it should. >>>>>>>>> >>>>>>>>> Is the above accurate? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org> >> wrote: >>>>>>>>> >>>>>>>>>> It has nothing to do with LAMA as you aren't using that mapper. >>>>>>>>>> >>>>>>>>>> How many nodes are in this allocation? >>>>>>>>>> >>>>>>>>>> On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Ralph, this is an additional information. >>>>>>>>>>> >>>>>>>>>>> Here is the main part of output by adding "-mca >> rmaps_base_verbose >>>>>>>> 50". >>>>>>>>>>> >>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm >>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating >>>> map >>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP >> in >>>>>>>>>>> allocation >>>>>>>>>>> [node08.cluster:26952] mca:rmaps: mapping job [56581,1] >>>>>>>>>>> [node08.cluster:26952] mca:rmaps: creating new map for job >>>> [56581,1] >>>>>>>>>>> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using >> ppr >>>>>>>> mapper >>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job >>>> [56581,1] >>>>>>>>>>> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using >> seq >>>>>>>> mapper >>>>>>>>>>> [node08.cluster:26952] mca:rmaps:resilient: cannot perform >> initial >>>>>> map >>>>>>>> of >>>>>>>>>>> job [56581,1] - no fault groups >>>>>>>>>>> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not >> using >>>>>>>> mindist >>>>>>>>>>> mapper >>>>>>>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] >>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in >> list >>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps >>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list >>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots >> 0 >>>>>>>> inuse 0 >>>>>>>>>>> >>>>>>>>>>> From this result, I guess it's related to oversubscribe. >>>>>>>>>>> So I added "-oversubscribe" and rerun, then it worked well as >> show >>>>>>>> below: >>>>>>>>>>> >>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in >> list >>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Filtering thru apps >>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list >>>>>>>>>>> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING: >>>>>>>>>>> [node08.cluster:27019] node: node08 daemon: 0 >>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node >>>>>> node08 >>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting at node node08 >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job >>>>>> [56774,1] >>>>>>>>>>> slots 1 num_procs 8 >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08 >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full - >>>>>>>> skipping >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is >>>>>>>> oversubscribed - >>>>>>>>>>> performing second pass >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08 >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to >>>>>> node >>>>>>>>>>> node08 >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot >> for >>>>>> job >>>>>>>>>>> [56774,1] >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node >>>>>> node08 >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node >>>>>> node08 >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node >>>>>> node08 >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node >>>>>> node08 >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node >>>>>> node08 >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node >>>>>> node08 >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node >>>>>> node08 >>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node >>>>>> node08 >>>>>>>>>>> >>>>>>>>>>> I think something is wrong with treatment of oversubscription, >>>> which >>>>>>>> might >>>>>>>>>>> be >>>>>>>>>>> related to "#3893: LAMA mapper has problems" >>>>>>>>>>> >>>>>>>>>>> tmishima >>>>>>>>>>> >>>>>>>>>>>> Hmmm...looks like we aren't getting your allocation. Can you >>>> rerun >>>>>>>> and >>>>>>>>>>> add -mca ras_base_verbose 50? >>>>>>>>>>>> >>>>>>>>>>>> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>> >>>>>>>>>>>>> Here is the output of "-mca plm_base_verbose 5". >>>>>>>>>>>>> >>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Querying >>>> component >>>>>>>> [rsh] >>>>>>>>>>>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on >>>>>>>>>>>>> agent /usr/bin/rsh path NULL >>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Query of >>>> component >>>>>>>> [rsh] >>>>>>>>>>> set >>>>>>>>>>>>> priority to 10 >>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Querying >>>> component >>>>>>>>>>> [slurm] >>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Skipping >>>> component >>>>>>>>>>> [slurm]. >>>>>>>>>>>>> Query failed to return a module >>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Querying >>>> component >>>>>>>> [tm] >>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Query of >>>> component >>>>>>>> [tm] >>>>>>>>>>> set >>>>>>>>>>>>> priority to 75 >>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Selected >>>> component >>>>>>>> [tm] >>>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias >> 23573 >>>>>>>>>>> nodename >>>>>>>>>>>>> hash 85176670 >>>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam >> 59480 >>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start >> comm >>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job >>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm >>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm >> creating >>>>>> map >>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only >> HNP >>>> in >>>>>>>>>>>>> allocation >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>> >>>> >> -------------------------------------------------------------------------- >>>>>>>>>>>>> All nodes which are allocated for this job are already filled. >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>> >>>> >> -------------------------------------------------------------------------- >>>>>>>>>>>>> >>>>>>>>>>>>> Here, openmpi's configuration is as follows: >>>>>>>>>>>>> >>>>>>>>>>>>> ./configure \ >>>>>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \ >>>>>>>>>>>>> --with-tm \ >>>>>>>>>>>>> --with-verbs \ >>>>>>>>>>>>> --disable-ipv6 \ >>>>>>>>>>>>> --disable-vt \ >>>>>>>>>>>>> --enable-debug \ >>>>>>>>>>>>> CC=pgcc CFLAGS="-tp k8-64e" \ >>>>>>>>>>>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \ >>>>>>>>>>>>> F77=pgfortran FFLAGS="-tp k8-64e" \ >>>>>>>>>>>>> FC=pgfortran FCFLAGS="-tp k8-64e" >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Okey, I can help you. Please give me some time to report the >>>>>>>> output. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I can try, but I have no way of testing Torque any more - so >>>> all >>>>>> I >>>>>>>>>>> can >>>>>>>>>>>>> do >>>>>>>>>>>>>> is a code review. If you can build --enable-debug and add >> -mca >>>>>>>>>>>>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing >> the >>>>>>>>>>>>>>> output. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp >> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thank you for your quick response. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'd like to report one more regressive issue about Torque >>>>>> support >>>>>>>> of >>>>>>>>>>>>>>>> openmpi-1.7.4a1r29646, which might be related to "#3893: >> LAMA >>>>>>>> mapper >>>>>>>>>>>>>>>> has problems" I reported a few days ago. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The script below does not work with openmpi-1.7.4a1r29646, >>>>>>>>>>>>>>>> although it worked with openmpi-1.7.3 as I told you before. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> #!/bin/sh >>>>>>>>>>>>>>>> #PBS -l nodes=node08:ppn=8 >>>>>>>>>>>>>>>> export OMP_NUM_THREADS=1 >>>>>>>>>>>>>>>> cd $PBS_O_WORKDIR >>>>>>>>>>>>>>>> cp $PBS_NODEFILE pbs_hosts >>>>>>>>>>>>>>>> NPROCS=`wc -l < pbs_hosts` >>>>>>>>>>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} >> -report-bindings >>>>>>>>>>> -bind-to >>>>>>>>>>>>>> core >>>>>>>>>>>>>>>> Myprog >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it >>>>>> works >>>>>>>>>>>>> fine. >>>>>>>>>>>>>>>> Since this happens without lama request, I guess it's not >> the >>>>>>>>>>> problem >>>>>>>>>>>>>>>> in lama itself. Anyway, please look into this issue as >> well. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Done - thanks! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp >>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Dear openmpi developers, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I got a segmentation fault in traial use of >>>>>>>> openmpi-1.7.4a1r29646 >>>>>>>>>>>>>> built >>>>>>>>>>>>>>>> by >>>>>>>>>>>>>>>>>> PGI13.10 as shown below: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 >>>>>>>>>>> -cpus-per-proc >>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>> -report-bindings mPre >>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core >> 4 >>>>>> [hwt >>>>>>>>>>> 0]], >>>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.] >>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core >> 6 >>>>>> [hwt >>>>>>>>>>> 0]], >>>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.] >>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core >> 0 >>>>>> [hwt >>>>>>>>>>> 0]], >>>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.] >>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core >> 2 >>>>>> [hwt >>>>>>>>>>> 0]], >>>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.] >>>>>>>>>>>>>>>>>> [manage:23082] *** Process received signal *** >>>>>>>>>>>>>>>>>> [manage:23082] Signal: Segmentation fault (11) >>>>>>>>>>>>>>>>>> [manage:23082] Signal code: Address not mapped (1) >>>>>>>>>>>>>>>>>> [manage:23082] Failing at address: 0x34 >>>>>>>>>>>>>>>>>> [manage:23082] *** End of error message *** >>>>>>>>>>>>>>>>>> Segmentation fault (core dumped) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun >>>> core.23082 >>>>>>>>>>>>>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) >>>>>>>>>>>>>>>>>> Copyright (C) 2009 Free Software Foundation, Inc. >>>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 >>>>>>>>>>>>> -report-bindings >>>>>>>>>>>>>>>>>> mPre'. >>>>>>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault. >>>>>>>>>>>>>>>>>> #0 0x00002b5f861c9c4f in recv_connect>>> >> (mod=0x5f861ca20b00007f, >>>>>>>>>>>>>>>> sd=32767, >>>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 >>>>>>>>>>>>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); >>>>>>>>>>>>>>>>>> (gdb) where >>>>>>>>>>>>>>>>>> #0 0x00002b5f861c9c4f in recv_connect >>>>>> (mod=0x5f861ca20b00007f, >>>>>>>>>>>>>>>> sd=32767, >>>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 >>>>>>>>>>>>>>>>>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, >>>>>>>>>>> flags=32767, >>>>>>>>>>>>>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760 >>>>>>>>>>>>>>>>>> #2 0x00002b5f848eb06a in >> event_process_active_single_queue >>>>>>>>>>>>>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff) >>>>>>>>>>>>>>>>>> at ./event.c:1366 >>>>>>>>>>>>>>>>>> #3 0x00002b5f848eb270 in event_process_active >>>>>>>>>>>>>>>> (base=0x5f848eb84900007f) >>>>>>>>>>>>>>>>>> at ./event.c:1435 >>>>>>>>>>>>>>>>>> #4 0x00002b5f848eb849 in >> opal_libevent2021_event_base_loop >>>>>>>>>>>>>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645 >>>>>>>>>>>>>>>>>> #5 0x00000000004077a0 in orterun (argc=7, >>>>>> argv=0x7fff25bbd4a8) >>>>>>>>>>>>>>>>>> at ./orterun.c:1030 >>>>>>>>>>>>>>>>>> #6 0x00000000004067fb in main (argc=7, >>>> argv=0x7fff25bbd4a8) >>>>>>>>>>>>>>>> at ./main.c:13 >>>>>>>>>>>>>>>>>> (gdb) quit >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently >>>>>>>>>>>>> unnecessary, >>>>>>>>>>>>>>>> which >>>>>>>>>>>>>>>>>> causes the segfault. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 624 /* lookup the corresponding process >>>> */>>>>>>>>>>>>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr-> >> origin); >>>>>>>>>>>>>>>>>> 626 if (NULL == peer) { >>>>>>>>>>>>>>>>>> 627 ui64 = (uint64_t*)(&peer->name); >>>>>>>>>>>>>>>>>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, >>>>>>>>>>>>>>>>>> orte_oob_base_framework.framework_output, >>>>>>>>>>>>>>>>>> 629 "%s >>>>>> mca_oob_tcp_recv_connect: >>>>>>>>>>>>>>>>>> connection from new peer", >>>>>>>>>>>>>>>>>> 630 ORTE_NAME_PRINT >>>>>>>>>>>>>> (ORTE_PROC_MY_NAME)); >>>>>>>>>>>>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); >>>>>>>>>>>>>>>>>> 632 peer->mod = mod; >>>>>>>>>>>>>>>>>> 633 peer->name = hdr->origin; >>>>>>>>>>>>>>>>>> 634 peer->state = MCA_OOB_TCP_ACCEPTING; >>>>>>>>>>>>>>>>>> 635 ui64 = (uint64_t*)(&peer->name); >>>>>>>>>>>>>>>>>> 636 if (OPAL_SUCCESS != >>>>>>>> opal_hash_table_set_value_uint64 >>>>>>>>>>>>>>>> (&mod-> >>>>>>>>>>>>>>>>>> peers, (*ui64), peer)) { >>>>>>>>>>>>>>>>>> 637 OBJ_RELEASE(peer); >>>>>>>>>>>>>>>>>> 638 return; >>>>>>>>>>>>>>>>>> 639 } >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Please fix this mistake in the next release. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >