Thank you, Ralph!
I didn't know that function of cups-per-proc. As fas as I know, it didn't work in openmpi-1.6.x like that. It was just 4 cores binding... Today I don't have much time and I'll check it tomorrow. And thank you again for checking oversubscription problem. tmishima > Guess I don't see why modifying the allocation is required - we have mapping options that should support such things. If you specify the total number of procs you want, and cpus-per-proc=4, it should > do the same thing I would think. You'd get 2 procs on the 8 slot nodes, 8 on the 32 proc nodes, and up to 6 on the 64 slot nodes (since you specified np=16). So I guess I don't understand the issue. > > Regardless, if NPROCS=8 (and you verified that by printing it out, not just assuming wc -l got that value), then it shouldn't think it is oversubscribed. I'll take a look under a slurm allocation as > that is all I can access. > > > On Nov 13, 2013, at 7:23 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Our cluster consists of three types of nodes. They have 8, 32 > > and 64 slots respectively. Since the performance of each core is > > almost same, mixed use of these nodes is possible. > > > > Furthremore, in this case, for hybrid application with openmpi+openmp, > > the modification of hostfile is necesarry as follows: > > > > #PBS -l nodes=1:ppn=32+4:ppn=8 > > export OMP_NUM_THREADS=4 > > modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines > > mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS > > Myprog > > > > That's why I want to do that. > > > > Of course I know, If I quit mixed use, -npernode is better for this > > purpose. > > > > (The script I showed you first is just a simplified one to clarify the > > problem.) > > > > tmishima > > > > > >> Why do it the hard way? I'll look at the FAQ because that definitely > > isn't a recommended thing to do - better to use -host to specify the > > subset, or just specify the desired mapping using all the > >> various mappers we provide. > >> > >> On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote: > >> > >>> > >>> > >>> Sorry for cross-post. > >>> > >>> Nodefile is very simple which consists of 8 lines: > >>> > >>> node08 > >>> node08 > >>> node08 > >>> node08 > >>> node08 > >>> node08 > >>> node08 > >>> node08 > >>> > >>> Therefore, NPROCS=8 > >>> > >>> My aim is to modify the allocation as you pointed out. According to > > Openmpi > >>> FAQ, > >>> proper subset of the hosts allocated to the Torque / PBS Pro job should > > be > >>> allowed. > >>> > >>> tmishima > >>> > >>>> Please - can you answer my question on script2? What is the value of > >>> NPROCS? > >>>> > >>>> Why would you want to do it this way? Are you planning to modify the > >>> allocation?? That generally is a bad idea as it can confuse the system > >>>> > >>>> > >>>> On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote: > >>>> > >>>>> > >>>>> > >>>>> Since what I really want is to run script2 correctly, please let us > >>>>> concentrate script2. > >>>>> > >>>>> I'm not an expert of the inside of openmpi. What I can do is just > >>>>> obsabation > >>>>> from the outside. I doubt these lines are strange, especially the > > last > >>> one. > >>>>> > >>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] > >>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list > >>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps > >>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list > >>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0 > > inuse > >>> 0 > >>>>> > >>>>> These lines come from this part of orte_rmaps_base_get_target_nodes > >>>>> in rmaps_base_support_fns.c: > >>>>> > >>>>> } else if (node->slots <= node->slots_inuse && > >>>>> (ORTE_MAPPING_NO_OVERSUBSCRIBE & > >>>>> ORTE_GET_MAPPING_DIRECTIVE(policy))) { > >>>>> /* remove the node as fully used */ > >>>>> OPAL_OUTPUT_VERBOSE((5, > >>>>> orte_rmaps_base_framework.framework_output, > >>>>> "%s Removing node %s slots %d inuse > >>> %d", > >>>>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), > >>>>> node->name, node->slots, node-> > >>>>> slots_inuse)); > >>>>> opal_list_remove_item(allocated_nodes, item); > >>>>> OBJ_RELEASE(item); /* "un-retain" it */ > >>>>> > >>>>> I wonder why node->slots and node->slots_inuse is 0, which I can read > >>>>> from the above line "Removing node node08 slots 0 inuse 0". > >>>>> > >>>>> Or I'm not sure but > >>>>> "else if (node->slots <= node->slots_inuse &&" should be > >>>>> "else if (node->slots < node->slots_inuse &&" ? > >>>>> > >>>>> tmishima > >>>>> > >>>>>> On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Yes, the node08 has 8 slots but the process I run is also 8. > >>>>>>> > >>>>>>> #PBS -l nodes=node08:ppn=8 > >>>>>>> > >>>>>>> Therefore, I think it should allow this allocation. Is that right? > >>>>>> > >>>>>> Correct > >>>>>> > >>>>>>> > >>>>>>> My question is why scritp1 works and script2 does not. They are > >>>>>>> almost same. > >>>>>>> > >>>>>>> #PBS -l nodes=node08:ppn=8 > >>>>>>> export OMP_NUM_THREADS=1 > >>>>>>> cd $PBS_O_WORKDIR > >>>>>>> cp $PBS_NODEFILE pbs_hosts > >>>>>>> NPROCS=`wc -l < pbs_hosts` > >>>>>>> > >>>>>>> #SCRITP1 > >>>>>>> mpirun -report-bindings -bind-to core Myprog > >>>>>>> > >>>>>>> #SCRIPT2 > >>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings > > -bind-to > >>>>> core > >>>>>> > >>>>>> This version is not only reading the PBS allocation, but also > > invoking > >>>>> the hostfile filter on top of it. Different code path. I'll take a > > look > >>> - > >>>>> it should still match up assuming NPROCS=8. Any > >>>>>> possibility that it is a different number? I don't recall, but isn't > >>>>> there some extra lines in the nodefile - e.g., comments? > >>>>>> > >>>>>> > >>>>>>> Myprog > >>>>>>> > >>>>>>> tmishima > >>>>>>> > >>>>>>>> I guess here's my confusion. If you are using only one node, and > >>> that > >>>>>>> node has 8 allocated slots, then we will not allow you to run more > >>> than > >>>>> 8 > >>>>>>> processes on that node unless you specifically provide > >>>>>>>> the --oversubscribe flag. This is because you are operating in a > >>>>> managed > >>>>>>> environment (in this case, under Torque), and so we treat the > >>>>> allocation as > >>>>>>> "mandatory" by default. > >>>>>>>> > >>>>>>>> I suspect that is the issue here, in which case the system is > >>> behaving > >>>>> as > >>>>>>> it should. > >>>>>>>> > >>>>>>>> Is the above accurate? > >>>>>>>> > >>>>>>>> > >>>>>>>> On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org> > > wrote: > >>>>>>>> > >>>>>>>>> It has nothing to do with LAMA as you aren't using that mapper. > >>>>>>>>> > >>>>>>>>> How many nodes are in this allocation? > >>>>>>>>> > >>>>>>>>> On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Hi Ralph, this is an additional information. > >>>>>>>>>> > >>>>>>>>>> Here is the main part of output by adding "-mca > > rmaps_base_verbose > >>>>>>> 50". > >>>>>>>>>> > >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm > >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating > >>> map > >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP > > in > >>>>>>>>>> allocation > >>>>>>>>>> [node08.cluster:26952] mca:rmaps: mapping job [56581,1] > >>>>>>>>>> [node08.cluster:26952] mca:rmaps: creating new map for job > >>> [56581,1] > >>>>>>>>>> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using > > ppr > >>>>>>> mapper > >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job > >>> [56581,1] > >>>>>>>>>> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using > > seq > >>>>>>> mapper > >>>>>>>>>> [node08.cluster:26952] mca:rmaps:resilient: cannot perform > > initial > >>>>> map > >>>>>>> of > >>>>>>>>>> job [56581,1] - no fault groups > >>>>>>>>>> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not > > using > >>>>>>> mindist > >>>>>>>>>> mapper > >>>>>>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] > >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in > > list > >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps > >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list > >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots > > 0 > >>>>>>> inuse 0 > >>>>>>>>>> > >>>>>>>>>> From this result, I guess it's related to oversubscribe. > >>>>>>>>>> So I added "-oversubscribe" and rerun, then it worked well as > > show > >>>>>>> below: > >>>>>>>>>> > >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in > > list > >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Filtering thru apps > >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list > >>>>>>>>>> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING: > >>>>>>>>>> [node08.cluster:27019] node: node08 daemon: 0 > >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node > >>>>> node08 > >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting at node node08 > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job > >>>>> [56774,1] > >>>>>>>>>> slots 1 num_procs 8 > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08 > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full - > >>>>>>> skipping > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is > >>>>>>> oversubscribed - > >>>>>>>>>> performing second pass > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08 > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to > >>>>> node > >>>>>>>>>> node08 > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot > > for > >>>>> job > >>>>>>>>>> [56774,1] > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node > >>>>> node08 > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node > >>>>> node08 > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node > >>>>> node08 > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node > >>>>> node08 > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node > >>>>> node08 > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node > >>>>> node08 > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node > >>>>> node08 > >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node > >>>>> node08 > >>>>>>>>>> > >>>>>>>>>> I think something is wrong with treatment of oversubscription, > >>> which > >>>>>>> might > >>>>>>>>>> be > >>>>>>>>>> related to "#3893: LAMA mapper has problems" > >>>>>>>>>> > >>>>>>>>>> tmishima > >>>>>>>>>> > >>>>>>>>>>> Hmmm...looks like we aren't getting your allocation. Can you > >>> rerun > >>>>>>> and > >>>>>>>>>> add -mca ras_base_verbose 50? > >>>>>>>>>>> > >>>>>>>>>>> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Hi Ralph, > >>>>>>>>>>>> > >>>>>>>>>>>> Here is the output of "-mca plm_base_verbose 5". > >>>>>>>>>>>> > >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Querying > >>> component > >>>>>>> [rsh] > >>>>>>>>>>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on > >>>>>>>>>>>> agent /usr/bin/rsh path NULL > >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Query of > >>> component > >>>>>>> [rsh] > >>>>>>>>>> set > >>>>>>>>>>>> priority to 10 > >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Querying > >>> component > >>>>>>>>>> [slurm] > >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Skipping > >>> component > >>>>>>>>>> [slurm]. > >>>>>>>>>>>> Query failed to return a module > >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Querying > >>> component > >>>>>>> [tm] > >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Query of > >>> component > >>>>>>> [tm] > >>>>>>>>>> set > >>>>>>>>>>>> priority to 75 > >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Selected > >>> component > >>>>>>> [tm] > >>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias > > 23573 > >>>>>>>>>> nodename > >>>>>>>>>>>> hash 85176670 > >>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam > > 59480 > >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start > > comm > >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job > >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm > >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm > > creating > >>>>> map > >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only > > HNP > >>> in > >>>>>>>>>>>> allocation > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>>> > >>> > > -------------------------------------------------------------------------- > >>>>>>>>>>>> All nodes which are allocated for this job are already filled. > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>>> > >>> > > -------------------------------------------------------------------------- > >>>>>>>>>>>> > >>>>>>>>>>>> Here, openmpi's configuration is as follows: > >>>>>>>>>>>> > >>>>>>>>>>>> ./configure \ > >>>>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \ > >>>>>>>>>>>> --with-tm \ > >>>>>>>>>>>> --with-verbs \ > >>>>>>>>>>>> --disable-ipv6 \ > >>>>>>>>>>>> --disable-vt \ > >>>>>>>>>>>> --enable-debug \ > >>>>>>>>>>>> CC=pgcc CFLAGS="-tp k8-64e" \ > >>>>>>>>>>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \ > >>>>>>>>>>>> F77=pgfortran FFLAGS="-tp k8-64e" \ > >>>>>>>>>>>> FC=pgfortran FCFLAGS="-tp k8-64e" > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Ralph, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Okey, I can help you. Please give me some time to report the > >>>>>>> output. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Tetsuya Mishima > >>>>>>>>>>>>> > >>>>>>>>>>>>>> I can try, but I have no way of testing Torque any more - so > >>> all > >>>>> I > >>>>>>>>>> can > >>>>>>>>>>>> do > >>>>>>>>>>>>> is a code review. If you can build --enable-debug and add > > -mca > >>>>>>>>>>>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing > > the > >>>>>>>>>>>>>> output. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp > > wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi Ralph, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thank you for your quick response. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I'd like to report one more regressive issue about Torque > >>>>> support > >>>>>>> of > >>>>>>>>>>>>>>> openmpi-1.7.4a1r29646, which might be related to "#3893: > > LAMA > >>>>>>> mapper > >>>>>>>>>>>>>>> has problems" I reported a few days ago. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The script below does not work with openmpi-1.7.4a1r29646, > >>>>>>>>>>>>>>> although it worked with openmpi-1.7.3 as I told you before. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> #!/bin/sh > >>>>>>>>>>>>>>> #PBS -l nodes=node08:ppn=8 > >>>>>>>>>>>>>>> export OMP_NUM_THREADS=1 > >>>>>>>>>>>>>>> cd $PBS_O_WORKDIR > >>>>>>>>>>>>>>> cp $PBS_NODEFILE pbs_hosts > >>>>>>>>>>>>>>> NPROCS=`wc -l < pbs_hosts` > >>>>>>>>>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} > > -report-bindings > >>>>>>>>>> -bind-to > >>>>>>>>>>>>> core > >>>>>>>>>>>>>>> Myprog > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it > >>>>> works > >>>>>>>>>>>> fine. > >>>>>>>>>>>>>>> Since this happens without lama request, I guess it's not > > the > >>>>>>>>>> problem>>>>>>>>>>>>>>> in lama itself. Anyway, please look into this issue as > > well. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>> Tetsuya Mishima > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Done - thanks! > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp > >>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Dear openmpi developers, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I got a segmentation fault in traial use of > >>>>>>> openmpi-1.7.4a1r29646 > >>>>>>>>>>>>> built > >>>>>>>>>>>>>>> by > >>>>>>>>>>>>>>>>> PGI13.10 as shown below: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 > >>>>>>>>>> -cpus-per-proc > >>>>>>>>>>>> 2 > >>>>>>>>>>>>>>>>> -report-bindings mPre > >>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0 [core > > 4 > >>>>> [hwt > >>>>>>>>>> 0]], > >>>>>>>>>>>>>>> socket > >>>>>>>>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.] > >>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1 [core > > 6 > >>>>> [hwt > >>>>>>>>>> 0]], > >>>>>>>>>>>>>>> socket > >>>>>>>>>>>>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.] > >>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0 [core > > 0 > >>>>> [hwt > >>>>>>>>>> 0]], > >>>>>>>>>>>>>>> socket > >>>>>>>>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.] > >>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0 [core > > 2 > >>>>> [hwt > >>>>>>>>>> 0]], > >>>>>>>>>>>>>>> socket > >>>>>>>>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.] > >>>>>>>>>>>>>>>>> [manage:23082] *** Process received signal *** > >>>>>>>>>>>>>>>>> [manage:23082] Signal: Segmentation fault (11) > >>>>>>>>>>>>>>>>> [manage:23082] Signal code: Address not mapped (1) > >>>>>>>>>>>>>>>>> [manage:23082] Failing at address: 0x34 > >>>>>>>>>>>>>>>>> [manage:23082] *** End of error message *** > >>>>>>>>>>>>>>>>> Segmentation fault (core dumped) > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun > >>> core.23082 > >>>>>>>>>>>>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) > >>>>>>>>>>>>>>>>> Copyright (C) 2009 Free Software Foundation, Inc. > >>>>>>>>>>>>>>>>> ... > >>>>>>>>>>>>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 > >>>>>>>>>>>> -report-bindings > >>>>>>>>>>>>>>>>> mPre'. > >>>>>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault. > >>>>>>>>>>>>>>>>> #0 0x00002b5f861c9c4f in recv_connect>>> > > (mod=0x5f861ca20b00007f, > >>>>>>>>>>>>>>> sd=32767, > >>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 > >>>>>>>>>>>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); > >>>>>>>>>>>>>>>>> (gdb) where > >>>>>>>>>>>>>>>>> #0 0x00002b5f861c9c4f in recv_connect > >>>>> (mod=0x5f861ca20b00007f, > >>>>>>>>>>>>>>> sd=32767, > >>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 > >>>>>>>>>>>>>>>>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, > >>>>>>>>>> flags=32767, > >>>>>>>>>>>>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760 > >>>>>>>>>>>>>>>>> #2 0x00002b5f848eb06a in > > event_process_active_single_queue > >>>>>>>>>>>>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff) > >>>>>>>>>>>>>>>>> at ./event.c:1366 > >>>>>>>>>>>>>>>>> #3 0x00002b5f848eb270 in event_process_active > >>>>>>>>>>>>>>> (base=0x5f848eb84900007f) > >>>>>>>>>>>>>>>>> at ./event.c:1435 > >>>>>>>>>>>>>>>>> #4 0x00002b5f848eb849 in > > opal_libevent2021_event_base_loop > >>>>>>>>>>>>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645 > >>>>>>>>>>>>>>>>> #5 0x00000000004077a0 in orterun (argc=7, > >>>>> argv=0x7fff25bbd4a8) > >>>>>>>>>>>>>>>>> at ./orterun.c:1030 > >>>>>>>>>>>>>>>>> #6 0x00000000004067fb in main (argc=7, > >>> argv=0x7fff25bbd4a8) > >>>>>>>>>>>>>>> at ./main.c:13 > >>>>>>>>>>>>>>>>> (gdb) quit > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently > >>>>>>>>>>>> unnecessary, > >>>>>>>>>>>>>>> which > >>>>>>>>>>>>>>>>> causes the segfault. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 624 /* lookup the corresponding process > >>> */>>>>>>>>>>>>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr-> > > origin); > >>>>>>>>>>>>>>>>> 626 if (NULL == peer) { > >>>>>>>>>>>>>>>>> 627 ui64 = (uint64_t*)(&peer->name); > >>>>>>>>>>>>>>>>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, > >>>>>>>>>>>>>>>>> orte_oob_base_framework.framework_output, > >>>>>>>>>>>>>>>>> 629 "%s > >>>>> mca_oob_tcp_recv_connect: > >>>>>>>>>>>>>>>>> connection from new peer", > >>>>>>>>>>>>>>>>> 630 ORTE_NAME_PRINT > >>>>>>>>>>>>> (ORTE_PROC_MY_NAME)); > >>>>>>>>>>>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); > >>>>>>>>>>>>>>>>> 632 peer->mod = mod; > >>>>>>>>>>>>>>>>> 633 peer->name = hdr->origin; > >>>>>>>>>>>>>>>>> 634 peer->state = MCA_OOB_TCP_ACCEPTING; > >>>>>>>>>>>>>>>>> 635 ui64 = (uint64_t*)(&peer->name); > >>>>>>>>>>>>>>>>> 636 if (OPAL_SUCCESS != > >>>>>>> opal_hash_table_set_value_uint64 > >>>>>>>>>>>>>>> (&mod-> > >>>>>>>>>>>>>>>>> peers, (*ui64), peer)) { > >>>>>>>>>>>>>>>>> 637 OBJ_RELEASE(peer); > >>>>>>>>>>>>>>>>> 638 return; > >>>>>>>>>>>>>>>>> 639 } > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Please fix this mistake in the next release. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>> Tetsuya Mishima > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>>>>> users mailing list > >>>>>>>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>>>> users mailing list > >>>>>>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>>> users mailing list > >>>>>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>> users mailing list > >>>>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>>>> > >>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>> users mailing list > >>>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>>> > >>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>> users mailing list > >>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> users mailing list > >>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> users mailing list > >>>>>>>>>> us...@open-mpi.org > >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list>> us...@open-mpi.org > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users