Guess I don't see why modifying the allocation is required - we have mapping options that should support such things. If you specify the total number of procs you want, and cpus-per-proc=4, it should do the same thing I would think. You'd get 2 procs on the 8 slot nodes, 8 on the 32 proc nodes, and up to 6 on the 64 slot nodes (since you specified np=16). So I guess I don't understand the issue.
Regardless, if NPROCS=8 (and you verified that by printing it out, not just assuming wc -l got that value), then it shouldn't think it is oversubscribed. I'll take a look under a slurm allocation as that is all I can access. On Nov 13, 2013, at 7:23 PM, tmish...@jcity.maeda.co.jp wrote: > > > Our cluster consists of three types of nodes. They have 8, 32 > and 64 slots respectively. Since the performance of each core is > almost same, mixed use of these nodes is possible. > > Furthremore, in this case, for hybrid application with openmpi+openmp, > the modification of hostfile is necesarry as follows: > > #PBS -l nodes=1:ppn=32+4:ppn=8 > export OMP_NUM_THREADS=4 > modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines > mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS > Myprog > > That's why I want to do that. > > Of course I know, If I quit mixed use, -npernode is better for this > purpose. > > (The script I showed you first is just a simplified one to clarify the > problem.) > > tmishima > > >> Why do it the hard way? I'll look at the FAQ because that definitely > isn't a recommended thing to do - better to use -host to specify the > subset, or just specify the desired mapping using all the >> various mappers we provide. >> >> On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Sorry for cross-post. >>> >>> Nodefile is very simple which consists of 8 lines: >>> >>> node08 >>> node08 >>> node08 >>> node08 >>> node08 >>> node08 >>> node08 >>> node08 >>> >>> Therefore, NPROCS=8 >>> >>> My aim is to modify the allocation as you pointed out. According to > Openmpi >>> FAQ, >>> proper subset of the hosts allocated to the Torque / PBS Pro job should > be >>> allowed. >>> >>> tmishima >>> >>>> Please - can you answer my question on script2? What is the value of >>> NPROCS? >>>> >>>> Why would you want to do it this way? Are you planning to modify the >>> allocation?? That generally is a bad idea as it can confuse the system >>>> >>>> >>>> On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote: >>>> >>>>> >>>>> >>>>> Since what I really want is to run script2 correctly, please let us >>>>> concentrate script2. >>>>> >>>>> I'm not an expert of the inside of openmpi. What I can do is just >>>>> obsabation >>>>> from the outside. I doubt these lines are strange, especially the > last >>> one. >>>>> >>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] >>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list >>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps >>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list >>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0 > inuse >>> 0 >>>>> >>>>> These lines come from this part of orte_rmaps_base_get_target_nodes >>>>> in rmaps_base_support_fns.c: >>>>> >>>>> } else if (node->slots <= node->slots_inuse && >>>>> (ORTE_MAPPING_NO_OVERSUBSCRIBE & >>>>> ORTE_GET_MAPPING_DIRECTIVE(policy))) { >>>>> /* remove the node as fully used */ >>>>> OPAL_OUTPUT_VERBOSE((5, >>>>> orte_rmaps_base_framework.framework_output, >>>>> "%s Removing node %s slots %d inuse >>> %d", >>>>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >>>>> node->name, node->slots, node-> >>>>> slots_inuse)); >>>>> opal_list_remove_item(allocated_nodes, item); >>>>> OBJ_RELEASE(item); /* "un-retain" it */ >>>>> >>>>> I wonder why node->slots and node->slots_inuse is 0, which I can read >>>>> from the above line "Removing node node08 slots 0 inuse 0". >>>>> >>>>> Or I'm not sure but >>>>> "else if (node->slots <= node->slots_inuse &&" should be >>>>> "else if (node->slots < node->slots_inuse &&" ? >>>>> >>>>> tmishima >>>>> >>>>>> On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> Yes, the node08 has 8 slots but the process I run is also 8. >>>>>>> >>>>>>> #PBS -l nodes=node08:ppn=8 >>>>>>> >>>>>>> Therefore, I think it should allow this allocation. Is that right? >>>>>> >>>>>> Correct >>>>>> >>>>>>> >>>>>>> My question is why scritp1 works and script2 does not. They are >>>>>>> almost same. >>>>>>> >>>>>>> #PBS -l nodes=node08:ppn=8 >>>>>>> export OMP_NUM_THREADS=1 >>>>>>> cd $PBS_O_WORKDIR >>>>>>> cp $PBS_NODEFILE pbs_hosts >>>>>>> NPROCS=`wc -l < pbs_hosts` >>>>>>> >>>>>>> #SCRITP1 >>>>>>> mpirun -report-bindings -bind-to core Myprog >>>>>>> >>>>>>> #SCRIPT2 >>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings > -bind-to >>>>> core >>>>>> >>>>>> This version is not only reading the PBS allocation, but also > invoking >>>>> the hostfile filter on top of it. Different code path. I'll take a > look >>> - >>>>> it should still match up assuming NPROCS=8. Any >>>>>> possibility that it is a different number? I don't recall, but isn't >>>>> there some extra lines in the nodefile - e.g., comments? >>>>>> >>>>>> >>>>>>> Myprog >>>>>>> >>>>>>> tmishima >>>>>>> >>>>>>>> I guess here's my confusion. If you are using only one node, and >>> that >>>>>>> node has 8 allocated slots, then we will not allow you to run more >>> than >>>>> 8 >>>>>>> processes on that node unless you specifically provide >>>>>>>> the --oversubscribe flag. This is because you are operating in a >>>>> managed >>>>>>> environment (in this case, under Torque), and so we treat the >>>>> allocation as >>>>>>> "mandatory" by default. >>>>>>>> >>>>>>>> I suspect that is the issue here, in which case the system is >>> behaving >>>>> as >>>>>>> it should. >>>>>>>> >>>>>>>> Is the above accurate? >>>>>>>> >>>>>>>> >>>>>>>> On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org> > wrote: >>>>>>>> >>>>>>>>> It has nothing to do with LAMA as you aren't using that mapper. >>>>>>>>> >>>>>>>>> How many nodes are in this allocation? >>>>>>>>> >>>>>>>>> On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Ralph, this is an additional information. >>>>>>>>>> >>>>>>>>>> Here is the main part of output by adding "-mca > rmaps_base_verbose >>>>>>> 50". >>>>>>>>>> >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating >>> map >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP > in >>>>>>>>>> allocation >>>>>>>>>> [node08.cluster:26952] mca:rmaps: mapping job [56581,1] >>>>>>>>>> [node08.cluster:26952] mca:rmaps: creating new map for job >>> [56581,1] >>>>>>>>>> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using > ppr >>>>>>> mapper >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job >>> [56581,1] >>>>>>>>>> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using > seq >>>>>>> mapper >>>>>>>>>> [node08.cluster:26952] mca:rmaps:resilient: cannot perform > initial >>>>> map >>>>>>> of >>>>>>>>>> job [56581,1] - no fault groups >>>>>>>>>> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not > using >>>>>>> mindist >>>>>>>>>> mapper >>>>>>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in > list >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list >>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots > 0 >>>>>>> inuse 0 >>>>>>>>>> >>>>>>>>>> From this result, I guess it's related to oversubscribe. >>>>>>>>>> So I added "-oversubscribe" and rerun, then it worked well as > show >>>>>>> below: >>>>>>>>>> >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in > list >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Filtering thru apps >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list >>>>>>>>>> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING: >>>>>>>>>> [node08.cluster:27019] node: node08 daemon: 0 >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node >>>>> node08 >>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting at node node08 >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job >>>>> [56774,1] >>>>>>>>>> slots 1 num_procs 8 >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08 >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full - >>>>>>> skipping >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is >>>>>>> oversubscribed - >>>>>>>>>> performing second pass >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08 >>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to >>>>> node >>>>>>>>>> node08 >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot > for >>>>> job >>>>>>>>>> [56774,1] >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node >>>>> node08 >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node >>>>> node08 >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node >>>>> node08 >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node >>>>> node08 >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node >>>>> node08 >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node >>>>> node08 >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node >>>>> node08 >>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node >>>>> node08 >>>>>>>>>> >>>>>>>>>> I think something is wrong with treatment of oversubscription, >>> which >>>>>>> might >>>>>>>>>> be >>>>>>>>>> related to "#3893: LAMA mapper has problems" >>>>>>>>>> >>>>>>>>>> tmishima >>>>>>>>>> >>>>>>>>>>> Hmmm...looks like we aren't getting your allocation. Can you >>> rerun >>>>>>> and >>>>>>>>>> add -mca ras_base_verbose 50? >>>>>>>>>>> >>>>>>>>>>> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>> >>>>>>>>>>>> Here is the output of "-mca plm_base_verbose 5". >>>>>>>>>>>> >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Querying >>> component >>>>>>> [rsh] >>>>>>>>>>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on >>>>>>>>>>>> agent /usr/bin/rsh path NULL >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Query of >>> component >>>>>>> [rsh] >>>>>>>>>> set >>>>>>>>>>>> priority to 10 >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Querying >>> component >>>>>>>>>> [slurm] >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Skipping >>> component >>>>>>>>>> [slurm]. >>>>>>>>>>>> Query failed to return a module >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Querying >>> component >>>>>>> [tm] >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Query of >>> component >>>>>>> [tm] >>>>>>>>>> set >>>>>>>>>>>> priority to 75 >>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Selected >>> component >>>>>>> [tm] >>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias > 23573 >>>>>>>>>> nodename >>>>>>>>>>>> hash 85176670 >>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam > 59480 >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start > comm >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm > creating >>>>> map >>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only > HNP >>> in >>>>>>>>>>>> allocation >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>>>>>>>>> All nodes which are allocated for this job are already filled. >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> Here, openmpi's configuration is as follows: >>>>>>>>>>>> >>>>>>>>>>>> ./configure \ >>>>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \ >>>>>>>>>>>> --with-tm \ >>>>>>>>>>>> --with-verbs \ >>>>>>>>>>>> --disable-ipv6 \ >>>>>>>>>>>> --disable-vt \ >>>>>>>>>>>> --enable-debug \ >>>>>>>>>>>> CC=pgcc CFLAGS="-tp k8-64e" \ >>>>>>>>>>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \ >>>>>>>>>>>> F77=pgfortran FFLAGS="-tp k8-64e" \ >>>>>>>>>>>> FC=pgfortran FCFLAGS="-tp k8-64e" >>>>>>>>>>>> >>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>> >>>>>>>>>>>>> Okey, I can help you. Please give me some time to report the >>>>>>> output. >>>>>>>>>>>>> >>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>> >>>>>>>>>>>>>> I can try, but I have no way of testing Torque any more - so >>> all >>>>> I >>>>>>>>>> can >>>>>>>>>>>> do >>>>>>>>>>>>> is a code review. If you can build --enable-debug and add > -mca >>>>>>>>>>>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing > the >>>>>>>>>>>>>> output. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp > wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you for your quick response. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'd like to report one more regressive issue about Torque >>>>> support >>>>>>> of >>>>>>>>>>>>>>> openmpi-1.7.4a1r29646, which might be related to "#3893: > LAMA >>>>>>> mapper >>>>>>>>>>>>>>> has problems" I reported a few days ago. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The script below does not work with openmpi-1.7.4a1r29646, >>>>>>>>>>>>>>> although it worked with openmpi-1.7.3 as I told you before. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> #!/bin/sh >>>>>>>>>>>>>>> #PBS -l nodes=node08:ppn=8 >>>>>>>>>>>>>>> export OMP_NUM_THREADS=1 >>>>>>>>>>>>>>> cd $PBS_O_WORKDIR >>>>>>>>>>>>>>> cp $PBS_NODEFILE pbs_hosts >>>>>>>>>>>>>>> NPROCS=`wc -l < pbs_hosts` >>>>>>>>>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} > -report-bindings >>>>>>>>>> -bind-to >>>>>>>>>>>>> core >>>>>>>>>>>>>>> Myprog >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it >>>>> works >>>>>>>>>>>> fine. >>>>>>>>>>>>>>> Since this happens without lama request, I guess it's not > the >>>>>>>>>> problem >>>>>>>>>>>>>>> in lama itself. Anyway, please look into this issue as > well. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Done - thanks! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp >>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Dear openmpi developers, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I got a segmentation fault in traial use of >>>>>>> openmpi-1.7.4a1r29646 >>>>>>>>>>>>> built >>>>>>>>>>>>>>> by >>>>>>>>>>>>>>>>> PGI13.10 as shown below: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 >>>>>>>>>> -cpus-per-proc >>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>> -report-bindings mPre >>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core > 4 >>>>> [hwt >>>>>>>>>> 0]], >>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.] >>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core > 6 >>>>> [hwt >>>>>>>>>> 0]], >>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.] >>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core > 0 >>>>> [hwt >>>>>>>>>> 0]], >>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.] >>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core > 2 >>>>> [hwt >>>>>>>>>> 0]], >>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.] >>>>>>>>>>>>>>>>> [manage:23082] *** Process received signal *** >>>>>>>>>>>>>>>>> [manage:23082] Signal: Segmentation fault (11) >>>>>>>>>>>>>>>>> [manage:23082] Signal code: Address not mapped (1) >>>>>>>>>>>>>>>>> [manage:23082] Failing at address: 0x34 >>>>>>>>>>>>>>>>> [manage:23082] *** End of error message *** >>>>>>>>>>>>>>>>> Segmentation fault (core dumped) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun >>> core.23082 >>>>>>>>>>>>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) >>>>>>>>>>>>>>>>> Copyright (C) 2009 Free Software Foundation, Inc. >>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 >>>>>>>>>>>> -report-bindings >>>>>>>>>>>>>>>>> mPre'. >>>>>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault. >>>>>>>>>>>>>>>>> #0 0x00002b5f861c9c4f in recv_connect>>> > (mod=0x5f861ca20b00007f, >>>>>>>>>>>>>>> sd=32767, >>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 >>>>>>>>>>>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); >>>>>>>>>>>>>>>>> (gdb) where >>>>>>>>>>>>>>>>> #0 0x00002b5f861c9c4f in recv_connect >>>>> (mod=0x5f861ca20b00007f, >>>>>>>>>>>>>>> sd=32767, >>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 >>>>>>>>>>>>>>>>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, >>>>>>>>>> flags=32767, >>>>>>>>>>>>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760 >>>>>>>>>>>>>>>>> #2 0x00002b5f848eb06a in > event_process_active_single_queue >>>>>>>>>>>>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff) >>>>>>>>>>>>>>>>> at ./event.c:1366 >>>>>>>>>>>>>>>>> #3 0x00002b5f848eb270 in event_process_active >>>>>>>>>>>>>>> (base=0x5f848eb84900007f) >>>>>>>>>>>>>>>>> at ./event.c:1435 >>>>>>>>>>>>>>>>> #4 0x00002b5f848eb849 in > opal_libevent2021_event_base_loop >>>>>>>>>>>>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645 >>>>>>>>>>>>>>>>> #5 0x00000000004077a0 in orterun (argc=7, >>>>> argv=0x7fff25bbd4a8) >>>>>>>>>>>>>>>>> at ./orterun.c:1030 >>>>>>>>>>>>>>>>> #6 0x00000000004067fb in main (argc=7, >>> argv=0x7fff25bbd4a8) >>>>>>>>>>>>>>> at ./main.c:13 >>>>>>>>>>>>>>>>> (gdb) quit >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently >>>>>>>>>>>> unnecessary, >>>>>>>>>>>>>>> which >>>>>>>>>>>>>>>>> causes the segfault. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 624 /* lookup the corresponding process >>> */>>>>>>>>>>>>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr-> > origin); >>>>>>>>>>>>>>>>> 626 if (NULL == peer) { >>>>>>>>>>>>>>>>> 627 ui64 = (uint64_t*)(&peer->name); >>>>>>>>>>>>>>>>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, >>>>>>>>>>>>>>>>> orte_oob_base_framework.framework_output, >>>>>>>>>>>>>>>>> 629 "%s >>>>> mca_oob_tcp_recv_connect: >>>>>>>>>>>>>>>>> connection from new peer", >>>>>>>>>>>>>>>>> 630 ORTE_NAME_PRINT >>>>>>>>>>>>> (ORTE_PROC_MY_NAME)); >>>>>>>>>>>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); >>>>>>>>>>>>>>>>> 632 peer->mod = mod; >>>>>>>>>>>>>>>>> 633 peer->name = hdr->origin; >>>>>>>>>>>>>>>>> 634 peer->state = MCA_OOB_TCP_ACCEPTING; >>>>>>>>>>>>>>>>> 635 ui64 = (uint64_t*)(&peer->name); >>>>>>>>>>>>>>>>> 636 if (OPAL_SUCCESS != >>>>>>> opal_hash_table_set_value_uint64 >>>>>>>>>>>>>>> (&mod-> >>>>>>>>>>>>>>>>> peers, (*ui64), peer)) { >>>>>>>>>>>>>>>>> 637 OBJ_RELEASE(peer); >>>>>>>>>>>>>>>>> 638 return; >>>>>>>>>>>>>>>>> 639 } >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Please fix this mistake in the next release. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users