Yes, the node08 has 8 slots but the process I run is also 8.
#PBS -l nodes=node08:ppn=8 Therefore, I think it should allow this allocation. Is that right? My question is why scritp1 works and script2 does not. They are almost same. #PBS -l nodes=node08:ppn=8 export OMP_NUM_THREADS=1 cd $PBS_O_WORKDIR cp $PBS_NODEFILE pbs_hosts NPROCS=`wc -l < pbs_hosts` #SCRITP1 mpirun -report-bindings -bind-to core Myprog #SCRIPT2 mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to core Myprog tmishima > I guess here's my confusion. If you are using only one node, and that node has 8 allocated slots, then we will not allow you to run more than 8 processes on that node unless you specifically provide > the --oversubscribe flag. This is because you are operating in a managed environment (in this case, under Torque), and so we treat the allocation as "mandatory" by default. > > I suspect that is the issue here, in which case the system is behaving as it should. > > Is the above accurate? > > > On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org> wrote: > > > It has nothing to do with LAMA as you aren't using that mapper. > > > > How many nodes are in this allocation? > > > > On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote: > > > >> > >> > >> Hi Ralph, this is an additional information. > >> > >> Here is the main part of output by adding "-mca rmaps_base_verbose 50". > >> > >> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm > >> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating map > >> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP in > >> allocation > >> [node08.cluster:26952] mca:rmaps: mapping job [56581,1] > >> [node08.cluster:26952] mca:rmaps: creating new map for job [56581,1] > >> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using ppr mapper > >> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job [56581,1] > >> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using seq mapper > >> [node08.cluster:26952] mca:rmaps:resilient: cannot perform initial map of > >> job [56581,1] - no fault groups > >> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not using mindist > >> mapper > >> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] > >> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list > >> [node08.cluster:26952] [[56581,0],0] Filtering thru apps > >> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list > >> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0 inuse 0 > >> > >> From this result, I guess it's related to oversubscribe. > >> So I added "-oversubscribe" and rerun, then it worked well as show below: > >> > >> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in list > >> [node08.cluster:27019] [[56774,0],0] Filtering thru apps > >> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list > >> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING: > >> [node08.cluster:27019] node: node08 daemon: 0 > >> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node node08 > >> [node08.cluster:27019] [[56774,0],0] Starting at node node08 > >> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job [56774,1] > >> slots 1 num_procs 8 > >> [node08.cluster:27019] mca:rmaps:rr:slot working node node08 > >> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full - skipping > >> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is oversubscribed - > >> performing second pass > >> [node08.cluster:27019] mca:rmaps:rr:slot working node node08 > >> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to node > >> node08 > >> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot for job > >> [56774,1] > >> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node node08 > >> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node node08 > >> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node node08 > >> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node node08 > >> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node node08 > >> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node node08 > >> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node node08 > >> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node node08 > >> > >> I think something is wrong with treatment of oversubscription, which might > >> be > >> related to "#3893: LAMA mapper has problems" > >> > >> tmishima > >> > >>> Hmmm...looks like we aren't getting your allocation. Can you rerun and > >> add -mca ras_base_verbose 50? > >>> > >>> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote: > >>> > >>>> > >>>> > >>>> Hi Ralph, > >>>> > >>>> Here is the output of "-mca plm_base_verbose 5". > >>>> > >>>> [node08.cluster:23573] mca:base:select:( plm) Querying component [rsh] > >>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on > >>>> agent /usr/bin/rsh path NULL > >>>> [node08.cluster:23573] mca:base:select:( plm) Query of component [rsh] > >> set > >>>> priority to 10 > >>>> [node08.cluster:23573] mca:base:select:( plm) Querying component > >> [slurm] > >>>> [node08.cluster:23573] mca:base:select:( plm) Skipping component > >> [slurm]. > >>>> Query failed to return a module > >>>> [node08.cluster:23573] mca:base:select:( plm) Querying component [tm] > >>>> [node08.cluster:23573] mca:base:select:( plm) Query of component [tm] > >> set > >>>> priority to 75 > >>>> [node08.cluster:23573] mca:base:select:( plm) Selected component [tm] > >>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias 23573 > >> nodename > >>>> hash 85176670 > >>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam 59480 > >>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start comm > >>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job > >>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm > >>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm creating map > >>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only HNP in > >>>> allocation > >>>> > >> -------------------------------------------------------------------------- > >>>> All nodes which are allocated for this job are already filled. > >>>> > >> -------------------------------------------------------------------------- > >>>> > >>>> Here, openmpi's configuration is as follows: > >>>> > >>>> ./configure \ > >>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \ > >>>> --with-tm \ > >>>> --with-verbs \ > >>>> --disable-ipv6 \ > >>>> --disable-vt \ > >>>> --enable-debug \ > >>>> CC=pgcc CFLAGS="-tp k8-64e" \ > >>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \ > >>>> F77=pgfortran FFLAGS="-tp k8-64e" \ > >>>> FC=pgfortran FCFLAGS="-tp k8-64e" > >>>> > >>>>> Hi Ralph, > >>>>> > >>>>> Okey, I can help you. Please give me some time to report the output. > >>>>> > >>>>> Tetsuya Mishima > >>>>> > >>>>>> I can try, but I have no way of testing Torque any more - so all I > >> can > >>>> do > >>>>> is a code review. If you can build --enable-debug and add -mca > >>>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing the > >>>>>> output. > >>>>>> > >>>>>> > >>>>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Hi Ralph, > >>>>>>> > >>>>>>> Thank you for your quick response. > >>>>>>> > >>>>>>> I'd like to report one more regressive issue about Torque support of > >>>>>>> openmpi-1.7.4a1r29646, which might be related to "#3893: LAMA mapper > >>>>>>> has problems" I reported a few days ago. > >>>>>>> > >>>>>>> The script below does not work with openmpi-1.7.4a1r29646, > >>>>>>> although it worked with openmpi-1.7.3 as I told you before. > >>>>>>> > >>>>>>> #!/bin/sh > >>>>>>> #PBS -l nodes=node08:ppn=8 > >>>>>>> export OMP_NUM_THREADS=1 > >>>>>>> cd $PBS_O_WORKDIR > >>>>>>> cp $PBS_NODEFILE pbs_hosts > >>>>>>> NPROCS=`wc -l < pbs_hosts` > >>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings > >> -bind-to > >>>>> core > >>>>>>> Myprog > >>>>>>> > >>>>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it works > >>>> fine. > >>>>>>> Since this happens without lama request, I guess it's not the > >> problem > >>>>>>> in lama itself. Anyway, please look into this issue as well. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Tetsuya Mishima > >>>>>>> > >>>>>>>> Done - thanks! > >>>>>>>> > >>>>>>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Dear openmpi developers, > >>>>>>>>> > >>>>>>>>> I got a segmentation fault in traial use of openmpi-1.7.4a1r29646 > >>>>> built > >>>>>>> by > >>>>>>>>> PGI13.10 as shown below: > >>>>>>>>> > >>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 > >> -cpus-per-proc > >>>> 2 > >>>>>>>>> -report-bindings mPre > >>>>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt > >> 0]], > >>>>>>> socket > >>>>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.] > >>>>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt > >> 0]], > >>>>>>> socket > >>>>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.] > >>>>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt > >> 0]], > >>>>>>> socket > >>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.] > >>>>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt > >> 0]], > >>>>>>> socket > >>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.] > >>>>>>>>> [manage:23082] *** Process received signal *** > >>>>>>>>> [manage:23082] Signal: Segmentation fault (11) > >>>>>>>>> [manage:23082] Signal code: Address not mapped (1) > >>>>>>>>> [manage:23082] Failing at address: 0x34 > >>>>>>>>> [manage:23082] *** End of error message *** > >>>>>>>>> Segmentation fault (core dumped) > >>>>>>>>> > >>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082 > >>>>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) > >>>>>>>>> Copyright (C) 2009 Free Software Foundation, Inc. > >>>>>>>>> ... > >>>>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 > >>>> -report-bindings > >>>>>>>>> mPre'. > >>>>>>>>> Program terminated with signal 11, Segmentation fault. > >>>>>>>>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, > >>>>>>> sd=32767, > >>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 > >>>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); > >>>>>>>>> (gdb) where > >>>>>>>>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, > >>>>>>> sd=32767, > >>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 > >>>>>>>>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, > >> flags=32767, > >>>>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760 > >>>>>>>>> #2 0x00002b5f848eb06a in event_process_active_single_queue > >>>>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff) > >>>>>>>>> at ./event.c:1366 > >>>>>>>>> #3 0x00002b5f848eb270 in event_process_active > >>>>>>> (base=0x5f848eb84900007f) > >>>>>>>>> at ./event.c:1435 > >>>>>>>>> #4 0x00002b5f848eb849 in opal_libevent2021_event_base_loop > >>>>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645 > >>>>>>>>> #5 0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8) > >>>>>>>>> at ./orterun.c:1030 > >>>>>>>>> #6 0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8) > >>>>>>> at ./main.c:13 > >>>>>>>>> (gdb) quit > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently > >>>> unnecessary, > >>>>>>> which > >>>>>>>>> causes the segfault. > >>>>>>>>> > >>>>>>>>> 624 /* lookup the corresponding process */ > >>>>>>>>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin); > >>>>>>>>> 626 if (NULL == peer) { > >>>>>>>>> 627 ui64 = (uint64_t*)(&peer->name); > >>>>>>>>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, > >>>>>>>>> orte_oob_base_framework.framework_output, > >>>>>>>>> 629 "%s mca_oob_tcp_recv_connect: > >>>>>>>>> connection from new peer", > >>>>>>>>> 630 ORTE_NAME_PRINT > >>>>> (ORTE_PROC_MY_NAME)); > >>>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); > >>>>>>>>> 632 peer->mod = mod; > >>>>>>>>> 633 peer->name = hdr->origin; > >>>>>>>>> 634 peer->state = MCA_OOB_TCP_ACCEPTING; > >>>>>>>>> 635 ui64 = (uint64_t*)(&peer->name); > >>>>>>>>> 636 if (OPAL_SUCCESS != opal_hash_table_set_value_uint64 > >>>>>>> (&mod-> > >>>>>>>>> peers, (*ui64), peer)) { > >>>>>>>>> 637 OBJ_RELEASE(peer); > >>>>>>>>> 638 return; > >>>>>>>>> 639 } > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Please fix this mistake in the next release. > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Tetsuya Mishima > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> us...@open-mpi.org > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users