FYI... The problem is discussed further in
Redhat Bugzilla: Bug 1321154 - numa enabled torque don't work https://bugzilla.redhat.com/show_bug.cgi?id=1321154 I'd seen this previous as it required me to add "num_node_boards=1" to each node in the /var/lib/torque/server_priv/nodes to get torque to at least work. Specifically by munging the $PBS_NODES" (which comes out correcT) into a host list containing the correct "slot=" counts. But of course now that I have compiled OpenMPI using "--with-tm" that should not have been needed as in fact is now ignored by OpenMPI in a Torque-PBS environment. However it seems ever since "NUMA" support was into the Torque RPM's, has also caused the current problems, and is still continuing. The last action is a new EPEL "test' version (August 2017), I will try shortly. Take you for your help, though I am still open to suggestions for a replacement. Anthony Thyssen ( System Programmer ) <a.thys...@griffith.edu.au> -------------------------------------------------------------------------- Encryption... is a powerful defensive weapon for free people. It offers a technical guarantee of privacy, regardless of who is running the government... It's hard to think of a more powerful, less dangerous tool for liberty. -- Esther Dyson -------------------------------------------------------------------------- On Wed, Oct 4, 2017 at 9:02 AM, Anthony Thyssen <a.thys...@griffith.edu.au> wrote: > Thank you Gilles. At least I now have something to follow though with. > > As a FYI, the torque is the pre-built version from the Redhat Extras > (EPEL) archive. > torque-4.2.10-10.el7.x86_64 > > Normally pre-build packages have no problems, but in this case. > > > > > On Tue, Oct 3, 2017 at 3:39 PM, Gilles Gouaillardet <gil...@rist.or.jp> > wrote: > >> Anthony, >> >> >> we had a similar issue reported some times ago (e.g. Open MPI ignores >> torque allocation), >> >> and after quite some troubleshooting, we ended up with the same behavior >> (e.g. pbsdsh is not working as expected). >> >> see https://www.mail-archive.com/users@lists.open-mpi.org/msg29952.html >> for the last email. >> >> >> from an Open MPI point of view, i would consider the root cause is with >> your torque install. >> >> this case was reported at http://www.clusterresources.co >> m/pipermail/torqueusers/2016-September/018858.html >> >> and no conclusion was reached. >> >> >> Cheers, >> >> >> Gilles >> >> >> On 10/3/2017 2:02 PM, Anthony Thyssen wrote: >> >>> The stdin and stdout are saved to separate channels. >>> >>> It is interesting that the output from pbsdsh is node21.emperor 5 times, >>> even though $PBS_NODES is the 5 individual nodes. >>> >>> Attached are the two compressed files, as well as the pbs_hello batch >>> used. >>> >>> Anthony Thyssen ( System Programmer ) <a.thys...@griffith.edu.au >>> <mailto:a.thys...@griffith.edu.au>> >>> ----------------------------------------------------------- >>> --------------- >>> There are two types of encryption: >>> One that will prevent your sister from reading your diary, and >>> One that will prevent your government. -- Bruce Schneier >>> ----------------------------------------------------------- >>> --------------- >>> >>> >>> >>> >>> On Tue, Oct 3, 2017 at 2:39 PM, Gilles Gouaillardet <gil...@rist.or.jp >>> <mailto:gil...@rist.or.jp>> wrote: >>> >>> Anthony, >>> >>> >>> in your script, can you >>> >>> >>> set -x >>> >>> env >>> >>> pbsdsh hostname >>> >>> mpirun --display-map --display-allocation --mca ess_base_verbose >>> 10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname >>> >>> >>> and then compress and send the output ? >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> >>> >>> On 10/3/2017 1:19 PM, Anthony Thyssen wrote: >>> >>> I noticed that too. Though the submitting host for torque is >>> a different host (main head node, "shrek"), "node21" is the >>> host that torque runs the batch script (and the mpirun >>> command) it being the first node in the "dualcore" resource >>> group. >>> >>> Adding option... >>> >>> It fixed the hostname in the allocation map, though had no >>> effect on the outcome. The allocation is still simply ignored. >>> >>> =======8<--------CUT HERE---------- >>> PBS Job Number 9000 >>> PBS batch run on node21.emperor >>> Time it was started 2017-10-03_14:11:20 >>> Current Directory /net/shrek.emperor/home/shrek/anthony >>> Submitted work dir /home/shrek/anthony/mpi-pbs >>> Number of Nodes 5 >>> Nodefile List /var/lib/torque/aux//9000.shrek.emperor >>> node21.emperor >>> node25.emperor >>> node24.emperor >>> node23.emperor >>> node22.emperor >>> --------------------------------------- >>> >>> ====================== ALLOCATED NODES ====================== >>> node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >>> node25.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >>> node24.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >>> node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >>> node22.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >>> ============================================================ >>> ===== >>> node21.emperor >>> node21.emperor >>> node21.emperor >>> node21.emperor >>> node21.emperor >>> =======8<--------CUT HERE---------- >>> >>> >>> Anthony Thyssen ( System Programmer ) >>> <a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au> >>> <mailto:a.thys...@griffith.edu.au >>> <mailto:a.thys...@griffith.edu.au>>> >>> ----------------------------------------------------------- >>> --------------- >>> The equivalent of an armoured car should always be used to >>> protect any secret kept in a cardboard box. >>> -- Anthony Thyssen, On the use of Encryption >>> ----------------------------------------------------------- >>> --------------- >>> >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://lists.open-mpi.org/mailman/listinfo/users >>> <https://lists.open-mpi.org/mailman/listinfo/users> >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://lists.open-mpi.org/mailman/listinfo/users >>> <https://lists.open-mpi.org/mailman/listinfo/users> >>> >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >>> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users