FYI...

The problem is discussed further in

Redhat Bugzilla: Bug 1321154 - numa enabled torque don't work
   https://bugzilla.redhat.com/show_bug.cgi?id=1321154

I'd seen this previous as it required me to add "num_node_boards=1" to each
node in the
/var/lib/torque/server_priv/nodes  to get torque to at least work.
Specifically by munging
the $PBS_NODES" (which comes out correcT) into a host list containing the
correct
"slot=" counts.  But of course now that I have compiled OpenMPI using
"--with-tm" that
should not have been needed as in fact is now ignored by OpenMPI in a
Torque-PBS
environment.

However it seems ever since "NUMA" support was into the Torque RPM's, has
also caused
the current problems, and is still continuing.   The last action is a new
EPEL "test' version
(August 2017),  I will try shortly.

Take you for your help, though I am still open to suggestions for a
replacement.

  Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au>
 --------------------------------------------------------------------------
   Encryption... is a powerful defensive weapon for free people.
   It offers a technical guarantee of privacy, regardless of who is
   running the government... It's hard to think of a more powerful,
   less dangerous tool for liberty.            --  Esther Dyson
 --------------------------------------------------------------------------



On Wed, Oct 4, 2017 at 9:02 AM, Anthony Thyssen <a.thys...@griffith.edu.au>
wrote:

> Thank you Gilles.  At least I now have something to follow though with.
>
> As a FYI, the torque is the pre-built version from the Redhat Extras
> (EPEL) archive.
> torque-4.2.10-10.el7.x86_64
>
> Normally pre-build packages have no problems, but in this case.
>
>
>
>
> On Tue, Oct 3, 2017 at 3:39 PM, Gilles Gouaillardet <gil...@rist.or.jp>
> wrote:
>
>> Anthony,
>>
>>
>> we had a similar issue reported some times ago (e.g. Open MPI ignores
>> torque allocation),
>>
>> and after quite some troubleshooting, we ended up with the same behavior
>> (e.g. pbsdsh is not working as expected).
>>
>> see https://www.mail-archive.com/users@lists.open-mpi.org/msg29952.html
>> for the last email.
>>
>>
>> from an Open MPI point of view, i would consider the root cause is with
>> your torque install.
>>
>> this case was reported at http://www.clusterresources.co
>> m/pipermail/torqueusers/2016-September/018858.html
>>
>> and no conclusion was reached.
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>> On 10/3/2017 2:02 PM, Anthony Thyssen wrote:
>>
>>> The stdin and stdout are saved to separate channels.
>>>
>>> It is interesting that the output from pbsdsh is node21.emperor 5 times,
>>> even though $PBS_NODES is the 5 individual nodes.
>>>
>>> Attached are the two compressed files, as well as the pbs_hello batch
>>> used.
>>>
>>> Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au
>>> <mailto:a.thys...@griffith.edu.au>>
>>>  -----------------------------------------------------------
>>> ---------------
>>>   There are two types of encryption:
>>>     One that will prevent your sister from reading your diary, and
>>>     One that will prevent your government.           -- Bruce Schneier
>>>  -----------------------------------------------------------
>>> ---------------
>>>
>>>
>>>
>>>
>>> On Tue, Oct 3, 2017 at 2:39 PM, Gilles Gouaillardet <gil...@rist.or.jp
>>> <mailto:gil...@rist.or.jp>> wrote:
>>>
>>>     Anthony,
>>>
>>>
>>>     in your script, can you
>>>
>>>
>>>     set -x
>>>
>>>     env
>>>
>>>     pbsdsh hostname
>>>
>>>     mpirun --display-map --display-allocation --mca ess_base_verbose
>>>     10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname
>>>
>>>
>>>     and then compress and send the output ?
>>>
>>>
>>>     Cheers,
>>>
>>>
>>>     Gilles
>>>
>>>
>>>     On 10/3/2017 1:19 PM, Anthony Thyssen wrote:
>>>
>>>         I noticed that too.  Though the submitting host for torque is
>>>         a different host (main head node, "shrek"),  "node21" is the
>>>         host that torque runs the batch script (and the mpirun
>>>         command) it being the first node in the "dualcore" resource
>>> group.
>>>
>>>         Adding option...
>>>
>>>         It fixed the hostname in the allocation map, though had no
>>>         effect on the outcome.  The allocation is still simply ignored.
>>>
>>>         =======8<--------CUT HERE----------
>>>         PBS Job Number       9000
>>>         PBS batch run on     node21.emperor
>>>         Time it was started  2017-10-03_14:11:20
>>>         Current Directory    /net/shrek.emperor/home/shrek/anthony
>>>         Submitted work dir   /home/shrek/anthony/mpi-pbs
>>>         Number of Nodes      5
>>>         Nodefile List       /var/lib/torque/aux//9000.shrek.emperor
>>>         node21.emperor
>>>         node25.emperor
>>>         node24.emperor
>>>         node23.emperor
>>>         node22.emperor
>>>         ---------------------------------------
>>>
>>>         ======================  ALLOCATED NODES  ======================
>>>         node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>>         node25.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>>         node24.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>>         node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>>         node22.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>>         ============================================================
>>> =====
>>>         node21.emperor
>>>         node21.emperor
>>>         node21.emperor
>>>         node21.emperor
>>>         node21.emperor
>>>         =======8<--------CUT HERE----------
>>>
>>>
>>>           Anthony Thyssen ( System Programmer )
>>>         <a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au>
>>>         <mailto:a.thys...@griffith.edu.au
>>>         <mailto:a.thys...@griffith.edu.au>>>
>>>          -----------------------------------------------------------
>>> ---------------
>>>            The equivalent of an armoured car should always be used to
>>>            protect any secret kept in a cardboard box.
>>>            -- Anthony Thyssen, On the use of Encryption
>>>          -----------------------------------------------------------
>>> ---------------
>>>
>>>
>>>
>>>
>>>         _______________________________________________
>>>         users mailing list
>>>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>         https://lists.open-mpi.org/mailman/listinfo/users
>>>         <https://lists.open-mpi.org/mailman/listinfo/users>
>>>
>>>
>>>     _______________________________________________
>>>     users mailing list
>>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>     https://lists.open-mpi.org/mailman/listinfo/users
>>>     <https://lists.open-mpi.org/mailman/listinfo/users>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to