Re: [OMPI users] -host vs -hostfile

2017-08-03 Thread Mahmood Naderan
Well, it seems that the default Rocks-openmpi dominates the systems. So, at
the moment, I stick with that which is 1.6.5 and uses -machinefile.
I will later debug to see why 2.0.1 doesn't work.

Thanks.

Regards,
Mahmood



On Tue, Aug 1, 2017 at 12:30 AM, Gus Correa  wrote:

> Maybe something is wrong with the Torque installation?
> Or perhaps with the Open MPI + Torque integration?
>
> 1) Make sure your Open MPI was configured and compiled with the
> Torque "tm" library of your Torque installation.
> In other words:
>
> configure --with-tm=/path/to/your/Torque/tm_library ...
>
> 2) Check if your $TORQUE/server_priv/nodes file has all the nodes
> in your cluster.  If not, edit the file and add the missing nodes.
> Then restart the Torque server (service pbs_server restart).
>
> 3) Run "pbsnodes" to see if all nodes are listed.
>
> 4) Run "hostname" with mpirun in a short Torque script:
>
> #PBS -l nodes=4:ppn=1
> ...
> mpirun hostname
>
> The output should show all four nodes.
>
> Good luck!
> Gus Correa
>
> On 07/31/2017 02:41 PM, Mahmood Naderan wrote:
>
>> Well it is confusing!! As you can see, I added four nodes to the host
>> file (the same nodes are used by PBS). The --map-by ppr:1:node works well.
>> However, the PBS directive doesn't work
>>
>> mahmood@cluster:mpitest$ /share/apps/computer/openmpi-2.0.1/bin/mpirun
>> -hostfile hosts --map-by ppr:1:node a.out
>> 
>> 
>> * hwloc 1.11.2 has encountered what looks like an error from the
>> operating system.
>> *
>> * Package (P#1 cpuset 0x) intersects with NUMANode (P#1 cpuset
>> 0xff00) without inclusion!
>> * Error occurred in topology.c line 1048
>> *
>> * The following FAQ entry in the hwloc documentation may help:
>> *   What should I do when hwloc reports "operating system" warnings?
>> * Otherwise please report this error message to the hwloc user's mailing
>> list,
>> * along with the output+tarball generated by the hwloc-gather-topology
>> script.
>> 
>> 
>> Hello world from processor cluster.hpc.org ,
>> rank 0 out of 4 processors
>> Hello world from processor compute-0-0.local, rank 1 out of 4 processors
>> Hello world from processor compute-0-1.local, rank 2 out of 4 processors
>> Hello world from processor compute-0-2.local, rank 3 out of 4 processors
>> mahmood@cluster:mpitest$ cat mmt.sh
>> #!/bin/bash
>> #PBS -V
>> #PBS -q default
>> #PBS -j oe
>> #PBS -l  nodes=4:ppn=1
>> #PBS -N job1
>> #PBS -o .
>> cd $PBS_O_WORKDIR
>> /share/apps/computer/openmpi-2.0.1/bin/mpirun a.out
>> mahmood@cluster:mpitest$ qsub mmt.sh
>> 6428.cluster.hpc.org 
>>
>> mahmood@cluster:mpitest$ cat job1.o6428
>> Hello world from processor compute-0-1.local, rank 0 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 2 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 3 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 4 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 5 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 6 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 8 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 9 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 12 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 15 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 16 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 18 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 19 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 20 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 21 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 22 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 24 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 26 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 27 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 28 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 29 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 30 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 31 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 7 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 10 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 14 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 1 out of 32 processors
>> Hello world from processor compute-0-1.local, rank 11 out of 32 processors
>> Hello 

Re: [OMPI users] -host vs -hostfile

2017-08-03 Thread Gilles Gouaillardet
Mahmood,

you might want to have a look at OpenHPC (which comes with a recent Open MPI)

Cheers,

Gilles

On Thu, Aug 3, 2017 at 9:48 PM, Mahmood Naderan  wrote:
> Well, it seems that the default Rocks-openmpi dominates the systems. So, at
> the moment, I stick with that which is 1.6.5 and uses -machinefile.
> I will later debug to see why 2.0.1 doesn't work.
>
> Thanks.
>
> Regards,
> Mahmood
>
>
>
> On Tue, Aug 1, 2017 at 12:30 AM, Gus Correa  wrote:
>>
>> Maybe something is wrong with the Torque installation?
>> Or perhaps with the Open MPI + Torque integration?
>>
>> 1) Make sure your Open MPI was configured and compiled with the
>> Torque "tm" library of your Torque installation.
>> In other words:
>>
>> configure --with-tm=/path/to/your/Torque/tm_library ...
>>
>> 2) Check if your $TORQUE/server_priv/nodes file has all the nodes
>> in your cluster.  If not, edit the file and add the missing nodes.
>> Then restart the Torque server (service pbs_server restart).
>>
>> 3) Run "pbsnodes" to see if all nodes are listed.
>>
>> 4) Run "hostname" with mpirun in a short Torque script:
>>
>> #PBS -l nodes=4:ppn=1
>> ...
>> mpirun hostname
>>
>> The output should show all four nodes.
>>
>> Good luck!
>> Gus Correa
>>
>> On 07/31/2017 02:41 PM, Mahmood Naderan wrote:
>>>
>>> Well it is confusing!! As you can see, I added four nodes to the host
>>> file (the same nodes are used by PBS). The --map-by ppr:1:node works well.
>>> However, the PBS directive doesn't work
>>>
>>> mahmood@cluster:mpitest$ /share/apps/computer/openmpi-2.0.1/bin/mpirun
>>> -hostfile hosts --map-by ppr:1:node a.out
>>>
>>> 
>>> * hwloc 1.11.2 has encountered what looks like an error from the
>>> operating system.
>>> *
>>> * Package (P#1 cpuset 0x) intersects with NUMANode (P#1 cpuset
>>> 0xff00) without inclusion!
>>> * Error occurred in topology.c line 1048
>>> *
>>> * The following FAQ entry in the hwloc documentation may help:
>>> *   What should I do when hwloc reports "operating system" warnings?
>>> * Otherwise please report this error message to the hwloc user's mailing
>>> list,
>>> * along with the output+tarball generated by the hwloc-gather-topology
>>> script.
>>>
>>> 
>>> Hello world from processor cluster.hpc.org , rank
>>> 0 out of 4 processors
>>> Hello world from processor compute-0-0.local, rank 1 out of 4 processors
>>> Hello world from processor compute-0-1.local, rank 2 out of 4 processors
>>> Hello world from processor compute-0-2.local, rank 3 out of 4 processors
>>> mahmood@cluster:mpitest$ cat mmt.sh
>>> #!/bin/bash
>>> #PBS -V
>>> #PBS -q default
>>> #PBS -j oe
>>> #PBS -l  nodes=4:ppn=1
>>> #PBS -N job1
>>> #PBS -o .
>>> cd $PBS_O_WORKDIR
>>> /share/apps/computer/openmpi-2.0.1/bin/mpirun a.out
>>> mahmood@cluster:mpitest$ qsub mmt.sh
>>> 6428.cluster.hpc.org 
>>>
>>> mahmood@cluster:mpitest$ cat job1.o6428
>>> Hello world from processor compute-0-1.local, rank 0 out of 32 processors
>>> Hello world from processor compute-0-1.local, rank 2 out of 32 processors
>>> Hello world from processor compute-0-1.local, rank 3 out of 32 processors
>>> Hello world from processor compute-0-1.local, rank 4 out of 32 processors
>>> Hello world from processor compute-0-1.local, rank 5 out of 32 processors
>>> Hello world from processor compute-0-1.local, rank 6 out of 32 processors
>>> Hello world from processor compute-0-1.local, rank 8 out of 32 processors
>>> Hello world from processor compute-0-1.local, rank 9 out of 32 processors
>>> Hello world from processor compute-0-1.local, rank 12 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 15 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 16 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 18 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 19 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 20 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 21 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 22 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 24 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 26 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 27 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 28 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 29 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 30 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 31 out of 32
>>> processors
>>> Hello world from processor compute-0-1.local, rank 7 out o