Re: [OMPI users] Forcing OpenMPI to use Ethernet interconnect instead of InfiniBand

2014-09-10 Thread Muhammad Ansar Javed
Yes, it is strange. I did similar kind of benchmarks few months back on
another environment and I was able to do achieve expected results on
Ethernet and InfiniBand interconnects. However I am unable to force OpenMPI
to use Ethernet in this particular environment even though openib is not
configured.

I have tried almost all the variants of mpirun scripts that can force
OpenMPI to use Ethernet instead of InfiniBand. Moreover verbose mode shows
that TCP btl module is being used but latency is way better than expected
values for Ethernet.

--
Ansar


On Wed, Sep 10, 2014 at 3:43 AM, George Bosilca  wrote:

> This is strange. I have a similar environment with one eth and one ipoib.
> If I manually select the interface I want to use with TCP I get the
> expected results.
>
>
> Here is over IB:
>
> mpirun -np 2 --mca btl tcp,self -host dancer00,dancer01 --mca
> btl_tcp_if_include ib1 ./NPmpi
> 1: dancer01
> 0: dancer00
> Now starting the main loop
>   0:   1 bytes   3093 times -->  0.24 Mbps in  31.39 usec
>   1:   2 bytes   3185 times -->  0.49 Mbps in  31.30 usec
>   2:   3 bytes   3195 times -->  0.73 Mbps in  31.41 usec
>   3:   4 bytes   2122 times -->  0.97 Mbps in  31.39 usec
>
>
> And here the slightly slower eth0:
>
> mpirun -np 2 --mca btl tcp,self -host dancer00,dancer01 --mca
> btl_tcp_if_include eth0 ./NPmpi
> 0: dancer00
> 1: dancer01
> Now starting the main loop
>   0:   1 bytes   1335 times -->  0.13 Mbps in  60.55 usec
>   1:   2 bytes   1651 times -->  0.28 Mbps in  53.62 usec
>   2:   3 bytes   1864 times -->  0.45 Mbps in  51.29 usec
>   3:   4 bytes   1299 times -->  0.61 Mbps in  50.36 usec
>
>
> George.
>
> On Wed, Sep 10, 2014 at 3:40 AM, Muhammad Ansar Javed <
> muhammad.an...@seecs.edu.pk> wrote:
>
>> Thanks George,
>> I am selecting Ethernet device (em1) in mpirun script.
>>
>> Here is ifconfig output:
>> em1   Link encap:Ethernet  HWaddr E0:DB:55:FD:38:46
>>   inet addr:10.30.10.121  Bcast:10.30.255.255  Mask:255.255.0.0
>>   inet6 addr: fe80::e2db:55ff:fefd:3846/64 Scope:Link
>>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>   RX packets:1537270190 errors:0 dropped:0 overruns:0 frame:0
>>   TX packets:136123598 errors:0 dropped:0 overruns:0 carrier:0
>>   collisions:0 txqueuelen:1000
>>   RX bytes:309333740659 (288.0 GiB)  TX bytes:143480101212 (133.6
>> GiB)
>>   Memory:9182-9184
>>
>> Ifconfig uses the ioctl access method to get the full address
>> information, which limits hardware addresses to 8 bytes.
>> Because Infiniband address has 20 bytes, only the first 8 bytes are
>> displayed correctly.
>> Ifconfig is obsolete! For replacement check ip.
>> ib0   Link encap:InfiniBand  HWaddr
>> 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>   inet addr:10.32.10.121  Bcast:10.32.255.255  Mask:255.255.0.0
>>   inet6 addr: fe80::211:7500:70:6ab4/64 Scope:Link
>>   UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>>   RX packets:33621 errors:0 dropped:0 overruns:0 frame:0
>>   TX packets:365 errors:0 dropped:5 overruns:0 carrier:0
>>   collisions:0 txqueuelen:256
>>   RX bytes:1882728 (1.7 MiB)  TX bytes:21920 (21.4 KiB)
>>
>> loLink encap:Local Loopback
>>   inet addr:127.0.0.1  Mask:255.0.0.0
>>   inet6 addr: ::1/128 Scope:Host
>>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>   RX packets:66889 errors:0 dropped:0 overruns:0 frame:0
>>   TX packets:66889 errors:0 dropped:0 overruns:0 carrier:0
>>   collisions:0 txqueuelen:0
>>   RX bytes:19005445 (18.1 MiB)  TX bytes:19005445 (18.1 MiB)
>>
>>
>>
>>
>>
>>
>>> Date: Wed, 10 Sep 2014 00:06:51 +0900
>>> From: George Bosilca 
>>> To: Open MPI Users 
>>> Subject: Re: [OMPI users] Forcing OpenMPI to use Ethernet interconnect
>>> instead of InfiniBand
>>>
>>>
>>> Look at your ifconfig output and select the Ethernet device (instead of
>>> the
>>> IPoIB one). Traditionally the name lack any fanciness, most distributions
>>> using eth0 as a default.
>>>
>>>   George.
>>>
>>>
>>> On Tue, Sep 9, 2014 at 11:24 PM, Muhammad Ansar Javed <
>>> muhammad.an...@seecs.edu.pk> wrote:
>>>
>>> > Hi,
>>> >
>>> > I am currently conducting some testing on system with Gigabit and
>>> > InfiniBand interconnects. Both Latency and Bandwidth benchmarks are
>>> doing
>>> > well as expected on InfiniBand interconnects but Ethernet interconnect
>>> is
>>> > achieving very high performance from expectations. Ethernet and
>>> InfiniBand
>>> > both are achieving equivalent performance.
>>> >
>>> > For some reason, it looks like openmpi (v1.8.1) is using the InfiniBand
>>> > interconnect rather than the Gigabit or TCP communication is being
>>> emulated
>>> > to use InifiniBand interconnect.
>>> >
>>> > Here are Latency and Bandwidth benchmark results.

[OMPI users] [Error running] OpenMPI after the installation of Torque (PBS)

2014-09-10 Thread Red Red
Hi,


after the installation of a Torque PBS when I start a simple program with
mpirun I get this result (i have already installed again):

[*oxygen1:04280] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
../../../../../../orte/mca/ess/env/ess_env_module.c at line 358*
*[oxygen1:04281] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
../../../../../../orte/mca/ess/env/ess_env_module.c at line 358*
*[oxygen1:04278] tcp_peer_recv_connect_ack: invalid header type: -236847104*
*[oxygen1:04280] [[INVALID],INVALID] routed:binomial: Connection to
lifeline [[61922,0],0] lost*
*[oxygen1:04282] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
../../../../../../orte/mca/ess/env/ess_env_module.c at line 358*
*---*
*Primary job  terminated normally, but 1 process returned*
*a non-zero exit code.. Per user-direction, the job has been aborted.*
*---*
*--*
*mpirun detected that one or more processes exited with non-zero status,
thus causing*
*the job to be terminated. The first process to do so was:*

*  Process name: [[61922,1],0]*
*  Exit code:1*


Please help me, thank you in advance.

Carmine


Re: [OMPI users] [Error running] OpenMPI after the installation of Torque (PBS)

2014-09-10 Thread Ralph Castain
What OMPI version?

On Sep 10, 2014, at 1:53 AM, Red Red  wrote:

> Hi,
> 
> 
> after the installation of a Torque PBS when I start a simple program with 
> mpirun I get this result (i have already installed again):
> 
> [oxygen1:04280] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> ../../../../../../orte/mca/ess/env/ess_env_module.c at line 358
> [oxygen1:04281] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> ../../../../../../orte/mca/ess/env/ess_env_module.c at line 358
> [oxygen1:04278] tcp_peer_recv_connect_ack: invalid header type: -236847104
> [oxygen1:04280] [[INVALID],INVALID] routed:binomial: Connection to lifeline 
> [[61922,0],0] lost
> [oxygen1:04282] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> ../../../../../../orte/mca/ess/env/ess_env_module.c at line 358
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> --
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[61922,1],0]
>   Exit code:1
> 
> 
> Please help me, thank you in advance.
> 
> Carmine
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25303.php



Re: [OMPI users] [Error running] OpenMPI after the installation of Torque (PBS)

2014-09-10 Thread Red Red
This is the version: mpirun (Open MPI) 1.7.5a1r30774.

Thank you for your interest.


2014-09-10 10:41 GMT+01:00 Ralph Castain :

> What OMPI version?
>
> On Sep 10, 2014, at 1:53 AM, Red Red  wrote:
>
> Hi,
>
>
> after the installation of a Torque PBS when I start a simple program with
> mpirun I get this result (i have already installed again):
>
> [*oxygen1:04280] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
> ../../../../../../orte/mca/ess/env/ess_env_module.c at line 358*
> *[oxygen1:04281] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
> ../../../../../../orte/mca/ess/env/ess_env_module.c at line 358*
> *[oxygen1:04278] tcp_peer_recv_connect_ack: invalid header type:
> -236847104*
> *[oxygen1:04280] [[INVALID],INVALID] routed:binomial: Connection to
> lifeline [[61922,0],0] lost*
> *[oxygen1:04282] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
> ../../../../../../orte/mca/ess/env/ess_env_module.c at line 358*
> *---*
> *Primary job  terminated normally, but 1 process returned*
> *a non-zero exit code.. Per user-direction, the job has been aborted.*
> *---*
>
> *--*
> *mpirun detected that one or more processes exited with non-zero status,
> thus causing*
> *the job to be terminated. The first process to do so was:*
>
> *  Process name: [[61922,1],0]*
> *  Exit code:1*
>
>
> Please help me, thank you in advance.
>
> Carmine
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25303.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25304.php
>


Re: [OMPI users] [Error running] OpenMPI after the installation of Torque (PBS)

2014-09-10 Thread Jeff Squyres (jsquyres)
Can you send all the information here:

http://www.open-mpi.org/community/help/


On Sep 10, 2014, at 5:43 AM, Red Red  wrote:

> This is the version: mpirun (Open MPI) 1.7.5a1r30774.
> 
> Thank you for your interest.
> 
> 
> 2014-09-10 10:41 GMT+01:00 Ralph Castain :
> What OMPI version?
> 
> On Sep 10, 2014, at 1:53 AM, Red Red  wrote:
> 
>> Hi,
>> 
>> 
>> after the installation of a Torque PBS when I start a simple program with 
>> mpirun I get this result (i have already installed again):
>> 
>> [oxygen1:04280] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
>> ../../../../../../orte/mca/ess/env/ess_env_module.c at line 358
>> [oxygen1:04281] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
>> ../../../../../../orte/mca/ess/env/ess_env_module.c at line 358
>> [oxygen1:04278] tcp_peer_recv_connect_ack: invalid header type: -236847104
>> [oxygen1:04280] [[INVALID],INVALID] routed:binomial: Connection to lifeline 
>> [[61922,0],0] lost
>> [oxygen1:04282] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
>> ../../../../../../orte/mca/ess/env/ess_env_module.c at line 358
>> ---
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> ---
>> --
>> mpirun detected that one or more processes exited with non-zero status, thus 
>> causing
>> the job to be terminated. The first process to do so was:
>> 
>>   Process name: [[61922,1],0]
>>   Exit code:1
>> 
>> 
>> Please help me, thank you in advance.
>> 
>> Carmine
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/09/25303.php
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25304.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25305.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Forcing OpenMPI to use Ethernet interconnect instead of InfiniBand

2014-09-10 Thread Jeff Squyres (jsquyres)
Are you inadvertently using the MXM MTL?  That's an alternate Mellanox 
transport that may activate itself, even if you've disabled the openib BTL.  
Try this:

  mpirun --mca pml ob1 --mca btl ^openib ...

This forces the use of the ob1 PML (which forces the use of the BTLs, not the 
MTLs), and then disables the openib BTL.


On Sep 9, 2014, at 10:24 AM, Muhammad Ansar Javed  
wrote:

> Hi,
> 
> I am currently conducting some testing on system with Gigabit and InfiniBand 
> interconnects. Both Latency and Bandwidth benchmarks are doing well as 
> expected on InfiniBand interconnects but Ethernet interconnect is achieving 
> very high performance from expectations. Ethernet and InfiniBand both are 
> achieving equivalent performance.
> 
> For some reason, it looks like openmpi (v1.8.1) is using the InfiniBand 
> interconnect rather than the Gigabit or TCP communication is being emulated 
> to use InifiniBand interconnect.  
> 
> Here are Latency and Bandwidth benchmark results.
> #---
> # Benchmarking PingPong
> # processes = 2
> # map-by node
> #---
> 
> Hello, world.  I am 1 on node124
> Hello, world.  I am 0 on node123
> Size Latency (usec) Bandwidth (Mbps)
> 11.654.62
> 21.679.16
> 41.6618.43
> 81.6636.74
> 161.8566.00
> 321.83133.28
> 641.83266.36
> 1281.88519.10
> 2561.99982.29
> 5122.231752.37
> 10242.583026.98
> 20483.324710.76
> 
> I read some of the FAQs and noted that OpenMPI prefers the faster available 
> interconnect. In an effort to force it to use the gigabit interconnect I ran 
> it as follows
> 
> 1. mpirun -np 2 -machinefile machines -map-by node --mca btl tcp --mca 
> btl_tcp_if_include em1 ./latency.ompi 
> 2. mpirun -np 2 -machinefile machines -map-by node --mca btl tcp,self,sm 
> --mca btl_tcp_if_include em1 ./latency.ompi
> 3. mpirun -np 2 -machinefile machines -map-by node --mca btl ^openib --mca 
> btl_tcp_if_include em1 ./latency.ompi
> 4. mpirun -np 2 -machinefile machines -map-by node --mca btl ^openib 
> ./latency.ompi
> 
> None of them resulted in a significantly different benchmark output. 
> 
> I am using OpenMPI by loading module on clustered environment and don't have 
> admin access. It is configured for both TCP and OpenIB (confirmed from 
> ompi_info). After trying all above mentioned methods without success I 
> installed OpenMPI v1.8.2 in my home directory and disable openib with 
> following configuration options
> 
> --disable-openib-control-hdr-padding --disable-openib-dynamic-sl 
> --disable-openib-connectx-xrc --disable-openib-udcm --disable-openib-rdmacm  
> --disable-btl-openib-malloc-alignment  --disable-io-romio --without-openib 
> --without-verbs  
> 
> Now, openib is not enabled (confirmed from ompi_info script) and there is no 
> "openib.so" file in $prefix/lib/openmpi directory as well. Still, above 
> mentioned mpirun commands are getting the same latency and bandwidth as that 
> of InfiniBand.
> 
> I tried mpirun in verbose mode with following command and here is the output
> 
> Command: 
> mpirun -np 2 -machinefile machines -map-by node --mca btl tcp --mca 
> btl_base_verbose 30 --mca btl_tcp_if_include em1 ./latency.ompi 
>  
> Output:
> [node123.prv.sciama.cluster:88310] mca: base: components_register: 
> registering btl components
> [node123.prv.sciama.cluster:88310] mca: base: components_register: found 
> loaded component tcp
> [node123.prv.sciama.cluster:88310] mca: base: components_register: component 
> tcp register function successful
> [node123.prv.sciama.cluster:88310] mca: base: components_open: opening btl 
> components
> [node123.prv.sciama.cluster:88310] mca: base: components_open: found loaded 
> component tcp
> [node123.prv.sciama.cluster:88310] mca: base: components_open: component tcp 
> open function successful
> [node124.prv.sciama.cluster:90465] mca: base: components_register: 
> registering btl components
> [node124.prv.sciama.cluster:90465] mca: base: components_register: found 
> loaded component tcp
> [node124.prv.sciama.cluster:90465] mca: base: components_register: component 
> tcp register function successful
> [node124.prv.sciama.cluster:90465] mca: base: components_open: opening btl 
> components
> [node124.prv.sciama.cluster:90465] mca: base: components_open: found loaded 
> component tcp
> [node124.prv.sciama.cluster:90465] mca: base: components_open: component tcp 
> open function successful
> Hello, world.  I am 1 on node124
> Hello, world.  I am 0 on node123
> Size Latency(usec) Bandwidth(Mbps)
> 14.181.83
> 23.664.17
> 44.087.48
> 83.1219.57
> 163.8331.84
> 323.4071.84
> 644.10118.97
> 1283.89251.19
> 2564.22462.77
> 5122.951325.71
> 10242.632969.49
> 20483.384628.29
> [node123.prv.sciama.cluster:88310] mca: base: close: component t

[OMPI users] still SIGSEGV for Java and openmpi-1.8.3a1r32692 on Solaris

2014-09-10 Thread Siegmar Gross
Hi,

today I installed openmpi-1.8.3a1r32692 on my machines (Solaris
10 Sparc (tyr), Solaris 10 x86_64 (sunpc1), and openSUSE Linux 12.1
x86_64 (linpc1)) with Sun C 5.12 and gcc-4.9.0.

I still get a segmentation fault for my small Java program on Solaris.

tyr java 102 ompi_info | grep -e MPI: -e "C compiler:"
Open MPI: 1.8.3a1r32692
  C compiler: cc
tyr java 103 mpijavac InitFinalizeMain.java 
tyr java 104 mpiexec -np 1 java InitFinalizeMain
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7ea3c7f0, pid=1860, tid=2
#
# JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode solaris-sparc 
compressed oops)
# Problematic frame:
# C  [libc.so.1+0x3c7f0]  strlen+0x50
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/fd1026/work/skripte/master/parallel/prog/mpi/java/hs_err_pid1860.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
--
mpiexec noticed that process rank 0 with PID 1860 on node tyr exited on signal 
6 
(Abort).
--
tyr java 105 


I have the same problem with my gcc-version.

tyr java 112 ompi_info | grep -e MPI: -e "C compiler:"
Open MPI: 1.8.3a1r32692
  C compiler: gcc
tyr java 113 mpijavac InitFinalizeMain.java 
tyr java 114 mpiexec -np 1 java InitFinalizeMain
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7ea3c7f0, pid=2489, tid=2
...


Can I provide anything else, so that you can solve the problem?

Kind regards

Siegmar



Re: [OMPI users] still SIGSEGV for Java and openmpi-1.8.3a1r32692 on Solaris

2014-09-10 Thread Ralph Castain
Working on the memory alignment issues in the trunk, and they are being 
scheduled to come across as we go.

On Sep 10, 2014, at 9:08 AM, Siegmar Gross 
 wrote:

> Hi,
> 
> today I installed openmpi-1.8.3a1r32692 on my machines (Solaris
> 10 Sparc (tyr), Solaris 10 x86_64 (sunpc1), and openSUSE Linux 12.1
> x86_64 (linpc1)) with Sun C 5.12 and gcc-4.9.0.
> 
> I still get a segmentation fault for my small Java program on Solaris.
> 
> tyr java 102 ompi_info | grep -e MPI: -e "C compiler:"
>Open MPI: 1.8.3a1r32692
>  C compiler: cc
> tyr java 103 mpijavac InitFinalizeMain.java 
> tyr java 104 mpiexec -np 1 java InitFinalizeMain
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7ea3c7f0, pid=1860, tid=2
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode 
> solaris-sparc 
> compressed oops)
> # Problematic frame:
> # C  [libc.so.1+0x3c7f0]  strlen+0x50
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/fd1026/work/skripte/master/parallel/prog/mpi/java/hs_err_pid1860.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.sun.com/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> #
> --
> mpiexec noticed that process rank 0 with PID 1860 on node tyr exited on 
> signal 6 
> (Abort).
> --
> tyr java 105 
> 
> 
> I have the same problem with my gcc-version.
> 
> tyr java 112 ompi_info | grep -e MPI: -e "C compiler:"
>Open MPI: 1.8.3a1r32692
>  C compiler: gcc
> tyr java 113 mpijavac InitFinalizeMain.java 
> tyr java 114 mpiexec -np 1 java InitFinalizeMain
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7ea3c7f0, pid=2489, tid=2
> ...
> 
> 
> Can I provide anything else, so that you can solve the problem?
> 
> Kind regards
> 
> Siegmar
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25308.php