Re: [OMPI users] Help: OpenMPI Compilation in Raspberry Pi

2013-01-17 Thread Lee Eric
Hi all, sorry to reply this thread so late. I tried and it works well.
However, it takes me about 12 hrs to compile he while package so I
gonna cross-compile in my laptop w/ proper toolchain I created. Here's
the command line I used.

./configure --build=x86_64-redhat-linux
--host=arm-unknown-linux-gnueabi CFLAGS="-Ofast -mfpu=vfp
-mfloat-abi=hard -march=armv6zk -mtune=arm1176jzf-s"

..

*** Assembler
checking dependency style of arm-unknown-linux-gnueabi-gcc... gcc3
checking for BSD- or MS-compatible name lister (nm)...
/home/huli/Projects/arm-devel/bin/arm-unknown-linux-gnueabi-nm -B
checking the name lister
(/home/huli/Projects/arm-devel/bin/arm-unknown-linux-gnueabi-nm -B)
interface... BSD nm
checking for fgrep... /bin/grep -F
checking if need to remove -g from CCASFLAGS... no
checking whether to enable smp locks... yes
checking if .proc/endp is needed... no
checking directive for setting text section... .text
checking directive for exporting symbols... .globl
checking for objdump... objdump
checking if .note.GNU-stack is needed... yes
checking suffix for labels... :
checking prefix for global symbol labels...
checking prefix for lsym labels... .L
checking prefix for function in .type... #
checking if .size is needed... yes
checking if .align directive takes logarithmic value... yes
configure: error: No atomic primitives available for arm-unknown-linux-gnueabi

..

Do we have any way to fix that?

Thanks.

On Sat, Jan 12, 2013 at 3:14 AM, Jeff Squyres (jsquyres)
 wrote:
> Ok, I was able to configure and run successfully on my Raspberry Pi with:
>
> ./configure CCASFLAGS=-march=armv7-a ...
>
> Is that something we should put on a FAQ page?
>
>
>
> On Jan 11, 2013, at 7:11 AM, George Bosilca  wrote:
>
>> This one belong to arm7 instruction set. Please try one of the following 
>> `armv7', `armv7-a', `armv7-r'.
>>
>>  George.
>>
>>
>> On Jan 11, 2013, at 00:38 , Jeff Squyres (jsquyres)  
>> wrote:
>>
>>> Sadly, none of these solutions worked for me on my RPi:
>>>
>>> -
>>> pi@raspberrypi ~/openmpi-1.6.3/opal/asm $ make CCASFLAGS=-mcpu=arm1176jzf-s
>>> CPPAS  atomic-asm.lo
>>> atomic-asm.S: Assembler messages:
>>> atomic-asm.S:7: Error: selected processor does not support ARM mode `dmb'
>>> atomic-asm.S:15: Error: selected processor does not support ARM mode `dmb'
>>> atomic-asm.S:23: Error: selected processor does not support ARM mode `dmb'
>>> atomic-asm.S:55: Error: selected processor does not support ARM mode `dmb'
>>> atomic-asm.S:70: Error: selected processor does not support ARM mode `dmb'
>>> make: *** [atomic-asm.lo] Error 1
>>> pi@raspberrypi ~/openmpi-1.6.3/opal/asm $ make CCASFLAGS=-march=armv6zk
>>> CPPAS  atomic-asm.lo
>>> atomic-asm.S: Assembler messages:
>>> atomic-asm.S:7: Error: selected processor does not support ARM mode `dmb'
>>> atomic-asm.S:15: Error: selected processor does not support ARM mode `dmb'
>>> atomic-asm.S:23: Error: selected processor does not support ARM mode `dmb'
>>> atomic-asm.S:55: Error: selected processor does not support ARM mode `dmb'
>>> atomic-asm.S:70: Error: selected processor does not support ARM mode `dmb'
>>> make: *** [atomic-asm.lo] Error 1
>>> pi@raspberrypi ~/openmpi-1.6.3/opal/asm $ make CCASFLAGS=-march=argv6k
>>> CPPAS  atomic-asm.lo
>>> cc1: error: bad value (argv6k) for -march switch
>>> make: *** [atomic-asm.lo] Error 1
>>> pi@raspberrypi ~/openmpi-1.6.3/opal/asm $
>>> -
>>>
>>> Although I'm using a bit different system than the original user cited (I'm 
>>> running the latest Raspbian distro):
>>>
>>> -
>>> pi@raspberrypi ~/openmpi-1.6.3/opal/asm $ uname -a
>>> Linux raspberrypi 3.2.27+ #250 PREEMPT Thu Oct 18 19:03:02 BST 2012 armv6l 
>>> GNU/Linux
>>> pi@raspberrypi ~/openmpi-1.6.3/opal/asm $ gcc --version
>>> gcc (Debian 4.6.3-12+rpi1) 4.6.3
>>> -
>>>
>>> On Jan 10, 2013, at 5:39 PM, George Bosilca 
>>> wrote:
>>>
 A little bit of google shows that this is a known issue. ldrex and strex 
 are not included in the default instruction set gcc uses (arm6). One has 
 to add the compile flag "-march=argv6k" to successfully compiles.

 George.

 PS: For more info: 
 http://www.raspberrypi.org/phpBB3/viewtopic.php?f=9&t=4256&start=250


 On Jan 10, 2013, at 16:20 , Jeff Squyres (jsquyres)  
 wrote:

> Mmmm.  Let's rope in our ARM expert here...
>
> Leif, do you know what the issue is here?
>
>
> On Jan 3, 2013, at 4:28 AM, Lee Eric  wrote:
>
>> Hi,
>>
>> I am going to compile OpenMPI 1.6.3 in Raspberry Pi and encounter 
>> following errors.
>>
>> make[2]: Entering directory `/root/openmpi-1.6.3/opal'
>> CC class/opal_bitmap.lo
>> CC class/opal_free_list.lo
>> CC class/opal_hash_table.lo
>> CC class/opal_list.lo
>> CC class/opal_object.lo
>> /tmp/ccniCtj0.s: Assembler messages:
>> /tmp/ccniCtj0.s:83: Error: selected processor does not support ARM mode 
>> `ldrex r3,[r1]

Re: [OMPI users] Error running program : mca_oob_tcp_msg_send_handler: writev:failed: Bad file descriptor

2013-01-17 Thread borja mf
Sorry! I removed the mails so I have to post another one.

I stopped the iptables on the three nodes. Ping it's working OK
(pruebaborja to clienteprueba / clienteprueba to pruebaborja).

My /etc/networks/interfaces - node:

pruebaborja Masternode
#The loopback network interface
auto lo
iface lo inet loopback
#The primary network interface
auto eth0
iface eth0 inet dhcp

clienteprueba and clientepruebados
auto lo
ifface lo inet loopback

My interface is Auto (eth0) on the three nodes.
Do you want to see "ifconfig" also?
Thank you again or answer


[OMPI users] Possible memory leak(s) in OpenMPI 1.6.3?

2013-01-17 Thread Victor Vysotskiy
Dear Developers,

I am running into memory problems when creating/allocating MPI's window and its 
memory frequently. Below is listed a sample code reproducing the problem:

#include 
#include 
#define NEL8
#define NTIMES 100

int main (int argc,char *argv[]) {
  int   i;
  doublew[NEL];
  MPI_Aint  win_size,warr_size;
  MPI_Win  *win;

  win_size=sizeof(MPI_Win);
  warr_size=sizeof(MPI_DOUBLE)*NEL;


  MPI_Init (&argc, &argv);

  for(i=0;i/* C Example */
#include 
#include 
#define NEL8
#define NTIMES 100

int main (int argc,char *argv[]) {
  int   i;
  doublew[NEL];
  MPI_Aint  win_size,warr_size;
  MPI_Win  *win;

  win_size=sizeof(MPI_Win);
  warr_size=sizeof(MPI_DOUBLE)*NEL;
  
  
  MPI_Init (&argc, &argv);
  
  for(i=0;i

massif.out.15028
Description: massif.out.15028


Re: [OMPI users] help me understand these error msgs

2013-01-17 Thread Jure Pečar
On Wed, 16 Jan 2013 07:46:41 -0800
Ralph Castain  wrote:

> This one means that a backend node lost its connection to mpirun. We use a 
> TCP socket between the daemon on a node and mpirun to launch the processes 
> and to detect if/when that node fails for some reason.

Hm. And what would be the reasons for this? Too much load on node where mpirun 
is run?

-- 

Jure Pečar
http://jure.pecar.org



[OMPI users] OMPI 1.6.3, InfiniBand and MTL MXM; unable to make it work!

2013-01-17 Thread Francesco Simula
I tried building from OMPI 1.6.3 tarball with the following 
./configure:
./configure 
--prefix=/apotto/home1/homedirs/fsimula/Lavoro/openmpi-1.6.3/install/ \

--disable-mpi-io \
--disable-io-romio \
--enable-dependency-tracking \
--without-slurm \
--with-platform=optimized \
--disable-mpi-f77 \
--disable-mpi-f90 \
--with-openib \
--disable-static \
--enable-shared \
--disable-vt \
--enable-pty-support \
--enable-mca-no-build=btl-ofud,pml-bfo \
--with-mxm=/opt/mellanox/mxm \
--with-mxm-libdir=/opt/mellanox/mxm/lib

As you can see from the last two lines, I want to enable the MXM 
transport layer on a cluster made of SuperMicro X8DTG-D boards with dual 
Xeons and Mellanox MT26428 HCAs; the OS is CentOS 5.8.


I tried with two different .rpm's for MXM, either 
'mxm-1.1.ad085ef-1.x86_64-centos5u7.rpm' taken from here:

http://www.mellanox.com/downloads/hpc/mxm/v1.1/mxm-latest.tar

and 'mxm-1.5.f583875-1.x86_64-centos5u7.rpm' taken from here:
http://www.mellanox.com/downloads/hpc/mxm/v1.5/mxm-latest.tar

With both, even if the compilation concludes successfully, a simple 
test (osu_bw from the OSU Micro-Benchmarks 3.8) fails with the sort of 
message reported below; the lines:


rdma_dev.c:122  MXM DEBUG Port 1 on mlx4_0 has a link layer different 
from IB. Skipping it
rdma_dev.c:155  MXM ERROR An active IB port on a Mellanox device, with 
lid [any] gid [any] not found


make it seem like it cannot access the HW for the HCA: is that so? The 
very same test works when using '-mca pml ob1' (thus using the openib 
BTL).


I'm quite ready to start pulling my hair; any suggestions?

The output of /usr/bin/ibv_devinfo for the two cluster nodes follows:
[cut]
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.7.000
node_guid:  0025:90ff:ff07:0ac4
sys_image_guid: 0025:90ff:ff07:0ac7
vendor_id:  0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id:   SM_106101000
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 4
port_lid:   6
port_lmc:   0x00
[/cut]

[cut]
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.7.000
node_guid:  0025:90ff:ff07:0acc
sys_image_guid: 0025:90ff:ff07:0acf
vendor_id:  0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id:   SM_106101000
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 4
port_lid:   8
port_lmc:   0x00
[/cut]

The complete output of the failing test follows:

[fsimula@agape5 osu-micro-benchmarks-3.8]$ mpirun -x MXM_LOG_LEVEL=poll 
-mca pml cm -mca mtl_mxm_np 1 -np 2 -host agape4,agape5 
install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw H H

[1358430343.266782] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
[1358430343.266815] [agape5:8596 :0] config_parser.c:168  MXM DEBUG 
default: MXM_HANDLE_ERRORS=bt
[1358430343.266826] [agape5:8596 :0] config_parser.c:168  MXM DEBUG 
default: MXM_GDB_PATH=/usr/bin/gdb
[1358430343.266838] [agape5:8596 :0] config_parser.c:168  MXM DEBUG 
default: MXM_DUMP_SIGNO=1
[1358430343.266851] [agape5:8596 :0] config_parser.c:168  MXM DEBUG 
default: MXM_DUMP_LEVEL=conn
[1358430343.266924] [agape5:8596 :0] config_parser.c:168  MXM DEBUG 
default: MXM_ASYNC_MODE=THREAD
[1358430343.266936] [agape5:8596 :0] config_parser.c:168  MXM DEBUG 
default: MXM_TIME_ACCURACY=0.1
[1358430343.266956] [agape5:8596 :0] config_parser.c:168  MXM DEBUG 
default: MXM_PTLS=self,shm,rdma
[1358430343.267249] [agape5:8596 :0] mpool.c:265  MXM DEBUG mpool 
'ptl_self_recv_ev': allocated chunk 0xc075f40 of 96016 bytes with 1000 
elements
[1358430343.267308] [agape5:8596 :0] mpool.c:156  MXM DEBUG mpool 
'ptl_self_recv_ev': align 16, maxelems 1000, elemsize 88, padding 8
[1358430343.267316] [agape5:8596 :0]  self.c:410  MXM DEBUG Created 
ptl_self
[1358430343.267333] [agape5:8596 :0]   shm_ptl.c:56   MXM DEBUG Created 
ptl_shm
[1358430343.268457] [agape5:8596 :0]  rdma_ptl.c:65   MXM TRACE Got 1 
IB devices
[1358430343.268640] [agape5:8596 :0]  rdma_ptl.c:112  MXM DEBUG added 
device mlx4_0
[1358430343

Re: [OMPI users] help me understand these error msgs

2013-01-17 Thread Ralph Castain

On Jan 17, 2013, at 2:25 AM, Jure Pečar  wrote:

> On Wed, 16 Jan 2013 07:46:41 -0800
> Ralph Castain  wrote:
> 
>> This one means that a backend node lost its connection to mpirun. We use a 
>> TCP socket between the daemon on a node and mpirun to launch the processes 
>> and to detect if/when that node fails for some reason.
> 
> Hm. And what would be the reasons for this? Too much load on node where 
> mpirun is run?

No, the error means the connection was completely lost - i.e., the socket was 
closed. Do I understand correctly that the job runs for awhile and then dies? 
So there are processes executing on the node that reports a lost connection?

Or is this happening on startup of the larger job, or during a call to 
MPI_Comm_spawn?


> 
> -- 
> 
> Jure Pečar
> http://jure.pecar.org
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Error running program : mca_oob_tcp_msg_send_handler: writev:failed: Bad file descriptor

2013-01-17 Thread Ralph Castain
Configure OMPI with --enable-debug, and then run

mpirun -n 1 -host clienteprueba -mca plm_base_verbose 5 hostname

You should see a daemon getting launched and successfully reporting back to 
mpirun, and then the application getting launched on the remote node.


On Jan 17, 2013, at 1:25 AM, borja mf  wrote:

> Sorry! I removed the mails so I have to post another one.
> 
> I stopped the iptables on the three nodes. Ping it's working OK (pruebaborja 
> to clienteprueba / clienteprueba to pruebaborja).
>  
> My /etc/networks/interfaces - node:
> 
> pruebaborja Masternode
> #The loopback network interface
> auto lo
> iface lo inet loopback
> #The primary network interface
> auto eth0
> iface eth0 inet dhcp
> 
> clienteprueba and clientepruebados
> auto lo
> ifface lo inet loopback
> 
> My interface is Auto (eth0) on the three nodes.
> Do you want to see "ifconfig" also? 
> Thank you again or answer 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem with mpirun for java codes

2013-01-17 Thread Ralph Castain
Just as an FYI: we have removed the Java bindings from the 1.7.0 release due to 
all the reported errors - looks like that code just isn't ready yet for 
release. It remains available on the nightly snapshots of the developer's trunk 
while we continue to debug it.

With that said, I tried your example using the current developer's trunk - and 
it worked just fine.

I ran it on a single node, however. Were you running this across multiple 
nodes? Is it possible that the "classes" directory wasn't available on the 
remote node?


On Jan 16, 2013, at 4:17 PM, Karos Lotfifar  wrote:

> Hi, 
> The version that I am using is 
> 
> 1.7rc6 (pre-release)
> 
> 
> Regards,
> Karos
> 
> On 16 Jan 2013, at 21:07, Ralph Castain  wrote:
> 
>> Which version of OMPI are you using?
>> 
>> 
>> On Jan 16, 2013, at 11:43 AM, Karos Lotfifar  wrote:
>> 
>>> Hi,
>>> 
>>> I am still struggling with the installation problems! I get very strange 
>>> errors. everything is fine when I run OpenMPI for C codes, but when I try 
>>> to run a simple java code I get very strange error. The code is as simple 
>>> as the following and I can not get it running:
>>> 
>>> import mpi.*;
>>> 
>>> class JavaMPI {
>>>   public static void main(String[] args) throws MPIException {
>>> MPI.Init(args);
>>> System.out.println("Hello world from rank " + 
>>>   MPI.COMM_WORLD.Rank() + " of " +
>>>   MPI.COMM_WORLD.Size() );
>>> MPI.Finalize();
>>>   }
>>> } 
>>> 
>>> everything is ok with mpijavac, my java code, etc. when I try to run the 
>>> code with the following command:
>>> 
>>> /usr/local/bin/mpijavac -d classes JavaMPI.java   --> FINE
>>> /usr/local/bin/mpirun -np 2 java -cp ./classes JavaMPI  --> *ERROR*
>>> 
>>> I'll the following error. Could you please help me about this (As I 
>>> mentioned the I can run C MPI codes without any problem ). The system 
>>> specifications are:
>>> 
>>> JRE version: 6.0_30-b12 (java-sun-6)
>>> OS: Linux 3.0.0-30-generic-pae #47-Ubuntu
>>> CPU:total 4 (2 cores per cpu, 2 threads per core) family 6 model 42 
>>> stepping 7, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, 
>>> popcnt, ht
>>> 
>>> 
>>> 
>>> 
>>> ##
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> #  SIGSEGV#
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> #  SIGSEGV (0xb) at pc=0x70e1dd12, pid=28616, tid=3063311216
>>> #
>>>  (0xb) at pc=0x70f61d12, pid=28615, tid=3063343984
>>> #
>>> # JRE version: 6.0_30-b12
>>> # JRE version: 6.0_30-b12
>>> # Java VM: Java HotSpot(TM) Server VM (20.5-b03 mixed mode linux-x86 )
>>> # Problematic frame:
>>> # C  [libmpi.so.1+0x20d12]  unsigned __int128+0xa2
>>> #
>>> # An error report file with more information is saved as:
>>> # /home/karos/hs_err_pid28616.log
>>> # Java VM: Java HotSpot(TM) Server VM (20.5-b03 mixed mode linux-x86 )
>>> # Problematic frame:
>>> # C  [libmpi.so.1+0x20d12]  unsigned __int128+0xa2
>>> #
>>> # An error report file with more information is saved as:
>>> # /home/karos/hs_err_pid28615.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> #   http://java.sun.com/webapps/bugreport/crash.jsp
>>> # The crash happened outside the Java Virtual Machine in native code.
>>> # See problematic frame for where to report the bug.
>>> #
>>> [tulips:28616] *** Process received signal ***
>>> [tulips:28616] Signal: Aborted (6)
>>> [tulips:28616] Signal code:  (-6)
>>> [tulips:28616] [ 0] [0xb777840c]
>>> [tulips:28616] [ 1] [0xb7778424]
>>> [tulips:28616] [ 2] /lib/i386-linux-gnu/libc.so.6(gsignal+0x4f) [0xb75e3cff]
>>> [tulips:28616] [ 3] /lib/i386-linux-gnu/libc.so.6(abort+0x175) [0xb75e7325]
>>> [tulips:28616] [ 4] 
>>> /usr/lib/jvm/java-6-sun-1.6.0.30/jre/lib/i386/server/libjvm.so(+0x5dcf7f) 
>>> [0xb6f6df7f]
>>> [tulips:28616] [ 5] 
>>> /usr/lib/jvm/java-6-sun-1.6.0.30/jre/lib/i386/server/libjvm.so(+0x724897) 
>>> [0xb70b5897]
>>> [tulips:28616] [ 6] 
>>> /usr/lib/jvm/java-6-sun-1.6.0.30/jre/lib/i386/server/libjvm.so(JVM_handle_linux_signal+0x21c)
>>>  [0xb6f7529c]
>>> [tulips:28616] [ 7] 
>>> /usr/lib/jvm/java-6-sun-1.6.0.30/jre/lib/i386/server/libjvm.so(+0x5dff64) 
>>> [0xb6f70f64]
>>> [tulips:28616] [ 8] [0xb777840c]
>>> [tulips:28616] [ 9] [0xb3891548]
>>> [tulips:28616] *** End of error message ***
>>> [tulips:28615] *** Process received signal ***
>>> [tulips:28615] Signal: Aborted (6)
>>> [tulips:28615] Signal code:  (-6)
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> #   http://java.sun.com/webapps/bugreport/crash.jsp
>>> # The crash happened outside the Java Virtual Machine in native code.
>>> # See problematic frame for where to report the bug.
>>> #
>>> [tulips:28615] [ 0] [0xb778040c]
>>> [tulips:28615] [ 1] [0xb7780424]
>>> [tulips:28615] [ 2] /lib/i386-linux-gnu/libc.so.6(gsignal+0x4f) [0xb75ebcff]
>>> [tulips:28615] [ 

Re: [OMPI users] Help: OpenMPI Compilation in Raspberry Pi

2013-01-17 Thread Jeff Squyres (jsquyres)
On Jan 16, 2013, at 6:41 AM, Leif Lindholm  wrote:

> That isn't, technically speaking, correct for the Raspberry Pi - but it is a 
> workaround if you know you will never actually use the asm implementations of 
> the atomics, but only the inline C ones..
> 
> This sort of hides the problem that the dedicated barrier instructions were 
> not available in ARMv6 (it used "system control coprocessor operations" 
> instead.
> 
> If you ever executed the asm implementation, you would trigger an undefined 
> instruction exception on the Pi.

Hah; sweet.  Ok.

So what's the right answer?  Would it be acceptable to use a no-op for this 
operation on such architectures?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/