We (Oracle) have not done that much extensive limits testing going
between 32 to 64bit applications. Most of the testing we've done is
more around endianess (SPARC vs x86_64).
Though the below is kind of interesting. Sounds like the eager limit
isn't being normalized on the 64 bit machines. Though a 32 bit rank 0
solving the problem also is very interesting, I wonder if that is not
more due to which rank is send and receiving?
--td
Message: 3
Date: Sun, 7 Mar 2010 05:34:21 -0600
From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
Subject: Re: [OMPI users] Segmentation fault when Send/Recv
onheterogeneouscluster (32/64 bit machines)
To: <us...@open-mpi.org>
Message-ID:
<58d723fe08dc6a4398e6596e38f3fa17056...@xmb-rcd-205.cisco.com>
Content-Type: text/plain; charset="utf-8"
Ibm and sun (oracle) have probably done the most heterogeneous testing, but its
probably not as stable as our homogeneous code paths.
Terry/brad - do you have any insight here?
Yes, setting eager limit high can impact performance. Its the amount of data that ompi will send eagerly without waiting for an ack from the receiver. There are several secondary performance effects that can occur if you are using sockets for transport and/or your program is only loosely synchronized. If your prog is tightly synchronous, it may not have too huge of an overall perf impact.
-jms
Sent from my PDA. No type good.
----- Original Message -----
From: users-boun...@open-mpi.org <users-boun...@open-mpi.org>
To: Open MPI Users <us...@open-mpi.org>
Sent: Thu Mar 04 09:02:19 2010
Subject: Re: [OMPI users] Segmentation fault when Send/Recv
onheterogeneouscluster (32/64 bit machines)
Hi,
I have some new discovery about this problem :
It seems that the array size sendable from a 32bit to 64bit machines
is proportional to the parameter "btl_tcp_eager_limit"
When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send an
array up to 2e07 double (152MB).
I didn't found much informations about btl_tcp_eager_limit other than
in the "ompi_info --all" command. If I let it at 2e08, will it impacts
the performance of OpenMPI ?
It may be noteworth also that if the master (rank 0) is a 32bit
machines, I don't have segfault. I can send big array with small
"btl_tcp_eager_limit" from a 64bit machine to a 32bit one.
Do I have to move this thread to devel mailing list ?
Regards,
TMHieu
On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu <mhtr...@gmail.com> wrote:
Hello,
Yes, I compiled OpenMPI with --enable-heterogeneous. More precisely I
compiled with :
$ ./configure --prefix=/tmp/openmpi --enable-heterogeneous
--enable-cxx-exceptions --enable-shared
--enable-orterun-prefix-by-default
$ make all install
I attach the output of ompi_info of my 2 machines.
? ?TMHieu
On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
Did you configure Open MPI with --enable-heterogeneous?
On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote:
Hello,
I have some problems running MPI on my heterogeneous cluster. More
precisley i got segmentation fault when sending a large array (about
10000) of double from a i686 machine to a x86_64 machine. It does not
happen with small array. Here is the send/recv code source (complete
source is in attached file) :
========code ================
? ? if (me == 0 ) {
? ? ? ? for (int pe=1; pe<nprocs; pe++)
? ? ? ? {
? ? ? ? ? ? ? ? printf("Receiving from proc %d : ",pe); fflush(stdout);
? ? ? ? ? ? d=(double *)malloc(sizeof(double)*n);
? ? ? ? ? ? MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status);
? ? ? ? ? ? printf("OK\n"); fflush(stdout);
? ? ? ? }
? ? ? ? printf("All done.\n");
? ? }
? ? else {
? ? ? d=(double *)malloc(sizeof(double)*n);
? ? ? MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD);
? ? }
======== code ================
I got segmentation fault with n=10000 but no error with n=1000
I have 2 machines :
sbtn155 : Intel Xeon, ? ? ? ? x86_64
sbtn211 : Intel Pentium 4, i686
The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1,
installed in /tmp/openmpi :
[mhtrinh@sbtn211 heterogenous]$ make hetero
gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.i686.o
/tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
hetero.i686.o -o hetero.i686 -lm
[mhtrinh@sbtn155 heterogenous]$ make hetero
gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.x86_64.o
/tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
hetero.x86_64.o -o hetero.x86_64 -lm
I run with the code using appfile and got thoses error :
$ cat appfile
--host sbtn155 -np 1 hetero.x86_64
--host sbtn155 -np 1 hetero.x86_64
--host sbtn211 -np 1 hetero.i686
$ mpirun -hetero --app appfile
Input array length :
10000
Receiving from proc 1 : OK
Receiving from proc 2 : [sbtn155:26386] *** Process received signal ***
[sbtn155:26386] Signal: Segmentation fault (11)
[sbtn155:26386] Signal code: Address not mapped (1)
[sbtn155:26386] Failing at address: 0x200627bd8
[sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540]
[sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2aaaad8d7908]
[sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so [0x2aaaae2fc6e3]
[sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2aaaaafe39db]
[sbtn155:26386] [ 4]
/tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2aaaaafd8b9e]
[sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2aaaad8d4b25]
[sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b)
[0x2aaaaab30f9b]
[sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe]
[sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fa421e074]
[sbtn155:26386] [ 9] hetero.x86_64 [0x400b29]
[sbtn155:26386] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 26386 on node sbtn155
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Am I missing an option in order to run in heterogenous cluster ?
MPI_Send/Recv have limit array size when using heterogeneous cluster ?
Thanks for your help. Regards
--
============================================
? ?M. TRINH Minh Hieu
? ?CEA, IBEB, SBTN/LIRM,
? ?F-30207 Bagnols-sur-C?ze, FRANCE
============================================
<hetero.c.bz2>_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
-------------- next part --------------
HTML attachment scrubbed and removed
**************************************