Hello, I changed the test code (hetero.c, in attach) so that the master (where data is centralized) can be rank 1 or 2. I tested with a master,rank 2 or rank 1 : same probleme, when the master is a 64bit machine, as soon as it receive data from a 32bit machines it got segfault. no probleme with a 32bit master. It seems to not be rank dependent ...
Regards, On Mon, Mar 8, 2010 at 1:27 PM, Terry Dontje <terry.don...@oracle.com>wrote: > We (Oracle) have not done that much extensive limits testing going between > 32 to 64bit applications. Most of the testing we've done is more around > endianess (SPARC vs x86_64). > > Though the below is kind of interesting. Sounds like the eager limit isn't > being normalized on the 64 bit machines. Though a 32 bit rank 0 solving the > problem also is very interesting, I wonder if that is not more due to which > rank is send and receiving? > > --td > > > >> Message: 3 >> Date: Sun, 7 Mar 2010 05:34:21 -0600 >> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> >> Subject: Re: [OMPI users] Segmentation fault when Send/Recv >> onheterogeneouscluster (32/64 bit machines) >> To: <us...@open-mpi.org> >> Message-ID: >> <58d723fe08dc6a4398e6596e38f3fa17056...@xmb-rcd-205.cisco.com> >> Content-Type: text/plain; charset="utf-8" >> >> Ibm and sun (oracle) have probably done the most heterogeneous testing, >> but its probably not as stable as our homogeneous code paths. >> >> Terry/brad - do you have any insight here? >> >> Yes, setting eager limit high can impact performance. Its the amount of >> data that ompi will send eagerly without waiting for an ack from the >> receiver. There are several secondary performance effects that can occur if >> you are using sockets for transport and/or your program is only loosely >> synchronized. If your prog is tightly synchronous, it may not have too huge >> of an overall perf impact. >> -jms >> Sent from my PDA. No type good. >> >> ----- Original Message ----- >> From: users-boun...@open-mpi.org <users-boun...@open-mpi.org> >> To: Open MPI Users <us...@open-mpi.org> >> Sent: Thu Mar 04 09:02:19 2010 >> Subject: Re: [OMPI users] Segmentation fault when Send/Recv >> onheterogeneouscluster (32/64 bit machines) >> >> Hi, >> >> I have some new discovery about this problem : >> >> It seems that the array size sendable from a 32bit to 64bit machines >> is proportional to the parameter "btl_tcp_eager_limit" >> When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send an >> array up to 2e07 double (152MB). >> >> I didn't found much informations about btl_tcp_eager_limit other than >> in the "ompi_info --all" command. If I let it at 2e08, will it impacts >> the performance of OpenMPI ? >> >> It may be noteworth also that if the master (rank 0) is a 32bit >> machines, I don't have segfault. I can send big array with small >> "btl_tcp_eager_limit" from a 64bit machine to a 32bit one. >> >> Do I have to move this thread to devel mailing list ? >> >> Regards, >> >> TMHieu >> >> On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu <mhtr...@gmail.com> >> wrote: >> >> >>> Hello, >>> >>> Yes, I compiled OpenMPI with --enable-heterogeneous. More precisely I >>> compiled with : >>> $ ./configure --prefix=/tmp/openmpi --enable-heterogeneous >>> --enable-cxx-exceptions --enable-shared >>> --enable-orterun-prefix-by-default >>> $ make all install >>> >>> I attach the output of ompi_info of my 2 machines. >>> >>> ? ?TMHieu >>> >>> On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres <jsquy...@cisco.com> wrote: >>> >>> >>>> Did you configure Open MPI with --enable-heterogeneous? >>>> >>>> On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote: >>>> >>>> >>>> >>>>> Hello, >>>>> >>>>> I have some problems running MPI on my heterogeneous cluster. More >>>>> precisley i got segmentation fault when sending a large array (about >>>>> 10000) of double from a i686 machine to a x86_64 machine. It does not >>>>> happen with small array. Here is the send/recv code source (complete >>>>> source is in attached file) : >>>>> ========code ================ >>>>> ? ? if (me == 0 ) { >>>>> ? ? ? ? for (int pe=1; pe<nprocs; pe++) >>>>> ? ? ? ? { >>>>> ? ? ? ? ? ? ? ? printf("Receiving from proc %d : ",pe); fflush(stdout); >>>>> ? ? ? ? ? ? d=(double *)malloc(sizeof(double)*n); >>>>> ? ? ? ? ? ? MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status); >>>>> ? ? ? ? ? ? printf("OK\n"); fflush(stdout); >>>>> ? ? ? ? } >>>>> ? ? ? ? printf("All done.\n"); >>>>> ? ? } >>>>> ? ? else { >>>>> ? ? ? d=(double *)malloc(sizeof(double)*n); >>>>> ? ? ? MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD); >>>>> ? ? } >>>>> ======== code ================ >>>>> >>>>> I got segmentation fault with n=10000 but no error with n=1000 >>>>> I have 2 machines : >>>>> sbtn155 : Intel Xeon, ? ? ? ? x86_64 >>>>> sbtn211 : Intel Pentium 4, i686 >>>>> >>>>> The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1, >>>>> installed in /tmp/openmpi : >>>>> [mhtrinh@sbtn211 heterogenous]$ make hetero >>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o >>>>> hetero.i686.o >>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include >>>>> hetero.i686.o -o hetero.i686 -lm >>>>> >>>>> [mhtrinh@sbtn155 heterogenous]$ make hetero >>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o >>>>> hetero.x86_64.o >>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include >>>>> hetero.x86_64.o -o hetero.x86_64 -lm >>>>> >>>>> I run with the code using appfile and got thoses error : >>>>> $ cat appfile >>>>> --host sbtn155 -np 1 hetero.x86_64 >>>>> --host sbtn155 -np 1 hetero.x86_64 >>>>> --host sbtn211 -np 1 hetero.i686 >>>>> >>>>> $ mpirun -hetero --app appfile >>>>> Input array length : >>>>> 10000 >>>>> Receiving from proc 1 : OK >>>>> Receiving from proc 2 : [sbtn155:26386] *** Process received signal *** >>>>> [sbtn155:26386] Signal: Segmentation fault (11) >>>>> [sbtn155:26386] Signal code: Address not mapped (1) >>>>> [sbtn155:26386] Failing at address: 0x200627bd8 >>>>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540] >>>>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so >>>>> [0x2aaaad8d7908] >>>>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so >>>>> [0x2aaaae2fc6e3] >>>>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2aaaaafe39db] >>>>> [sbtn155:26386] [ 4] >>>>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2aaaaafd8b9e] >>>>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so >>>>> [0x2aaaad8d4b25] >>>>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b) >>>>> [0x2aaaaab30f9b] >>>>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe] >>>>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) >>>>> [0x3fa421e074] >>>>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29] >>>>> [sbtn155:26386] *** End of error message *** >>>>> >>>>> -------------------------------------------------------------------------- >>>>> mpirun noticed that process rank 0 with PID 26386 on node sbtn155 >>>>> exited on signal 11 (Segmentation fault). >>>>> >>>>> -------------------------------------------------------------------------- >>>>> >>>>> Am I missing an option in order to run in heterogenous cluster ? >>>>> MPI_Send/Recv have limit array size when using heterogeneous cluster ? >>>>> Thanks for your help. Regards >>>>> >>>>> -- >>>>> ============================================ >>>>> ? ?M. TRINH Minh Hieu >>>>> ? ?CEA, IBEB, SBTN/LIRM, >>>>> ? ?F-30207 Bagnols-sur-C?ze, FRANCE >>>>> ============================================ >>>>> >>>>> <hetero.c.bz2>_______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> -------------- next part -------------- >> HTML attachment scrubbed and removed >> >> ************************************** >> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- ============================================ M. TRINH Minh Hieu CEA, IBEB, SBTN/LIRM, F-30207 Bagnols-sur-Cèze, FRANCE ============================================
#include <stdio.h> #include <stdlib.h> #include <mpi.h> int main(int argc, char *argv[]) { unsigned int n; int me, nprocs; int master=1; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &nprocs); MPI_Comm_rank (MPI_COMM_WORLD, &me); if (me == 0) { printf("%s", "Input array length :\n"); scanf ("%d", &n); printf("Size in MB: %g\n",(double) n*8/1024/1024); } MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Barrier (MPI_COMM_WORLD); MPI_Status status; double *d; // Master recv if (me == master ) { for (int pe=0; pe<nprocs; pe++) { if (pe==master) continue; printf("I am proc %d. Receiving from proc %d : ",me,pe); fflush(stdout); d=(double *)malloc(sizeof(double)*n); if (d==NULL) { printf("ERROR : Unable to malloc !\n"); MPI_Finalize(); exit(EXIT_FAILURE); } MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status); printf("OK."); fflush(stdout); unsigned int i; for (i=0; i<n; i++) { if (d[i] != i) { printf("Data corrupted at %d!.\n",i); i=n+1; } } if (i!=n+1) printf("Data OK.\n"); free(d); } printf("All done.\n"); fflush(stdout); } // Slave send else { d=(double *)malloc(sizeof(double)*n); for (unsigned int i=0; i<n; i++) d[i]=i; MPI_Send(d,n,MPI_DOUBLE,master,999,MPI_COMM_WORLD); } MPI_Finalize(); return 0; }