Hello,

I changed the test code (hetero.c, in attach) so that the master (where data
is centralized) can be rank 1 or 2.
I tested with a master,rank 2 or rank 1 : same probleme, when the master is
a 64bit machine, as soon as it receive data from a 32bit machines it got
segfault. no probleme with a 32bit master. It seems to not be rank dependent
...

Regards,


On Mon, Mar 8, 2010 at 1:27 PM, Terry Dontje <terry.don...@oracle.com>wrote:

> We (Oracle) have not done that much extensive limits testing going between
> 32 to 64bit applications.  Most of the testing we've done is more around
> endianess (SPARC vs x86_64).
>
> Though the below is kind of interesting.  Sounds like the eager limit isn't
> being normalized on the 64 bit machines.  Though a 32 bit rank 0 solving the
> problem also is very interesting, I wonder if that is not more due to which
> rank is send and receiving?
>
> --td
>
>
>
>> Message: 3
>> Date: Sun, 7 Mar 2010 05:34:21 -0600
>> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
>> Subject: Re: [OMPI users] Segmentation fault when Send/Recv
>>        onheterogeneouscluster (32/64 bit machines)
>> To: <us...@open-mpi.org>
>> Message-ID:
>>        <58d723fe08dc6a4398e6596e38f3fa17056...@xmb-rcd-205.cisco.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Ibm and sun (oracle) have probably done the most heterogeneous testing,
>> but its probably not as stable as our homogeneous code paths.
>>
>> Terry/brad - do you have any insight here?
>>
>> Yes, setting eager limit high can impact performance. Its the amount of
>> data that ompi will send eagerly without waiting for an ack from the
>> receiver. There are several secondary performance effects that can occur if
>> you are using sockets for transport and/or your program is only loosely
>> synchronized. If your prog is tightly synchronous, it may not have too huge
>> of an overall perf impact.
>> -jms
>> Sent from my PDA.  No type good.
>>
>> ----- Original Message -----
>> From: users-boun...@open-mpi.org <users-boun...@open-mpi.org>
>> To: Open MPI Users <us...@open-mpi.org>
>> Sent: Thu Mar 04 09:02:19 2010
>> Subject: Re: [OMPI users] Segmentation fault when Send/Recv
>> onheterogeneouscluster (32/64 bit machines)
>>
>> Hi,
>>
>> I have some new discovery about this problem :
>>
>> It seems that the array size sendable from a 32bit to 64bit machines
>> is proportional to the parameter "btl_tcp_eager_limit"
>> When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send an
>> array up to 2e07 double (152MB).
>>
>> I didn't found much informations about btl_tcp_eager_limit other than
>> in the "ompi_info --all" command. If I let it at 2e08, will it impacts
>> the performance of OpenMPI ?
>>
>> It may be noteworth also that if the master (rank 0) is a 32bit
>> machines, I don't have segfault. I can send big array with small
>> "btl_tcp_eager_limit" from a 64bit machine to a 32bit one.
>>
>> Do I have to move this thread to devel mailing list ?
>>
>> Regards,
>>
>>   TMHieu
>>
>> On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu <mhtr...@gmail.com>
>> wrote:
>>
>>
>>> Hello,
>>>
>>> Yes, I compiled OpenMPI with --enable-heterogeneous. More precisely I
>>> compiled with :
>>> $ ./configure --prefix=/tmp/openmpi --enable-heterogeneous
>>> --enable-cxx-exceptions --enable-shared
>>> --enable-orterun-prefix-by-default
>>> $ make all install
>>>
>>> I attach the output of ompi_info of my 2 machines.
>>>
>>> ? ?TMHieu
>>>
>>> On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
>>>
>>>
>>>> Did you configure Open MPI with --enable-heterogeneous?
>>>>
>>>> On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote:
>>>>
>>>>
>>>>
>>>>> Hello,
>>>>>
>>>>> I have some problems running MPI on my heterogeneous cluster. More
>>>>> precisley i got segmentation fault when sending a large array (about
>>>>> 10000) of double from a i686 machine to a x86_64 machine. It does not
>>>>> happen with small array. Here is the send/recv code source (complete
>>>>> source is in attached file) :
>>>>> ========code ================
>>>>> ? ? if (me == 0 ) {
>>>>> ? ? ? ? for (int pe=1; pe<nprocs; pe++)
>>>>> ? ? ? ? {
>>>>> ? ? ? ? ? ? ? ? printf("Receiving from proc %d : ",pe); fflush(stdout);
>>>>> ? ? ? ? ? ? d=(double *)malloc(sizeof(double)*n);
>>>>> ? ? ? ? ? ? MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status);
>>>>> ? ? ? ? ? ? printf("OK\n"); fflush(stdout);
>>>>> ? ? ? ? }
>>>>> ? ? ? ? printf("All done.\n");
>>>>> ? ? }
>>>>> ? ? else {
>>>>> ? ? ? d=(double *)malloc(sizeof(double)*n);
>>>>> ? ? ? MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD);
>>>>> ? ? }
>>>>> ======== code ================
>>>>>
>>>>> I got segmentation fault with n=10000 but no error with n=1000
>>>>> I have 2 machines :
>>>>> sbtn155 : Intel Xeon, ? ? ? ? x86_64
>>>>> sbtn211 : Intel Pentium 4, i686
>>>>>
>>>>> The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1,
>>>>> installed in /tmp/openmpi :
>>>>> [mhtrinh@sbtn211 heterogenous]$ make hetero
>>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o
>>>>> hetero.i686.o
>>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
>>>>> hetero.i686.o -o hetero.i686 -lm
>>>>>
>>>>> [mhtrinh@sbtn155 heterogenous]$ make hetero
>>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o
>>>>> hetero.x86_64.o
>>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
>>>>> hetero.x86_64.o -o hetero.x86_64 -lm
>>>>>
>>>>> I run with the code using appfile and got thoses error :
>>>>> $ cat appfile
>>>>> --host sbtn155 -np 1 hetero.x86_64
>>>>> --host sbtn155 -np 1 hetero.x86_64
>>>>> --host sbtn211 -np 1 hetero.i686
>>>>>
>>>>> $ mpirun -hetero --app appfile
>>>>> Input array length :
>>>>> 10000
>>>>> Receiving from proc 1 : OK
>>>>> Receiving from proc 2 : [sbtn155:26386] *** Process received signal ***
>>>>> [sbtn155:26386] Signal: Segmentation fault (11)
>>>>> [sbtn155:26386] Signal code: Address not mapped (1)
>>>>> [sbtn155:26386] Failing at address: 0x200627bd8
>>>>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540]
>>>>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so
>>>>> [0x2aaaad8d7908]
>>>>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so
>>>>> [0x2aaaae2fc6e3]
>>>>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2aaaaafe39db]
>>>>> [sbtn155:26386] [ 4]
>>>>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2aaaaafd8b9e]
>>>>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so
>>>>> [0x2aaaad8d4b25]
>>>>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b)
>>>>> [0x2aaaaab30f9b]
>>>>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe]
>>>>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4)
>>>>> [0x3fa421e074]
>>>>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29]
>>>>> [sbtn155:26386] *** End of error message ***
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 0 with PID 26386 on node sbtn155
>>>>> exited on signal 11 (Segmentation fault).
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> Am I missing an option in order to run in heterogenous cluster ?
>>>>> MPI_Send/Recv have limit array size when using heterogeneous cluster ?
>>>>> Thanks for your help. Regards
>>>>>
>>>>> --
>>>>> ============================================
>>>>> ? ?M. TRINH Minh Hieu
>>>>> ? ?CEA, IBEB, SBTN/LIRM,
>>>>> ? ?F-30207 Bagnols-sur-C?ze, FRANCE
>>>>> ============================================
>>>>>
>>>>> <hetero.c.bz2>_______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> -------------- next part --------------
>> HTML attachment scrubbed and removed
>>
>>  **************************************
>>
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
============================================
  M. TRINH Minh Hieu
  CEA, IBEB, SBTN/LIRM,
  F-30207 Bagnols-sur-Cèze, FRANCE
============================================
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
    unsigned int n;
    int me, nprocs;
    int master=1;

    MPI_Init (&argc, &argv);
    MPI_Comm_size (MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank (MPI_COMM_WORLD, &me);
    if (me == 0)
    {
	printf("%s", "Input array length :\n");
	scanf ("%d", &n);
	printf("Size in MB: %g\n",(double) n*8/1024/1024);
    }
    MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Barrier (MPI_COMM_WORLD);
    
    MPI_Status status;
    double *d;

    // Master recv
    if (me == master ) {
	for (int pe=0; pe<nprocs; pe++)
	{
	    if (pe==master) continue;

	    printf("I am proc %d. Receiving from proc %d : ",me,pe); fflush(stdout);
	    d=(double *)malloc(sizeof(double)*n);	   
	    if (d==NULL) 
	    {
		printf("ERROR : Unable to malloc !\n");
		MPI_Finalize();
		exit(EXIT_FAILURE);
	    }
	    
	    MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status);
	    printf("OK."); fflush(stdout);
	    unsigned int i;
	    for (i=0; i<n; i++)
	    {
		if (d[i] != i) {
		    printf("Data corrupted at %d!.\n",i);
		    i=n+1;
		}
	    }
	    if (i!=n+1) printf("Data OK.\n");
	    free(d);
	}
	printf("All done.\n"); fflush(stdout);

    }

    // Slave send
    else {
      d=(double *)malloc(sizeof(double)*n);
      for (unsigned int i=0; i<n; i++) d[i]=i;
      MPI_Send(d,n,MPI_DOUBLE,master,999,MPI_COMM_WORLD);
      
    }
    
    MPI_Finalize();
    return 0;
} 

Reply via email to