-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

yes, I have multiple clusters, some with infiniband, some with mx, some
nodes with both Myrinet et Infiniband hardware and others with ethernet
only.

I reproduced it on a vanilla 1.4.1 and 1.4.2 with and without the
- --with-mx switch.

this is the output I get on a node with ethernet and infiniband hardware.
note the Error regarding mx.

$ ~/openmpi-1.4.2-bin/bin/mpirun ~/bwlat/mpi_helloworld
[bordeplage-9.bordeaux.grid5000.fr:32365] Error in mx_init (error No MX
device entry in /dev.)
[bordeplage-9.bordeaux.grid5000.fr:32365] mca_btl_mx_component_init:
mx_get_info(MX_NIC_COUNT) failed with status 4(MX not initialized.)
Hello world from process 0 of 1
[bordeplage-9:32365] *** Process received signal ***
[bordeplage-9:32365] Signal: Segmentation fault (11)
[bordeplage-9:32365] Signal code: Address not mapped (1)
[bordeplage-9:32365] Failing at address: 0x7f53bb7bb360
- --------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 32365 on node
bordeplage-9.bordeaux.grid5000.fr exited on signal 11 (Segmentation fault).
- --------------------------------------------------------------------------

I recompiled a 1.4.2 --with-openib --without-mx and the problem is gone
(no segfault, no error message).
seems you aimed at the right spot.

now the problem is that I need support for both.
I could compile two versions of openmpi and deploy appropriate versions
on each cluster with support either for mx, either for openib... but
it's quite painful and well, how should I manage nodes with both?

for now I'll be sticking to a version of openmpi compiled with both
hardware support and --without-memory-manager.
unless the list has a better idea?

thanks for the input, much appreciated.
if you need further infos, I can recompile everything with -g and fire a
gdb and locate the segfault more precisely.

On 06/01/2010 03:34 PM, Jeff Squyres wrote:
> Are you running on nodes with both MX and OpenFabrics?
> 
> I don't know if this is a well-tested scenario -- there may be some strange 
> interactions in the registered memory management between MX and OpenFabrics 
> verbs.  
> 
> FWIW, you should be able to disable Open MPI's memory management at run time 
> in the 1.4 series by setting the environment variable 
> OMPI_MCA_memory_ptmalloc2_disable to 1 (for good measure, ensure that it's 
> set on all nodes where you are running Open MPI).
> 
> 
> 
> On May 31, 2010, at 11:02 AM, guillaume ranquet wrote:
> 
> we use a slightly modified openmpi-1.4.1
> 
> the patch is here:
> <diff>
> --- ompi/mca/btl/tcp/btl_tcp_proc.c.orig        2010-03-23
> 14:01:28.000000000 +0100
> +++ ompi/mca/btl/tcp/btl_tcp_proc.c     2010-03-23 14:01:50.000000000 +0100
> @@ -496,7 +496,7 @@
>                                  local_interfaces[i]->ipv4_netmask)) {
>                          weights[i][j] = CQ_PRIVATE_SAME_NETWORK;
>                      } else {
> -                        weights[i][j] = CQ_PRIVATE_DIFFERENT_NETWORK;
> +                        weights[i][j] = CQ_NO_CONNECTION;
>                      }
>                      best_addr[i][j] =
> peer_interfaces[j]->ipv4_endpoint_addr;
>                  }
> </diff>
> 
> I actually just discovered the existence of this patch,
> I'm planning to run tests with a vanilla 1.4.1 and if possible a 1.4.2 ASAP.
> 
> 
> On 05/31/2010 04:18 PM, Ralph Castain wrote:
>>>> What OMPI version are you using?
>>>>
>>>> On May 31, 2010, at 5:37 AM, guillaume ranquet wrote:
>>>>
>>>> Hi,
>>>> I'm new to the list and quite new to the world of MPI.
>>>>
>>>> a bit of background:
>>>> I'm a sysadmin and have to provide a working environment (debian base)
>>>> for researchers to work with MPI : I'm _NOT_ an open-mpi user - I know
>>>> C, but that's all.
>>>>
>>>> I compile openmpi with the following selectors: --prefix=/usr
>>>> --with-openib=/usr --with-mx=/usr
>>>> (yes, everything goes in /usr)
>>>>
>>>> when running an mpi application (any application) on a machine equipped
>>>> with infiniband hardware, I get a segmentation fault during the
>>>> MPI_Finalise()
>>>> the code just runs fine on machines that have no Infiniband devices.
>>>>
>>>> <code>
>>>> #include <stdio.h>
>>>> #include <mpi.h>
>>>>
>>>>
>>>> int main (int argc,char *argv[])
>>>> {
>>>>  int i=0,rank, size;
>>>>
>>>>  MPI_Init (&argc, &argv);      /* starts MPI */
>>>>  MPI_Comm_rank (MPI_COMM_WORLD, &rank);        /* get current process id */
>>>>  MPI_Comm_size (MPI_COMM_WORLD, &size);        /* get number of
>>>> processes */
>>>> while (i == 0)
>>>>      sleep(5);
>>>>  printf( "Hello world from process %d of %d\n", rank, size );
>>>>  MPI_Finalize();
>>>>  return 0;
>>>> }
>>>> </code>
>>>>
>>>> my gdb-fu is quite rusty, but I get the vague idea it happens somewhere
>>>> in the MPI_Finalize(); (I can probably dig a bit there to find exactly
>>>> where, if it's relevant)
>>>>
>>>> I'm running it with:
>>>> $ mpirun --mca orte_base_help_aggregate 0 --mca plm_rsh_agent oarsh
>>>> -machinefile nodefile ./mpi_helloworld
>>>>
>>>>
>>>> after various tests I've been suggested to try recompiling openmpi with
>>>> the --without-memory-manager selector.
>>>> it actually solves the issue and everything runs fine.
>>>>
>>>> from what I understand (correct me if I'm wrong) the "memory manager" is
>>>> used with Infiniband RDMA to have a somewhat persistant memory region
>>>> available on the device instead of destroying/recreating it everytime.
>>>> and thus, it's only a "performance tunning" issue, that disables the
>>>> openmpi "leave_pinned" option?
>>>>
>>>> the various questions I have:
>>>> is this bug/behaviour known?
>>>> if so, is there a better workaround?
>>>> as I'm not an openmpi user, I don't really know if it's considered
>>>> acceptable to have this option disabled?
>>>> does the list want more details on this bug?
>>>>
>>>>
>>>> thanks,
>>>> Guillaume Ranquet.
>>>> Grid5000 support-staff.
>>>>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
>>
>>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJMBlHKAAoJEEzIl7PMEAlipHQIAJT4+oTQbGM8TijO9yWEqOCv
XTUQtYDz6wB/9FViEPncynRgNh8Sbxr2/fPSHkfaLAmVMGaoMpvS2rW6hx2XwXM7
tAWFHtfBxhjjGDG1blSxEyhn0fQMy7ZgPEZ66QTNUslFtZ3cbPcY+hBMwXNfalES
3JCuE1n7G54NF/jl/4sO4d0voFUxIK3Jyt63hisY5b3n4WCf77/yGVjCA24xG2pY
/GqJ3ZkaPNu59zkKUZG8RTGmjQfA+hbhh6NSEvSgvPvUIrOcDYFR/BkVAKSf7nGc
fc0jzzwiSFcodux+5UGZ5I8M27FmHKFxK3LvR1/KRXRC42/PdCBWQSnBjVxluFs=
=/w/Y
-----END PGP SIGNATURE-----

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to