-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi,
yes, I have multiple clusters, some with infiniband, some with mx, some nodes with both Myrinet et Infiniband hardware and others with ethernet only. I reproduced it on a vanilla 1.4.1 and 1.4.2 with and without the - --with-mx switch. this is the output I get on a node with ethernet and infiniband hardware. note the Error regarding mx. $ ~/openmpi-1.4.2-bin/bin/mpirun ~/bwlat/mpi_helloworld [bordeplage-9.bordeaux.grid5000.fr:32365] Error in mx_init (error No MX device entry in /dev.) [bordeplage-9.bordeaux.grid5000.fr:32365] mca_btl_mx_component_init: mx_get_info(MX_NIC_COUNT) failed with status 4(MX not initialized.) Hello world from process 0 of 1 [bordeplage-9:32365] *** Process received signal *** [bordeplage-9:32365] Signal: Segmentation fault (11) [bordeplage-9:32365] Signal code: Address not mapped (1) [bordeplage-9:32365] Failing at address: 0x7f53bb7bb360 - -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 32365 on node bordeplage-9.bordeaux.grid5000.fr exited on signal 11 (Segmentation fault). - -------------------------------------------------------------------------- I recompiled a 1.4.2 --with-openib --without-mx and the problem is gone (no segfault, no error message). seems you aimed at the right spot. now the problem is that I need support for both. I could compile two versions of openmpi and deploy appropriate versions on each cluster with support either for mx, either for openib... but it's quite painful and well, how should I manage nodes with both? for now I'll be sticking to a version of openmpi compiled with both hardware support and --without-memory-manager. unless the list has a better idea? thanks for the input, much appreciated. if you need further infos, I can recompile everything with -g and fire a gdb and locate the segfault more precisely. On 06/01/2010 03:34 PM, Jeff Squyres wrote: > Are you running on nodes with both MX and OpenFabrics? > > I don't know if this is a well-tested scenario -- there may be some strange > interactions in the registered memory management between MX and OpenFabrics > verbs. > > FWIW, you should be able to disable Open MPI's memory management at run time > in the 1.4 series by setting the environment variable > OMPI_MCA_memory_ptmalloc2_disable to 1 (for good measure, ensure that it's > set on all nodes where you are running Open MPI). > > > > On May 31, 2010, at 11:02 AM, guillaume ranquet wrote: > > we use a slightly modified openmpi-1.4.1 > > the patch is here: > <diff> > --- ompi/mca/btl/tcp/btl_tcp_proc.c.orig 2010-03-23 > 14:01:28.000000000 +0100 > +++ ompi/mca/btl/tcp/btl_tcp_proc.c 2010-03-23 14:01:50.000000000 +0100 > @@ -496,7 +496,7 @@ > local_interfaces[i]->ipv4_netmask)) { > weights[i][j] = CQ_PRIVATE_SAME_NETWORK; > } else { > - weights[i][j] = CQ_PRIVATE_DIFFERENT_NETWORK; > + weights[i][j] = CQ_NO_CONNECTION; > } > best_addr[i][j] = > peer_interfaces[j]->ipv4_endpoint_addr; > } > </diff> > > I actually just discovered the existence of this patch, > I'm planning to run tests with a vanilla 1.4.1 and if possible a 1.4.2 ASAP. > > > On 05/31/2010 04:18 PM, Ralph Castain wrote: >>>> What OMPI version are you using? >>>> >>>> On May 31, 2010, at 5:37 AM, guillaume ranquet wrote: >>>> >>>> Hi, >>>> I'm new to the list and quite new to the world of MPI. >>>> >>>> a bit of background: >>>> I'm a sysadmin and have to provide a working environment (debian base) >>>> for researchers to work with MPI : I'm _NOT_ an open-mpi user - I know >>>> C, but that's all. >>>> >>>> I compile openmpi with the following selectors: --prefix=/usr >>>> --with-openib=/usr --with-mx=/usr >>>> (yes, everything goes in /usr) >>>> >>>> when running an mpi application (any application) on a machine equipped >>>> with infiniband hardware, I get a segmentation fault during the >>>> MPI_Finalise() >>>> the code just runs fine on machines that have no Infiniband devices. >>>> >>>> <code> >>>> #include <stdio.h> >>>> #include <mpi.h> >>>> >>>> >>>> int main (int argc,char *argv[]) >>>> { >>>> int i=0,rank, size; >>>> >>>> MPI_Init (&argc, &argv); /* starts MPI */ >>>> MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */ >>>> MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of >>>> processes */ >>>> while (i == 0) >>>> sleep(5); >>>> printf( "Hello world from process %d of %d\n", rank, size ); >>>> MPI_Finalize(); >>>> return 0; >>>> } >>>> </code> >>>> >>>> my gdb-fu is quite rusty, but I get the vague idea it happens somewhere >>>> in the MPI_Finalize(); (I can probably dig a bit there to find exactly >>>> where, if it's relevant) >>>> >>>> I'm running it with: >>>> $ mpirun --mca orte_base_help_aggregate 0 --mca plm_rsh_agent oarsh >>>> -machinefile nodefile ./mpi_helloworld >>>> >>>> >>>> after various tests I've been suggested to try recompiling openmpi with >>>> the --without-memory-manager selector. >>>> it actually solves the issue and everything runs fine. >>>> >>>> from what I understand (correct me if I'm wrong) the "memory manager" is >>>> used with Infiniband RDMA to have a somewhat persistant memory region >>>> available on the device instead of destroying/recreating it everytime. >>>> and thus, it's only a "performance tunning" issue, that disables the >>>> openmpi "leave_pinned" option? >>>> >>>> the various questions I have: >>>> is this bug/behaviour known? >>>> if so, is there a better workaround? >>>> as I'm not an openmpi user, I don't really know if it's considered >>>> acceptable to have this option disabled? >>>> does the list want more details on this bug? >>>> >>>> >>>> thanks, >>>> Guillaume Ranquet. >>>> Grid5000 support-staff. >>>>> > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> >> _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.15 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJMBlHKAAoJEEzIl7PMEAlipHQIAJT4+oTQbGM8TijO9yWEqOCv XTUQtYDz6wB/9FViEPncynRgNh8Sbxr2/fPSHkfaLAmVMGaoMpvS2rW6hx2XwXM7 tAWFHtfBxhjjGDG1blSxEyhn0fQMy7ZgPEZ66QTNUslFtZ3cbPcY+hBMwXNfalES 3JCuE1n7G54NF/jl/4sO4d0voFUxIK3Jyt63hisY5b3n4WCf77/yGVjCA24xG2pY /GqJ3ZkaPNu59zkKUZG8RTGmjQfA+hbhh6NSEvSgvPvUIrOcDYFR/BkVAKSf7nGc fc0jzzwiSFcodux+5UGZ5I8M27FmHKFxK3LvR1/KRXRC42/PdCBWQSnBjVxluFs= =/w/Y -----END PGP SIGNATURE-----
smime.p7s
Description: S/MIME Cryptographic Signature