[OMPI users] MPI_CANCEL for nonblocking collective communication
Dear MPI Users and Maintainers, I am using openMPI in version 1.10.4 with enabled multithread support and java bindings. I use MPI in java, having one process per machine and multiple threads per process. I was trying to build a broadcast listener thread which calls MPI_iBcast, followed by MPI_WAIT. I use the request object, which is returned by MPI_iBcast, to shut the listener down, calling MPI-CANCEL for that request from the main thread. This results in [fe-402-1:2972] *** An error occurred in MPI_Cancel [fe-402-1:2972] *** reported by process [1275002881,17179869185 <(717)%20986-9185>] [fe-402-1:2972] *** on communicator MPI_COMM_WORLD [fe-402-1:2972] *** MPI_ERR_REQUEST: invalid request [fe-402-1:2972] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [fe-402-1:2972] ***and potentially your MPI job) Which indicates that the request is invalid in some fashion. I already checked that it is not null (MPI_REQUEST_NULL). I have also set up a simple testbed, where nothing else happens, except that one broadcast. The request object is always invalid, no matter from where i call cancel(). As far as I understand the MPI specifications, cancel is also supposed to work for collective nonblocking communication (which includes my broadcasts). I haven't found any advice yet, so I hope to find some help in this mailing list. Kind regards, Markus Jeromin PS: Testbed for calling mpi cancel, written in Java. ___ package distributed.mpi; import java.nio.ByteBuffer; import mpi.MPI; import mpi.MPIException; import mpi.Request; /** * Testing MPI_CANCEL on MPI_iBcast. * Program does not terminate because the listeners are still running and * waiting for the java native call MPI_WAIT to return. MPI_CANCEL is called, but * the listener never unblocks (i.e. the MPI_WAIT never returns) * * @author mjeromin * */ public class BroadcastTestCancel { static int myrank; /** * Listener that waits for incoming broadcasts from specified root. Uses * asynchronous MPI_iBcast and MPI_WAIT * */ static class Listener extends Thread { ByteBuffer b = ByteBuffer.allocateDirect(100); public Request req = null; @Override public void run() { super.run(); try { req = MPI.COMM_WORLD.iBcast(b, b.limit(), MPI.BYTE, 0); System.out.println(myrank + ": waiting for bcast (that will never come)"); req.waitFor(); } catch (MPIException e) { e.printStackTrace(); } System.out.println(myrank + ": listener unblocked"); } } public static void main(String[] args) throws MPIException, InterruptedException { // we need full thread support int threadSupport = MPI.InitThread(args, MPI.THREAD_MULTIPLE); if (threadSupport != MPI.THREAD_MULTIPLE) { System.out.println(myrank + ": no multithread support. Aborting."); MPI.Finalize(); return; } // disable or enable exceptions, it does not matter at all. MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN); myrank = MPI.COMM_WORLD.getRank(); // start receiving listeners, but no sender (which would be node 0) if (myrank > 0) { Listener l = new Listener(); l.start(); // let the listener reach at waitFor() Thread.sleep(5000); // call MPI_CANCEL (matching send will never arrive) try { l.req.cancel(); } catch (MPIException e) { // depends on error handler System.out.println(myrank + ": MPI Exception \n" + e.toString()); } } // don't call MPI_FINISH too early. (not that necessary to wait here, but just to be sure) Thread.sleep(15000); System.out.println(myrank + ": calling finish"); MPI.Finalize(); System.out.println(myrank + ": finished"); } } ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Deadlocks and warnings from libevent when using MPI_THREAD_MULTIPLE
Hi everyone, I'm using the current Open MPI 1.8.1 release and observe non-deterministic deadlocks and warnings from libevent when using MPI_THREAD_MULTIPLE. Open MPI has been configured with --enable-mpi-thread-multiple --with-tm --with-verbs (see attached config.log) Attached is a sample application that spawns a thread for each process after MPI_Init_thread has been called. The thread then calls MPI_Recv which blocks until the matching MPI_Send is called just before MPI_Finalize is called in the main thread. (AFAIK MPICH uses such kind of facility to implement a progress thread) Meanwhile the main thread exchanges data with its right/left neighbor via ISend/IRecv. I only see this, when the MPI processes run on separate nodes like in the following: $ mpiexec -n 2 -map-by node ./test [0] isend/irecv. [0] progress thread... [0] waitall. [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once. [1] isend/irecv. [1] progress thread... [1] waitall. [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once. Can anybody confirm this? Best regards, Markus -- Markus Wittmann, HPC Services Friedrich-Alexander-Universität Erlangen-Nürnberg Regionales Rechenzentrum Erlangen (RRZE) Martensstrasse 1, 91058 Erlangen, Germany http://www.rrze.fau.de/hpc/ info.tar.bz2 Description: Binary data // Compile with: mpicc test.c -pthread -o test #include #include #include #include #include static void * ProgressThread(void * ptRank) { int buffer = 0xCDEFCDEF; int rank = *((int *)ptRank); int error; printf("[%d] progress thread...\n", rank); MPI_Recv(&buffer, 1, MPI_INT, rank, 999, MPI_COMM_WORLD, MPI_STATUS_IGNORE); return NULL; } int main(int argc, char * argv[]) { int rank = -1; int size = -1; int bufferSend = 0; int bufferRecv = 0; int requested = MPI_THREAD_MULTIPLE; int provided = -1; int error; pthread_t thread; MPI_Request requests[2]; MPI_Init_thread(&argc, &argv, requested, &provided); if (requested != provided) { printf("error: requested %d != provided %d\n", requested, provided); exit(1); } MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); error = pthread_create(&thread, NULL, &ProgressThread, &rank); if (error != 0) { fprintf(stderr, "pthread_create failed (%d): %s\n", error, strerror(error)); } printf("[%d] isend/irecv.\n", rank); MPI_Isend(&bufferSend, 1, MPI_INT, (rank + 1) % size, 0, MPI_COMM_WORLD, &requests[0]); MPI_Irecv(&bufferRecv, 1, MPI_INT, (rank - 1 + size) % size, 0, MPI_COMM_WORLD, &requests[1]); printf("[%d] waitall.\n", rank); MPI_Waitall(2, requests, MPI_STATUS_IGNORE); printf("[%d] send.\n", rank); MPI_Send(&bufferSend, 1, MPI_INT, rank, 999, MPI_COMM_WORLD); error = pthread_join(thread, NULL); if (error != 0) { fprintf(stderr, "pthread_join failed (%d): %s\n", error, strerror(error)); } printf("[%d] done.\n", rank); MPI_Finalize(); return 0; }
Re: [OMPI users] Deadlocks and warnings from libevent when using MPI_THREAD_MULTIPLE
Am 25.04.2014 23:40, schrieb Ralph Castain: We don't fully support THREAD_MULTIPLE, and most definitely not when using IB. We are planning on extending that coverage in the 1.9 series Ah OK, thanks for the fast reply. -- Markus Wittmann, HPC Services Friedrich-Alexander-Universität Erlangen-Nürnberg Regionales Rechenzentrum Erlangen (RRZE) Martensstrasse 1, 91058 Erlangen, Germany Tel.: +49 9131 85-20104 markus.wittm...@fau.de http://www.rrze.fau.de/hpc/
[OMPI users] [R] Short survey concerning the use of software engineering in the field of High Performance Computing
Dear Colleagues, this is a short survey (21 questions that take about 10 minutes to answer) in context of the research work for my PhD thesis and the Munich Center of Advanced Computing (Project B2). It would be very helpful, if you will take the time to answer my questions concerning the use of software engineering in the field of High Performance Computing. Please note, all questions are mandatory to answer! http://www.q-set.de/q-set.php?sCode=TCSBHMPZAASZ Thank you very much, kind regards Miriam Schmidberger (Dipl. Medien-Inf.) schmi...@in.tum.de Technische Universität München Institut für Informatik Boltzmannstr. 3 85748 Garching Germany Office 01.07.037 Tel: +49 (89) 289-18226
[OMPI users] Running your MPI application on a Computer Cluster in the Cloud - cloudnumbers.com
Dear MPI users and experts, cloudnumbers.com provides researchers and companies with the resources to perform high performance calculations in the cloud. As cloudnumbers.com's community manager I may invite you to register and test your MPI application on a computer cluster in the cloud for free: http://my.cloudnumbers.com/register Our aim is to change the way of research collaboration is done today by bringing together scientists and businesses from all over the world on a single platform. cloudnumbers.com is a Berlin (Germany) based international high-tech startup striving for enabling everyone to benefit from the High Performance Computing related advantages of the cloud. We provide easy access to applications running on any kind of computer hardware: from single core high memory machines up to 1000 cores computer clusters. Our platform provides several advantages: * Turn fixed into variable costs and pay only for the capacity you need. Watch our latest saving costs with cloudnumbers.com video: http://www.youtube.com/watch?v=ln_BSVigUhg&feature=player_embedded * Enter the cloud using an intuitive and user friendly platform. Watch our latest cloudnumbers.com in a nutshell video: http://www.youtube.com/watch?v=0ZNEpR_ElV0&feature=player_embedded * Be released from ongoing technological obsolescence and continuous maintenance costs (e.g. linking to libraries or system dependencies) * Accelerated your C, C++, Fortran, R, Python, ... calculations through parallel processing and great computing capacity - more than 1000 cores are available and GPUs are coming soon. * Share your results worldwide (coming soon). * Get high speed access to public databases (please let us know, if your favorite database is missing!). * We have developed a security architecture that meets high requirements of data security and privacy. Read our security white paper: http://d1372nki7bx5yg.cloudfront.net/wp-content/uploads/2011/06/cloudnumberscom-security.whitepaper.pdf This is only a selection of our top features. To get more information check out our web-page (http://www.cloudnumbers.com/) or follow our blog about cloud computing, HPC and HPC applications: http://cloudnumbers.com/blog Register and test for free now at cloudnumbers.com: http://my.cloudnumbers.com/register We are looking forward to get your feedback and consumer insights. Take the chance and have an impact to the development of a new cloud computing calculation platform. Best Markus -- Dr. rer. nat. Markus Schmidberger Senior Community Manager Cloudnumbers.com GmbH Chausseestraße 6 10119 Berlin www.cloudnumbers.com E-Mail: markus.schmidber...@cloudnumbers.com * Amtsgericht München, HRB 191138 Geschäftsführer: Erik Muttersbach, Markus Fensterer, Moritz v. Petersdorff-Campen
[OMPI users] open-mpi error
Hello, i have some problem with mpi, i looked in the FAQ and google already but i couldnt find a solution. To build mpi i used this: shell$ ./configure --prefix=/opt/mpirun <...lots of output...> shell$ make all install Worked fine so far. I am using dlpoly, and this makefile: $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \ FC="mpif90 -c" FCFLAGS="-O3" \ EX=$(EX) BINROOT=$(BINROOT) $(TYPE) This worked fine too, the problem occurs when i want to run a job with mpiexec -n 4 ./DLPOLY.Z or mpirun -n 4 ./DLPOLY.z I get this error: -- [linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 125 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 Some Informations: I use Open MPI 1.4.4, Suse 64bit, AMD quadcore make check gives: make: *** No rule to make target `check'. Stop. I attached the ompi_info. Thx alot for your help, regards, Markus markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> ompi_info --all Package: Open MPI abuild@build08 Distribution Open MPI: 1.4.3 Open MPI SVN revision: r23834 Open MPI release date: Oct 05, 2010 Open RTE: 1.4.3 Open RTE SVN revision: r23834 Open RTE release date: Oct 05, 2010 OPAL: 1.4.3 OPAL SVN revision: r23834 OPAL release date: Oct 05, 2010 Ident string: 1.4.3 MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.4.3) MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4.3) MCA timer: linux (MCA v2.0, API v2.0, Component v1.4.3) MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4.3) MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4.3) Prefix: /usr/lib64/mpi/gcc/openmpi Exec_prefix: /usr/lib64/mpi/gcc/openmpi Bindir: /usr/lib64/mpi/gcc/openmpi/bin Sbindir: /usr/lib64/mpi/gcc/openmpi/sbin Libdir: /usr/lib64/mpi/gcc/openmpi/lib64 Incdir: /usr/lib64/mpi/gcc/openmpi/include Mandir: /usr/lib64/mpi/gcc/openmpi/share/man Pkglibdir: /usr/lib64/mpi/gcc/openmpi/lib64/openmpi Libexecdir: /usr/lib64/mpi/gcc/openmpi/lib Datarootdir: /usr/lib64/mpi/gcc/openmpi/share Datadir: /usr/lib64/mpi/gcc/openmpi/share Sysconfdir: /etc Sharedstatedir: /usr/lib64/mpi/gcc/openmpi/com Localstatedir: /var Infodir: /usr/lib64/mpi/gcc/openmpi/share/info Pkgdatadir: /usr/lib64/mpi/gcc/openmpi/share/openmpi Pkglibdir: /usr/lib64/mpi/gcc/openmpi/lib64/openmpi Pkgincludedir: /usr/lib64/mpi/gcc/openmpi/include/openmpi Configured architecture: x86_64-suse-linux-gnu Configure host: build08 Configured by: abuild Configured on: Sat Oct 29 15:50:22 UTC 2011 Configure host: build08 Built by: abuild Built on: Sat Oct 29 16:04:18 UTC 2011 Built host: build08 C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: gcc C compiler absolute: /usr/bin/gcc C char size: 1 C bool size: 1 C short size: 2 C int size: 4 C long size: 8 C float size: 4 C double size: 8 C pointer size: 8 C char align: 1 C bool align: 1 C int align: 4 C float align: 4 C double align: 8 C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: gfortran Fortran77 compiler abs: /usr/bin/gfortran Fortran90 compiler: gfortran Fortran90 compiler abs: /usr/bin/gfortran Fort integer size: 4 Fort logical size: 4 Fort lo
Re: [OMPI users] open-mpi error
On 11/24/2011 10:08 PM, MM wrote: Hi I get the same error while linking against home built 1.5.4 openmpi libs on win32. I didn't get this error against the prebuilt libs. I see you use Suse. There probably is a openmpi.rpm or openmpi.dpkg already available for Suse which contains the libraries and you could link against those and that may work MM -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Markus Stiller Sent: 24 November 2011 20:41 To: us...@open-mpi.org Subject: [OMPI users] open-mpi error Hello, i have some problem with mpi, i looked in the FAQ and google already but i couldnt find a solution. To build mpi i used this: shell$ ./configure --prefix=/opt/mpirun <...lots of output...> shell$ make all install Worked fine so far. I am using dlpoly, and this makefile: $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \ FC="mpif90 -c" FCFLAGS="-O3" \ EX=$(EX) BINROOT=$(BINROOT) $(TYPE) This worked fine too, the problem occurs when i want to run a job with mpiexec -n 4 ./DLPOLY.Z or mpirun -n 4 ./DLPOLY.z I get this error: -- [linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 125 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 Some Informations: I use Open MPI 1.4.4, Suse 64bit, AMD quadcore make check gives: make: *** No rule to make target `check'. Stop. I attached the ompi_info. Thx alot for your help, regards, Markus ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users Hi, thx for your answer. When i try this (with mpich) i get problems with dl_poly itself: /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../x86_64-suse-linux/bin/ld: cannot find -lmpi_f90 /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../x86_64-suse-linux/bin/ld: cannot find -lmpi_f77 /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../x86_64-suse-linux/bin/ld: cannot find -lmpi /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../x86_64-suse-linux/bin/ld: cannot find -lopen-rte /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../x86_64-suse-linux/bin/ld: cannot find -lopen-pal I do not really know how to get rid of this either ^^
Re: [OMPI users] open-mpi error
On 11/24/2011 10:08 PM, MM wrote: Hi I get the same error while linking against home built 1.5.4 openmpi libs on win32. I didn't get this error against the prebuilt libs. I see you use Suse. There probably is a openmpi.rpm or openmpi.dpkg already available for Suse which contains the libraries and you could link against those and that may work MM -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Markus Stiller Sent: 24 November 2011 20:41 To: us...@open-mpi.org Subject: [OMPI users] open-mpi error Hello, i have some problem with mpi, i looked in the FAQ and google already but i couldnt find a solution. To build mpi i used this: shell$ ./configure --prefix=/opt/mpirun <...lots of output...> shell$ make all install Worked fine so far. I am using dlpoly, and this makefile: $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \ FC="mpif90 -c" FCFLAGS="-O3" \ EX=$(EX) BINROOT=$(BINROOT) $(TYPE) This worked fine too, the problem occurs when i want to run a job with mpiexec -n 4 ./DLPOLY.Z or mpirun -n 4 ./DLPOLY.z I get this error: -- [linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 125 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 Some Informations: I use Open MPI 1.4.4, Suse 64bit, AMD quadcore make check gives: make: *** No rule to make target `check'. Stop. I attached the ompi_info. Thx alot for your help, regards, Markus ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users Now i made open mpi new, but now im ggetting stuff like this: .. /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_ssi_base_param_find' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `asc_parse' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_ssi_base_param_register_string' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_ssi_base_param_register_int' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lampanic' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_thread_self' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_debug_close' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_CONVERSION_FN_NULL' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_File_read_at_all' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `sfh_sock_set_buf_size' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `blktype' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_File_preallocate' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `ao_init' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_mutex_destroy' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_File_iread_shared' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `al_init' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `stoi' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_ssi_base_hostmap' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_FORTRAN_ERRCODES_IGNORE' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_File_close' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `al_next' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_Register_datarep' /usr/loca
Re: [OMPI users] open-mpi error
Hi Castain, You have some major problems with confused installations of MPIs. First, you cannot compile an application against>MPICH and expect to run it with OMPI - the two are not binary compatible. You need to compile against the MPI>installation you intend to run against. I did this, sry i didnt tell this. I tried mpich and openmpi, and of course for each case i compiled againt mpich and opem mpi Second, your errors appear to be because you are not pointing your library path at the OMPI installation, and so the>libraries are not being found. You need to set LD_LIBRARY_PATH to include the path to where you installed OMPI.>Based on the configure line you give, that would mean ensuring that /opt/mpirun/lib was in that envar. Likewise,>/opt/mpirun/bin needs to be in your PATH. hmmi installed openmpi in the std location, changed the variables to this and this works now But now i have the same problem again (the problem why i wrote u in the first place): markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> sudo mpirun -n 4 ./DLPOLY.Z root's password: [linux-6wa6:05565] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 125 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [linux-6wa6:05565] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 What can i do with this? Thx, Markus On 11/25/2011 03:42 AM, Ralph Castain wrote: Hi Markus You have some major problems with confused installations of MPIs. First, you cannot compile an application against MPICH and expect to run it with OMPI - the two are not binary compatible. You need to compile against the MPI installation you intend to run against. Second, your errors appear to be because you are not pointing your library path at the OMPI installation, and so the libraries are not being found. You need to set LD_LIBRARY_PATH to include the path to where you installed OMPI. Based on the configure line you give, that would mean ensuring that /opt/mpirun/lib was in that envar. Likewise, /opt/mpirun/bin needs to be in your PATH. Once you have those correctly set, and build your app against the appropriate mpicc, you should be able to run. BTW: your last message indicates that you built against an old LAM MPI, so you appear to have some pretty old software laying around. Perhaps cleaning out some of the old MPI installations would help. On Nov 24, 2011, at 4:32 PM, Markus Stiller wrote: On 11/24/2011 10:08 PM, MM wrote: Hi I get the same error while linking against home built 1.5.4 openmpi libs on win32. I didn't get this error against the prebuilt libs. I see you use Suse. There probably is a openmpi.rpm or openmpi.dpkg already available for Suse which contains the libraries and you could link against those and that may work MM -Original Message- From:users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Markus Stiller Sent: 24 November 2011 20:41 To:us...@open-mpi.org Subject: [OMPI users] open-mpi error Hello, i have some problem with mpi, i looked in the FAQ and google already but i couldnt find a solution. To build mpi i used this: shell$ ./configure --prefix=/opt/mpirun <...lots of output...> shell$ make all install Worked fine so far. I am using dlpoly, and this makefile: $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \ FC="mpif90 -c" FCFLAGS="-O3" \ EX=$(EX) BINROOT=$(BINROOT) $(TYPE) This worked fine too, the problem occurs when i want to run a job with mpiexec -n 4 ./DLPOLY.Z or mpirun -n 4 ./DLPOLY.z I get this error: -- [linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 125 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; he
[OMPI users] Problems with btl openib and MPI_THREAD_MULTIPLE
Hello, I've compiled Open MPI 1.6.3 with --enable-mpi-thread-multiple -with-tm -with-openib --enable-opal-multi-threads. When I use for example the pingpong benchmark from the Intel MPI Benchmarks, which call MPI_Init the btl openib is used and everything works fine. When instead the benchmark calls MPI_Thread_init with MPI_THREAD_MULTIPLE as requested threading level the btl openib fails to load but gives no further hints for the reason: mpirun -v -n 2 -npernode 1 -gmca btl_base_verbose 200 ./imb- tm-openmpi-ts pingpong ... [l0519:08267] select: initializing btl component openib [l0519:08267] select: init of component openib returned failure [l0519:08267] select: module openib unloaded ... The question is now, is currently just the support for MPI_THREADM_MULTIPLE missing in the openib module or are there other errors occurring and if so, how to identify them. Attached ist the config.log from the Open MPI build, the ompi_info output and the output of the IMB pingpong bechmarks. As system used were two nodes with: - OpenFabrics 1.5.3 - CentOS release 5.8 (Final) - Linux Kernel 2.6.18-308.11.1.el5 x86_64 - OpenSM 3.3.3 [l0519] src > ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.7.000 node_guid: 0030:48ff:fff6:31e4 sys_image_guid: 0030:48ff:fff6:31e7 vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: SM_212201000 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 48 port_lid: 278 port_lmc: 0x00 Thanks for the help in advance. Regards, Markus -- Markus Wittmann, HPC Services Friedrich-Alexander-Universität Erlangen-Nürnberg Regionales Rechenzentrum Erlangen (RRZE) Martensstrasse 1, 91058 Erlangen, Germany Tel.: +49 9131 85-20104 markus.wittm...@fau.de http://www.rrze.fau.de/hpc/ imb.txt.bz2 Description: application/bzip imb-tm.txt.bz2 Description: application/bzip ompi_info.txt.bz2 Description: application/bzip config.log.bz2 Description: application/bzip
Re: [OMPI users] Problems with btl openib and MPI_THREAD_MULTIPLE
Hi, OK, that makes it clear. Thank you for the fast response. Regards, Markus Am 07.11.2012 13:49, schrieb Iliev, Hristo: Hello, Markus, The openib BTL component is not thread-safe. It disables itself when the thread support level is MPI_THREAD_MULTIPLE. See this rant from one of my colleagues: http://www.open-mpi.org/community/lists/devel/2012/10/11584.php A message is shown but only if the library was compiled with developer-level debugging. Open MPI guys, could the debug-level message in btl_openib_component.c:btl_openib_component_init() be replaced by a help text, e.g. the same way that the help text about the amount of registerable memory not being enough is shown. Looks like the case of openib being disabled for no apparent reason when MPI_THREAD_MULTIPLE is in effect is not isolated to our users only. Or at least could you put somewhere in the FAQ an explicit statement that openib is not only not thread-safe, but that it would disable itself in a multithreaded environment. Kind regards, Hristo -- Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany) Tel: +49 241 80 24367 -- Fax/UMS: +49 241 80 624367 -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Markus Wittmann Sent: Wednesday, November 07, 2012 1:14 PM To: us...@open-mpi.org Subject: [OMPI users] Problems with btl openib and MPI_THREAD_MULTIPLE Hello, I've compiled Open MPI 1.6.3 with --enable-mpi-thread-multiple -with-tm - with-openib --enable-opal-multi-threads. When I use for example the pingpong benchmark from the Intel MPI Benchmarks, which call MPI_Init the btl openib is used and everything works fine. When instead the benchmark calls MPI_Thread_init with MPI_THREAD_MULTIPLE as requested threading level the btl openib fails to load but gives no further hints for the reason: mpirun -v -n 2 -npernode 1 -gmca btl_base_verbose 200 ./imb- tm-openmpi- ts pingpong ... [l0519:08267] select: initializing btl component openib [l0519:08267] select: init of component openib returned failure [l0519:08267] select: module openib unloaded ... The question is now, is currently just the support for MPI_THREADM_MULTIPLE missing in the openib module or are there other errors occurring and if so, how to identify them. Attached ist the config.log from the Open MPI build, the ompi_info output and the output of the IMB pingpong bechmarks. As system used were two nodes with: - OpenFabrics 1.5.3 - CentOS release 5.8 (Final) - Linux Kernel 2.6.18-308.11.1.el5 x86_64 - OpenSM 3.3.3 [l0519] src > ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.7.000 node_guid: 0030:48ff:fff6:31e4 sys_image_guid: 0030:48ff:fff6:31e7 vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: SM_212201000 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 48 port_lid: 278 port_lmc: 0x00 Thanks for the help in advance. Regards, Markus -- Markus Wittmann, HPC Services Friedrich-Alexander-Universität Erlangen-Nürnberg Regionales Rechenzentrum Erlangen (RRZE) Martensstrasse 1, 91058 Erlangen, Germany Tel.: +49 9131 85-20104 markus.wittm...@fau.de http://www.rrze.fau.de/hpc/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Markus Wittmann, HPC Services Friedrich-Alexander-Universität Erlangen-Nürnberg Regionales Rechenzentrum Erlangen (RRZE) Martensstrasse 1, 91058 Erlangen, Germany Tel.: +49 9131 85-20104 markus.wittm...@fau.de http://www.rrze.fau.de/hpc/
Re: [OMPI users] MPI and C++ - now Send and Receive of Classes and STL containers
Hi, On Mon, Jul 06, 2009 at 03:24:07PM -0400, Luis Vitorio Cargnini wrote: > Thanks, but I really do not want to use Boost. > Is easier ? certainly is, but I want to make it using only MPI > itself > and not been dependent of a Library, or templates like the majority > of > boost a huge set of templates and wrappers for different libraries, > implemented in C, supplying a wrapper for C++. > I admit Boost is a valuable tool, but in my case, as much > independent I > could be from additional libs, better. > If you do not want to use boost, then I suggest not using nested vectors but just ones that contain PODs as value_type (or even C-arrays). If you insist on using complicated containers you will end up writing your own MPI-C++ abstraction (resulting in a library). This will be a lot of (unnecessary and hard) work. Just my 2 cents. Cheers, Markus
[OMPI users] Problem with cascading derived data types
Hi, In one of my applications I am using cascaded derived MPI datatypes created with MPI_Type_struct. One of these types is used to just send a part (one MPI_Char) of a struct consisting of an int followed by two chars. I.e, the int at the beginning is/should be ignored. This works fine if I use this data type on its own. Unfortunately I need to send another struct that contains an int and the int-char-char struct from above. Again I construct a custom MPI data type for this. When sending this cascaded data type It seems that the offset of the char in the inner custom type is disregarded on the receiving end and the received data ('1') is stored in the first int instead of the following char. I have tested this code with both lam and mpich. There it worked as expected (saving the '1' in the first char). The last two lines of the output of the attached test case read received global=10 attribute=0 (local=1 public=0) received attribute=1 (local=100 public=0) for openmi instead of received global=10 attribute=1 (local=100 public=0) received attribute=1 (local=100 public=0) for lam and mpich. The same problem is experienced when using version 1.3-2 of openmpi. Am I doing something completely wrong or have I accidentally found a bug? Cheers, Markus #include"mpi.h" #include struct LocalIndex { int local_; char attribute_; char public_; }; struct IndexPair { int global_; LocalIndex local_; }; int main(int argc, char** argv) { MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); if(size<2) { std::cerr<<"no procs has to be >2"<
[OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
d8cb69 in mca_pml_ob1_recv () from /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so #1356865 0xb7d5bb1c in ompi_coll_tuned_reduce_intra_basic_linear () from /home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so #1356866 0xb7d55913 in ompi_coll_tuned_reduce_intra_dec_fixed () from /home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so #1356867 0xb7f3db6c in PMPI_Reduce () from /home/johnm/local/ompi/lib/libmpi.so.0 #1356868 0x0804899e in main (argc=1, argv=0xbfba8a84) at ompi-crash2.c:58 --- snip- I poked around in the code, and it looks like the culprit might be in the macros that try to allocate fragments in mca_pml_ob1_recv_frag_match: MCA_PML_OB1_RECV_FRAG_ALLOC and MCA_PML_OB1_RECV_FRAG_INIT use OMPI_FREE_LIST_WAIT, which again can end up calling opal_condition_wait(). opal_condition_wait() calls opal_progress() to "block", which looks like it leads to infinite recursion in this case. I guess the problem is a race condition when one node is hammered with incoming packets. The stack trace contains about 1.35 million lines, so I won't include all of it here, but here's some statistics to verify that not much else is happening in that stack (I can make the full trace available if anybody needs it): --- snip- Number of callframes: 1356870 Called function statistics (how often in stackdump): PMPI_Reduce1 _int_malloc1 main 1 malloc 1 mca_btl_tcp_endpoint_recv_handler 339197 mca_pml_ob1_recv 1 mca_pml_ob1_recv_frag_match 72 ompi_coll_tuned_reduce_intra_basic_linear 1 ompi_coll_tuned_reduce_intra_dec_fixed 1 ompi_free_list_grow1 opal_event_base_loop 339197 opal_event_loop 339197 opal_progress 339197 sysconf2 Address statistics (how often in stackdump), plus functions with that addr (sanity check): 0x00434184 2 set(['sysconf']) 0x0804899e 1 set(['main']) 0xb7d55913 1 set(['ompi_coll_tuned_reduce_intra_dec_fixed']) 0xb7d5bb1c 1 set(['ompi_coll_tuned_reduce_intra_basic_linear']) 0xb7d74a7d72 set(['mca_btl_tcp_endpoint_recv_handler']) 0xb7d74e70 1 set(['mca_btl_tcp_endpoint_recv_handler']) 0xb7d74f08339124 set(['mca_btl_tcp_endpoint_recv_handler']) 0xb7d8cb69 1 set(['mca_pml_ob1_recv']) 0xb7d8f38972 set(['mca_pml_ob1_recv_frag_match']) 0xb7e5d284339197 set(['opal_progress']) 0xb7e62b44339197 set(['opal_event_base_loop']) 0xb7e62cff339197 set(['opal_event_loop']) 0xb7e78b59 1 set(['_int_malloc']) 0xb7e799ce 1 set(['malloc']) 0xb7f04852 1 set(['ompi_free_list_grow']) 0xb7f3db6c 1 set(['PMPI_Reduce']) --- snip- I don't have any suggestions for a fix though, since this is the first time I've looked into the OpenMPI code. Btw. In case it makes a difference for triggering the bug: I'm running this on a cluster with 1 frontend and 44 nodes. The cluster runs Rocks 4.1, and each of the nodes are 3.2GHz P4 Prescott machines with 2GB RAM, connected with gigabit Ethernet. Regards, -- // John Markus Bjørndalen // http://www.cs.uit.no/~johnm/
Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
Hi, and thanks for the feedback everyone. George Bosilca wrote: Brian is completely right. Here is a more detailed description of this problem. [] On the other side, I hope that not many users write such applications. This is the best way to completely kill the performances of any MPI implementation, by overloading one process with messages. This is exactly what MPI_Reduce and MPI_Gather do, one process will get the final result and all other processes only have to send some data. This behavior only arises when the gather or the reduce use a very flat tree, and only for short messages. Because of the short messages there is no handshake between the sender and the receiver, which will make all messages unexpected, and the flat tree guarantee that there will be a lot of small messages. If you add a barrier every now and then (100 iterations) this problem will never happens. I have done some more testing. Of the tested parameters, I'm observing this behaviour with group sizes from 16-44, and from 1 to 32768 integers in MPI_Reduce. For MPI_Gather, I'm observing crashes with group sizes 16-44 and from 1 to 4096 integers (per node). In other words, it actually happens with other tree configurations and larger packet sizes :-/ By the way, I'm also observing crashes with MPI_Broadcast (groups of size 4-44 with the root process (rank 0) broadcasting integer arrays of size 16384 and 32768). It looks like the root process is crashing. Can a sender crash because it runs out of buffer space as well? -- snip -- /home/johnm/local/ompi/bin/mpirun -hostfile lamhosts.all.r360 -np 4 ./ompi-crash 16384 1 3000 { 'groupsize' : 4, 'count' : 16384, 'bytes' : 65536, 'bufbytes' : 262144, 'iters' : 3000, 'bmno' : 1 [compute-0-0][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=104 mpirun noticed that job rank 0 with PID 16366 on node compute-0-0 exited on signal 15 (Terminated). 3 additional processes aborted (not shown) -- snip -- One more thing, doing a lot of collective in a loop and computing the total time is not the correct way to evaluate the cost of any collective communication, simply because you will favor all algorithms based on pipelining. There is plenty of literature about this topic. george. As I said in the original e-mail: I had only thrown them in for a bit of sanity checking. I expected funny numbers, but not that OpenMPI would crash. The original idea was just to make a quick comparison of Allreduce, Allgather and Alltoall in LAM and OpenMPI. The opportunity for pipelining the operations there is rather small since they can't get much out of phase with each other. Regards, -- // John Markus Bjørndalen // http://www.cs.uit.no/~johnm/
Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
George Bosilca wrote: [.] I don't think the root crashed. I guess that one of the other nodes crashed, the root got a bad socket (which is what the first error message seems to indicate), and get terminated. As the output is not synchronized between the nodes, one cannot rely on its order nor contents. Moreover, mpirun report that the root was killed with signal 15, which is how we cleanup the remaining processes when we detect that something really bad (like a seg fault) happened in the parallel application. Sorry, I should have rephrased that as a question ("is it the root?"). I'm not that familiar with the debug output of OpenMPI yet, so I included it in case somebody made more sense of it than me. There are many differences between the routed and non routed collectives. All errors that you reported so far are related to rooted collectives, which make sense. I didn't state that it is normal that Open MPI do not behave [sic]. I wonder if you can get such errors with non routed collectives (such as allreduce, allgather and alltoall), or with messages larger than the eager size ? You're right, I haven't seen any crashes with the All*-variants. TCP eager limit is set to 65536 (output from ompi_info): MCA btl: parameter "btl_tcp_eager_limit" (current value: "65536") MCA btl: parameter "btl_tcp_min_send_size" (current value: "65536") MCA btl: parameter "btl_tcp_max_send_size" (current value: "131072") I observed crashes with Broadcasts and Reduces of 131072 bytes. I'm playing around with larger messages now, and while Reduce with 16 nodes seem stable at 262144 byte messages, it still crashes with 44 nodes. If you type "ompi_info --param btl tcp", you will see what is the eager size for the TCP BTL. Everything smaller than this size will be send eagerly; have the opportunity to became unexpected on the receiver side and can lead to this problem. As a quick test, you can add "--mca btl_tcp_eager_limit 2048" to your mpirun command line, and this problem will not happen with for size over the 2K. This was the original solution for the flow control problem. If you know your application will generate thousands of unexpected messages, then you should set the eager limit to zero. I tried running Reduce with 4096 ints (16384 bytes), 16 nodes and eager limit 2048: mpirun -hostfile lamhosts.all.r360 -np 16 --mca btl_tcp_eager_limit 2048 ./ompi-crash 4096 2 3000 { 'groupsize' : 16, 'count' : 4096, 'bytes' : 16384, 'bufbytes' : 262144, 'iters' : 3000, 'bmno' : 2 [compute-2-2][0,1,10][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] [compute-3-2][0,1,14][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=104 mca_btl_tcp_frag_recv: readv failed with errno=104 mpirun noticed that job rank 0 with PID 30407 on node compute-0-0 exited on signal 15 (Terminated). 15 additional processes aborted (not shown) This one tries to run Reduce with 1 integer per node and also crashes (with eager size 0): /mpirun -hostfile lamhosts.all.r360 -np 16 --mca btl_tcp_eager_limit 0 ./ompi-crash 1 2 3000 ... This is puzzling. I'm mostly familiarizing myself with OpenMPI at the moment as well as poking around to see how the collective operations work and perform compared to LAM. Partly because I have some ideas I'd like to test out, and partly because I'm considering to move some student exercises over from LAM to OpenMPI. I don't expect to write actual applications that treat MPI like this myself, but on the other hand, not having to do throttling on top of MPI could be an advantage in some application patterns. Regards, -- // John Markus Bjørndalen // http://www.cs.uit.no/~johnm/