Re: [OMPI users] Do MPI calls ever sleep?
On Jul 21, 2010, at 2:54 PM CDT, Jed Brown wrote: > On Wed, 21 Jul 2010 15:20:24 -0400, David Ronis wrote: >> Hi Jed, >> >> Thanks for the reply and suggestion. I tried adding -mca >> yield_when_idle 1 (and later mpi_yield_when_idle 1 which is what >> ompi_info reports the variable as) but it seems to have had 0 effect. >> My master goes into fftw planning routines for a minute or so (I see the >> threads being created), but the overall usage of the slaves remains >> close to 100% during this time. Just to be sure, I put the slaves into >> a MPI_Barrier(MPI_COMM_WORLD) while they were waiting for the fftw >> planner to finish. It also didn't help. > > They still spin (instead of using e.g. select()), but call sched_yield() > so should only be actively spinning when nothing else is trying to run. > Are you sure that the planner is always running in parallel? What OS > and OMPI version are you using? sched_yield doesn't work as expected in late 2.6 Linux kernels: http://kerneltrap.org/Linux/CFS_and_sched_yield If this scheduling behavior change is affecting you, you might be able to fix it with: echo "1" >/proc/sys/kernel/sched_compat_yield -Dave
Re: [OMPI users] Hair depleting issue with Ompi143 and one program
I can't speak to what OMPI might be doing to your program, but I have a few suggestions for looking into the Valgrind issues. Valgrind's "--track-origins=yes" option is usually helpful for figuring out where the uninitialized values came from. However, if I understand you correctly and if you are correct in your assumption that _mm_setzero_ps is not actually zeroing your xEv variable for some reason, then this option will unhelpfully tell you that it was caused by a stack allocation at the entrance to the function where the variable is declared. But it's worth turning on because it's easy to do and it might show you something obvious that you are missing. The next thing you can do is disable optimization when building your code in case GCC is taking a shortcut that is either incorrect or just doesn't play nicely with Valgrind. Valgrind might run pretty slow though, because -O0 code can be really verbose and slow to check. After that, if you really want to dig in, you can try reading the assembly code that is generated for that _mm_setzero_ps line. The easiest way is to pass "-save-temps" to gcc and it will keep a copy of "sourcefile.s" corresponding to "sourcefile.c". Sometimes "-fverbose-asm" helps, sometimes it makes things harder to follow. And the last semi-desperate step is to dig into what Valgrind thinks is going on. You'll want to read up on how memcheck really works [1] before doing this. Then read up on client requests [2,3]. You can then use the VALGRIND_GET_VBITS client request on your xEv variable in order to see which parts of the variable Valgrind thinks are undefined. If the vbits don't match with what you expect, there's a chance that you might have found a bug in Valgrind itself. It doesn't happen often, but the SSE code can be complicated and isn't exercised as often as the non-vector portions of Valgrind. Good luck, -Dave [1] http://valgrind.org/docs/manual/mc-manual.html#mc-manual.machine [2] http://valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.clientreq [3] http://valgrind.org/docs/manual/mc-manual.html#mc-manual.clientreqs On Jan 20, 2011, at 5:07 PM CST, David Mathog wrote: > I have been working on slightly modifying a software package by Sean > Eddy called Hmmer 3. The hardware acceleration was originally SSE2 but > since most of our compute nodes only have SSE1 and MMX I rewrote a few > small sections to just use those instructions. (And yes, as far as I > can tell it invokes emms before any floating point operations are run > after each MMX usage.) On top of that each binary has 3 options for > running the programs: single threaded, threaded, or MPI (using > Ompi143). For all other programs in this package everything works > everywhere. For one called "jackhmmer" this table results (+=runs > correctly, - = problems), where the exact same problem is run in each > test (theoretically exercising exactly the same routines, just under > different threading control): > > SSE2 SSE1 > Single + + > Threaded+ + > Ompi143 + - > > The negative result for the SSE/Ompi143 combination happens whether the > worker nodes are Athlon MP (SSE1 only) or Athlon64. The test machine > for the single and threaded runs is a two CPU Opteron 280 (4 cores > total). Ompi143 is 32 bit everywhere (local copies though). There have > been no modifications whatsoever made to the main jackhmmer.c file, > which is where the various run methods are implemented. > > Now if there was some intrinsic problem with my SSE1 code it should > presumably manifest in both the Single and Threaded versions as well > (the thread control is different, but they all feed through the same > underlying functions), or in one of the other programs, which isn't > seen. Running under valgrind using Single or Threaded produces no > warnings. Using mpirun with valgrind on the SSE2 produces 3: two > related to OMPI itself which are seen in every OMPI program run in > valgrind, and one caused by an MPIsend operation where the buffer > contains some uninitialized data (this is nothing toxic, just bytes in > fixed length fields which which were never set because a shorter string > is stored there). > > ==19802== Syscall param writev(vector[...]) points to uninitialised byte(s) > ==19802==at 0x4C77AC1: writev (in /lib/libc-2.10.1.so) > ==19802==by 0x8A069B5: mca_btl_tcp_frag_send (in > /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so) > ==19802==by 0x8A0626E: mca_btl_tcp_endpoint_send (in > /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so) > ==19802==by 0x8A01ADC: mca_btl_tcp_send (in > /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so) > ==19802==by 0x7FA24A9: mca_pml_ob1_send_request_start_prepare (in > /opt/ompi143.X32/lib/openmpi/mca_pml_ob1.so) > ==19802==by 0x7F98443: mca_pml_ob1_send (in > /opt/ompi143.X32/lib/openmpi/mca_pml_ob1.so) > ==19802==by 0x4A8530F: PMPI_Send (in > /opt/ompi143.X32/lib/libmpi.so.0.0.2) > =
Re: [OMPI users] Deadlock with mpi_init_thread + mpi_file_set_view
FWIW, we solved this problem with ROMIO in MPICH2 by making the "big global lock" a recursive mutex. In the past it was implicitly so because of the way that recursive MPI calls were handled. In current MPICH2 it's explicitly initialized with type PTHREAD_MUTEX_RECURSIVE instead. -Dave On Apr 4, 2011, at 9:28 AM CDT, Ralph Castain wrote: > > On Apr 4, 2011, at 8:18 AM, Rob Latham wrote: > >> On Sat, Apr 02, 2011 at 04:59:34PM -0400, fa...@email.com wrote: >>> >>> opal_mutex_lock(): Resource deadlock avoided >>> #0 0x0012e416 in __kernel_vsyscall () >>> #1 0x01035941 in raise (sig=6) at >>> ../nptl/sysdeps/unix/sysv/linux/raise.c:64 >>> #2 0x01038e42 in abort () at abort.c:92 >>> #3 0x00d9da68 in ompi_attr_free_keyval (type=COMM_ATTR, key=0xbffda0e4, >>> predefined=0 '\000') at attribute/attribute.c:656 >>> #4 0x00dd8aa2 in PMPI_Keyval_free (keyval=0xbffda0e4) at pkeyval_free.c:52 >>> #5 0x01bf3e6a in ADIOI_End_call (comm=0xf1c0c0, keyval=10, >>> attribute_val=0x0, extra_state=0x0) at ad_end.c:82 >>> #6 0x00da01bb in ompi_attr_delete. (type=UNUSED_ATTR, object=0x6, >>> attr_hash=0x2c64, key=14285602, predefined=232 '\350', need_lock=128 >>> '\200') at attribute/attribute.c:726 >>> #7 0x00d9fb22 in ompi_attr_delete_all (type=COMM_ATTR, object=0xf1c0c0, >>> attr_hash=0x8d0fee8) at attribute/attribute.c:1043 >>> #8 0x00dbda65 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:133 >>> #9 0x00dd12c2 in PMPI_Finalize () at pfinalize.c:46 >>> #10 0x00d6b515 in mpi_finalize_f (ierr=0xbffda2b8) at pfinalize_f.c:62 >> >> I guess I need some OpenMPI eyeballs on this... >> >> ROMIO hooks into the attribute keyval deletion mechanism to clean up >> the internal data structures it has allocated. I suppose since this >> is MPI_Finalize, we could just leave those internal data structures >> alone and let the OS deal with it. >> >> What I see happening here is the OpenMPI finalize routine is deleting >> attributes. one of those attributes is ROMIO's, which in turn tries >> to free keyvals. Is the deadlock that noting "under" ompi_attr_delete >> can itself call ompi_* routines? (as ROMIO triggers a call to >> ompi_attr_free_keyval) ? >> >> Here's where ROMIO sets up the keyval and the delete handler: >> https://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/trunk/src/mpi/romio/mpi-io/mpir-mpioinit.c#L39 >> >> that routine gets called upon any "MPI-IO entry point" (open, delete, >> register-datarep). The keyvals help ensure that ROMIO's internal >> structures get initialized exactly once, and the delete hooks help us >> be good citizens and clean up on exit. > > FWIW: his trace shows that OMPI incorrectly attempts to acquire a thread lock > that has already been locked. This occurs in OMPI's attribute code, probably > surrounding the call to your code. > > In other words, it looks to me like the problem is on our side, not yours. > Jeff is the one who generally handles the attribute code, though, so I'll > ping his eyeballs :-) > > >> >> ==rob >> >> -- >> Rob Latham >> Mathematics and Computer Science Division >> Argonne National Lab, IL USA >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] data types and alignment to word boundary
On Jun 29, 2011, at 10:56 AM CDT, Jeff Squyres wrote: > There's probably an alignment gap between the short and char array, and > possibly an alignment gap between the char array and the double array > (depending on the value of SHORT_INPUT and your architecture). > > So for your displacements, you should probably actually measure what the > displacements are instead of using sizeof(short), for example. > > tVStruct foo; > aiDispsT5[0] = 0; > aiDispsT5[0] = ((char*) &(foo.sCapacityFile) - (char*) &foo); There's a C-standard "offsetof" macro for this calculation. Using it instead of the pointer math above greatly improves readability: http://en.wikipedia.org/wiki/Offsetof So the second line becomes: 8< aiDispsT5[1] = offsetof(tVStruct, sCapacityFile); 8< -Dave
Re: [OMPI users] MPI defined macro
This has been discussed previously in the MPI Forum: http://lists.mpi-forum.org/mpi-forum/2010/11/0838.php I think it resulted in this proposal, but AFAIK it was never pushed forward by a regular attendee of the Forum: https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ReqPPMacro -Dave On Aug 23, 2011, at 6:59 AM CDT, Jeff Squyres wrote: > I unfortunately won't be at the next Forum meeting, but you might want to ask > someone to bring it up for you. > > It might not give you exactly what you want, however, because not all > platforms have "mpicc" (or similar) wrapper compilers. I.e., to compile an > MPI application on some platforms, you just "cc ... -lmpi". Hence, there's > no way for the compiler to know whether to #define MPI or not. > > Such a macro *could* be added to mpi.h (but not Fortran), but then you > wouldn't get at least one of the use cases that you (assumedly :-) ) want: > > #if MPI > #include > #endif > > > On Aug 23, 2011, at 7:46 AM, Gabriele Fatigati wrote: > >> Can I suggest to insert this macro in next MPI 3 standard? >> >> I think It's very useful. >> >> 2011/8/23 Jeff Squyres >> I'm afraid not. Sorry! :-( >> >> We have the OPEN_MPI macro -- it'll be defined to 1 if you compile with Open >> MPI, but that doesn't really help your portability issue. :-\ >> >> On Aug 23, 2011, at 5:19 AM, Gabriele Fatigati wrote: >> >>> Dear OpenMPi users, >>> >>> is there some portable MPI macro to check if a code is compiled with MPI >>> compiler? Something like _OPENMP for OpenMP codes: >>> >>> #ifdef _OPENMP >>> >>> >>> >>> #endif >>> >>> >>> it exist? >>> >>> #ifdef MPI >>> >>> >>> >>> >>> #endif >>> >>> Thanks >>> >>> -- >>> Ing. Gabriele Fatigati >>> >>> HPC specialist >>> >>> SuperComputing Applications and Innovation Department >>> >>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy >>> >>> www.cineca.itTel: +39 051 6171722 >>> >>> g.fatigati [AT] cineca.it >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> -- >> Ing. Gabriele Fatigati >> >> HPC specialist >> >> SuperComputing Applications and Innovation Department >> >> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy >> >> www.cineca.itTel: +39 051 6171722 >> >> g.fatigati [AT] cineca.it >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] possible bug exercised by mpi4py
On May 24, 2012, at 10:22 AM CDT, Jeff Squyres wrote: > I read it to be: reduce the data in the local group, scatter the results to > the remote group. > > As such, the reduce COUNT is sum(recvcounts), and is used for the reduction > in the local group. Then use recvcounts to scatter it to the remote group. > > ...right? > right. -Dave
Re: [OMPI users] possible bug exercised by mpi4py
On May 24, 2012, at 10:57 AM CDT, Lisandro Dalcin wrote: > On 24 May 2012 12:40, George Bosilca wrote: > >> I don't see much difference with the other collective. The generic behavior >> is that you apply the operation on the local group but the result is moved >> into the remote group. > > Well, for me this one DO IS different (for example, SCATTER is > unidirectional for intercomunicators, but REDUCE_SCATTER is > bidirectional). The "recvbuff" is a local buffer, but you understand > "recvcounts" as remote. > > Mmm, the standard is really confusing in this point... Don't think of it like an intercommunicator-scatter, think of it more like an intercommunicator-allreduce. The allreduce is also bidirectional. The only difference is that instead of an allreduce (logically reduce+bcast), you instead have a reduce_scatter (logically reduce+scatterv). -Dave
Re: [OMPI users] possible bug exercised by mpi4py
On May 24, 2012, at 8:13 PM CDT, Jeff Squyres wrote: > On May 24, 2012, at 11:57 AM, Lisandro Dalcin wrote: > >> The standard says this: >> >> "Within each group, all processes provide the same recvcounts >> argument, and provide input vectors of sum_i^n recvcounts[i] elements >> stored in the send buffers, where n is the size of the group" >> >> So, I read " Within each group, ... where n is the size of the group" >> as being the LOCAL group size. > > Actually, that seems like a direct contradiction with the prior sentence: > > If comm is an intercommunicator, then the result of the reduction of the data > provided by processes in one group (group A) is scattered among processes in > the other group (group B), and vice versa. > > It looks like the implementors of 2 implementations agree that recvcounts > should be the size of the remote group. Sounds like this needs to be brought > up in front of the Forum... So I take back my prior "right". Upon further inspection of the text and the MPICH2 code I believe it to be true that the number of the elements in the recvcounts array must be equal to the size of the LOCAL group. The text certainly could use a bit of clarification. I'll bring it up at the meeting next week. -Dave
Re: [OMPI users] possible bug exercised by mpi4py
On May 24, 2012, at 10:34 PM CDT, George Bosilca wrote: > On May 24, 2012, at 23:18, Dave Goodell wrote: > >> So I take back my prior "right". Upon further inspection of the text and >> the MPICH2 code I believe it to be true that the number of the elements in >> the recvcounts array must be equal to the size of the LOCAL group. > > This is quite illogical, but it will not be the first time the standard is > lacking some. So, if I understand you correctly, in the case of an > intercommunicator a process doesn't know how much data it has to reduce, at > least not until it receives the array of recvcounts from the remote group. > Weird! No, it knows because of the restriction that $sum_i^n{recvcounts[i]}$ yields the same sum in each group. The way it's implemented in MPICH2, and the way that makes this make a lot more sense to me, is that you first do intercommunicator reductions to temporary buffers on rank 0 in each group. Then rank 0 scatters within the local group. The way I had been thinking about it was to do a local reduction followed by an intercomm scatter, but that isn't what the standard is saying, AFAICS. -Dave
Re: [OMPI users] MPI_IN_PLACE not working for Fortran-compiled code linked with mpicc on Mac OS X
On Jan 4, 2013, at 2:55 AM CST, Torbjörn Björkman wrote: > It seems that a very old bug (svn.open-mpi.org/trac/ompi/ticket/1982) is > playing up when linking fortran code with mpicc on Mac OS X 10.6 and the > Macports distribution openmpi @1.6.3_0+gcc44. I got it working by reading up > on this discussion thread: > http://www.open-mpi.org/community/lists/users/2011/11/17862.php > and applying the fix given there, add '-Wl,-commons,use_dylibs', to the c > compiler flags solves the problem. I'm not an Open MPI developer (or user, really), but in MPICH we also had to ensure that we passed both "-Wl,-commons,use_dylibs" *and* "-Wl,-flat_namespace" in the end. For MPI users that do not use Fortran (and therefore don't need common blocks to work correctly between the app and the library), we provide a "--enable-two-level-namespace" configure option to allow users to generate two-level namespace dylibs instead. Some combinations of third-party dylibs will require two-level namespaced MPI dylibs. I don't know if Open MPI is using "-Wl,-flat_namespace" or not, but this is something else that any investigation should probably check. For reference on the later MPICH discoveries about dynamically linking common symbols on Darwin: http://trac.mpich.org/projects/mpich/ticket/1590 -Dave
Re: [OMPI users] Progress in MPI_Win_unlock
On Feb 3, 2010, at 6:24 PM, Dorian Krause wrote: Unless it is also specified that a process must eventually exit with a call to MPI_Finalize (I couldn't find such a requirement), progress for RMA access to a passive server which does not participate actively in any MPI communication is not guaranteed, right? (Btw. mvapich2 has the same behavior in this regard) For the finalize requirement, see MPI-2.2 page 291, lines 36-38: --8<-- This routine cleans up all MPI state. Each process must call MPI_FINALIZE before it exits. Unless there has been a call to MPI_ABORT, each process must ensure that all pending nonblocking communications are (locally) complete before calling MPI_FINALIZE. --8<-- MPI is intentionally vague on progress issues and leaves lots of room for implementation choices. I'll let the Open MPI folks answer the questions about their implementation. -Dave
Re: [OMPI users] MPI_Init() and MPI_Init_thread()
On Mar 3, 2010, at 11:35 AM, Richard Treumann wrote: If the application will make MPI calls from multiple threads and MPI_INIT_THREAD has returned FUNNELED, the application must be willing to take the steps that ensure there will never be concurrent calls to MPI from the threads. The threads will take turns - without fail. Minor nitpick: if the implementation returns FUNNELED, only the main thread (basically the thread that called MPI_INIT_THREAD, see MPI-2.2 pg 386 for def'n) may make MPI calls. Dick's paragraph above is correct if you replace FUNNELED with SERIALIZED. -Dave
Re: [OMPI users] MPI_Init() and MPI_Init_thread()
On Mar 4, 2010, at 7:36 AM, Richard Treumann wrote: A call to MPI_Init allows the MPI library to return any level of thread support it chooses. This is correct, insofar as the MPI implementation can always choose any level of thread support. This MPI 1.1 call does not let the application say what it wants and does not let the implementation reply with what it can guarantee. Well, sort of. MPI-2.2, sec 12.4.3, page 385, lines 24-25: --8<-- 24| A call to MPI_INIT has the same effect as a call to MPI_INIT_THREAD with a required 25| = MPI_THREAD_SINGLE. --8<-- So even though there is no explicit request and response for thread level support, it is implicitly asking for MPI_THREAD_SINGLE. Since all implementations must be able to support at least SINGLE (0 threads running doesn't really make sense), SINGLE will be provided at a minimum. Callers to plain-old "MPI_Init" should not expect any higher level of thread support if they wish to maintain portability. [...snip...] Consider a made up example: Imagine some system supports Mutex lock/unlock but with terrible performance. As a work around, it offers a non-standard substitute for malloc called st_malloc (single thread malloc) that does not do locking. [...snip...] Dick's example is a great illustration of why FUNNELED might be necessary. The moral of the story is "don't lie to the MPI implementation" :) -Dave
Re: [OMPI users] MPI_Init() and MPI_Init_thread()
On Mar 4, 2010, at 10:52 AM, Anthony Chan wrote: - "Yuanyuan ZHANG" wrote: For an OpenMP/MPI hybrid program, if I only want to make MPI calls using the main thread, ie., only in between parallel sections, can I just use SINGLE or MPI_Init? If your MPI calls is NOT within OpenMP directives, MPI does not even know you are using threads. So calling MPI_Init is good enough. This is *not true*. Please read Dick's previous post for a good example of why this is not the case. In practice, on most platforms, implementation support for SINGLE and FUNNELED are identical (true for stock MPICH2, for example). However Dick's example of thread-safe versus non-thread-safe malloc options clearly shows why programs need to request (and check "provided" for) >=FUNNELED in this scenario if they wish to be truly portable. -Dave
Re: [OMPI users] Problem building OpenMPI 1.8 on RHEL6
On Apr 1, 2014, at 10:26 AM, "Blosch, Edwin L" wrote: > I am getting some errors building 1.8 on RHEL6. I tried autoreconf as > suggested, but it failed for the same reason. Is there a minimum version of > m4 required that is newer than that provided by RHEL6? Don't run "autoreconf" by hand, make sure to run the "./autogen.sh" script that is packaged with OMPI. It will also check your versions and warn you if they are out of date. Do you need to build OMPI from the SVN source? Or would a (pre-autogen'ed) release tarball work for you? -Dave
Re: [OMPI users] usNIC point-to-point messaging module
On Apr 1, 2014, at 12:13 PM, Filippo Spiga wrote: > Dear Ralph, Dear Jeff, > > I've just recompiled the latest Open MPI 1.8. I added > "--enable-mca-no-build=btl-usnic" to configure but the message still appear. > Here the output of "--mca btl_base_verbose 100" (trunked immediately after > the application starts) Jeff's on vacation, so I'll see if I can help here. Try deleting all the files in "$PREFIX/lib/openmpi/", where "$PREFIX" is the value you passed to configure with "--prefix=". If you did not pass a value, then it is "/usr/local". Then reinstall (with "make install" in the OMPI build tree). What I think is happening is that you still have an "mca_btl_usnic.so" file leftover from the last time you installed OMPI (before passing "--enable-mca-no-build=btl-usnic"). So OMPI is using this shared library and you get exactly the same problem. -Dave
Re: [OMPI users] usNIC point-to-point messaging module
On Apr 2, 2014, at 12:57 PM, Filippo Spiga wrote: > I still do not understand why this keeps appearing... > > srun: cluster configuration lacks support for cpu binding > > Any clue? I don't know what causes that message. Ralph, any thoughts here? -Dave
Re: [OMPI users] mpirun runs in serial even I set np to several processors
On Apr 14, 2014, at 12:15 PM, Djordje Romanic wrote: > When I start wrf with mpirun -np 4 ./wrf.exe, I get this: > - > starting wrf task0 of1 > starting wrf task0 of1 > starting wrf task0 of1 > starting wrf task0 of1 > - > This indicates that it is not using 4 processors, but 1. > > Any idea what might be the problem? It could be that you compiled WRF with a different MPI implementation than you are using to run it (e.g., MPICH vs. Open MPI). -Dave
Re: [OMPI users] OMPI 1.8.1 Deadlock in mpi_finalize with mpi_init_thread
I don't know of any workaround. I've created a ticket to track this, but it probably won't be very high priority in the short term: https://svn.open-mpi.org/trac/ompi/ticket/4575 -Dave On Apr 25, 2014, at 3:27 PM, Jamil Appa wrote: > > Hi > > The following program deadlocks in mpi_finalize with OMPI 1.8.1 but works > correctly with OMPI 1.6.5 > > Is there a work around? > > Thanks > > Jamil > > program mpiio > use mpi > implicit none > integer(kind=4) :: iprov, fh, ierr > call mpi_init_thread(MPI_THREAD_SERIALIZED, iprov, ierr) > if (iprov < MPI_THREAD_SERIALIZED) stop 'mpi_init_thread' > call mpi_file_open(MPI_COMM_WORLD, 'test.dat', & > MPI_MODE_WRONLY + MPI_MODE_CREATE, MPI_INFO_NULL, fh, ierr) > call mpi_file_close(fh, ierr) > call mpi_finalize(ierr) > end program mpiio > > (gdb) bt > #0 0x003155a0e054 in __lll_lock_wait () from /lib64/libpthread.so.0 > #1 0x003155a09388 in _L_lock_854 () from /lib64/libpthread.so.0 > #2 0x003155a09257 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #3 0x77819f3c in ompi_attr_free_keyval () from > /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/libmpi.so.1 > #4 0x77857be1 in PMPI_Keyval_free () from > /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/libmpi.so.1 > #5 0x715b21f2 in ADIOI_End_call () from > /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/openmpi/mca_io_romio.so > #6 0x7781a325 in ompi_attr_delete_impl () from > /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/libmpi.so.1 > #7 0x7781a4ec in ompi_attr_delete_all () from > /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/libmpi.so.1 > #8 0x77832ad5 in ompi_mpi_finalize () from > /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/libmpi.so.1 > #9 0x77b12e59 in pmpi_finalize__ () from > /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/libmpi_mpifh.so.2 > #10 0x00400b64 in mpiio () at t.f90:10 > #11 0x00400b9a in main () > #12 0x00315561ecdd in __libc_start_main () from /lib64/libc.so.6 > #13 0x00400a19 in _start () > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] importing to MPI data already in memory from another process
On Jun 27, 2014, at 8:53 AM, Brock Palen wrote: > Is there a way to import/map memory from a process (data acquisition) such > that an MPI program could 'take' or see that memory? > > We have a need to do data acquisition at the rate of .7TB/s and need todo > some shuffles/computation on these data, some of the nodes are directly > connected to the device, and some will do processing. > > Here is the proposed flow: > > * Data collector nodes runs process collecting data from device > * Those nodes somehow pass the data to an MPI job running on these nodes and > a number of other nodes (cpu need for filterting is greater than what the 16 > data nodes can provide). For a non-MPI solution for intranode data transfer in this case, take a look at vmsplice(2): http://man7.org/linux/man-pages/man2/vmsplice.2.html Pay particular attention to the SPLICE_F_GIFT flag, which will allow you to simply give memory pages away to the MPI process, avoiding unnecessary data copies. You would just need a pipe shared between the data collector process and the MPI process (and to be a bit careful with your memory allocation/management, since any page you gift away should probably come from mmap(2) directly). Otherwise, as George mentioned, I would investigate converting your current data collector processes to also be MPI processes so that they can simply communicate the data to the rest of the cluster. -Dave
Re: [OMPI users] OpenMPI 1.8.2 segfaults while 1.6.5 works?
Looks like boost::mpi and/or your python "mpi" module might be creating a bogus argv array and passing it to OMPI's MPI_Init routine. Note that argv is required by C99 to be terminated with a NULL pointer (that is, (argv[argc]==NULL) must hold). See http://stackoverflow.com/a/3772826/158513. -Dave On Sep 29, 2014, at 1:34 PM, Ralph Castain wrote: > Afraid I cannot replicate a problem with singleton behavior in the 1.8 series: > > 11:31:52 /home/common/openmpi/v1.8/orte/test/mpi$ ./hello foo bar > Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0-23 > OMPI_MCA_orte_default_hostfile=/home/common/hosts > OMPI_COMMAND=./hello > OMPI_ARGV=foo bar > OMPI_NUM_APP_CTX=1 > OMPI_FIRST_RANKS=0 > OMPI_APP_CTX_NUM_PROCS=1 > OMPI_MCA_orte_ess_num_procs=1 > > You can see that the OMPI_ARGV envar (which is the spot you flagged) is > correctly being set and there is no segfault. Not sure what your program may > be doing, though, so I'm not sure I've really tested your scenario. > > > On Sep 29, 2014, at 10:55 AM, Ralph Castain wrote: > >> Okay, so regression-test.py is calling MPI_Init as a singleton, correct? >> Just trying to fully understand the scenario >> >> Singletons are certainly allowed, if that's the scenario >> >> On Sep 29, 2014, at 10:51 AM, Amos Anderson >> wrote: >> >>> I'm not calling mpirun in this case because this particular calculation >>> doesn't use more than one processor. What I'm doing on my command line is >>> this: >>> >>> /home/user/myapp/tools/python/bin/python test/regression/regression-test.py >>> test/regression/regression-jobs >>> >>> and internally I check for rank/size. This command is executed in the >>> context of a souped up LD_LIBRARY_PATH. You can see the variable argv in >>> opal_argv_join is ending up with the last argument on my command line. >>> >>> I suppose your question implies that mpirun is mandatory for executing >>> anything compiled with OpenMPI > 1.6 ? >>> >>> >>> >>> On Sep 29, 2014, at 10:28 AM, Ralph Castain wrote: >>> Can you pass us the actual mpirun command line being executed? Especially need to see the argv being passed to your application. On Sep 27, 2014, at 7:09 PM, Amos Anderson wrote: > FWIW, I've confirmed that the segfault also happens with OpenMPI 1.7.5. > Also, I have some gdb output (from 1.7.5) for your perusal, including a > printout of some of the variables' values. > > > > Starting program: /home/user/myapp/tools/python/bin/python > test/regression/regression-test.py test/regression/regression-jobs > [Thread debugging using libthread_db enabled] > > Program received signal SIGSEGV, Segmentation fault. > 0x2bc8df1e in opal_argv_join (argv=0xa39398, delimiter=32) at > argv.c:299 > 299 str_len += strlen(*p) + 1; > (gdb) where > #0 0x2bc8df1e in opal_argv_join (argv=0xa39398, delimiter=32) at > argv.c:299 > #1 0x2ab2ce4e in ompi_mpi_init (argc=2, argv=0xa39390, > requested=0, provided=0x7fffba98) at runtime/ompi_mpi_init.c:450 > #2 0x2ab63e39 in PMPI_Init (argc=0x7fffbb8c, > argv=0x7fffbb80) at pinit.c:84 > #3 0x2aaab7b965d6 in boost::mpi::environment::environment > (this=0xa3a1d0, argc=@0x7fffbb8c, argv=@0x7fffbb80, > abort_on_exception=true) >at ../tools/boost/libs/mpi/src/environment.cpp:98 > #4 0x2aaabc7b311d in boost::mpi::python::mpi_init (python_argv=..., > abort_on_exception=true) at > ../tools/boost/libs/mpi/src/python/py_environment.cpp:60 > #5 0x2aaabc7b33fb in boost::mpi::python::export_environment () at > ../tools/boost/libs/mpi/src/python/py_environment.cpp:94 > #6 0x2aaabc7d5ab5 in boost::mpi::python::init_module_mpi () at > ../tools/boost/libs/mpi/src/python/module.cpp:44 > #7 0x2aaab792a2f2 in > boost::detail::function::void_function_ref_invoker0 void>::invoke (function_obj_ptr=...) >at ../tools/boost/boost/function/function_template.hpp:188 > #8 0x2aaab7929e6b in boost::function0::operator() > (this=0x7fffc110) at > ../tools/boost/boost/function/function_template.hpp:767 > #9 0x2aaab7928f11 in boost::python::handle_exception_impl (f=...) at > ../tools/boost/libs/python/src/errors.cpp:25 > #10 0x2aaab792a54f in boost::python::handle_exception > (f=0x2aaabc7d5746 ) at > ../tools/boost/boost/python/errors.hpp:29 > #11 0x2aaab792a1d9 in boost::python::detail::(anonymous > namespace)::init_module_in_scope (m=0x2aaabc617f68, >init_function=0x2aaabc7d5746 ) > at ../tools/boost/libs/python/src/module.cpp:24 > #12 0x2aaab792a26c in boost::python::detail::init_module > (name=0x2aaabc7f7f4d "mpi", init_function=0x2aaabc7d5746 > ) >at ../tools/boost/libs/python/src/module.cpp:59 > #13 0x0
Re: [OMPI users] mpi_wtime implementation
On Nov 24, 2014, at 12:06 AM, George Bosilca wrote: > https://github.com/open-mpi/ompi/pull/285 is a potential answer. I would like > to hear Dave Goodell comment on this before pushing it upstream. > > George. I'll take a look at it today. My notification settings were messed up when you originally CCed me on the PR, so I didn't see this until now. -Dave
Re: [OMPI users] send and receive vectors + variable length
On Jan 9, 2015, at 7:46 AM, Jeff Squyres (jsquyres) wrote: > Yes, I know examples 3.8/3.9 are blocking examples. > > But it's morally the same as: > > MPI_WAITALL(send_requests...) > MPI_WAITALL(recv_requests...) > > Strictly speaking, that can deadlock, too. > > It reality, it has far less chance of deadlocking than examples 3.8 and 3.9 > (because you're likely within the general progression engine, and the > implementation will progress both the send and receive requests while in the > first WAITALL). > > But still, it would be valid for an implementation to *only* progress the > send requests -- and NOT the receive requests -- while in the first WAITALL. > Which makes it functionally equivalent to examples 3.8/3.9. That's not true. The implementation is required to make progress on all outstanding requests (assuming they can be progressed). The following should not deadlock: ✂ for (...) MPI_Isend(...) for (...) MPI_Irecv(...) MPI_Waitall(send_requests...) MPI_Waitall(recv_requests...) ✂ -Dave
Re: [OMPI users] New to (Open)MPI
Lachlan mentioned that he has "M Series" hardware, which, to the best of my knowledge, does not officially support usNIC. It may not be possible to even configure the relevant usNIC adapter policy in UCSM for M Series modules/chassis. Using the TCP BTL may be the only realistic option here. -Dave > On Sep 2, 2016, at 5:35 AM, Jeff Squyres (jsquyres) > wrote: > > Greetings Lachlan. > > Yes, Gilles and John are correct: on Cisco hardware, our usNIC transport is > the lowest latency / best HPC-performance transport. I'm not aware of any > MPI implementation (including Open MPI) that has support for FC types of > transports (including FCoE). > > I'll ping you off-list with some usNIC details. > > >> On Sep 1, 2016, at 10:06 PM, Lachlan Musicman wrote: >> >> Hola, >> >> I'm new to MPI and OpenMPI. Relatively new to HPC as well. >> >> I've just installed a SLURM cluster and added OpenMPI for the users to take >> advantage of. >> >> I'm just discovering that I have missed a vital part - the networking. >> >> I'm looking over the networking options and from what I can tell we only >> have (at the moment) Fibre Channel over Ethernet (FCoE). >> >> Is this a network technology that's supported by OpenMPI? >> >> (system is running Centos 7, on Cisco M Series hardware) >> >> Please excuse me if I have terms wrong or am missing knowledge. Am new to >> this. >> >> cheers >> Lachlan >> >> >> -- >> The most dangerous phrase in the language is, "We've always done it this >> way." >> >> - Grace Hopper >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] trying to use personal copy of 1.7.4
Perhaps there's an RPATH issue here? I don't fully understand the structure of Rmpi, but is there both an app and a library (or two separate libraries) that are linking against MPI? I.e., what we want is: app -> ~ross/OMPI \ / --> library -- But what we're getting is: app ---> /usr/OMPI \ --> library ---> ~ross/OMPI If one of them was first linked against the /usr/OMPI and managed to get an RPATH then it could override your LD_LIBRARY_PATH. -Dave On Mar 12, 2014, at 5:39 AM, Jeff Squyres (jsquyres) wrote: > Generally, all you need to ensure that your personal copy of OMPI is used is > to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI > installation. I do this all the time on my development cluster (where I have > something like 6 billion different installations of OMPI available... mmm... > should probably clean that up...) > > export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH > export PATH=path-to-my-ompi/bin:$PATH > > It should be noted that: > > 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values > 2. you need to set these values in a way that will be picked up on all > servers that you use in your job. The safest way to do this is in your shell > startup files (e.g., $HOME/.bashrc or whatever is relevant for your shell). > > See http://www.open-mpi.org/faq/?category=running#run-prereqs, > http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and > http://www.open-mpi.org/faq/?category=running#mpirun-prefix. > > Note the --prefix option that is described in the 3rd FAQ item I cited -- > that can be a bit easier, too. > > > > On Mar 12, 2014, at 2:51 AM, Ross Boylan wrote: > >> I took the advice here and built a personal copy of the current openmpi, >> to see if the problems I was having with Rmpi were a result of the old >> version on the system. >> >> When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically >> by R) everything looks fine; path references that should be local are. >> But when I run the program and do lsof it shows that both the system and >> personal versions of key libraries are opened. >> >> First, does anyone know which library will actually be used, or how to >> tell which library is actually used, in this situation. I'm running on >> linux (Debian squeeze)? >> >> Second, it there some way to prevent the wrong/old/sytem libraries from >> being loaded? >> >> FWIW I'm still seeing the old misbehavior when I run this way, but, as I >> said, I'm really not sure which libraries are being used. Since Rmpi >> was built against the new/local ones, I think the fact that it doesn't >> crash means I really am using the new ones. >> >> Here are highlights of lsof on the process running R: >> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME >> R 17634 ross cwdDIR 254,212288 150773764 >> /home/ross/KHC/sunbelt >> R 17634 ross rtdDIR8,1 4096 2 / >> R 17634 ross txtREG8,1 5648 3058294 >> /usr/lib/R/bin/exec/R >> R 17634 ross DELREG8,12416718 >> /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100 >> R 17634 ross memREG8,1 335240 3105336 >> /usr/lib/openmpi/lib/libopen-pal.so.0.0.0 >> R 17634 ross memREG8,1 304576 3105337 >> /usr/lib/openmpi/lib/libopen-rte.so.0.0.0 >> R 17634 ross memREG8,1 679992 3105332 >> /usr/lib/openmpi/lib/libmpi.so.0.0.2 >> R 17634 ross memREG8,193936 2967826 >> /usr/lib/libz.so.1.2.3.4 >> R 17634 ross memREG8,110648 3187256 >> /lib/libutil-2.11.3.so >> R 17634 ross memREG8,132320 2359631 >> /usr/lib/libpciaccess.so.0.10.8 >> R 17634 ross memREG8,133368 2359338 >> /usr/lib/libnuma.so.1 >> R 17634 ross memREG 254,2 979113 152045740 >> /home/ross/install/lib/libopen-pal.so.6.1.0 >> R 17634 ross memREG8,1 183456 2359592 >> /usr/lib/libtorque.so.2.0.0 >> R 17634 ross memREG 254,2 1058125 152045781 >> /home/ross/install/lib/libopen-rte.so.7.0.0 >> R 17634 ross memREG8,149936 2359341 >> /usr/lib/libibverbs.so.1.0.0 >> R 17634 ross memREG 254,2 2802579 152045867 >> /home/ross/install/lib/libmpi.so.1.3.0 >> R 17634 ross memREG 254,2 106626 152046481 >> /home/ross/Rlib-3.0.1/Rmpi/libs/Rmpi.so >> >> So libmpi, libopen-pal, and libopen-rte all are opened in two versions and >> two locations. >> >> Thanks. >> Ross Boylan >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman
Re: [OMPI users] Bug: Disabled mpi_leave_pinned for GPUDirect and InfiniBand during run-time caused by GCC optimizations
On Jun 5, 2015, at 8:47 PM, Gilles Gouaillardet wrote: > i did not use the term "pure" properly. > > please read instead "posix_memalign is a function that does not modify any > user variable" > that assumption is correct when there is no wrapper, and incorrect in our > case. My suggestion is to try to create a small reproducer program that we can send to the GCC folks with the claim that we believe it to be a buggy optimization. Then we can see whether they agree and if not, how they defend that behavior. We probably still need a workaround for now though, and the "volatile" approach seems fine to me. -Dave
Re: [OMPI users] Using POSIX shared memory as send buffer
On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski wrote: > > Hello, everyone > > I am struggling a bit with IB performance when sending data from a POSIX > shared memory region (/dev/shm). The memory is shared among many MPI > processes within the same compute node. Essentially, I see a bit hectic > performance, but it seems that my code it is roughly twice slower than when > using a usual, malloced send buffer. It may have to do with NUMA effects and the way you're allocating/touching your shared memory vs. your private (malloced) memory. If you have a multi-NUMA-domain system (i.e., any 2+ socket server, and even some single-socket servers) then you are likely to run into this sort of issue. The PCI bus on which your IB HCA communicates is almost certainly closer to one NUMA domain than the others, and performance will usually be worse if you are sending/receiving from/to a "remote" NUMA domain. "lstopo" and other tools can sometimes help you get a handle on the situation, though I don't know if it knows how to show memory affinity. I think you can find memory affinity for a process via "/proc//numa_maps". There's lots of info about NUMA affinity here: https://queue.acm.org/detail.cfm?id=2513149 -Dave