Re: [OMPI users] Do MPI calls ever sleep?

2010-07-21 Thread Dave Goodell
On Jul 21, 2010, at 2:54 PM CDT, Jed Brown wrote:

> On Wed, 21 Jul 2010 15:20:24 -0400, David Ronis  wrote:
>> Hi Jed,
>> 
>> Thanks for the reply and suggestion.  I tried adding -mca
>> yield_when_idle 1 (and later mpi_yield_when_idle 1 which is what
>> ompi_info reports the variable as) but it seems to have had 0 effect.
>> My master goes into fftw planning routines for a minute or so (I see the
>> threads being created), but the overall usage of the slaves remains
>> close to 100% during this time.  Just to be sure, I put the slaves into
>> a MPI_Barrier(MPI_COMM_WORLD) while they were waiting for the fftw
>> planner to finish.   It also didn't help.
> 
> They still spin (instead of using e.g. select()), but call sched_yield()
> so should only be actively spinning when nothing else is trying to run.
> Are you sure that the planner is always running in parallel?  What OS
> and OMPI version are you using?

sched_yield doesn't work as expected in late 2.6 Linux kernels: 
http://kerneltrap.org/Linux/CFS_and_sched_yield

If this scheduling behavior change is affecting you, you might be able to fix 
it with:

echo "1" >/proc/sys/kernel/sched_compat_yield

-Dave




Re: [OMPI users] Hair depleting issue with Ompi143 and one program

2011-01-20 Thread Dave Goodell
I can't speak to what OMPI might be doing to your program, but I have a few 
suggestions for looking into the Valgrind issues.

Valgrind's "--track-origins=yes" option is usually helpful for figuring out 
where the uninitialized values came from.  However, if I understand you 
correctly and if you are correct in your assumption that _mm_setzero_ps is not 
actually zeroing your xEv variable for some reason, then this option will 
unhelpfully tell you that it was caused by a stack allocation at the entrance 
to the function where the variable is declared.  But it's worth turning on 
because it's easy to do and it might show you something obvious that you are 
missing.

The next thing you can do is disable optimization when building your code in 
case GCC is taking a shortcut that is either incorrect or just doesn't play 
nicely with Valgrind.  Valgrind might run pretty slow though, because -O0 code 
can be really verbose and slow to check.

After that, if you really want to dig in, you can try reading the assembly code 
that is generated for that _mm_setzero_ps line.  The easiest way is to pass 
"-save-temps" to gcc and it will keep a copy of "sourcefile.s" corresponding to 
"sourcefile.c".  Sometimes "-fverbose-asm" helps, sometimes it makes things 
harder to follow.

And the last semi-desperate step is to dig into what Valgrind thinks is going 
on.  You'll want to read up on how memcheck really works [1] before doing this. 
 Then read up on client requests [2,3].  You can then use the 
VALGRIND_GET_VBITS client request on your xEv variable in order to see which 
parts of the variable Valgrind thinks are undefined.  If the vbits don't match 
with what you expect, there's a chance that you might have found a bug in 
Valgrind itself.  It doesn't happen often, but the SSE code can be complicated 
and isn't exercised as often as the non-vector portions of Valgrind.

Good luck,
-Dave

[1] http://valgrind.org/docs/manual/mc-manual.html#mc-manual.machine
[2] 
http://valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.clientreq
[3] http://valgrind.org/docs/manual/mc-manual.html#mc-manual.clientreqs

On Jan 20, 2011, at 5:07 PM CST, David Mathog wrote:

> I have been working on slightly modifying a software package by Sean
> Eddy called Hmmer 3.  The hardware acceleration was originally SSE2 but
> since most of our compute nodes only have SSE1 and MMX I rewrote a few
> small sections to just use those instructions.  (And yes, as far as I
> can tell it invokes emms before any floating point operations are run
> after each MMX usage.)   On top of that each binary has 3 options for
> running the programs: single threaded, threaded, or MPI (using 
> Ompi143).  For all other programs in this package everything works
> everywhere.  For one called "jackhmmer" this table results (+=runs
> correctly, - = problems), where the exact same problem is run in each
> test (theoretically exercising exactly the same routines, just under
> different threading control):
> 
>   SSE2   SSE1 
> Single  +  +
> Threaded+  +
> Ompi143 +  -
> 
> The negative result for the SSE/Ompi143 combination happens whether the
> worker nodes are Athlon MP (SSE1 only) or Athlon64.  The test machine
> for the single and threaded runs is a two CPU Opteron 280 (4 cores
> total).  Ompi143 is 32 bit everywhere (local copies though).  There have
> been no modifications whatsoever made to the main jackhmmer.c file,
> which is where the various run methods are implemented.
> 
> Now if there was some intrinsic problem with my SSE1 code it should
> presumably manifest in both the Single and Threaded versions as well
> (the thread control is different, but they all feed through the same
> underlying functions), or in one of the other programs, which isn't
> seen.  Running under valgrind using Single or Threaded produces no
> warnings.  Using mpirun with valgrind on the SSE2 produces 3: two
> related to OMPI itself which are seen in every OMPI program run in
> valgrind, and one caused by an MPIsend operation where the buffer
> contains some uninitialized data (this is nothing toxic, just bytes in
> fixed length fields which which were never set because a shorter string
> is stored there). 
> 
> ==19802== Syscall param writev(vector[...]) points to uninitialised byte(s)
> ==19802==at 0x4C77AC1: writev (in /lib/libc-2.10.1.so)
> ==19802==by 0x8A069B5: mca_btl_tcp_frag_send (in
> /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so)
> ==19802==by 0x8A0626E: mca_btl_tcp_endpoint_send (in
> /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so)
> ==19802==by 0x8A01ADC: mca_btl_tcp_send (in
> /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so)
> ==19802==by 0x7FA24A9: mca_pml_ob1_send_request_start_prepare (in
> /opt/ompi143.X32/lib/openmpi/mca_pml_ob1.so)
> ==19802==by 0x7F98443: mca_pml_ob1_send (in
> /opt/ompi143.X32/lib/openmpi/mca_pml_ob1.so)
> ==19802==by 0x4A8530F: PMPI_Send (in
> /opt/ompi143.X32/lib/libmpi.so.0.0.2)
> =

Re: [OMPI users] Deadlock with mpi_init_thread + mpi_file_set_view

2011-04-04 Thread Dave Goodell
FWIW, we solved this problem with ROMIO in MPICH2 by making the "big global 
lock" a recursive mutex.  In the past it was implicitly so because of the way 
that recursive MPI calls were handled.  In current MPICH2 it's explicitly 
initialized with type PTHREAD_MUTEX_RECURSIVE instead.

-Dave

On Apr 4, 2011, at 9:28 AM CDT, Ralph Castain wrote:

> 
> On Apr 4, 2011, at 8:18 AM, Rob Latham wrote:
> 
>> On Sat, Apr 02, 2011 at 04:59:34PM -0400, fa...@email.com wrote:
>>> 
>>> opal_mutex_lock(): Resource deadlock avoided
>>> #0  0x0012e416 in __kernel_vsyscall ()
>>> #1  0x01035941 in raise (sig=6) at 
>>> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
>>> #2  0x01038e42 in abort () at abort.c:92
>>> #3  0x00d9da68 in ompi_attr_free_keyval (type=COMM_ATTR, key=0xbffda0e4, 
>>> predefined=0 '\000') at attribute/attribute.c:656
>>> #4  0x00dd8aa2 in PMPI_Keyval_free (keyval=0xbffda0e4) at pkeyval_free.c:52
>>> #5  0x01bf3e6a in ADIOI_End_call (comm=0xf1c0c0, keyval=10, 
>>> attribute_val=0x0, extra_state=0x0) at ad_end.c:82
>>> #6  0x00da01bb in ompi_attr_delete. (type=UNUSED_ATTR, object=0x6, 
>>> attr_hash=0x2c64, key=14285602, predefined=232 '\350', need_lock=128 
>>> '\200') at attribute/attribute.c:726
>>> #7  0x00d9fb22 in ompi_attr_delete_all (type=COMM_ATTR, object=0xf1c0c0, 
>>> attr_hash=0x8d0fee8) at attribute/attribute.c:1043
>>> #8  0x00dbda65 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:133
>>> #9  0x00dd12c2 in PMPI_Finalize () at pfinalize.c:46
>>> #10 0x00d6b515 in mpi_finalize_f (ierr=0xbffda2b8) at pfinalize_f.c:62
>> 
>> I guess I need some OpenMPI eyeballs on this...
>> 
>> ROMIO hooks into the attribute keyval deletion mechanism to clean up
>> the internal data structures it has allocated.  I suppose since this
>> is MPI_Finalize, we could just leave those internal data structures
>> alone and let the OS deal with it. 
>> 
>> What I see happening here is the OpenMPI finalize routine is deleting
>> attributes.   one of those attributes is ROMIO's, which in turn tries
>> to free keyvals.  Is the deadlock that noting "under" ompi_attr_delete
>> can itself call ompi_* routines? (as ROMIO triggers a call to
>> ompi_attr_free_keyval) ?
>> 
>> Here's where ROMIO sets up the keyval and the delete handler:
>> https://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/trunk/src/mpi/romio/mpi-io/mpir-mpioinit.c#L39
>> 
>> that routine gets called upon any "MPI-IO entry point" (open, delete,
>> register-datarep).  The keyvals help ensure that ROMIO's internal
>> structures get initialized exactly once, and the delete hooks help us
>> be good citizens and clean up on exit. 
> 
> FWIW: his trace shows that OMPI incorrectly attempts to acquire a thread lock 
> that has already been locked. This occurs  in OMPI's attribute code, probably 
> surrounding the call to your code.
> 
> In other words, it looks to me like the problem is on our side, not yours. 
> Jeff is the one who generally handles the attribute code, though, so I'll 
> ping his eyeballs :-)
> 
> 
>> 
>> ==rob
>> 
>> -- 
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] data types and alignment to word boundary

2011-06-29 Thread Dave Goodell
On Jun 29, 2011, at 10:56 AM CDT, Jeff Squyres wrote:

> There's probably an alignment gap between the short and char array, and 
> possibly an alignment gap between the char array and the double array 
> (depending on the value of SHORT_INPUT and your architecture).
> 
> So for your displacements, you should probably actually measure what the 
> displacements are instead of using sizeof(short), for example.
> 
> tVStruct foo;
> aiDispsT5[0] = 0;
> aiDispsT5[0] = ((char*) &(foo.sCapacityFile) - (char*) &foo);

There's a C-standard "offsetof" macro for this calculation.  Using it instead 
of the pointer math above greatly improves readability: 
http://en.wikipedia.org/wiki/Offsetof

So the second line becomes:

8<
aiDispsT5[1] = offsetof(tVStruct, sCapacityFile);
8<

-Dave




Re: [OMPI users] MPI defined macro

2011-08-23 Thread Dave Goodell
This has been discussed previously in the MPI Forum:

http://lists.mpi-forum.org/mpi-forum/2010/11/0838.php

I think it resulted in this proposal, but AFAIK it was never pushed forward by 
a regular attendee of the Forum: 
https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ReqPPMacro

-Dave

On Aug 23, 2011, at 6:59 AM CDT, Jeff Squyres wrote:

> I unfortunately won't be at the next Forum meeting, but you might want to ask 
> someone to bring it up for you.
> 
> It might not give you exactly what you want, however, because not all 
> platforms have "mpicc" (or similar) wrapper compilers.  I.e., to compile an 
> MPI application on some platforms, you just "cc ... -lmpi".  Hence, there's 
> no way for the compiler to know whether to #define MPI or not.
> 
> Such a macro *could* be added to mpi.h (but not Fortran), but then you 
> wouldn't get at least one of the use cases that you (assumedly :-) ) want:
> 
> #if MPI
> #include 
> #endif
> 
> 
> On Aug 23, 2011, at 7:46 AM, Gabriele Fatigati wrote:
> 
>> Can I suggest to insert this macro in next MPI 3 standard?
>> 
>> I think It's very useful.
>> 
>> 2011/8/23 Jeff Squyres 
>> I'm afraid not.  Sorry!  :-(
>> 
>> We have the OPEN_MPI macro -- it'll be defined to 1 if you compile with Open 
>> MPI, but that doesn't really help your portability issue.  :-\
>> 
>> On Aug 23, 2011, at 5:19 AM, Gabriele Fatigati wrote:
>> 
>>> Dear OpenMPi users,
>>> 
>>> is there some portable MPI macro to check if a code is compiled with MPI 
>>> compiler? Something like _OPENMP for OpenMP codes:
>>> 
>>> #ifdef _OPENMP
>>> 
>>> 
>>> 
>>> #endif
>>> 
>>> 
>>> it exist?
>>> 
>>> #ifdef MPI
>>> 
>>> 
>>> 
>>> 
>>> #endif
>>> 
>>> Thanks
>>> 
>>> --
>>> Ing. Gabriele Fatigati
>>> 
>>> HPC specialist
>>> 
>>> SuperComputing Applications and Innovation Department
>>> 
>>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>>> 
>>> www.cineca.itTel:   +39 051 6171722
>>> 
>>> g.fatigati [AT] cineca.it
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>> -- 
>> Ing. Gabriele Fatigati
>> 
>> HPC specialist
>> 
>> SuperComputing Applications and Innovation Department
>> 
>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>> 
>> www.cineca.itTel:   +39 051 6171722
>> 
>> g.fatigati [AT] cineca.it   
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Dave Goodell
On May 24, 2012, at 10:22 AM CDT, Jeff Squyres wrote:

> I read it to be: reduce the data in the local group, scatter the results to 
> the remote group.
> 
> As such, the reduce COUNT is sum(recvcounts), and is used for the reduction 
> in the local group.  Then use recvcounts to scatter it to the remote group.
> 
> ...right?
> 

right.

-Dave




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Dave Goodell
On May 24, 2012, at 10:57 AM CDT, Lisandro Dalcin wrote:

> On 24 May 2012 12:40, George Bosilca  wrote:
> 
>> I don't see much difference with the other collective. The generic behavior 
>> is that you apply the operation on the local group but the result is moved 
>> into the remote group.
> 
> Well, for me this one DO IS different (for example, SCATTER is
> unidirectional for intercomunicators, but REDUCE_SCATTER is
> bidirectional). The "recvbuff" is a local buffer, but you understand
> "recvcounts" as remote.
> 
> Mmm, the standard is really confusing in this point...

Don't think of it like an intercommunicator-scatter, think of it more like an 
intercommunicator-allreduce.  The allreduce is also bidirectional.  The only 
difference is that instead of an allreduce (logically reduce+bcast), you 
instead have a reduce_scatter (logically reduce+scatterv).

-Dave




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Dave Goodell
On May 24, 2012, at 8:13 PM CDT, Jeff Squyres wrote:

> On May 24, 2012, at 11:57 AM, Lisandro Dalcin wrote:
> 
>> The standard says this:
>> 
>> "Within each group, all processes provide the same recvcounts
>> argument, and provide input vectors of  sum_i^n recvcounts[i] elements
>> stored in the send buffers, where n is the size of the group"
>> 
>> So, I read " Within each group, ... where n is the size of the group"
>> as being the LOCAL group size.
> 
> Actually, that seems like a direct contradiction with the prior sentence: 
> 
> If comm is an intercommunicator, then the result of the reduction of the data 
> provided by processes in one group (group A) is scattered among processes in 
> the other group (group B), and vice versa.
> 
> It looks like the implementors of 2 implementations agree that recvcounts 
> should be the size of the remote group.  Sounds like this needs to be brought 
> up in front of the Forum...

So I take back my prior "right".  Upon further inspection of the text and the 
MPICH2 code I believe it to be true that the number of the elements in the 
recvcounts array must be equal to the size of the LOCAL group.

The text certainly could use a bit of clarification.  I'll bring it up at the 
meeting next week.

-Dave




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Dave Goodell
On May 24, 2012, at 10:34 PM CDT, George Bosilca wrote:

> On May 24, 2012, at 23:18, Dave Goodell  wrote:
> 
>> So I take back my prior "right".  Upon further inspection of the text and 
>> the MPICH2 code I believe it to be true that the number of the elements in 
>> the recvcounts array must be equal to the size of the LOCAL group.
> 
> This is quite illogical, but it will not be the first time the standard is 
> lacking some. So, if I understand you correctly, in the case of an 
> intercommunicator a process doesn't know how much data it has to reduce, at 
> least not until it receives the array of recvcounts from the remote group. 
> Weird!

No, it knows because of the restriction that $sum_i^n{recvcounts[i]}$ yields 
the same sum in each group.

The way it's implemented in MPICH2, and the way that makes this make a lot more 
sense to me, is that you first do intercommunicator reductions to temporary 
buffers on rank 0 in each group.  Then rank 0 scatters within the local group.  
The way I had been thinking about it was to do a local reduction followed by an 
intercomm scatter, but that isn't what the standard is saying, AFAICS.

-Dave




Re: [OMPI users] MPI_IN_PLACE not working for Fortran-compiled code linked with mpicc on Mac OS X

2013-01-04 Thread Dave Goodell
On Jan 4, 2013, at 2:55 AM CST, Torbjörn Björkman wrote:

> It seems that a very old bug (svn.open-mpi.org/trac/ompi/ticket/1982) is 
> playing up when linking fortran code with mpicc on Mac OS X 10.6 and the 
> Macports distribution openmpi @1.6.3_0+gcc44. I got it working by reading up 
> on this discussion thread:
> http://www.open-mpi.org/community/lists/users/2011/11/17862.php
> and applying the fix given there, add '-Wl,-commons,use_dylibs', to the c 
> compiler flags solves the problem. 

I'm not an Open MPI developer (or user, really), but in MPICH we also had to 
ensure that we passed both "-Wl,-commons,use_dylibs" *and* 
"-Wl,-flat_namespace" in the end.  For MPI users that do not use Fortran (and 
therefore don't need common blocks to work correctly between the app and the 
library), we provide a "--enable-two-level-namespace" configure option to allow 
users to generate two-level namespace dylibs instead.  Some combinations of 
third-party dylibs will require two-level namespaced MPI dylibs.

I don't know if Open MPI is using "-Wl,-flat_namespace" or not, but this is 
something else that any investigation should probably check.

For reference on the later MPICH discoveries about dynamically linking common 
symbols on Darwin: http://trac.mpich.org/projects/mpich/ticket/1590

-Dave




Re: [OMPI users] Progress in MPI_Win_unlock

2010-02-04 Thread Dave Goodell

On Feb 3, 2010, at 6:24 PM, Dorian Krause wrote:

Unless it is also specified that a process must eventually exit with  
a call to MPI_Finalize (I couldn't find such a requirement),  
progress for RMA access to a passive server which does not  
participate actively in any MPI communication is not guaranteed,  
right?

(Btw. mvapich2 has the same behavior in this regard)


For the finalize requirement, see MPI-2.2 page 291, lines 36-38:

--8<--
This routine cleans up all MPI state. Each process must call  
MPI_FINALIZE before it exits. Unless there has been a call to  
MPI_ABORT, each process must ensure that all pending nonblocking  
communications are (locally) complete before calling MPI_FINALIZE.

--8<--

MPI is intentionally vague on progress issues and leaves lots of room  
for implementation choices.


I'll let the Open MPI folks answer the questions about their  
implementation.


-Dave



Re: [OMPI users] MPI_Init() and MPI_Init_thread()

2010-03-03 Thread Dave Goodell

On Mar 3, 2010, at 11:35 AM, Richard Treumann wrote:
If the application will make MPI calls from multiple threads and  
MPI_INIT_THREAD has returned FUNNELED, the application must be  
willing to take the steps that ensure there will never be concurrent  
calls to MPI from the threads. The threads will take turns - without  
fail.


Minor nitpick: if the implementation returns FUNNELED, only the main  
thread (basically the thread that called MPI_INIT_THREAD, see MPI-2.2  
pg 386 for def'n) may make MPI calls.  Dick's paragraph above is  
correct if you replace FUNNELED with SERIALIZED.


-Dave



Re: [OMPI users] MPI_Init() and MPI_Init_thread()

2010-03-04 Thread Dave Goodell

On Mar 4, 2010, at 7:36 AM, Richard Treumann wrote:
A call to MPI_Init allows the MPI library to return any level of  
thread support it chooses.


This is correct, insofar as the MPI implementation can always choose  
any level of thread support.
This MPI 1.1 call does not let the application say what it wants and  
does not let the implementation reply with what it can guarantee.



Well, sort of.  MPI-2.2, sec 12.4.3, page 385, lines 24-25:

--8<--
24|  A call to MPI_INIT has the same effect as a call to  
MPI_INIT_THREAD with a required

25|  = MPI_THREAD_SINGLE.
--8<--

So even though there is no explicit request and response for thread  
level support, it is implicitly asking for MPI_THREAD_SINGLE.  Since  
all implementations must be able to support at least SINGLE (0 threads  
running doesn't really make sense), SINGLE will be provided at a  
minimum.  Callers to plain-old "MPI_Init" should not expect any higher  
level of thread support if they wish to maintain portability.


[...snip...]

Consider a made up example:

Imagine some system supports Mutex lock/unlock but with terrible  
performance. As a work around, it offers a non-standard substitute  
for malloc called st_malloc (single thread malloc) that does not do  
locking.



[...snip...]

Dick's example is a great illustration of why FUNNELED might be  
necessary.  The moral of the story is "don't lie to the MPI  
implementation" :)


-Dave



Re: [OMPI users] MPI_Init() and MPI_Init_thread()

2010-03-04 Thread Dave Goodell

On Mar 4, 2010, at 10:52 AM, Anthony Chan wrote:


- "Yuanyuan ZHANG"  wrote:


For an OpenMP/MPI hybrid program, if I only want to make MPI calls
using the main thread, ie., only in between parallel sections, can  
I just

use SINGLE or MPI_Init?


If your MPI calls is NOT within OpenMP directives, MPI does not even
know you are using threads.  So calling MPI_Init is good enough.


This is *not true*.  Please read Dick's previous post for a good  
example of why this is not the case.


In practice, on most platforms, implementation support for SINGLE and  
FUNNELED are identical (true for stock MPICH2, for example).  However  
Dick's example of thread-safe versus non-thread-safe malloc options  
clearly shows why programs need to request (and check "provided" for)  
>=FUNNELED in this scenario if they wish to be truly portable.


-Dave



Re: [OMPI users] Problem building OpenMPI 1.8 on RHEL6

2014-04-01 Thread Dave Goodell (dgoodell)
On Apr 1, 2014, at 10:26 AM, "Blosch, Edwin L"  wrote:

> I am getting some errors building 1.8 on RHEL6.  I tried autoreconf as 
> suggested, but it failed for the same reason.  Is there a minimum version of 
> m4 required that is newer than that provided by RHEL6?

Don't run "autoreconf" by hand, make sure to run the "./autogen.sh" script that 
is packaged with OMPI.  It will also check your versions and warn you if they 
are out of date.

Do you need to build OMPI from the SVN source?  Or would a (pre-autogen'ed) 
release tarball work for you?

-Dave




Re: [OMPI users] usNIC point-to-point messaging module

2014-04-01 Thread Dave Goodell (dgoodell)
On Apr 1, 2014, at 12:13 PM, Filippo Spiga  wrote:

> Dear Ralph, Dear Jeff,
> 
> I've just recompiled the latest Open MPI 1.8. I added 
> "--enable-mca-no-build=btl-usnic" to configure but the message still appear. 
> Here the output of "--mca btl_base_verbose 100" (trunked immediately after 
> the application starts)

Jeff's on vacation, so I'll see if I can help here.

Try deleting all the files in "$PREFIX/lib/openmpi/", where "$PREFIX" is the 
value you passed to configure with "--prefix=".  If you did not pass a value, 
then it is "/usr/local".  Then reinstall (with "make install" in the OMPI build 
tree).

What I think is happening is that you still have an "mca_btl_usnic.so" file 
leftover from the last time you installed OMPI (before passing 
"--enable-mca-no-build=btl-usnic").  So OMPI is using this shared library and 
you get exactly the same problem.

-Dave



Re: [OMPI users] usNIC point-to-point messaging module

2014-04-02 Thread Dave Goodell (dgoodell)
On Apr 2, 2014, at 12:57 PM, Filippo Spiga  wrote:

> I still do not understand why this keeps appearing...
> 
> srun: cluster configuration lacks support for cpu binding
> 
> Any clue?

I don't know what causes that message.  Ralph, any thoughts here?

-Dave



Re: [OMPI users] mpirun runs in serial even I set np to several processors

2014-04-14 Thread Dave Goodell (dgoodell)
On Apr 14, 2014, at 12:15 PM, Djordje Romanic  wrote:

> When I start wrf with mpirun -np 4 ./wrf.exe, I get this:
> -
>  starting wrf task0  of1
>  starting wrf task0  of1
>  starting wrf task0  of1
>  starting wrf task0  of1
> -
> This indicates that it is not using 4 processors, but 1. 
> 
> Any idea what might be the problem? 

It could be that you compiled WRF with a different MPI implementation than you 
are using to run it (e.g., MPICH vs. Open MPI).

-Dave



Re: [OMPI users] OMPI 1.8.1 Deadlock in mpi_finalize with mpi_init_thread

2014-04-29 Thread Dave Goodell (dgoodell)
I don't know of any workaround.  I've created a ticket to track this, but it 
probably won't be very high priority in the short term:

https://svn.open-mpi.org/trac/ompi/ticket/4575

-Dave

On Apr 25, 2014, at 3:27 PM, Jamil Appa  wrote:

> 
>   Hi 
> 
> The following program deadlocks in mpi_finalize with OMPI 1.8.1 but works 
> correctly with OMPI 1.6.5
> 
> Is there a work around?
> 
>   Thanks
> 
>  Jamil
> 
> program mpiio
> use mpi
> implicit none
> integer(kind=4) :: iprov, fh, ierr
> call mpi_init_thread(MPI_THREAD_SERIALIZED, iprov, ierr)
> if (iprov < MPI_THREAD_SERIALIZED) stop 'mpi_init_thread'
> call mpi_file_open(MPI_COMM_WORLD, 'test.dat', &
> MPI_MODE_WRONLY + MPI_MODE_CREATE, MPI_INFO_NULL, fh, ierr)
> call mpi_file_close(fh, ierr)
> call mpi_finalize(ierr)
> end program mpiio
> 
> (gdb) bt
> #0  0x003155a0e054 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x003155a09388 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x003155a09257 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x77819f3c in ompi_attr_free_keyval () from 
> /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/libmpi.so.1
> #4  0x77857be1 in PMPI_Keyval_free () from 
> /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/libmpi.so.1
> #5  0x715b21f2 in ADIOI_End_call () from 
> /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/openmpi/mca_io_romio.so
> #6  0x7781a325 in ompi_attr_delete_impl () from 
> /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/libmpi.so.1
> #7  0x7781a4ec in ompi_attr_delete_all () from 
> /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/libmpi.so.1
> #8  0x77832ad5 in ompi_mpi_finalize () from 
> /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/libmpi.so.1
> #9  0x77b12e59 in pmpi_finalize__ () from 
> /gpfs/thirdparty/zenotech/home/jappa/apps6.4/lib/libmpi_mpifh.so.2
> #10 0x00400b64 in mpiio () at t.f90:10
> #11 0x00400b9a in main ()
> #12 0x00315561ecdd in __libc_start_main () from /lib64/libc.so.6
> #13 0x00400a19 in _start ()
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] importing to MPI data already in memory from another process

2014-06-27 Thread Dave Goodell (dgoodell)
On Jun 27, 2014, at 8:53 AM, Brock Palen  wrote:

> Is there a way to import/map memory from a process (data acquisition) such 
> that an MPI program could 'take' or see that memory?
> 
> We have a need to do data acquisition at the rate of .7TB/s and need todo 
> some shuffles/computation on these data,  some of the nodes are directly 
> connected to the device, and some will do processing. 
> 
> Here is the proposed flow:
> 
> * Data collector nodes runs process collecting data from device
> * Those nodes somehow pass the data to an MPI job running on these nodes and 
> a number of other nodes (cpu need for filterting is greater than what the 16 
> data nodes can provide).

For a non-MPI solution for intranode data transfer in this case, take a look at 
vmsplice(2):

http://man7.org/linux/man-pages/man2/vmsplice.2.html

Pay particular attention to the SPLICE_F_GIFT flag, which will allow you to 
simply give memory pages away to the MPI process, avoiding unnecessary data 
copies.  You would just need a pipe shared between the data collector process 
and the MPI process (and to be a bit careful with your memory 
allocation/management, since any page you gift away should probably come from 
mmap(2) directly).


Otherwise, as George mentioned, I would investigate converting your current 
data collector processes to also be MPI processes so that they can simply 
communicate the data to the rest of the cluster.

-Dave




Re: [OMPI users] OpenMPI 1.8.2 segfaults while 1.6.5 works?

2014-09-29 Thread Dave Goodell (dgoodell)
Looks like boost::mpi and/or your python "mpi" module might be creating a bogus 
argv array and passing it to OMPI's MPI_Init routine.  Note that argv is 
required by C99 to be terminated with a NULL pointer (that is, 
(argv[argc]==NULL) must hold).  See http://stackoverflow.com/a/3772826/158513.

-Dave

On Sep 29, 2014, at 1:34 PM, Ralph Castain  wrote:

> Afraid I cannot replicate a problem with singleton behavior in the 1.8 series:
> 
> 11:31:52  /home/common/openmpi/v1.8/orte/test/mpi$ ./hello foo bar
> Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0-23
> OMPI_MCA_orte_default_hostfile=/home/common/hosts
> OMPI_COMMAND=./hello
> OMPI_ARGV=foo bar
> OMPI_NUM_APP_CTX=1
> OMPI_FIRST_RANKS=0
> OMPI_APP_CTX_NUM_PROCS=1
> OMPI_MCA_orte_ess_num_procs=1
> 
> You can see that the OMPI_ARGV envar (which is the spot you flagged) is 
> correctly being set and there is no segfault. Not sure what your program may 
> be doing, though, so I'm not sure I've really tested your scenario.
> 
> 
> On Sep 29, 2014, at 10:55 AM, Ralph Castain  wrote:
> 
>> Okay, so regression-test.py is calling MPI_Init as a singleton, correct? 
>> Just trying to fully understand the scenario
>> 
>> Singletons are certainly allowed, if that's the scenario
>> 
>> On Sep 29, 2014, at 10:51 AM, Amos Anderson  
>> wrote:
>> 
>>> I'm not calling mpirun in this case because this particular calculation 
>>> doesn't use more than one processor. What I'm doing on my command line is 
>>> this:
>>> 
>>> /home/user/myapp/tools/python/bin/python test/regression/regression-test.py 
>>> test/regression/regression-jobs
>>> 
>>> and internally I check for rank/size. This command is executed in the 
>>> context of a souped up LD_LIBRARY_PATH. You can see the variable argv in 
>>> opal_argv_join is ending up with the last argument on my command line.
>>> 
>>> I suppose your question implies that mpirun is mandatory for executing 
>>> anything compiled with OpenMPI > 1.6 ?
>>> 
>>> 
>>> 
>>> On Sep 29, 2014, at 10:28 AM, Ralph Castain  wrote:
>>> 
 Can you pass us the actual mpirun command line being executed? Especially 
 need to see the argv being passed to your application.
 
 
 On Sep 27, 2014, at 7:09 PM, Amos Anderson  
 wrote:
 
> FWIW, I've confirmed that the segfault also happens with OpenMPI 1.7.5. 
> Also, I have some gdb output (from 1.7.5) for your perusal, including a 
> printout of some of the variables' values.
> 
> 
> 
> Starting program: /home/user/myapp/tools/python/bin/python 
> test/regression/regression-test.py test/regression/regression-jobs
> [Thread debugging using libthread_db enabled]
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x2bc8df1e in opal_argv_join (argv=0xa39398, delimiter=32) at 
> argv.c:299
> 299   str_len += strlen(*p) + 1;
> (gdb) where
> #0  0x2bc8df1e in opal_argv_join (argv=0xa39398, delimiter=32) at 
> argv.c:299
> #1  0x2ab2ce4e in ompi_mpi_init (argc=2, argv=0xa39390, 
> requested=0, provided=0x7fffba98) at runtime/ompi_mpi_init.c:450
> #2  0x2ab63e39 in PMPI_Init (argc=0x7fffbb8c, 
> argv=0x7fffbb80) at pinit.c:84
> #3  0x2aaab7b965d6 in boost::mpi::environment::environment 
> (this=0xa3a1d0, argc=@0x7fffbb8c, argv=@0x7fffbb80, 
> abort_on_exception=true)
>at ../tools/boost/libs/mpi/src/environment.cpp:98
> #4  0x2aaabc7b311d in boost::mpi::python::mpi_init (python_argv=..., 
> abort_on_exception=true) at 
> ../tools/boost/libs/mpi/src/python/py_environment.cpp:60
> #5  0x2aaabc7b33fb in boost::mpi::python::export_environment () at 
> ../tools/boost/libs/mpi/src/python/py_environment.cpp:94
> #6  0x2aaabc7d5ab5 in boost::mpi::python::init_module_mpi () at 
> ../tools/boost/libs/mpi/src/python/module.cpp:44
> #7  0x2aaab792a2f2 in 
> boost::detail::function::void_function_ref_invoker0 void>::invoke (function_obj_ptr=...)
>at ../tools/boost/boost/function/function_template.hpp:188
> #8  0x2aaab7929e6b in boost::function0::operator() 
> (this=0x7fffc110) at 
> ../tools/boost/boost/function/function_template.hpp:767
> #9  0x2aaab7928f11 in boost::python::handle_exception_impl (f=...) at 
> ../tools/boost/libs/python/src/errors.cpp:25
> #10 0x2aaab792a54f in boost::python::handle_exception 
> (f=0x2aaabc7d5746 ) at 
> ../tools/boost/boost/python/errors.hpp:29
> #11 0x2aaab792a1d9 in boost::python::detail::(anonymous 
> namespace)::init_module_in_scope (m=0x2aaabc617f68, 
>init_function=0x2aaabc7d5746 ) 
> at ../tools/boost/libs/python/src/module.cpp:24
> #12 0x2aaab792a26c in boost::python::detail::init_module 
> (name=0x2aaabc7f7f4d "mpi", init_function=0x2aaabc7d5746 
> )
>at ../tools/boost/libs/python/src/module.cpp:59
> #13 0x0

Re: [OMPI users] mpi_wtime implementation

2014-11-24 Thread Dave Goodell (dgoodell)
On Nov 24, 2014, at 12:06 AM, George Bosilca  wrote:

> https://github.com/open-mpi/ompi/pull/285 is a potential answer. I would like 
> to hear Dave Goodell comment on this before pushing it upstream.
> 
>   George.

I'll take a look at it today.  My notification settings were messed up when you 
originally CCed me on the PR, so I didn't see this until now.

-Dave



Re: [OMPI users] send and receive vectors + variable length

2015-01-09 Thread Dave Goodell (dgoodell)
On Jan 9, 2015, at 7:46 AM, Jeff Squyres (jsquyres)  wrote:

> Yes, I know examples 3.8/3.9 are blocking examples.
> 
> But it's morally the same as:
> 
> MPI_WAITALL(send_requests...)
> MPI_WAITALL(recv_requests...)
> 
> Strictly speaking, that can deadlock, too.  
> 
> It reality, it has far less chance of deadlocking than examples 3.8 and 3.9 
> (because you're likely within the general progression engine, and the 
> implementation will progress both the send and receive requests while in the 
> first WAITALL).  
> 
> But still, it would be valid for an implementation to *only* progress the 
> send requests -- and NOT the receive requests -- while in the first WAITALL.  
> Which makes it functionally equivalent to examples 3.8/3.9.

That's not true.  The implementation is required to make progress on all 
outstanding requests (assuming they can be progressed).  The following should 
not deadlock:

✂
for (...)  MPI_Isend(...)
for (...)  MPI_Irecv(...)
MPI_Waitall(send_requests...)
MPI_Waitall(recv_requests...)
✂

-Dave



Re: [OMPI users] New to (Open)MPI

2016-09-02 Thread Dave Goodell (dgoodell)
Lachlan mentioned that he has "M Series" hardware, which, to the best of my 
knowledge, does not officially support usNIC.  It may not be possible to even 
configure the relevant usNIC adapter policy in UCSM for M Series 
modules/chassis.

Using the TCP BTL may be the only realistic option here.

-Dave

> On Sep 2, 2016, at 5:35 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Greetings Lachlan.
> 
> Yes, Gilles and John are correct: on Cisco hardware, our usNIC transport is 
> the lowest latency / best HPC-performance transport.  I'm not aware of any 
> MPI implementation (including Open MPI) that has support for FC types of 
> transports (including FCoE).
> 
> I'll ping you off-list with some usNIC details.
> 
> 
>> On Sep 1, 2016, at 10:06 PM, Lachlan Musicman  wrote:
>> 
>> Hola,
>> 
>> I'm new to MPI and OpenMPI. Relatively new to HPC as well.
>> 
>> I've just installed a SLURM cluster and added OpenMPI for the users to take 
>> advantage of.
>> 
>> I'm just discovering that I have missed a vital part - the networking.
>> 
>> I'm looking over the networking options and from what I can tell we only 
>> have (at the moment) Fibre Channel over Ethernet (FCoE).
>> 
>> Is this a network technology that's supported by OpenMPI?
>> 
>> (system is running Centos 7, on Cisco M Series hardware)
>> 
>> Please excuse me if I have terms wrong or am missing knowledge. Am new to 
>> this.
>> 
>> cheers
>> Lachlan
>> 
>> 
>> --
>> The most dangerous phrase in the language is, "We've always done it this 
>> way."
>> 
>> - Grace Hopper
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] trying to use personal copy of 1.7.4

2014-03-12 Thread Dave Goodell (dgoodell)
Perhaps there's an RPATH issue here?  I don't fully understand the structure of 
Rmpi, but is there both an app and a library (or two separate libraries) that 
are linking against MPI?

I.e., what we want is:

app -> ~ross/OMPI
\  /
 --> library --

But what we're getting is:

app ---> /usr/OMPI   
\
 --> library ---> ~ross/OMPI


If one of them was first linked against the /usr/OMPI and managed to get an 
RPATH then it could override your LD_LIBRARY_PATH.

-Dave

On Mar 12, 2014, at 5:39 AM, Jeff Squyres (jsquyres)  wrote:

> Generally, all you need to ensure that your personal copy of OMPI is used is 
> to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI 
> installation.  I do this all the time on my development cluster (where I have 
> something like 6 billion different installations of OMPI available... mmm... 
> should probably clean that up...)
> 
> export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH
> export PATH=path-to-my-ompi/bin:$PATH
> 
> It should be noted that:
> 
> 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values
> 2. you need to set these values in a way that will be picked up on all 
> servers that you use in your job.  The safest way to do this is in your shell 
> startup files (e.g., $HOME/.bashrc or whatever is relevant for your shell).
> 
> See http://www.open-mpi.org/faq/?category=running#run-prereqs, 
> http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and 
> http://www.open-mpi.org/faq/?category=running#mpirun-prefix.
> 
> Note the --prefix option that is described in the 3rd FAQ item I cited -- 
> that can be a bit easier, too.
> 
> 
> 
> On Mar 12, 2014, at 2:51 AM, Ross Boylan  wrote:
> 
>> I took the advice here and built a personal copy of the current openmpi,
>> to see if the problems I was having with Rmpi were a result of the old
>> version on the system.
>> 
>> When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically
>> by R) everything looks fine; path references that should be local are.
>> But when I run the program and do lsof it shows that both the system and
>> personal versions of key libraries are opened.
>> 
>> First, does anyone know which library will actually be used, or how to
>> tell which library is actually used, in this situation.  I'm running on
>> linux (Debian squeeze)?
>> 
>> Second, it there some way to prevent the wrong/old/sytem libraries from
>> being loaded?
>> 
>> FWIW I'm still seeing the old misbehavior when I run this way, but, as I
>> said, I'm really not sure which libraries are being used.  Since Rmpi
>> was built against the new/local ones, I think the fact that it doesn't
>> crash means I really am using the new ones.
>> 
>> Here are highlights of lsof on the process running R:
>> COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
>> R   17634 ross  cwdDIR  254,212288 150773764 
>> /home/ross/KHC/sunbelt
>> R   17634 ross  rtdDIR8,1 4096 2 /
>> R   17634 ross  txtREG8,1 5648   3058294 
>> /usr/lib/R/bin/exec/R
>> R   17634 ross  DELREG8,12416718 
>> /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100
>> R   17634 ross  memREG8,1   335240   3105336 
>> /usr/lib/openmpi/lib/libopen-pal.so.0.0.0
>> R   17634 ross  memREG8,1   304576   3105337 
>> /usr/lib/openmpi/lib/libopen-rte.so.0.0.0
>> R   17634 ross  memREG8,1   679992   3105332 
>> /usr/lib/openmpi/lib/libmpi.so.0.0.2
>> R   17634 ross  memREG8,193936   2967826 
>> /usr/lib/libz.so.1.2.3.4
>> R   17634 ross  memREG8,110648   3187256 
>> /lib/libutil-2.11.3.so
>> R   17634 ross  memREG8,132320   2359631 
>> /usr/lib/libpciaccess.so.0.10.8
>> R   17634 ross  memREG8,133368   2359338 
>> /usr/lib/libnuma.so.1
>> R   17634 ross  memREG  254,2   979113 152045740 
>> /home/ross/install/lib/libopen-pal.so.6.1.0
>> R   17634 ross  memREG8,1   183456   2359592 
>> /usr/lib/libtorque.so.2.0.0
>> R   17634 ross  memREG  254,2  1058125 152045781 
>> /home/ross/install/lib/libopen-rte.so.7.0.0
>> R   17634 ross  memREG8,149936   2359341 
>> /usr/lib/libibverbs.so.1.0.0
>> R   17634 ross  memREG  254,2  2802579 152045867 
>> /home/ross/install/lib/libmpi.so.1.3.0
>> R   17634 ross  memREG  254,2   106626 152046481 
>> /home/ross/Rlib-3.0.1/Rmpi/libs/Rmpi.so
>> 
>> So libmpi, libopen-pal, and libopen-rte all are opened in two versions and 
>> two locations.
>> 
>> Thanks.
>> Ross Boylan
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman

Re: [OMPI users] Bug: Disabled mpi_leave_pinned for GPUDirect and InfiniBand during run-time caused by GCC optimizations

2015-06-08 Thread Dave Goodell (dgoodell)
On Jun 5, 2015, at 8:47 PM, Gilles Gouaillardet  
wrote:

> i did not use the term "pure" properly.
> 
> please read instead "posix_memalign is a function that does not modify any 
> user variable"
> that assumption is correct when there is no wrapper, and incorrect in our 
> case.

My suggestion is to try to create a small reproducer program that we can send 
to the GCC folks with the claim that we believe it to be a buggy optimization.  
Then we can see whether they agree and if not, how they defend that behavior.

We probably still need a workaround for now though, and the "volatile" approach 
seems fine to me.

-Dave




Re: [OMPI users] Using POSIX shared memory as send buffer

2015-09-28 Thread Dave Goodell (dgoodell)
On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski  
wrote:
> 
> Hello, everyone
> 
> I am struggling a bit with IB performance when sending data from a POSIX 
> shared memory region (/dev/shm). The memory is shared among many MPI 
> processes within the same compute node. Essentially, I see a bit hectic 
> performance, but it seems that my code it is roughly twice slower than when 
> using a usual, malloced send buffer.

It may have to do with NUMA effects and the way you're allocating/touching your 
shared memory vs. your private (malloced) memory.  If you have a 
multi-NUMA-domain system (i.e., any 2+ socket server, and even some 
single-socket servers) then you are likely to run into this sort of issue.  The 
PCI bus on which your IB HCA communicates is almost certainly closer to one 
NUMA domain than the others, and performance will usually be worse if you are 
sending/receiving from/to a "remote" NUMA domain.

"lstopo" and other tools can sometimes help you get a handle on the situation, 
though I don't know if it knows how to show memory affinity.  I think you can 
find memory affinity for a process via "/proc//numa_maps".  There's lots 
of info about NUMA affinity here: https://queue.acm.org/detail.cfm?id=2513149

-Dave