[OMPI users] Ubuntu and MPI

2015-11-19 Thread dave
Hello-  I have a Ubuntu 12.04 distro, running on a 32 platform. I 
installed http://www.open-mpi.org/software/ompi/v1.10/downloads/openm . 
I have hello_c.c in the examples subdirectory. I installed 'c' compiler.


When I run mpicc hello_c.c screen dump shows:

dave@ubuntu-desk:~/Desktop/openmpi-1.10.1$ mpicc hello_c.c
The program 'mpicc' can be found in the following packages:
 * lam4-dev
 * libmpich-mpd1.0-dev
 * libmpich-shmem1.0-dev
 * libmpich1.0-dev
 * libmpich2-dev
 * libopenmpi-dev
 * libopenmpi1.5-dev
Try: sudo apt-get install 
dave@ubuntu-desk:~/Desktop/openmpi-1.10.1$

This code helloworld.c works:

/* Hello World C Program */

#include

main()
{
printf("Hello World!");

return 0;

}



I am at a stop point and was hoping for some assist from the group. What 
info/log file can I send that will help?


Newbee here


[O-MPI users] libtool error

2006-01-27 Thread Dave Hudak

Hello,

I am having problem with the configure stage of building openmpi.  I  
downloaded the 1.0.1 release tarball, unpacked it and ran


./configure --prefix=/opt/openmpi | & tee config-output.txt

...it ran for a couple minutes and then said:

config.status: config.h is unchanged
config.status: executing depfiles commands
configure: /bin/sh './configure' succeeded for opal/libltdl
checking for libtool-supplied linker flags... libtool error!
configure: error: Cannot continue

I have attached the config.log and the output of configure.  I have a  
PowerMac quad G5, OS X 10.4.4, XCode 2.2, plus assorted utilities  
installed from darwinports and fink.


Regards,
Dave Hudak


dhudak-error.tgz
Description: Binary data


---
David E. Hudak, Ph.D.
dhu...@osc.edu




Re: [OMPI users] OpenMPI-ROMIO-OrangeFS

2014-03-25 Thread Dave Love
Edgar Gabriel  writes:

> I am still looking into the PVFS2 with ROMIO problem with the 1.6
> series, where (as I mentioned yesterday) the problem I am having right
> now is that the data is wrong. Not sure what causes it, but since I have
> teach this afternoon again, it might be friday until I can digg into that.

Was there any progress with this?  Otherwise, what version of PVFS2 is
known to work with OMPI 1.6?  Thanks.


Re: [OMPI users] OpenMPI-ROMIO-OrangeFS

2014-03-25 Thread Dave Love
Edgar Gabriel  writes:

> yes, the patch has been submitted to the 1.6 branch for review, not sure
> what the precise status of it is. The problems found are more or less
> independent of the PVFS2 version.

Thanks; I should have looked in the tracker.


Re: [OMPI users] OpenMPI-ROMIO-OrangeFS

2014-03-27 Thread Dave Love
Edgar Gabriel  writes:

> not sure honestly. Basically, as suggested in this email chain earlier,
> I had to disable the PVFS2_IreadContig and PVFS2_IwriteContig routines
> in ad_pvfs2.c to make the tests pass. Otherwise the tests worked but
> produced wrong data. I did not have however the time to figure what
> actually goes wrong underneath the hood.

[I can't get into trac to comment on the issue (hangs on login), so I'm
following up here.]

In case it's not clear, the changes for 1.6 and 1.7 are different, and
probably shouldn't be.  The patch I took from 1.7 looked similar to
what's in mpich, but hard-wired rather than autoconfiscated, whereas the
patch for 1.6 on the tracker sets the entries to NULL instead.

> Edgar
>
> On 3/25/2014 9:21 AM, Rob Latham wrote:
>> 
>> 
>> On 03/25/2014 07:32 AM, Dave Love wrote:
>>> Edgar Gabriel  writes:
>>>
>>>> I am still looking into the PVFS2 with ROMIO problem with the 1.6
>>>> series, where (as I mentioned yesterday) the problem I am having right
>>>> now is that the data is wrong. Not sure what causes it, but since I have
>>>> teach this afternoon again, it might be friday until I can digg into
>>>> that.
>>>
>>> Was there any progress with this?  Otherwise, what version of PVFS2 is
>>> known to work with OMPI 1.6?  Thanks.
>> 
>> Edgar, should I pick this up for MPICH, or was this fix specific to
>> OpenMPI ?
>> 
>> ==rob
>> 


Re: [OMPI users] busy waiting and oversubscriptions

2014-03-27 Thread Dave Love
Gus Correa  writes:

> Torque+Maui, SGE/OGE, and Slurm are free.

[OGE certainly wasn't free, but it apparently no longer exists --
another thing Oracle screwed up and eventually dumped.]

> If you build the queue system with cpuset control, a node can be
> shared among several jobs, but the cpus/cores will be assigned
> specifically
> to each job's processes, so that nobody steps on each other toes.

Actually there's no need for cpusets unless jobs are badly-behaved and
escape their bindings.  Core binding by the resource manager, inherited
by OMPI, is typically enough.  (Note that, as far as I know, cpusets are
Linux-specific now Irix is dead along with its better support for
resource management.)

Anyhow, yes you should use a resource manager even with only trivial
scheduling.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/



Re: [OMPI users] busy waiting and oversubscriptions

2014-03-27 Thread Dave Love
Gus Correa  writes:

> On 03/27/2014 05:05 AM, Andreas Schäfer wrote:
>>> >Queue systems won't allow resources to be oversubscribed.

[Maybe that meant that resource managers can, and typically do, prevent
resources being oversubscribed.]

>> I'm fairly confident that you can configure Slurm to oversubscribe
>> nodes: just specify more cores for a node than are actually present.
>>
>
> That is true.
> If you lie to the queue system about your resources,
> it will believe you and oversubscribe.

For what it's worth, oversubscription might be overall or limited.  We
just had a user running some crazy Java program he refuses to explain
submitted as a serial job running ~150 threads.  The over-subscription
was confined to core is used, and the effect on the 127 others was
mostly due to the small overhead of the node daemon reading the crazy
/proc smaps file to track the memory usage.  The other cores were
normally subscribed.

Ob-OMPI:  the other jobs may have been OMPI ones!

> Torque has this same feature.
> I don't know about SGE.
> You may choose to set some or all nodes with more cores than they
> actually have, if that is a good choice for the codes you run.
> However, for our applications oversubscribing is bad, hence my mindset.

Right.  I don't think there's any question that it's a bad idea on a
general purpose cluster running some OMPI jobs, for instance.




Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Dave Love
Reuti  writes:

> Do all of them have an internal bookkeeping of granted cores to slots
> - i.e. not only the number of scheduled slots per job per node, but
> also which core was granted to which job? Does Open MPI read this
> information would be the next question then.

OMPI works with the bindings it's handed via orted (if the processes are
started that way).

>> My understanding is that Torque delegates to OpenMPI the process placement 
>> and binding (beyond the list of nodes/cpus available for
>> the job).

Can't/doesn't torque start the MPI processes itself?  Otherwise, yes,
since orted gets the binding.

>> My guess is that OpenPBS behaves the same as Torque.
>> 
>> SLURM and SGE/OGE *probably* have pretty much the same behavior.
>
> SGE/OGE: no, any binding request is only a soft request.

I don't understand that.  Does it mean the system-specific "strict" and
"non-strict" binding in hwloc, in which case I don't see how UGE can do
anything different?

> UGE: here you can request a hard binding. But I have no clue whether this 
> information is read by Open MPI too.
>
> If in doubt: use only complete nodes for each job (which is often done
> for massively parallel jobs anyway).

There's no need with a recent SGE.  All our jobs get core bindings --
unless they use all the cores, since binding them all is equivalent to
binding none -- and OMPI inherits them.  See
 for the
SGE+OMPI configuration.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/



Re: [OMPI users] change in behaviour 1.6 -> 1.8 under sge

2014-11-03 Thread Dave Love
Mark Dixon  writes:

> Hi there,
>
> We've started looking at moving to the openmpi 1.8 branch from 1.6 on
> our CentOS6/Son of Grid Engine cluster and noticed an unexpected
> difference when binding multiple cores to each rank.
>
> Has openmpi's definition 'slot' changed between 1.6 and 1.8?

You wouldn't expect it to be documented if so, of course :-(, but it it
doesn't look so.

> It used to mean ranks, but now it appears to mean processing elements
> (see Details, below).

I'm fairly confused by this.  Bizarrely, it happens I was going to ask
whether anyone had a patch or workaround for the problem we see with
1.6.  [I notice there was a previous thread about mpi+openmp I didn't
catch at the time which looked pretty confused.  I suppose I should
follow it up for archives.]

> Thanks,
>
> Mark
>
> PS Also, the man page for 1.8.3 reports that '--bysocket' is
> deprecated, but it doesn't seem to exist when we try to use it:
>
>   mpirun: Error: unknown option "-bysocket"
>   Type 'mpirun --help' for usage.

[Yes, per mpirun --help.]

> == Details ==
>
> On 1.6.5, we launch with the following core binding options:
>
>   mpirun --bind-to-core --cpus-per-proc  

That just doesn't work here on multiple nodes (and you forgot the
--np to override $NSLOTS).  It tries to over-allocate the first host.
The workaround is to use --loadbalance in this case, but it fails in the
normal case if you try to make it the default, sigh.  So the
recommendation for MPI+OpenMP jobs, until I fix it, is a script like

  #$ -l exclusive
  export OMP_NUM_THREADS=2
  exec mpirun --loadbalance --cpus-per-proc $OMP_NUM_THREADS --np 
$(($NSLOTS/$OMP_NUM_THREADS)) ...

assuming OMP_NUM_THREADS divides cores/socket on the relevant nodes
sensibly, and eliding issues with per-rank OMP affinity.

>   mpirun --bind-to-core --bysocket --cpus-per-proc  

Similarly in that case.  (I assume that trying to keep consecutive ranks
adjacent is a good default.)

>   where  is calculated to maximise the number of cores available to
>   use - I guess affectively
>   max(1, int(number of cores per node / slots per node requested)).
>
>   openmpi reads the file $PE_HOSTFILE and launches a rank for each slot
>   defined in it, binding  cores per rank.

That's why you need the --np, or is this with a fiddled host file?

> On 1.8.3, we've tried launching with the following core binding
> options (which we hoped were equivalent):
>
>   mpirun -map-by node:PE= 
>   mpirun -map-by socket:PE= 

With 1.8.3 here, replacing "--loadbalance --cpus-per-proc" with
"--map-by slot:PE=2" works.

I assume you use --report-bindings to check what's going on (which gave
me the hint about --loadbalance).  I've never seen it lie about the
binding the processes actually get.

>   openmpi reads the file $PE_HOSTFILE and launches a factor of  fewer
>   ranks than under 1.6.5. We also notice that, where we wanted a single
>   rank on the box and  is the number of cores available, openmpi
>   refuses to launch and we get the message:
>
>   "There are not enough slots available in the system to satisfy the 1
>   slots that were requested by the application"
>
>   I think that error message needs a little work :)


Re: [OMPI users] change in behaviour 1.6 -> 1.8 under sge

2014-11-04 Thread Dave Love
I wrote:

>   #$ -l exclusive
>   export OMP_NUM_THREADS=2
>   exec mpirun --loadbalance --cpus-per-proc $OMP_NUM_THREADS --np 
> $(($NSLOTS/$OMP_NUM_THREADS)) ...

I should have said core binding is the default here
 [so
Intel MPI doesn't look faster!].  Otherwise, you'd need to specify it
above.



Re: [OMPI users] change in behaviour 1.6 -> 1.8 under sge

2014-11-04 Thread Dave Love
Ralph Castain  writes:

> If you only have one allocated PE on a node, then mpirun will
> correctly tell you that it can’t launch with PE>1 as there aren’t
> enough resources to meet your request. IIRC, we may have been ignoring
> this under SGE and running as many procs as we wanted on an allocated
> node - the SGE folks provided a patch to fix that hole.

I don't know what that refers to, but for what it's worth, there's no
problem I know of with OMPI failing to fail if you try to over-subscribe
under SGE (at least since v1.3).  (Without the --np in my example, it
will fail.)

-- SGE folk



Re: [OMPI users] change in behaviour 1.6 -> 1.8 under sge

2014-11-05 Thread Dave Love
Ralph Castain  writes:

> I confirmed that things are working as intended.

I could have been more explicit saying so before.

> If you have 12 cores on a machine, and you do
>
> mpirun -map-by socket:PE=2 
>
> we will execute 6 copies of foo on the node because 12 cores/2pe/core = 6 
> procs.

For what it's worth, you need to be a bit careful testing.  1.6 works on
a single node without --loadbalance.

I'm fairly sure what Mark sees will be the result of messing with the
SGE internals, possibly combined with SGE core binding/cpuset
restrictions.  I've always found that confusing, and preferred to let
mpirun do the work, but there's no accounting for things in Yorkshire.



Re: [OMPI users] OPENMPI-1.8.3: missing fortran bindings for MPI_SIZEOF

2014-11-05 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> Yes, this is a correct report.
>
> In short, the MPI_SIZEOF situation before the upcoming 1.8.4 was a bit
> of a mess; it actually triggered a bunch of discussion up in the MPI
> Forum Fortran working group (because the design of MPI_SIZEOF actually
> has some unintended consequences that came to light when another OMPI
> user noted the same thing you did a few months ago).
>
> Can you download a 1.8.4 nightly tarball (or the rc) and see if
> MPI_SIZEOF is working for you there?

Is the issue documented publicly?  I'm puzzled, because it certainly
works in a simple case:

  $ cat x.f90
  use mpi
  integer size, ierror
  double precision d
  call mpi_sizeof (size, size, ierror)
  print *, size
  call mpi_sizeof (d, size, ierror)
  print *, size
  end
  $ mpif90 --showme:version
  mpif90: Open MPI 1.8.3 (Language: Fortran)
  $ mpif90 --showme:command
  gfortran
  $ mpif90 x.f90 && ./a.out
 4
 8

The missing routine is in the library for me:

  $ nm -D $MPI_LIB/libmpi_usempi.so | grep mpi_sizeof0di4_
  1cf0 T mpi_sizeof0di4_

I don't understand how it can work generally with mpif.h (f77?), as
implied by the man page, rather than the module.


Re: [OMPI users] OPENMPI-1.8.3: missing fortran bindings for MPI_SIZEOF

2014-11-10 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> There were several commits; this was the first one:
>
> https://github.com/open-mpi/ompi/commit/d7eaca83fac0d9783d40cac17e71c2b090437a8c

I don't have time to follow this properly, but am I reading right that
that says mpi_sizeof will now _not_ work with gcc < 4.9, i.e. the system
compiler of the vast majority of HPC GNU/Linux systems, whereas it did
before (at least in simple cases)?

> IIRC, it only affected certain configure situations (e.g., only
> certain fortran compilers).  I'm failing to remember the exact
> scenario offhand that was problematic right now, but it led to the
> larger question of: "hey, wait, don't we have to support MPI_SIZEOF in
> mpif.h, too?"

I'd have said the answer was a clear "no", without knowing what the
standard says about mpif.h, but I'd expect that to be deprecated anyhow.
(The man pages generally don't mention USE, only INCLUDE, which seems
wrong.)

>
>> I don't understand how it can work generally with mpif.h (f77?), as
>> implied by the man page, rather than the module.
>
> According to discussion in the Forum Fortran working group, it is
> required that MPI_SIZEOF must be supported in *all* MPI Fortran
> interfaces, including mpif.h.

Well that's generally impossible if it's meant to include Fortran77
compilers (which I must say doesn't seem worth it at this stage).

> Hence, if certain conditions are met by your Fortran compiler (i.e.,
> it's modern enough), OMPI 1.8.4 will have MPI_SIZEOF prototypes in
> mpif.h.  If not, then you get the same old mpif.h you've always had
> (i.e., no MPI_SIZEOF prototypes, and MPI_SIZEOF won't work properly if
> you use the mpif.h interfaces).

If it's any consolation, it doesn't work in the other MPIs here
(mp(va)pich and intel), as I'd expect.

> Keep in mind that MPI does not prohibit having prototypes in mpif.h --
> it's just that most (all?) MPI implementations don't tend to provide
> them.  However, in the case of MPI_SIZEOF, it is *required* that
> prototypes are available because the implementation needs the type
> information to return the size properly (in mpif.h., mpi module, and
> mpi_f08 module).
>
> Make sense?

Fortran has interfaces, not prototypes!

I understand the technicalities -- I hacked on g77 intrinsics -- but I'm
not sure how much sense it's making if things have effectively gone
backwards with gfortran.



Re: [OMPI users] OPENMPI-1.8.3: missing fortran bindings for MPI_SIZEOF

2014-11-11 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> There are several reasons why MPI implementations have not added explicit 
> interfaces to their mpif.h files, mostly boiling down to: they may/will break 
> real world MPI programs.
>
> 1. All modern compilers have ignore-TKR syntax,

Hang on!  (An equivalent of) ignore_tkr only appeared in gfortran 4.9
(the latest release) as far as I know.  The system compiler of the bulk
of GNU/Linux HPC systems currently is distinctly older (and the RHEL
devtoolset packaging of gcc-4.9 is still beta).  RHEL 6 has gcc 4.4 as
te system compiler and Debian stable has 4.7 and older.

I'm just pointing that out in case decisions are being made assuming
everyone has this.  No worries if not.

> so it's at least not a problem for subroutines like MPI_SEND (with a choice 
> buffer).  However: a) this was not true at the time when MPI-3 was written, 
> and b) it's not standard fortran.



Re: [OMPI users] OPENMPI-1.8.3: missing fortran bindings for MPI_SIZEOF

2014-11-11 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> On Nov 10, 2014, at 8:27 AM, Dave Love  wrote:
>
>>> https://github.com/open-mpi/ompi/commit/d7eaca83fac0d9783d40cac17e71c2b090437a8c
>> 
>> I don't have time to follow this properly, but am I reading right that
>> that says mpi_sizeof will now _not_ work with gcc < 4.9, i.e. the system
>> compiler of the vast majority of HPC GNU/Linux systems, whereas it did
>> before (at least in simple cases)?
>
> You raise a very good point, which raises another unfortunately good related 
> point.
>
> 1. No, the goal is to enable MPI_SIZEOF in *more* cases, and still preserve 
> all the old cases.  Your mail made me go back and test all the old cases this 
> morning, and I discovered a bug which I need to fix before 1.8.4 is released 
> (details unimportant, unless someone wants to gory details).

I haven't checked the source, but the commit message above says

  If the Fortran compiler supports both INTERFACE and ISO_FORTRAN_ENV,
  then we'll build the MPI_SIZEOF interfaces.  If not, we'll skip
  MPI_SIZEOF in mpif.h and the mpi module.

which implies it it's been removed for gcc < 4.9, whereas it worked before.

> The answer actually turned out to be "yes".  :-\
>
> Specifically: the spec just says it's available in the Fortran interfaces.  
> It doesn't say "the Fortran interfaces, except MPI_SIZEOF."
>
> Indeed, the spec doesn't prohibit explicit interfaces in mpif.h (it never 
> has).  It's just that most (all?) MPI implementations have not provided 
> explicit interfaces in mpif.h.
>
> But for MPI_SIZEOF to work, explicit interfaces are *required*.

[Yes, I understand -- sorry if that wasn't clear and you wasted time
explaining.]

>> but I'd expect that to be deprecated anyhow.
>> (The man pages generally don't mention USE, only INCLUDE, which seems
>> wrong.)
>
> Mmm.  Yes, true.
>
> Any chance I could convince you to submit a patch?  :-)

Maybe, but I don't really know what it should involve or whether it can
be done mechanically; I definitely don't have time to dissect the spec.
Actually, I'd have expected the API man pages to be reference versions,
shared across implementations, but MPICH's are different.

> Fortran 77 compilers haven't existed for *many, many years*.

[I think f2c still gets some use, and g77 was only obsoleted with gcc 4
-- I'm not _that old_!  I'm not actually advocating f77, of course.]

> And I'll say it again: MPI has *never* supported Fortran 77 (it's a
> common misconception that it ever did).

Well, having "Fortran77 interface" in the standard could confuse a
stupid person!  (As a former language lawyer for it, I'd allow laxity in
"Fortran77", like the latest MPI isn't completely compatible with the
latest Fortran either.)

>> Fortran has interfaces, not prototypes!
>
> Yes, sorry -- I'm a C programmer and I dabble in Fortran

That was mainly as in it's better ☺.

> (read: I'm the guy who keeps the Fortran stuff maintained in OMPI), so
> I sometimes use the wrong terminology.  Mea culpa!

Sure, and thanks.  I dare say you can get some community help if you
need it, especially if people think Fortran isn't being properly
supported, though I'm not complaining.



Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-11 Thread Dave Love
"SLIM H.A."  writes:

> We switched on hyper threading on our cluster with two eight core
> sockets per node (32 threads per node).

Assuming that's Xeon-ish hyperthreading, the best advice is not to.  It
will typically hurt performance of HPC applications, not least if it
defeats core binding, and it is likely to confusion with resource
managers.  If there are specific applications which benefit from it,
under Linux you can switch it on on the relevant cores for the duration
of jobs which ask for it.

> We configured  gridengine with 16 slots per node to allow the 16 extra
> threads for kernel process use

Have you actually measured that?  We did, and we switch off HT at boot
time.  We've never had cause to turn it on, though there might be a few
jobs which could use it.

> but this apparently does not work. Printout of the gridengine hostfile
> shows that for a 32 slots job, 16 slots are placed on each of two
> nodes as expected. Including the openmpi --display-map option shows
> that all 32 processes are incorrectly placed on the head node. Here is
> part of the output

If OMPI is scheduling by thread, then that's what you'd expect.  (As far
as I know, SGE will DTRT, binding a cores per slot in that case, but
I'll look at bug reports if not.)

> I found some related mailings about a new warning in 1.8.2 about 
> oversubscription and  I tried a few options to avoid the use of the extra 
> threads for MPI tasks by openmpi without success, e.g. variants of
>
> --cpus-per-proc 1 
> --bind-to-core 
>
> and some others. Gridengine treats hw threads as cores==slots (?)

What a slot is is up to you, but if you want to do core binding at all
sensibly, it needs to correspond to a core.  You can fiddle things in
the job itself (see the recent thread that Mark started for OMPI --np !=
SGE NSLOTS).

> but the content of $PE_HOSTFILE suggests it distributes the slots
> sensibly  so it seems there is an option for openmpi required to get
> 16 cores per node?

I'm not sure precisely what you want, but with OMPI 1.8, you should be
able to lay out the job by core if that's what you want.  That may
requires exclusive node access, which makes SGE core binding a null
operation.

> I tried both 1.8.2, 1.8.3 and also 1.6.5.
>
> Thanks for some clarification that anyone can give.

The above is for the current SGE with a recent hwloc.  If Durham are
still using an ancient version, it may not apply, but that should be
irrelevant with -l exclusive or a fixed-count PE.



Re: [OMPI users] OPENMPI-1.8.3: missing fortran bindings for MPI_SIZEOF

2014-11-12 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> Yeah, we don't actually share man pages.

I suppose it wouldn't save much anyhow at this stage of the game.

> I think the main issue would be just to edit the *.3in pages here:
>
> https://github.com/open-mpi/ompi/tree/master/ompi/mpi/man/man3
>
> They're all native nroff format (they're .3in instead of .3 because we 
> pre-process them during "make" to substitute things like the release date and 
> version in).

Sure.

> I'm guessing it would be a pretty mechanical kind of patch -- just adding 
> Fortran interfaces at the top of each page.

I'll try to take a look sometime and see if it actually is trivial.



Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-12 Thread Dave Love
Ralph Castain  writes:

> You might also add the —display-allocation flag to mpirun so we can
> see what it thinks the allocation looks like. If there are only 16
> slots on the node, it seems odd that OMPI would assign 32 procs to it
> unless it thinks there is only 1 node in the job, and oversubscription
> is allowed (which it won’t be by default if it read the GE allocation)

I think there's a problem with documentation at least not being
explicit, and it would really help to have it clarified unless I'm
missing some.

Although there's probably more to it in this case, the behaviour seemed
consistent with what I deduced (without reading the code) from the doc,
ompi_info, and experiment that at least wasn't inconsistent:  the node
has 32 processing units, and the default allocation is by socket,
apparently round-robin within nodes.  I can't check the actual behaviour
in that case just now.



Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-12 Thread Dave Love
Reuti  writes:

>> If so, I’m wondering if that NULL he shows in there is the source of the 
>> trouble. The parser doesn’t look like it would handle that very well, though 
>> I’d need to test it. Is that NULL expected? Or is the NULL not really in the 
>> file?
>
> I must admit here: for me the fourth column is either literally
> UNDEFINED or the tuple cpu,core in case of turned on binding like 0,0
> But it's never , neither literally nor the byte 0x00. Maybe the
> OP can tell us which GE version he uses,.

See the source of sge_exec_job.  I see I should fix sge_pe(5).



Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-12 Thread Dave Love
"SLIM H.A."  writes:

> Dear Reuti and Ralph
>
> Below is the output of the run for openmpi 1.8.3 with this line
>
> mpirun -np $NSLOTS --display-map --display-allocation --cpus-per-proc 1 $exe

-np is redundant with tight integration unless you're using fewer than
NSLOTS from SGE.

> ompi_info | grep psm
> gives MCA mtl: psm (MCA v2.0, API v2.0, Component v1.8.3)
> because the intercoonect is TrueScale/QLogic
>
> and
>
> setenv OMPI_MCA_mtl "psm"
>
> is set in the script.

It should select that anyhow, though it's worth defaulting it in
openmpi-mca-params.conf in case something goes awry and you end up with
openib, or even tcp, instead of psm.  (I've known inconsistent library
versions cause psm not to load.)

> This is the PE
>
> pe_name   orte
> slots 4000
> user_listsNONE
> xuser_lists   NONE
> start_proc_args   /bin/true
> stop_proc_args/bin/true

"none" is a better choice, now the default.

> allocation_rule   $fill_up

fill_up is potentially problematic with PSM -- at least the old stuff we
have.  You tend to run out of contexts(?) with multiple jobs on the node
and I couldn't get it to behave by setting environment variables.

> control_slavesTRUE
> job_is_first_task FALSE
> urgency_slots min



Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-13 Thread Dave Love
Ralph Castain  writes:

 cn6050 16 par6.q@cn6050 
 cn6045 16 par6.q@cn6045 
>> 
>> The above looks like the PE_HOSTFILE. So it should be 16 slots per node.
>
> Hey Reuti
>
> Is that the standard PE_HOSTFILE format? I’m looking at the ras/gridengine 
> module, and it looks like it is expecting a different format. I suspect that 
> is the problem

I should have said that the parsing code is OK, and it specifically
works with the above.  (It should probably be made more robust by
ensuring it reads to end-of-line, and preferably should interpret a
binding string as the fourth field.)


Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-13 Thread Dave Love
Ralph Castain  writes:

>> I think there's a problem with documentation at least not being
>> explicit, and it would really help to have it clarified unless I'm
>> missing some.
>
> Not quite sure I understand this comment - the problem is that we
> aren’t correctly reading the allocation, as evidenced by when the user
> ran with —display-allocation. From what we can see, it looks like the
> PE_HOSTFILE may be containing some unexpected characters that make us
> think we hit EOF at the end of the first line, thus ignoring the
> second node.

I suspect that the environment variables Reuti listed are trashed, but
without printing the environment and the contents of $PE_HOSTFILE, it's
only a guess.

But on the face of it (ignoring the diagnostics) the observed
"oversubscription" still seems consistent with what documentation there
is.  I can't see where it says what is the correct behaviour for the
mapping without the mpirun command specifying it.

>> 
>> Although there's probably more to it in this case, the behaviour seemed
>> consistent with what I deduced (without reading the code) from the doc,
>> ompi_info, and experiment that at least wasn't inconsistent:  the node
>> has 32 processing units, and the default allocation is by socket,
>> apparently round-robin within nodes.  I can't check the actual behaviour
>> in that case just now.



[OMPI users] mpi_wtime implementation

2014-11-17 Thread Dave Love
I discovered from looking at the mpiP profiler that OMPI always uses
gettimeofday rather than clock_gettime to implement mpi_wtime on
GNU/Linux, and that looks sub-optimal.  I don't remember what the
resolution of gettimeofday is in practice, but I did need to write a
drop-in replacement for benchmarks.  [mpiP expects mpi_wtime to be high
resolution, and you have to configure if for clock_gettime explicitly.]

Before I raise an issue:  is there some good reason not to use
clock_gettime, especially as gettimeofday is obsolete in POSIX?  I guess
not, especially as the VT component uses clock_gettime.



Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-17 Thread Dave Love
Ralph Castain  writes:

>> On Nov 13, 2014, at 3:36 PM, Dave Love  wrote:
>> 
>> Ralph Castain  writes:
>> 
>>>>>> cn6050 16 par6.q@cn6050 
>>>>>> cn6045 16 par6.q@cn6045 
>>>> 
>>>> The above looks like the PE_HOSTFILE. So it should be 16 slots per node.
>>> 
>>> Hey Reuti
>>> 
>>> Is that the standard PE_HOSTFILE format? I’m looking at the ras/gridengine 
>>> module, and it looks like it is expecting a different format. I suspect 
>>> that is the problem
>> 
>> I should have said that the parsing code is OK, and it specifically
>> works with the above.  (It should probably be made more robust by
>> ensuring it reads to end-of-line, and preferably should interpret a
>> binding string as the fourth field.)
>
> Afraid I am confused - if we look at the user’s output from mpirun
> —display-allocation, you can see that we only got the first line in
> the above. We didn’t see the second node at all. So the parsing code
> is clearly not reading that file correctly, or they have some envar
> set that is telling us to ignore the second node somehow.
>
> What am I missing?

Well, you don't know what mpirun's environment looked like, other than
NSLOTS apparently being intact.  The output from mpirun was consistent
with clobbering the other variables Reuti listed (breaking the SGE
"tight integration"):

  $ cat STDIN.o$(qsub -pe mpi 32 -l p=16,h_rt=9 -terse -sync y | head -1)
  unset PE_HOSTFILE
  mpirun --np $NSLOTS --display-allocation true

  ==   ALLOCATED NODES   ==
comp162: slots=16 max_slots=0 slots_inuse=0 state=UP
  =
  $ 



Re: [OMPI users] mpi_wtime implementation

2014-11-19 Thread Dave Love
"Daniels, Marcus G"  writes:

> On Mon, 2014-11-17 at 17:31 +, Dave Love wrote:
>> I discovered from looking at the mpiP profiler that OMPI always uses
>> gettimeofday rather than clock_gettime to implement mpi_wtime on
>> GNU/Linux, and that looks sub-optimal. 
>
> It can be very expensive in practice, especially for codes that have
> fine-grained instrumentation. 

OK, but I assumed VT would take that sort of thing into account for
platforms I don't have.  clock_gettime(CLOCK_MONOTONIC,) is as fast as
gettimeofday on our mainstream sort of system (RHEL6, sandybridge);
CLOCK_MONOTONIC_COARSE is about three times faster.  [I can't find that
sort of information in Linux doc.]

Perhaps there should be a choice via an MCA parameter, but it looks as
though it should default to clock_gettime on x86_64 Linux.  I suppose
one can argue what "high resolution" means in the mpi_wtime doc, but I'd
rather not.


[OMPI users] "default-only MCA variable"?

2014-11-27 Thread Dave Love
Why can't I set parameters like this (not the only one) with 1.8.3?

  WARNING: A user-supplied value attempted to override the default-only MCA
  variable named "btl_sm_use_knem".



Re: [OMPI users] "default-only MCA variable"?

2014-11-28 Thread Dave Love
Gilles Gouaillardet  writes:

> It could be because configure did not find the knem headers and hence knem is 
> not supported and hence this mca parameter is read-only

Yes, in that case (though knem was meant to be used and it's annoying
that configure doesn't abort if it doesn't find something you've
explicitly asked for, and I didn't immediately need it).  However, I got
the same for at least mpi_abort_print_stack with that parameter set.

This didn't happen with OMPI 1.6 and there's no obvious way to turn it
off.



Re: [OMPI users] "default-only MCA variable"?

2014-11-28 Thread Dave Love
Gustavo Correa  writes:

> Hi Dave, Gilles, list
>
> There is a problem with knem in OMPI 1.8.3.
> A fix is supposed to come on OMPI 1.8.4.
> Please, see this long thread:
> http://www.open-mpi.org/community/lists/users/2014/10/25511.php
>
> Note also, as documented in the thread, 
> that in the OMPI 1.8 series "vader" replaces "sm" as the default intranode 
> btl.

Thanks.  I share the frustration (though my real ire currently is
directed at Red Hat for the MPI damage in RHEL 6.6).



[OMPI users] using multiple IB connections between hosts

2015-01-28 Thread Dave Turner
 I ran some aggregate bandwidth tests between 2 hosts connected by
both QDR InfiniBand and RoCE enabled 10 Gbps Mellanox cards.  The tests
measured the aggregate performance for 16 cores on one host communicating
with 16 on the second host.  I saw the same performance as with the QDR
InfiniBand alone, so it appears that the addition of the 10 Gbps RoCE cards
is
not helping.

 Should OpenMPI be using both in this case by default, or is there
something
I need to configure to allow for this?  I suppose this is the same question
as
how to make use of 2 identical IB connections on each node, or is the system
simply ignoring the 10 Gbps cards because they are the slower option.

 Any clarification on this would be helpful.  The only posts I've found
are very
old and discuss mostly channel bonding of 1 Gbps cards.

 Dave Turner

-- 
Work: davetur...@ksu.edu (785) 532-7791
 118 Nichols Hall, Manhattan KS  66502
Home:drdavetur...@gmail.com
  cell: (785) 770-5929


Re: [OMPI users] Segmentation Fault (Core Dumped) on mpif90 -v

2016-05-06 Thread Dave Love
Gus Correa  writes:

> Hi Giacomo
>
> Some programs fail with segmentation fault
> because the stack size is too small.

Yes, the default for Intel Fortran is to allocate large-ish amounts on
the stack, which may matter when the compiled program runs.

However, look at the backtrace.  It's apparently coming from the loader,
so something is pretty screwed up, though I can't guess what.  It would
help to have debugging symbols; always use at least -g and have
GNU/Linux distribution debuginfo packages to hand.

[Probably not relevant in this case, but I try to solve problems with
the Intel compiler and MPI (sorry Jeff et al) by persuading users to
avoid them.  GCC is more reliable in my experience, and the story about
its supposedly poor code generation isn't supported by experiment (if
that counts for anything these days).]

> [But others because of bugs in memory allocation/management, etc.]
>
> Have you tried
>
> ulimit -s unlimited
>
> before you run the program?
>
> Are you using a single machine or a cluster?
> If you're using infiniband you may need also to make the locked memory
> unlimited:
>
> ulimit -l unlimited
>
> I hope this helps,
> Gus Correa
>
> On 05/05/2016 05:15 AM, Giacomo Rossi wrote:
>>   gdb /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90
>> GNU gdb (GDB) 7.11
>> Copyright (C) 2016 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later
>> 
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-pc-linux-gnu".
>> Type "show configuration" for configuration details.
>> For bug reporting instructions, please see:
>> .
>> Find the GDB manual and other documentation resources online at:
>> .
>> For help, type "help".
>> Type "apropos word" to search for commands related to "word"...
>> Reading symbols from /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90...(no
>> debugging symbols found)...done.
>> (gdb) r -v
>> Starting program: /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90 -v
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x76858f38 in ?? ()
>> (gdb) bt
>> #0  0x76858f38 in ?? ()
>> #1  0x77de5828 in _dl_relocate_object () from
>> /lib64/ld-linux-x86-64.so.2
>> #2  0x77ddcfa3 in dl_main () from /lib64/ld-linux-x86-64.so.2
>> #3  0x77df029c in _dl_sysdep_start () from
>> /lib64/ld-linux-x86-64.so.2
>> #4  0x774a in _dl_start () from /lib64/ld-linux-x86-64.so.2
>> #5  0x77dd9d98 in _start () from /lib64/ld-linux-x86-64.so.2
>> #6  0x0002 in ?? ()
>> #7  0x7fffaa8a in ?? ()
>> #8  0x7fffaab6 in ?? ()
>> #9  0x in ?? ()
>>
>> Giacomo Rossi Ph.D., Space Engineer
>>
>> Research Fellow at Dept. of Mechanical and Aerospace Engineering,
>> "Sapienza" University of Rome
>> *p: *(+39) 0692927207 | *m**: *(+39) 3408816643 | *e:
>> *giacom...@gmail.com 
>> 
>> Member of Fortran-FOSS-programmers
>> 


Re: [OMPI users] Building vs packaging

2016-05-16 Thread Dave Love
"Rob Malpass"  writes:

> Almost in desperation, I cheated:

Why is that cheating?  Unless you specifically want a different version,
it seems sensible to me, especially as you then have access to packaged
versions of at least some MPI programs.  Likewise with rpm-based
systems, which I'm afraid I know more about.

Also the package system ensures that things don't break by inadvertently
removing their dependencies; the hwloc libraries might be an example.

> sudo  apt-get install openmpi-bin
>
>  
>
> and hey presto.   I can now do (from head node)
>
>  
>
> mpirun -H node2,node3,node4 -n 10 foo
>
>  
>
> and it works fine.   So clearly apt-get install has set something that I'd
> not done (and it's seemingly not LD_LIBRARY_PATH) as ssh node2 'echo
> $LD_LIBRARY_PATH' still returns a blank line.

No.  As I said recently, Debian installs a default MPI (via the
alternatives system) with libraries in the system search path.  Check
the library contents.


Re: [OMPI users] No core dump in some cases

2016-05-16 Thread Dave Love
Gilles Gouaillardet  writes:

> Are you sure ulimit -c unlimited is *really* applied on all hosts
>
>
> can you please run the simple program below and confirm that ?

Nothing specifically wrong with that, but it's worth installing
procenv(1) as a general solution to checking the (generalized)
environment of a job.  It's packaged for Debian/Ubuntu and Fedora/EPEL,
at least.


Re: [OMPI users] Question about mpirun mca_oob_tcp_recv_handler error.

2016-05-16 Thread Dave Love
Ralph Castain  writes:

> This usually indicates that the remote process is using a different OMPI
> version. You might check to ensure that the paths on the remote nodes are
> correct.

That seems quite a common problem with non-obvious failure modes.

Is it not possible to have a mechanism that checks the consistency of
the components and aborts in a clear way?  I've never thought it out,
but it seems that some combination of OOB messages, library versioning
(at least with ELF) and environment variables might do it.


Re: [OMPI users] Building vs packaging

2016-05-20 Thread Dave Love
dani  writes:

> I don't know about .deb packages, but at least in the rpms there is a
> post install scriptlet that re-runs ldconfig to ensure the new libs
> are in the ldconfig cache.

MPI packages following the Fedora guidelines don't do that (and rpmlint
complains bitterly as a consequence).  They rely on LD_LIBRARY_PATH via
environment modules, for better or worse:

  $ mock --shell 'rpm -q openmpi; rpm -q --scripts openmpi' 2>/dev/null
  openmpi-1.8.1-1.el6.x86_64
  $ 

[Using mock for a vanilla environment.]


Re: [OMPI users] OpenMPI 1.6.5 on CentOS 7.1, silence ib-locked-pages?

2016-05-20 Thread Dave Love
Ryan Novosielski  writes:

> I’m pretty sure this is no longer relevant (having read Roland’s
> messages about it from a couple of years ago now). Can you please
> confirm that for me, and then let me know if there is any way that I
> can silence this old copy of OpenMPI that I need to use with some
> software that depends on it for some reason? It is causing my users to
> report it as an issue pretty regularly.

Does following the FAQ not have any effect?  I don't see it would do
much harm anyway.

[For what it's worth, the warning still occurs here on a very large
memory system with the recommended settings.]


[OMPI users] wtime implementation in 1.10

2016-05-23 Thread Dave Love
I thought the 1.10 branch had been fixed to use clock_gettime for
MPI_Wtime where it's available, a la
https://www.open-mpi.org/community/lists/users/2016/04/28899.php -- and
have been telling people so!  However, I realize it hasn't, and it looks
as if 1.10 is still being maintained.

Is there a good reason for that, or could it be fixed?


Re: [OMPI users] wtime implementation in 1.10

2016-05-24 Thread Dave Love
Ralph Castain  writes:

> Nobody ever filed a PR to update the branch with the patch - looks
> like you never responded to confirm that George’s proposed patch was
> acceptable.

I've never seen anything asking me about it, but I'm not an OMPI
developer in a position to review backports or even put things in a bug
tracker.

1.10 isn't used here, and I just subvert gettimeofday whenever I'm
running something that might use it for timing short intervals.

> I’ll create the PR and copy you for review
>
>
>> On May 23, 2016, at 9:17 AM, Dave Love  wrote:
>> 
>> I thought the 1.10 branch had been fixed to use clock_gettime for
>> MPI_Wtime where it's available, a la
>> https://www.open-mpi.org/community/lists/users/2016/04/28899.php -- and
>> have been telling people so!  However, I realize it hasn't, and it looks
>> as if 1.10 is still being maintained.
>> 
>> Is there a good reason for that, or could it be fixed?


Re: [OMPI users] users Digest, Vol 3510, Issue 2

2016-05-24 Thread Dave Love
Megdich Islem  writes:

> Yes, Empire does the fluid structure coupling. It couples OpenFoam (fluid 
> analysis) and Abaqus (structural analysis).
> Does all the software need to have the same MPI architecture in order to 
> communicate ?

I doubt it's doing that, and presumably you have no control over abaqus,
which is a major source of pain here.

You could wrap one (set of) program(s) in a script to set the
appropriate environment before invoking the real program.  That might be
a bit painful if you need many of the OF components, but it should be
straightforward to put scripts somewhere on PATH ahead of the real
versions.

On the other hand, it never ceases to amaze how difficult proprietary
engineering applications make life on HPC systems; I could believe
there's a catch.  Also you (or systems people) normally want programs to
use the system MPI, assuming that's been set up appropriately.


Re: [OMPI users] users Digest, Vol 3510, Issue 2

2016-05-25 Thread Dave Love
I wrote: 

> You could wrap one (set of) program(s) in a script to set the
> appropriate environment before invoking the real program.  

I realize I should have said something like "program invocations",
i.e. if you have no control over something invoking mpirun for programs
using different MPIs, then an mpirun wrapper needs to check what it's
being asked to run.


[OMPI users] 2.0 documentation

2016-06-22 Thread Dave Love
I know it's not traditional, but is there any chance of complete
documentation of the important changes in v2.0?  Currently NEWS mentions
things like minor build issues, but there's nothing, for instance, on
the addition and removal of whole frameworks, one of which I've been
trying to understand.


Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-22 Thread Dave Love
"Llolsten Kaonga"  writes:

> Hello Grigory,
>
> I am not sure what Redhat does exactly but when you install the OS, there is
> always an InfiniBand Support module during the installation process. We
> never check/install that module when we do OS installations because it is
> usually several versions of OFED behind (almost obsolete).

In addition to what Peter Kjellström said:  Do you have evidence of
actual significant problems with RH's IB support?  It was an improvement
to throw out our vendor's OFED offering.  Also why run RHEL if you're
going to use things which will presumably prevent you getting support in
important areas?  (At least two OFED components were maintained by a Red
Hat employee last I knew.)


Re: [OMPI users] Docker Cluster Queue Manager

2016-06-22 Thread Dave Love
Rob Nagler  writes:

> Thanks, John. I sometimes wonder if I'm the only one out there with this
> particular problem.
>
> Ralph, thanks for sticking with me. :) Using a pool of uids doesn't really
> work due to the way cgroups/containers works. It also would require
> changing the permissions of all of the user's files, which would create
> issues for Jupyter/Hub's access to the files, which is used for in situ
> monitoring.

Skimming back at this, like Ralph I really don't understand it as a
maintainer of a resource manager (at a level above Ralph's) and as
someone who formerly had the "pleasure" of HEP requirements which
attempted to defeat essentially any reasonable management policy.  (It
seems off-topic here.)

Amongst reasons for not running Docker, a major one that I didn't notice
raised is that containers are not started by the resource manager, but
by a privileged daemon, so the resource manager can't directly control
or monitor them.

>From a brief look at Jupyter when it came up a while ago, I wouldn't
want to run it, and I wasn't alone.  (I've been lectured about the lack
of problems with such things by people on whose clusters I could
trivially run jobs as any normal user and sometimes as root.)

+1 for what Ralph said about singularity in particular.  While there's
work to be done, you could even convert docker images on the fly in a
resource manager prolog.  I'm awaiting enlightenment on the on-topic
issue of running MPI jobs with it, though.


Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-18 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> On Aug 16, 2016, at 3:07 PM, Reuti  wrote:
>> 
>> Thx a bunch - that was it. Despite searching for a solution I found
>> only hints that didn't solve the issue.
>
> FWIW, we talk about this in the HACKING file, but I admit that's not
> necessarily the easiest place to find:
>
> https://github.com/open-mpi/ompi/blob/master/HACKING#L126-L129

autogen.pl tries to check versions of the tools, as one might hope, so
the question is why it fails.  The check works for me on RHEL6 if I
reverse the order of the autoconf and libtool checks.

A related question I should have asked long ago:  I don't suppose it
would have helped to catch this, but why is it necessary to configure
gridengine support specifically?  It doesn't need library support and
seems harmless at run time if you're not using gridengine, as it just
needs environment variables which are unlikely to be wrongly set --
other things rely on them to distinguish resource managers.  Similarly
for any other resource manager that works like that.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-18 Thread Dave Love
"Audet, Martin"  writes:

> Hi Josh,
>
> Thanks for your reply. I did try setting MXM_RDMA_PORTS=mlx4_0:1 for all my 
> MPI processes
> and it did improve performance but the performance I obtain isn't completely 
> satisfying.

I raised the issue of MXM hurting p2p latency here a while ago, but
don't have a solution.  Mellanox were here last week and promised to
address that, but I haven't heard back.  I get the impression this stuff
isn't widely used, and since it's proprietary, unlike PSM, we can't
really investigate.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Certain files for mpi missing when building mpi4py

2016-08-31 Thread Dave Love
"Mahdi, Sam"  writes:

> HI everyone,
>
> I am using a linux fedora. I downloaded/installed
> openmpi-1.7.3-1.fc20(64-bit) and openmpi-devel-1.7.3-1.fc20(64-bit). As
> well as pypar-openmpi-2.1.5_108-3.fc20(64-bit) and
> python3-mpi4py-openmpi-1.3.1-1.fc20(64-bit). The problem I am having is
> building mpi4py using the mpicc wrapper.

Why build it when you have the package?

If you do need to rebuild it for some reason, get the source rpm and
look at the recipe in the .spec file, or edit the .spec and just use
rpmbuild.

[I assume there's a good reason for F20, but it's three versions
obsolete.]

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] mpi4py/fc20 (was: users Digest, Vol 3592, Issue 1)

2016-09-01 Thread Dave Love
"Mahdi, Sam"  writes:

> To dave, from the installation guide I found, it seemed I couldnt just
> directly download it from the package list, but rather Id need to use the
> mpicc wrapper to compile and install.

That makes no sense to a maintainer of some openmpi Fedora packages, and
I actually have mpi4py-openmpi installed and working from EPEL6.

> I also wanted to see if I could build
> it from the installation guide, sorta learn how the whole process worked.

Well, the spec file tells you how to build on the relevant version of
Fedora, including the dependencies.

> To guilles, do I need to download open mpi directly from the site to obtain
> the mpicc and to get the current version?

You said you already have the openmpi-devel package, which is what
provides it.

I really wouldn't run f20 on a typical HPC system, though.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] MPI libraries

2016-09-12 Thread Dave Love
Gilles Gouaillardet  writes:

> Mahmood,
>
> mpi_siesta is a siesta library, not an Open MPI library.
>
> fwiw, you might want to try again from scratch with
> MPI_INTERFACE=libmpi_f90.a
> DEFS_MPI=-DMPI
> in your arch.make
>
> i do not think libmpi_f90.a is related to an OpenMPI library.

libmpi_f90 is the Fortran 90 library in OMPI 1.6, but presumably you
want the shared, system version.

> if you need some more support, please refer to the siesta doc and/or ask on
> a siesta mailing list

I used the system MPI (which is OMPI 1.6 for historical reasons) and it
seems siesta 4.0 just built on RHEL6 with the rpm spec fragment below,
but I'm sure it would also work with 1.8.  (However, it needs cleaning
up significantly for the intended Fedora packaging.)

  %global _configure ../Src/configure
  cd Obj
  ../Src/obj_setup.sh
  %_openmpi_load
  %configure --enable-mpi
  make # not smp-safe

(%_openmpi_load just does "module load openmpi_x86_64" in this case.)
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] MPI libraries

2016-09-13 Thread Dave Love
I wrote: 

> Gilles Gouaillardet  writes:
>
>> Mahmood,
>>
>> mpi_siesta is a siesta library, not an Open MPI library.
>>
>> fwiw, you might want to try again from scratch with
>> MPI_INTERFACE=libmpi_f90.a
>> DEFS_MPI=-DMPI
>> in your arch.make
>>
>> i do not think libmpi_f90.a is related to an OpenMPI library.

For completeness/accuracy:  Apologies -- it turns out that's right (as
well as the below, of course).  The link step has

  ... libmpi_f90.a ... -libmpi_f90 ...

so there's at least some excuse for confusion if you haven't looked
twice at the end of the build output.

> libmpi_f90 is the Fortran 90 library in OMPI 1.6, but presumably you
> want the shared, system version.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Compilation without NVML support

2016-09-20 Thread Dave Love
Brice Goglin  writes:

> Hello
> Assuming this NVML detection is actually done by hwloc, I guess there's
> nothing in OMPI to disable it. It's not the first time we get such an
> issue with OMPI not having all hwloc's --disable-foo options, but I
> don't think we actually want to propagate all of them.

I'd build against the system libhwloc, if only for consistency with
other things using hwloc.

However, don't --disable-... options get passed to sub-configures
(possibly with a warning at higher levels)?  I guess that's what Gilles
meant.

> Maybe we should just force several enable_foo=no when OMPI invokes
> hwloc's configury. At least nvml, gl, opencl, libudev are likely useless
> for OMPI.
> Brice

For what it's worth, I've found the nvidia bits harmful in another
situation.  After much head scratching, I found that mysterious bus
errors crashing SGE daemons seemed connected to them, and the crashes
went away when I rebuilt the hwloc library without the stuff.  [I didn't
think I could make a useful bug report.]
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] specifying memory affinity

2016-09-20 Thread Dave Love
I don't think it's possible, but just to check:  can you specify memory
affinity distinct from core binding somehow with OMPI (i.e. not with
hwloc-bind as a shim under mpirun)?

It seems to be relevant in Knight's Landing "hybrid" mode with separate
MCDRAM NUMA nodes as I assume you still want core binding -- or is that
not so?
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-10-11 Thread Dave Love
Gilles Gouaillardet  writes:

> Bennet,
>
>
> my guess is mapping/binding to sockets was deemed the best compromise
> from an
>
> "out of the box" performance point of view.
>
>
> iirc, we did fix some bugs that occured when running under asymmetric
> cpusets/cgroups.
>
> if you still have some issues with the latest Open MPI version (2.0.1)
> and the default policy,
>
> could you please describe them ?

I also don't understand why binding to sockets is the right thing to do.
Binding to cores seems the right default to me, and I set that locally,
with instructions about running OpenMP.  (Isn't that what other
implementations do, which makes them look better?)

I think at least numa should be used, rather than socket.  Knights
Landing, for instance, is single-socket, so no gets no actual binding by
default.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Launching hybrid MPI/OpenMP jobs on a cluster: correct OpenMPI flags?

2016-10-11 Thread Dave Love
Wirawan Purwanto  writes:

> Instead of the scenario above, I was trying to get the MPI processes
> side-by-side (more like "fill_up" policy in SGE scheduler), i.e. fill
> node 0 first, then fill node 1, and so on. How do I do this properly?
>
> I tried a few attempts that fail:
>
> $ export OMP_NUM_THREADS=2
> $ mpirun -np 16 -map-by core:PE=2 ./EXECUTABLE

...

> Clearly I am not understanding how this map-by works. Could somebody
> help me? There was a wiki article partially written:
>
> https://github.com/open-mpi/ompi/wiki/ProcessPlacement
>
> but unfortunately it is also not clear to me.

Me neither; this stuff has traditionally been quite unclear and really
needs documenting/explaining properly.

This sort of thing from my local instructions for OMPI 1.8 probably does
what you want for OMP_NUM_THREADS=2 (where the qrsh options just get me
a couple of small nodes):

  $ qrsh -pe mpi 24 -l num_proc=12 \
 mpirun -n 12 --map-by slot:PE=2 --bind-to core --report-bindings true |&
 sort -k 4 -n
  [comp544:03093] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B/./././.][./././././.]
  [comp544:03093] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 
3[hwt 0]]: [././B/B/./.][./././././.]
  [comp544:03093] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 
5[hwt 0]]: [././././B/B][./././././.]
  [comp544:03093] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 
7[hwt 0]]: [./././././.][B/B/./././.]
  [comp544:03093] MCW rank 4 bound to socket 1[core 8[hwt 0]], socket 1[core 
9[hwt 0]]: [./././././.][././B/B/./.]
  [comp544:03093] MCW rank 5 bound to socket 1[core 10[hwt 0]], socket 1[core 
11[hwt 0]]: [./././././.][././././B/B]
  [comp527:03056] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B/./././.][./././././.]
  [comp527:03056] MCW rank 7 bound to socket 0[core 2[hwt 0]], socket 0[core 
3[hwt 0]]: [././B/B/./.][./././././.]
  [comp527:03056] MCW rank 8 bound to socket 0[core 4[hwt 0]], socket 0[core 
5[hwt 0]]: [././././B/B][./././././.]
  [comp527:03056] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket 1[core 
7[hwt 0]]: [./././././.][B/B/./././.]
  [comp527:03056] MCW rank 10 bound to socket 1[core 8[hwt 0]], socket 1[core 
9[hwt 0]]: [./././././.][././B/B/./.]
  [comp527:03056] MCW rank 11 bound to socket 1[core 10[hwt 0]], socket 1[core 
11[hwt 0]]: [./././././.][././././B/B]

I don't remember how I found that out.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Using Open MPI with multiple versions of GCC and G++

2016-10-11 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> Especially with C++, the Open MPI team strongly recommends you
> building Open MPI with the target versions of the compilers that you
> want to use.  Unexpected things can happen when you start mixing
> versions of compilers (particularly across major versions of a
> compiler).  To be clear: compilers are *supposed* to be compatible
> across multiple versions (i.e., compile a library with one version of
> the compiler, and then use that library with an application compiled
> by a different version of the compiler), but a) there's other issues,
> such as C++ ABI issues and other run-time bootstrapping that can
> complicate things, and b) bugs in forward and backward compatibility
> happen.

Is that actually observed in GNU/Linux systems?  I'd expect it either to
work or just fail to link.  For instance, the RHEL 6 devtoolset-4 (gcc
5) uses the system libstdc++, and the system compiler is gcc 4.4.

> The short answer is in this FAQ item:
> https://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0.
> Substituting the gcc 5 compiler may work just fine.

For what it's worth, not for GNU Fortran, which unfortunately changes
the module format incompatibly with each release, or at least most
releases.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)

2016-11-07 Thread Dave Love
[Some time ago]
Jeff Hammond  writes:

> If you want to keep long-waiting MPI processes from clogging your CPU
> pipeline and heating up your machines, you can turn blocking MPI
> collectives into nicer ones by implementing them in terms of MPI-3
> nonblocking collectives using something like the following.

I see sleeping for ‘0s’ typically taking ≳50μs on Linux (measured on
RHEL 6 or 7, without specific tuning, on recent Intel).  It doesn't look
like something you want in paths that should be low latency, but maybe
there's something you can do to improve that?  (sched_yield takes <1μs.)

> I typed this code straight into this email, so you should validate it
> carefully.

...

> #elif USE_CPU_RELAX
> cpu_relax(); /*
> http://linux-kernel.2935.n7.nabble.com/x86-cpu-relax-why-nop-vs-pause-td398656.html
> */

Is cpu_relax available to userland?  (GCC has an x86-specific intrinsic
__builtin_ia32_pause in fairly recent versions, but it's not in RHEL6's
gcc-4.4.)
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-11-07 Thread Dave Love
"r...@open-mpi.org"  writes:

> Yes, I’ve been hearing a growing number of complaints about cgroups for that 
> reason. Our mapping/ranking/binding options will work with the cgroup 
> envelope, but it generally winds up with a result that isn’t what the user 
> wanted or expected.

How?  I don't understand as an implementor why there's a difference from
just resource manager core binding, assuming the programs don't try to
escape the binding.  (I'm not saying there's nothing wrong with cgroups
in general...)

> We always post the OMPI BoF slides on our web site, and we’ll do the same 
> this year. I may try to record webcast on it and post that as well since I 
> know it can be confusing given all the flexibility we expose.
>
> In case you haven’t read it yet, here is the relevant section from “man 
> mpirun”:

I'm afraid I read that, and various versions of the code at different
times, and I've worked on resource manager core binding.  I still had to
experiment to find a way to run mpi+openmp jobs correctly, in multiple
ompi versions.  NEWS usually doesn't help, nor conference talks for
people who aren't there and don't know they should search beyond the
documentation.  We don't even seem to be able to make reliable bug
reports as they may or may not get picked up here.

Regardless, I can't see how binding to socket can be a good default.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Redusing libmpi.so size....

2016-11-07 Thread Dave Love
Mahesh Nanavalla  writes:

> Hi all,
>
> I am using openmpi-1.10.3.
>
> openmpi-1.10.3 compiled for  arm(cross compiled on X86_64 for openWRT
> linux)  libmpi.so.12.0.3 size is 2.4MB,but if i compiled on X86_64 (linux)
> libmpi.so.12.0.3 size is 990.2KB.
>
> can anyone tell how to reduce the size of libmpi.so.12.0.3 compiled for
>  arm.

Do what Debian does for armel?

  du -h lib/openmpi/lib/libmpi.so.20.0.1
  804K  lib/openmpi/lib/libmpi.so.20.0.1

[What's ompi useful for on an openWRT system?]
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] mpi4py+OpenMPI: Qs about submitting bugs and examples

2016-11-07 Thread Dave Love
"r...@open-mpi.org"  writes:

>> Is this mailing list a good spot to submit bugs for OpenMPI? Or do I
>> use github?
>
> You can use either - I would encourage the use of github “issues” when
> you have a specific bug, and the mailing list for general questions

I was told not to do that, and to send here instead; README was even
changed to say so.  It doesn't seem a good way of getting issues
addressed.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)

2016-11-09 Thread Dave Love
Jeff Hammond  writes:

>> I see sleeping for ‘0s’ typically taking ≳50μs on Linux (measured on
>> RHEL 6 or 7, without specific tuning, on recent Intel).  It doesn't look
>> like something you want in paths that should be low latency, but maybe
>> there's something you can do to improve that?  (sched_yield takes <1μs.)
>
> I demonstrated a bunch of different implementations with the instruction to
> "pick one of these...", where establishing the relationship between
> implementation and performance was left as an exercise for the reader :-)

The point was that only the one seemed available on RHEL6 to this
exercised reader.  No complaints about the useful list of possibilities.

> Note that MPI implementations may be interested in taking advantage of
> https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.

Is that really useful if it's KNL-specific and MSR-based, with a setup
that implementations couldn't assume?

>> Is cpu_relax available to userland?  (GCC has an x86-specific intrinsic
>> __builtin_ia32_pause in fairly recent versions, but it's not in RHEL6's
>> gcc-4.4.)
>
> The pause instruction is available in ring3.  Just use that if cpu_relax
> wrapper is not implemented.

[OK; I meant in a userland library.]

Are there published measurements of the typical effects of spinning and
ameliorations on some sort of "representative" system?
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] An old code compatibility

2016-11-15 Thread Dave Love
Mahmood Naderan  writes:

> Hi,
> The following mpifort command fails with a syntax error. It seems that the
> code is compatible with old gfortran, but I am not aware of that. Any idea
> about that?
>
> mpifort -ffree-form -ffree-line-length-0 -ff2c -fno-second-underscore
> -I/opt/fftw-3.3.5/include  -O3  -c xml.f90
> xml.F:641.46:
>
>CALL XML_TAG("set", comment="spin "
>   1
> Error: Syntax error in argument list at (1)
>
>
>
>
> In the source code, that line is
>
> CALL XML_TAG("set", comment="spin "//TRIM(ADJUSTL(strcounter)))

Apparently that mpifort is running cpp on .f90 files, and without
--traditional.  I've no idea how it could be set up to do that; gfortran
itself won't do it.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)

2016-12-08 Thread Dave Love
Jeff Hammond  writes:

>>
>>
>> > Note that MPI implementations may be interested in taking advantage of
>> > https://software.intel.com/en-us/blogs/2016/10/06/intel-
>> xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.
>>
>> Is that really useful if it's KNL-specific and MSR-based, with a setup
>> that implementations couldn't assume?
>>
>>
> Why wouldn't it be useful in the context of a parallel runtime system like
> MPI?  MPI implementations take advantage of all sorts of stuff that needs
> to be queried with configuration, during compilation or at runtime.

I probably should have said "useful in practice".  The difference from
other things I can think of is that access to MSRs is privileged, and
it's not clear to me what the implications are of changing it or to what
extent you can assume people will.

> TSX requires that one check the CPUID bits for it, and plenty of folks are
> happily using MSRs (e.g.
> http://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html).

Yes, as root, and there are N different systems to at least provide
unprivileged read access on HPC systems, but that's a bit different, I
think.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] MPI+OpenMP core binding redux

2016-12-08 Thread Dave Love
I think there was a suggestion that the SC16 material would explain how
to get appropriate core binding for MPI+OpenMP (i.e. OMP_NUM_THREADS
cores/process), but it doesn't as far as I can see.

Could someone please say how you're supposed to do that in recent
versions (without relying on bound DRM slots), and provide a working
example in the documentation?  It seems a fairly important case that
should be clear.  Thanks.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)

2016-12-12 Thread Dave Love
Andreas Schäfer  writes:

>> Yes, as root, and there are N different systems to at least provide
>> unprivileged read access on HPC systems, but that's a bit different, I
>> think.
>
> LIKWID[1] uses a daemon to provide limited RW access to MSRs for
> applications. I wouldn't wonder if support for this was added to
> LIKWID by RRZE.

Yes, that's one of the N I had in mind; others provide Linux modules.

>From a system manager's point of view it's not clear what are the
implications of the unprivileged access, or even how much it really
helps.  I've seen enough setups suggested for HPC systems in areas I
understand (and used by vendors) which allow privilege escalation more
or less trivially, maybe without any real operational advantage.  If
it's clearly safe and helpful then great, but I couldn't assess that.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] rdmacm and udcm failure in 2.0.1 on RoCE

2016-12-15 Thread Dave Turner
Nathan:  Thanks for providing the debug flags.  I've attached the
output (NetPIPE.debug1) which basically shows that for RoCE the
udcm_component_query() will always fail.  Can someone verify if
this is correct that udcm is not supported for RoCE?  When I change
the test to force usage it does not work (NetPIPE.debug2).

[hero35][[38845,1],0][connect/btl_openib_connect_udcm.c:452:udcm_component_query]
UD CPC only supported on InfiniBand; skipped on mlx4_0:1
[hero35][[38845,1],0][connect/btl_openib_connect_udcm.c:501:udcm_component_query]
unavailable for use on mlx4_0:1; skipped

from btl_openib_connect_udcm.c

 438 static int udcm_component_query(mca_btl_openib_module_t *btl,
 439 opal_btl_openib_connect_base_module_t
**cpc)
 440 {
 441 udcm_module_t *m = NULL;
 442 int rc = OPAL_ERR_NOT_SUPPORTED;
 443
 444 do {
 445 /* If we do not have struct ibv_device.transport_device, then
 446we're in an old version of OFED that is IB only (i.e., no
 447iWarp), so we can safely assume that we can use this CPC. */
 448 #if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) &&
HAVE_DECL_IBV_LINK_LAYER_ETHERN ET
 449 if (BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)) {
 450 BTL_VERBOSE(("UD CPC only supported on InfiniBand; skipped
on %s:%d",
 451  ibv_get_device_name(btl->device->ib_dev),
 452  btl->port_num));
 453 break;
 454 }
 455 #endif

from base.h

#ifdef OPAL_HAVE_RDMAOE
#define BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)   \
(((IBV_TRANSPORT_IB != ((btl)->device->ib_dev->transport_type)) || \
(IBV_LINK_LAYER_ETHERNET == ((btl)->ib_port_attr.link_layer))) ?   \
true : false)
#else
#define BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)   \
((IBV_TRANSPORT_IB != ((btl)->device->ib_dev->transport_type)) ?   \
true : false)
#endif

So clearly for RoCE the transport is InfiniBand and the link layer is
Ethernet
so this will show that NOT_IB() is true, meaning that udcm is evidently
not supported for RoCE.  udcm definitely fails under 1.10.4 for RoCE in
our tests.  That means we need rdmacm to work which it evidently does
not at the moment for 2.0.1.  Could someone please verify that rdmacm
is not currently working in 2.0.1?  And therefore I'm assuming that
2.0.1 has not been successfully tested on RoCE???

   Dave



> --
>
> Message: 1
> Date: Wed, 14 Dec 2016 21:12:16 -0700
> From: Nathan Hjelm 
> To: drdavetur...@gmail.com, Open MPI Users 
> Subject: Re: [OMPI users] rdmacm and udcm failure in 2.0.1 on RoCE
> Message-ID: <32528c5d-14bc-42ce-b19a-684b81801...@me.com>
> Content-Type: text/plain; charset=utf-8
>
> Can you configure with ?enable-debug and run with ?mca btl_base_verbose
> 100 and provide the output? It may indicate why neither udcm nor rdmacm are
> available.
>
> -Nathan
>
>
> > On Dec 14, 2016, at 2:47 PM, Dave Turner  wrote:
> >
> > 
> --
> > No OpenFabrics connection schemes reported that they were able to be
> > used on a specific port.  As such, the openib BTL (OpenFabrics
> > support) will be disabled for this port.
> >
> >   Local host:   elf22
> >   Local device: mlx4_2
> >   Local port:   1
> >   CPCs attempted:   rdmacm, udcm
> > 
> --
> >
> > We have had no problems using 1.10.4 on RoCE but 2.0.1 fails to
> > find either connection manager.  I've read that rdmacm may have
> > issues under 2.0.1 so udcm may be the only one working.  Are there
> > any known issues with that on RoCE?  Or does this just mean we
> > don't have RoCE configured correctly?
> >
> >   Dave Turner
> >
> > --
> > Work: davetur...@ksu.edu (785) 532-7791
> >  2219 Engineering Hall, Manhattan KS  66506
> > Home:drdavetur...@gmail.com
> >   cell: (785) 770-5929
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> --
Work: davetur...@ksu.edu (785) 532-7791
 2219 Engineering Hall, Manhattan KS  66506
Home:drdavetur...@gmail.com
  cell: (785) 770-5929


NetPIPE.debug1
Description: Binary data


NetPIPE.debug2
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] epoll add error with OpenMPI 2.0.1 and SGE

2016-12-17 Thread Dave Turner
   I've solved this problem by omitting --with-libevent=/usr from
the configuration to force it to use the internal version.  I thought
I had tried this before posting but evidently did something wrong.

  Dave

On Tue, Dec 13, 2016 at 9:57 PM,  wrote:

> Send users mailing list submissions to
> users@lists.open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@lists.open-mpi.org
>
> You can reach the person managing the list at
> users-ow...@lists.open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>1. epoll add error with OpenMPI 2.0.1 and SGE (Dave Turner)
>
>
> --
>
> Message: 1
> Date: Tue, 13 Dec 2016 21:57:40 -0600
> From: Dave Turner 
> To: users@lists.open-mpi.org
> Subject: [OMPI users] epoll add error with OpenMPI 2.0.1 and SGE
> Message-ID:
>  mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> [warn] Epoll ADD(4) on fd 1 failed.  Old events were 0; read change was 0
> (none); write change was 1 (add): Operation not permitted
>
> Gentoo with compiled OpenMPI 2.0.1 and SGE
> ompi_info --all  file attached
>
> We recently did a maintenance upgrade to our cluster including
> moving to OpenMPI 2.0.1.  Fortran programs now give the
> epoll add error above at the start of a run and the stdout file
> freezes until the end of the run when all info is dumped.
>
> I've read about this problem and it seems to be a file lock
> issue where OpenMPI and SGE are both trying to lock the
> same output file.  We have not seen this problem with
> previous versions of OpenMPI.
>
> We've tried compiling OpenMPI with and without
> specifying  --with-libevent=/usr, and I've tried compiling
> with --disable-event-epoll and using -mca opal_event_include poll.
> Both of these were suggestions from a few years back but
> neither affects the problem.  I've also tried redirecting the output
> manually as:
>
> mpirun -np 4 ./app > file.out
>
> This just locks file.out instead with all the output again being
> dumped at the end of the run.
>
> We also do not have this issue with 1.10.4 installed.
>
>  Any suggestions?  Has anyone else run into this problem?
>
> Dave Turner
> --
> Work: davetur...@ksu.edu (785) 532-7791
>  2219 Engineering Hall, Manhattan KS  66506
> Home:drdavetur...@gmail.com
>   cell: (785) 770-5929
> -- next part --
> An HTML attachment was scrubbed...
> URL: <https://rfd.newmexicoconsortium.org/mailman/private/users/
> attachments/20161213/beb370b0/attachment.html>
> -- next part --
> A non-text attachment was scrubbed...
> Name: ompi_info.2.0.1.all
> Type: application/octet-stream
> Size: 202298 bytes
> Desc: not available
> URL: <https://rfd.newmexicoconsortium.org/mailman/private/users/
> attachments/20161213/beb370b0/attachment.obj>
>
> --
>
> Subject: Digest Footer
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> --
>
> End of users Digest, Vol 3675, Issue 2
> **
>



-- 
Work: davetur...@ksu.edu (785) 532-7791
 2219 Engineering Hall, Manhattan KS  66506
Home:drdavetur...@gmail.com
  cell: (785) 770-5929
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] openib/mpi_alloc_mem pathology

2017-03-06 Thread Dave Love
I've been looking at a new version of an application (cp2k, for for what
it's worth) which is calling mpi_alloc_mem/mpi_free_mem, and I don't
think it did so the previous version I looked at.  I found on an
IB-based system it's spending about half its time in those allocation
routines (according to its own profiling) -- a tad surprising.

It turns out that's due to some pathological interaction with openib,
and just having openib loaded.  It shows up on a single-node run iff I
don't suppress the openib btl, and doesn't for multi-node PSM runs iff I
suppress openib (on a mixed Mellanox/Infinipath system).

Can anyone say why, and whether there's a workaround?  (I can't easily
diagnose what it's up to as ptrace is turned off on the system
concerned, and I can't find anything relevant in archives.)

I had the idea to try libfabric instead for multi-node jobs, and that
doesn't show the pathological behaviour iff openib is suppressed.
However, it requires ompi 1.10, not 1.8, which I was trying to use.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] openib/mpi_alloc_mem pathology

2017-03-09 Thread Dave Love
Paul Kapinos  writes:

> Hi Dave,
>
>
> On 03/06/17 18:09, Dave Love wrote:
>> I've been looking at a new version of an application (cp2k, for for what
>> it's worth) which is calling mpi_alloc_mem/mpi_free_mem, and I don't
>
> Welcome to the club! :o)
> In our measures we see some 70% of time in 'mpi_free_mem'... and 15x
> performance loss if using Open MPI vs. Intel MPI. So it goes.
>
> https://www.mail-archive.com/users@lists.open-mpi.org//msg30593.html

Ah, that didn't match my search terms.

Did cp2k's own profile not show the site of the slowdown (MP_Mem, if I
recall correctly)?  Maybe it's a different issue, especially if IMPI
surprisingly wins so much over IB -- even if it isn't subject to the
same pathology and is using a better collective algorithms.  For a
previous version of cp2k, my all-free software build was reported faster
than an all-Intel build on a similar system with faster processors.

OPA performance would be interesting if you could report it, say, for a
reasonably large cp2k quickstep run, especially if IB+libfabric results
were available on the same system.  (The two people I know who were
measuring OPA were NDA'd when I last knew.)

>> think it did so the previous version I looked at.  I found on an
>> IB-based system it's spending about half its time in those allocation
>> routines (according to its own profiling) -- a tad surprising.
>>
>> It turns out that's due to some pathological interaction with openib,
>> and just having openib loaded.  It shows up on a single-node run iff I
>> don't suppress the openib btl, and doesn't for multi-node PSM runs iff I
>> suppress openib (on a mixed Mellanox/Infinipath system).
>
> we're lucky - our issue is on Intel OmniPath (OPA) network (and we
> will junk IB hardware in near future, I think) - so we disabled the IB
> transport failback,
> --mca btl ^tcp,openib

That's what I did, but could still run with IB under OMPI 1.10 using the
ofi mtl.

> For single-node jobs this will also help on plain IB nodes,
> likely. (you can disable IB if you do not use it)

Yes, I guess I wasn't clear.

I'd still like to know the basic reason for this, and whether it's
OMPI-specific, if someone can say.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] openib/mpi_alloc_mem pathology

2017-03-09 Thread Dave Love
Nathan Hjelm  writes:

> If this is with 1.10.x or older run with --mca memory_linux_disable
> 1. There is a bad interaction between ptmalloc2 and psm2 support. This
> problem is not present in v2.0.x and newer.

Is that applicable to openib too?
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] openib/mpi_alloc_mem pathology

2017-03-15 Thread Dave Love
Paul Kapinos  writes:

> Nathan,
> unfortunately '--mca memory_linux_disable 1' does not help on this
> issue - it does not change the behaviour at all.
>  Note that the pathological behaviour is present in Open MPI 2.0.2 as
> well as in /1.10.x, and Intel OmniPath (OPA) network-capable nodes are
> affected only.

[I guess that should have been "too" rather than "only".  It's loading
the openib btl that is the problem.]

> The known workaround is to disable InfiniBand failback by '--mca btl
> ^tcp,openib' on nodes with OPA network. (On IB nodes, the same tweak
> lead to 5% performance improvement on single-node jobs;

It was a lot more than that in my cp2k test.

> but obviously
> disabling IB on nodes connected via IB is not a solution for
> multi-node jobs, huh).

But it works OK with libfabric (ofi mtl).  Is there a problem with
libfabric?

Has anyone reported this issue to the cp2k people?  I know it's not
their problem, but I assume they'd like to know for users' sake,
particularly if it's not going to be addressed.  I wonder what else
might be affected.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] openib/mpi_alloc_mem pathology

2017-03-21 Thread Dave Love
I wrote: 

> But it works OK with libfabric (ofi mtl).  Is there a problem with
> libfabric?

Apparently there is, or at least with ompi 1.10.  I've now realized IMB
pingpong latency on a QDR IB system with ompi 1.10.6+libfabric is
~2.5μs, which it isn't with ompi 1.6 openib.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Questions about integration with resource distribution systems

2017-07-27 Thread Dave Love
"r...@open-mpi.org"  writes:

> Oh no, that's not right. Mpirun launches daemons using qrsh and those
> daemons spawn the app's procs. SGE has no visibility of the app at all

Oh no, that's not right.

The whole point of tight integration with remote startup using qrsh is
to report resource usage and provide control over the job.  I'm somewhat
familiar with this.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] NUMA interaction with Open MPI

2017-07-27 Thread Dave Love
Gilles Gouaillardet  writes:

> Adam,
>
> keep in mind that by default, recent Open MPI bind MPI tasks
> - to cores if -np 2
> - to NUMA domain otherwise

Not according to ompi_info from the latest release; it says socket.

> (which is a socket in most cases, unless
> you are running on a Xeon Phi)

[There have been multiple nodes/socket on x86 since Magny Cours, and
it's also relevant for POWER.  That's a reason things had to switch to
hwloc from whatever the predecessor was called.]

> so unless you specifically asked mpirun to do a binding consistent
> with your needs, you might simply try to ask no binding at all
> mpirun --bind-to none ...

Why would you want to turn off core binding?  The resource manager is
likely to supply a binding anyhow if incomplete nodes are allocated.

> i am not sure whether you can direclty ask Open MPI to do the memory
> binding you expect from the command line.

You can't control memory binding as far as I can tell.  That's
specifically important on KNL, which was brought up here some time ago.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] absolute paths printed by info programs

2017-08-01 Thread Dave Love
ompi_info et al print absolute compiler paths for some reason.  What
would they ever be used for, and are they intended to refer to the OMPI
build or application building?  They're an issue for packaging in Guix,
at least.  Similarly, what's io_romio_complete_configure_params intended
to be used for?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] --enable-builtin-atomics

2017-08-01 Thread Dave Love
What are the pros and cons of configuring with --enable-builtin-atomics?
I haven't spotted any discussion of the option.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Questions about integration with resource distribution systems

2017-08-01 Thread Dave Love
Gilles Gouaillardet  writes:

> Dave,
>
>
> unless you are doing direct launch (for example, use 'srun' instead of
> 'mpirun' under SLURM),
>
> this is the way Open MPI is working : mpirun will use whatever the
> resource manager provides
>
> in order to spawn the remote orted (tm with PBS, qrsh with SGE, srun
> with SLURM, ...).
>
>
> then mpirun/orted will fork&exec the MPI tasks.

I know quite well how SGE works with openmpi, which isn't special --
I've done enough work on it.  SGE tracks the process tree under orted
just like under bash, even if things daemonize.  The OP was correct.

I should qualify that by noting that ENABLE_ADDGRP_KILL has apparently
never propagated through remote startup, so killing those orphans after
VASP crashes may fail, though resource reporting works.  (I never
installed a fix for want of a test system, but it's not needed with
Linux cpusets.)
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] --enable-builtin-atomics

2017-08-02 Thread Dave Love
Nathan Hjelm  writes:

> So far only cons. The gcc and sync builtin atomic provide slower
> performance on x86-64 (and possible other platforms). I plan to
> investigate this as part of the investigation into requiring C11
> atomics from the C compiler.

Thanks.  Is that a gcc deficiency, or do the intrinsics just do
something different (more extensive)?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] --enable-builtin-atomics

2017-08-02 Thread Dave Love
"Barrett, Brian via users"  writes:

> Well, if you’re trying to get Open MPI running on a platform for which
> we don’t have atomics support, built-in atomics solves a problem for
> you…

That's not an issue in this case, I think.  (I'd expect it to default to
intrinsic if extrinsic support is missing.)
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Questions about integration with resource distribution systems

2017-08-02 Thread Dave Love
Reuti  writes:

>> I should qualify that by noting that ENABLE_ADDGRP_KILL has apparently
>> never propagated through remote startup,
>
> Isn't it a setting inside SGE which the sge_execd is aware of? I never
> exported any environment variable for this purpose.

Yes, but this is surely off-topic, even though
 mentions openmpi.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] built-in memchecker support

2017-08-24 Thread Dave Love
Apropos configuration parameters for packaging:

Is there a significant benefit to configuring built-in memchecker
support, rather than using the valgrind preload library?  I doubt being
able to use another PMPI tool directly at the same time counts.

Also, are there measurements of the performance impact of configuring,
but not using, it with recent hardware and software?  I don't know how
relevant the results in https://www.open-mpi.org/papers/parco-2007/
would be now, especially on a low-latency network.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] built-in memchecker support

2017-08-24 Thread Dave Love
Christoph Niethammer  writes:

> Hi Dave,
>
> The memchecker interface is an addition which allows other tools to be
> used as well.

Do you mean it allows other things to be hooked in other than through
PMPI?

> A more recent one is memPin [1].

Thanks, but Pin is proprietary, so it's no use as an alternative in this
case.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] built-in memchecker support

2017-08-24 Thread Dave Love
Gilles Gouaillardet  writes:

> Dave,
>
> the builtin memchecker can detect MPI usage errors such as modifying
> the buffer passed to MPI_Isend() before the request completes

OK, thanks.  The implementation looks rather different, and it's not
clear without checking the code in detail how it differs from the
preload library (which does claim to check at least some correctness) or
why that that sort of check has to be built in.

> all the extra work is protected
> if ( running_under_valgrind() ) {
>extra_checks();
> }
>
> so if you are not running under valgrind, the overhead should be unnoticeable

Thanks.  Is there a good reason not to enable it by default, then?
(Apologies that I've just found and checked the FAQ entry, and it does
actually say that, in contradiction to the paper it references.  I
assume the implementation has changed since then.)

A deficiency of the preload library I just realized is that it says it's
only MPI-2.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Do MPI calls ever sleep?

2010-07-21 Thread Dave Goodell
On Jul 21, 2010, at 2:54 PM CDT, Jed Brown wrote:

> On Wed, 21 Jul 2010 15:20:24 -0400, David Ronis  wrote:
>> Hi Jed,
>> 
>> Thanks for the reply and suggestion.  I tried adding -mca
>> yield_when_idle 1 (and later mpi_yield_when_idle 1 which is what
>> ompi_info reports the variable as) but it seems to have had 0 effect.
>> My master goes into fftw planning routines for a minute or so (I see the
>> threads being created), but the overall usage of the slaves remains
>> close to 100% during this time.  Just to be sure, I put the slaves into
>> a MPI_Barrier(MPI_COMM_WORLD) while they were waiting for the fftw
>> planner to finish.   It also didn't help.
> 
> They still spin (instead of using e.g. select()), but call sched_yield()
> so should only be actively spinning when nothing else is trying to run.
> Are you sure that the planner is always running in parallel?  What OS
> and OMPI version are you using?

sched_yield doesn't work as expected in late 2.6 Linux kernels: 
http://kerneltrap.org/Linux/CFS_and_sched_yield

If this scheduling behavior change is affecting you, you might be able to fix 
it with:

echo "1" >/proc/sys/kernel/sched_compat_yield

-Dave




Re: [OMPI users] OpenMPI on the ARM processor architecture?

2010-09-22 Thread Dave Love
Jeff Squyres  writes:

> I believe that the first step would be to get some assembly for the
> ARM platform for some of OMPI's key routines (locks, atomics, etc.).
> Beyond that, it *might* "just work"...?

Is http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=579505
relevant/useful?



Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-12 Thread Dave Love
Chris Jewell  writes:

> I've scrapped this system now in favour of the new SGE core binding feature.

How does that work, exactly?  I thought the OMPI SGE integration didn't
support core binding, but good if it does.



Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-14 Thread Dave Love
Reuti  writes:

> With the default binding_instance set to "set" (the default) the
> shepherd should bind the processes to cores already. With other types
> of binding_instance these selected cores must be forward to the
> application via an environment variable or in the hostfile.

My question was specifically about SGE/OMPI tight integration; are you
actually doing binding successfully with that?  I think I read here that
the integration doesn't (yet?) deal with SGE core binding, and when we
turned on the SGE feature we got the OMPI tasks piled onto a single
core.  We quickly turned it off for MPI jobs when we realized what was
happening, and I didn't try to investigate further.

> As this is only a hint to SGE and not a hard request, the user must
> plan a little bit the allocation beforehand. Especially if you
> oversubscribe a machine it won't work. 

[It is documented that the binding isn't applied if the selected cores
are occupied.]



Re: [OMPI users] Hair depleting issue with Ompi143 and one program

2011-01-20 Thread Dave Goodell
I can't speak to what OMPI might be doing to your program, but I have a few 
suggestions for looking into the Valgrind issues.

Valgrind's "--track-origins=yes" option is usually helpful for figuring out 
where the uninitialized values came from.  However, if I understand you 
correctly and if you are correct in your assumption that _mm_setzero_ps is not 
actually zeroing your xEv variable for some reason, then this option will 
unhelpfully tell you that it was caused by a stack allocation at the entrance 
to the function where the variable is declared.  But it's worth turning on 
because it's easy to do and it might show you something obvious that you are 
missing.

The next thing you can do is disable optimization when building your code in 
case GCC is taking a shortcut that is either incorrect or just doesn't play 
nicely with Valgrind.  Valgrind might run pretty slow though, because -O0 code 
can be really verbose and slow to check.

After that, if you really want to dig in, you can try reading the assembly code 
that is generated for that _mm_setzero_ps line.  The easiest way is to pass 
"-save-temps" to gcc and it will keep a copy of "sourcefile.s" corresponding to 
"sourcefile.c".  Sometimes "-fverbose-asm" helps, sometimes it makes things 
harder to follow.

And the last semi-desperate step is to dig into what Valgrind thinks is going 
on.  You'll want to read up on how memcheck really works [1] before doing this. 
 Then read up on client requests [2,3].  You can then use the 
VALGRIND_GET_VBITS client request on your xEv variable in order to see which 
parts of the variable Valgrind thinks are undefined.  If the vbits don't match 
with what you expect, there's a chance that you might have found a bug in 
Valgrind itself.  It doesn't happen often, but the SSE code can be complicated 
and isn't exercised as often as the non-vector portions of Valgrind.

Good luck,
-Dave

[1] http://valgrind.org/docs/manual/mc-manual.html#mc-manual.machine
[2] 
http://valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.clientreq
[3] http://valgrind.org/docs/manual/mc-manual.html#mc-manual.clientreqs

On Jan 20, 2011, at 5:07 PM CST, David Mathog wrote:

> I have been working on slightly modifying a software package by Sean
> Eddy called Hmmer 3.  The hardware acceleration was originally SSE2 but
> since most of our compute nodes only have SSE1 and MMX I rewrote a few
> small sections to just use those instructions.  (And yes, as far as I
> can tell it invokes emms before any floating point operations are run
> after each MMX usage.)   On top of that each binary has 3 options for
> running the programs: single threaded, threaded, or MPI (using 
> Ompi143).  For all other programs in this package everything works
> everywhere.  For one called "jackhmmer" this table results (+=runs
> correctly, - = problems), where the exact same problem is run in each
> test (theoretically exercising exactly the same routines, just under
> different threading control):
> 
>   SSE2   SSE1 
> Single  +  +
> Threaded+  +
> Ompi143 +  -
> 
> The negative result for the SSE/Ompi143 combination happens whether the
> worker nodes are Athlon MP (SSE1 only) or Athlon64.  The test machine
> for the single and threaded runs is a two CPU Opteron 280 (4 cores
> total).  Ompi143 is 32 bit everywhere (local copies though).  There have
> been no modifications whatsoever made to the main jackhmmer.c file,
> which is where the various run methods are implemented.
> 
> Now if there was some intrinsic problem with my SSE1 code it should
> presumably manifest in both the Single and Threaded versions as well
> (the thread control is different, but they all feed through the same
> underlying functions), or in one of the other programs, which isn't
> seen.  Running under valgrind using Single or Threaded produces no
> warnings.  Using mpirun with valgrind on the SSE2 produces 3: two
> related to OMPI itself which are seen in every OMPI program run in
> valgrind, and one caused by an MPIsend operation where the buffer
> contains some uninitialized data (this is nothing toxic, just bytes in
> fixed length fields which which were never set because a shorter string
> is stored there). 
> 
> ==19802== Syscall param writev(vector[...]) points to uninitialised byte(s)
> ==19802==at 0x4C77AC1: writev (in /lib/libc-2.10.1.so)
> ==19802==by 0x8A069B5: mca_btl_tcp_frag_send (in
> /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so)
> ==19802==by 0x8A0626E: mca_btl_tcp_endpoint_send (in
> /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so)
> ==19802==by 0x8A01ADC: mca_btl_tcp_send (in
> /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so)
> ==19802==by 0x7FA24A9: mca_pml_ob1_sen

[OMPI users] bizarre failure with IMB/openib

2011-03-21 Thread Dave Love
I'm trying to test some new nodes with ConnectX adaptors, and failing to
get (so far just) IMB to run on them.

The binary runs on the same cluster using TCP, or using PSM on some
other IB nodes.  A rebuilt PMB and various existing binaries work with
openib on the ConnectX nodes running it exactly the same way as IMB.
I.e. this seems to be something specific to IMB and openib.

It seems rather bizarre, and I have no idea how to debug it in the
absence of hints from a web search, i.e. why has it failed to attempt
the openib BTL in this case.  I can't get any openib-related information
using obvious MCA verbosity flags.  Can anyone make suggestions?

I'm using gcc-compiled OMPI 1.4.3 and the current RedHat 5 OFED with IMB
3.2.2, specifying `btl openib,sm,self' (or `mtl psm' on the Qlogic
nodes).  I'm not sure what else might be relevant.  The output from
trying to run IMB follows, for what it's worth.

  --
  At least one pair of MPI processes are unable to reach each other for
  MPI communications.  This means that no Open MPI device has indicated
  that it can be used to communicate between these processes.  This is
  an error; Open MPI requires that all MPI processes be able to reach
  each other.  This error can sometimes be the result of forgetting to
  specify the "self" BTL.

Process 1 ([[25307,1],2]) is on host: lvgig116
Process 2 ([[25307,1],12]) is on host: lvgig117
BTLs attempted: self sm

  Your MPI job is now going to abort; sorry.
  --
  --
  It looks like MPI_INIT failed for some reason; your parallel process is
  likely to abort.  There are many reasons that a parallel process can
  fail during MPI_INIT; some of which are due to configuration or environment
  problems.  This failure appears to be an internal failure; here's some
  additional information (which may only be relevant to an Open MPI
  developer):

PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
  --
  *** The MPI_Init_thread() function was called before MPI_INIT was invoked.
  *** This is disallowed by the MPI standard.
  *** Your MPI job will now abort.
  [lvgig116:8052] Abort before MPI_INIT completed successfully; not able to 
guarantee that all other processes were killed!
  *** The MPI_Init_thread() function was called before MPI_INIT was invoked.
  *** This is disallowed by the MPI standard.
  *** Your MPI job will now abort.

  ...

  [lvgig116:07931] 19 more processes have sent help message help-mca-bml-r2.txt 
/ unreachable proc
  [lvgig116:07931] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages
  [lvgig116:07931] 19 more processes have sent help message help-mpi-runtime / 
mpi_init:startup:internal-failure



[OMPI users] 1.5.3 and SGE integration?

2011-03-21 Thread Dave Love
I've just tried 1.5.3 under SGE with tight integration, which seems to
be broken.  I built and ran in the same way as for 1.4.{1,3}, which
works, and ompi_info reports the same gridengine parameters for 1.5 as
for 1.4.

The symptoms are that it reports a failure to communicate using ssh,
whereas it should be using the SGE builtin method via qrsh.

There doesn't seem to be a relevant bug report, but before I
investigate, has anyone else succeeded/failed with it, or have any
hints?



Re: [OMPI users] bizarre failure with IMB/openib

2011-03-21 Thread Dave Love
Peter Kjellström  writes:

> Are you sure you launched it correctly and that you have (re)built OpenMPI 
> against your Redhat-5 ib stack?

Yes.  I had to rebuild because I'd omitted openib when we only needed
psm.  As I said, I did exactly the same thing successfully with PMB
(initially because I wanted to try an old binary, and PMB was lying
around).

>>   Your MPI job is now going to abort; sorry.
> ...
>>   [lvgig116:07931] 19 more processes have sent help message
>> help-mca-bml-r2.txt / unreachable proc [lvgig116:07931] Set MCA parameter
>
> Seems to me that OpenMPI gave up because it didn't succeed in initializing 
> any 
> inter-node btl/mtl.

Sure, but why won't it load the btl under IMB when it will under PMB
(and other codes like XHPL), and how do I get any diagnostics?

My boss has just stumbled upon a reference while looking for something
else It looks as if it's an OFED bug entry, but I can't find an
operational version of an OFED tracker or any other reference to the bug
than (the equivalent of)
http://lists.openfabrics.org/pipermail/ewg/2010-March/014983.html :

  1976  maj jsquyres at cisco.com   errors running IMB over 
openmpi-1.4.1

I guess Jeff will enlighten me if/when he spots this.  (Thanks in
advance, obviously.)



Re: [OMPI users] 1.5.3 and SGE integration?

2011-03-21 Thread Dave Love
Terry Dontje  writes:

> Dave what version of Grid Engine are you using?

6.2u5, plus irrelevant patches.  It's fine with ompi 1.4.  (All I did to
switch was to load the 1.5.3 modules environment.)

> The plm checks for the following env-var's to determine if you are
> running Grid Engine.
> SGE_ROOT
> ARC
> PE_HOSTFILE
> JOB_ID
>
> If these are not there during the session that mpirun is executed then
> it will resort to ssh.

Sure.  What ras_gridengine_debug reported looked correct.  I'll try to
debug it.  At least I stand a reasonable chance with grid engine issues.



Re: [OMPI users] 1.5.3 and SGE integration?

2011-03-21 Thread Dave Love
Ralph Castain  writes:

> Just looking at this for another question. Yes, SGE integration is broken in 
> 1.5. Looking at how to fix now.
>
> Meantime, you can get it work by adding "-mca plm ^rshd" to your mpirun cmd 
> line.

Thanks.  I'd forgotten about plm when checking, though I guess that
wouldn't have helped me.

Should rshd be mentioned in the release notes?



Re: [OMPI users] bizarre failure with IMB/openib

2011-03-22 Thread Dave Love
Dave Love  writes:

> I'm trying to test some new nodes with ConnectX adaptors, and failing to
> get (so far just) IMB to run on them.

I suspect this is https://svn.open-mpi.org/trac/ompi/ticket/1919.  I'm
rather surprised it isn't an FAQ (actually frequently asked, not meaning
someone should have written it up).



Re: [OMPI users] 1.5.3 and SGE integration?

2011-03-22 Thread Dave Love
Ralph Castain  writes:

>> Should rshd be mentioned in the release notes?
>
> Just starting the discussion on the best solution going forward. I'd
> rather not have to tell SGE users to add this to their cmd line. :-(

Sure.  I just thought a new component would normally be mentioned in the
notes.



Re: [OMPI users] Deadlock with mpi_init_thread + mpi_file_set_view

2011-04-04 Thread Dave Goodell
FWIW, we solved this problem with ROMIO in MPICH2 by making the "big global 
lock" a recursive mutex.  In the past it was implicitly so because of the way 
that recursive MPI calls were handled.  In current MPICH2 it's explicitly 
initialized with type PTHREAD_MUTEX_RECURSIVE instead.

-Dave

On Apr 4, 2011, at 9:28 AM CDT, Ralph Castain wrote:

> 
> On Apr 4, 2011, at 8:18 AM, Rob Latham wrote:
> 
>> On Sat, Apr 02, 2011 at 04:59:34PM -0400, fa...@email.com wrote:
>>> 
>>> opal_mutex_lock(): Resource deadlock avoided
>>> #0  0x0012e416 in __kernel_vsyscall ()
>>> #1  0x01035941 in raise (sig=6) at 
>>> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
>>> #2  0x01038e42 in abort () at abort.c:92
>>> #3  0x00d9da68 in ompi_attr_free_keyval (type=COMM_ATTR, key=0xbffda0e4, 
>>> predefined=0 '\000') at attribute/attribute.c:656
>>> #4  0x00dd8aa2 in PMPI_Keyval_free (keyval=0xbffda0e4) at pkeyval_free.c:52
>>> #5  0x01bf3e6a in ADIOI_End_call (comm=0xf1c0c0, keyval=10, 
>>> attribute_val=0x0, extra_state=0x0) at ad_end.c:82
>>> #6  0x00da01bb in ompi_attr_delete. (type=UNUSED_ATTR, object=0x6, 
>>> attr_hash=0x2c64, key=14285602, predefined=232 '\350', need_lock=128 
>>> '\200') at attribute/attribute.c:726
>>> #7  0x00d9fb22 in ompi_attr_delete_all (type=COMM_ATTR, object=0xf1c0c0, 
>>> attr_hash=0x8d0fee8) at attribute/attribute.c:1043
>>> #8  0x00dbda65 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:133
>>> #9  0x00dd12c2 in PMPI_Finalize () at pfinalize.c:46
>>> #10 0x00d6b515 in mpi_finalize_f (ierr=0xbffda2b8) at pfinalize_f.c:62
>> 
>> I guess I need some OpenMPI eyeballs on this...
>> 
>> ROMIO hooks into the attribute keyval deletion mechanism to clean up
>> the internal data structures it has allocated.  I suppose since this
>> is MPI_Finalize, we could just leave those internal data structures
>> alone and let the OS deal with it. 
>> 
>> What I see happening here is the OpenMPI finalize routine is deleting
>> attributes.   one of those attributes is ROMIO's, which in turn tries
>> to free keyvals.  Is the deadlock that noting "under" ompi_attr_delete
>> can itself call ompi_* routines? (as ROMIO triggers a call to
>> ompi_attr_free_keyval) ?
>> 
>> Here's where ROMIO sets up the keyval and the delete handler:
>> https://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/trunk/src/mpi/romio/mpi-io/mpir-mpioinit.c#L39
>> 
>> that routine gets called upon any "MPI-IO entry point" (open, delete,
>> register-datarep).  The keyvals help ensure that ROMIO's internal
>> structures get initialized exactly once, and the delete hooks help us
>> be good citizens and clean up on exit. 
> 
> FWIW: his trace shows that OMPI incorrectly attempts to acquire a thread lock 
> that has already been locked. This occurs  in OMPI's attribute code, probably 
> surrounding the call to your code.
> 
> In other words, it looks to me like the problem is on our side, not yours. 
> Jeff is the one who generally handles the attribute code, though, so I'll 
> ping his eyeballs :-)
> 
> 
>> 
>> ==rob
>> 
>> -- 
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] using openib and psm together

2011-04-21 Thread Dave Love
We have an installation with both Mellanox and Qlogic IB adaptors (in
distinct islands), so I built open-mpi 1.4.3 with openib and psm
support.

Now I've just read this in the OFED source, but I can't see any relevant
issue in the open-mpi tracker:

  OpenMPI support
  ---
  It is recommended to use the OpenMPI v1.5 development branch. Prior versions
  of OpenMPI have an issue with support PSM network transports mixed with 
standard
  Verbs transport (BTL openib). This prevents an OpenMPI installation with
  network modules available for PSM and Verbs to work correctly on nodes with
  no QLogic IB hardware. This has been fixed in the latest development branch
  allowing a single OpenMPI installation to target IB hardware via PSM or Verbs
  as well as alternate transports seamlessly.

Do I definitely need 1.5 (and is 1.5.3 good enough?) to have openib and
psm working correctly?  Also what are the symptoms of it not working
correctly?



Re: [OMPI users] using openib and psm together

2011-04-26 Thread Dave Love
Jeff Squyres  writes:

> I believe it was mainly a startup issue -- there's a complicated
> sequence of events that happens during MPI_INIT.  IIRC, the issue was
> that if OMPI had software support for PSM, it assumed that the lack of
> PSM hardware was effectively an error.

Thanks.  For what it's worth, I'm not seeing the problem with 1.4.3 for
some reason.



Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-03 Thread Dave Love
Brock Palen  writes:

> We managed to have another user hit the bug that causes collectives (this 
> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>
> btl_openib_cpc_include rdmacm

Could someone explain this?  We also have problems with collective hangs
with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
see any relevant issues filed.  However, rdmacm isn't an available value
for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
that I understand what these things are...).



Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Dave Love
Jeff Squyres  writes:

> We had a user-reported issue of some hangs that the IB vendors have
> been unable to replicate in their respective labs.  We *suspect* that
> it may be an issue with the oob openib CPC, but that code is pretty
> old and pretty mature, so all of us would be at least somewhat
> surprised if that were the case.  If anyone can reliably reproduce
> this error, please let us know and/or give us access to your machines

We can reproduce it with IMB.  We could provide access, but we'd have to
negotiate with the owners of the relevant nodes to give you interactive
access to them.  Maybe Brock's would be more accessible?  (If you
contact me, I may not be able to respond for a few days.)

> -- we have not closed this issue,

Which issue?   I couldn't find a relevant-looking one.

> but are unable to move forward
> because the customers who reported this issue switched to rdmacm and
> moved on (i.e., we don't have access to their machines to test any
> more).

For what it's worth, I figured out why I couldn't see rdmacm, but adding
ipoib would be a bit of a pain.

-- 
Excuse the typping -- I have a broken wrist


Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Dave Love
Ralph Castain  writes:

> I'll go back to my earlier comments. Users always claim that their
> code doesn't have the sync issue, but it has proved to help more often
> than not, and costs nothing to try,

Could you point to that post, or tell us what to try excatly, given
we're running IMB?  Thanks.

(As far as I know, this isn't happening with real codes, just IMB, but
only a few have been in use.)

-- 
Excuse the typping -- I have a broken wrist


Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-13 Thread Dave Love
Jeff Squyres  writes:

> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>
>> We can reproduce it with IMB.  We could provide access, but we'd have to
>> negotiate with the owners of the relevant nodes to give you interactive
>> access to them.  Maybe Brock's would be more accessible?  (If you
>> contact me, I may not be able to respond for a few days.)
>
> Brock has replied off-list that he, too, is able to reliably reproduce the 
> issue with IMB, and is working to get access for us.  Many thanks for your 
> offer; let's see where Brock's access takes us.

Good.  Let me know if we could be useful

>>> -- we have not closed this issue,
>> 
>> Which issue?   I couldn't find a relevant-looking one.
>
> https://svn.open-mpi.org/trac/ompi/ticket/2714

Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
connectx with more than one collective I can't recall.

-- 
Excuse the typping -- I have a broken wrist



  1   2   3   4   >