[OMPI users] change in behaviour 1.6 -> 1.8 under sge

2014-11-03 Thread Mark Dixon

Hi there,

We've started looking at moving to the openmpi 1.8 branch from 1.6 on our 
CentOS6/Son of Grid Engine cluster and noticed an unexpected difference 
when binding multiple cores to each rank.


Has openmpi's definition 'slot' changed between 1.6 and 1.8? It used to 
mean ranks, but now it appears to mean processing elements (see Details, 
below).


Thanks,

Mark

PS Also, the man page for 1.8.3 reports that '--bysocket' is deprecated, 
but it doesn't seem to exist when we try to use it:


  mpirun: Error: unknown option "-bysocket"
  Type 'mpirun --help' for usage.

== Details ==

On 1.6.5, we launch with the following core binding options:

  mpirun --bind-to-core --cpus-per-proc  
  mpirun --bind-to-core --bysocket --cpus-per-proc  

  where  is calculated to maximise the number of cores available to
  use - I guess affectively
  max(1, int(number of cores per node / slots per node requested)).

  openmpi reads the file $PE_HOSTFILE and launches a rank for each slot
  defined in it, binding  cores per rank.

On 1.8.3, we've tried launching with the following core binding options 
(which we hoped were equivalent):


  mpirun -map-by node:PE= 
  mpirun -map-by socket:PE= 

  openmpi reads the file $PE_HOSTFILE and launches a factor of  fewer
  ranks than under 1.6.5. We also notice that, where we wanted a single
  rank on the box and  is the number of cores available, openmpi
  refuses to launch and we get the message:

  "There are not enough slots available in the system to satisfy the 1
  slots that were requested by the application"

  I think that error message needs a little work :)


[OMPI users] Startup limited to 128 remote hosts in some situations?

2017-01-17 Thread Mark Dixon

Hi,

While commissioning a new cluster, I wanted to run HPL across the whole 
thing using openmpi 2.0.1.


I couldn't get it to start on more than 129 hosts under Son of Gridengine 
(128 remote plus the localhost running the mpirun command). openmpi would 
sit there, waiting for all the orted's to check in; however, there were 
"only" a maximum of 128 qrsh processes, therefore a maximum of 128 
orted's, therefore waiting a lng time.


Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job 
to launch.


Is this intentional, please?

Doesn't openmpi use a tree-like startup sometimes - any particular reason 
it's not using it here?


Cheers,

Mark
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Startup limited to 128 remote hosts in some situations?

2017-01-24 Thread Mark Dixon

Hi,

It works for me :)

Thanks!

Mark

On Fri, 20 Jan 2017, r...@open-mpi.org wrote:


Well, it appears we are already forwarding all envars, which should include 
PATH. Here is the qrsh command line we use:

“qrsh --inherit --nostdin -V"

So would you please try the following patch:

diff --git a/orte/mca/plm/rsh/plm_rsh_component.c 
b/orte/mca/plm/rsh/plm_rsh_component.c
index 0183bcc..1cc5aa4 100644
--- a/orte/mca/plm/rsh/plm_rsh_component.c
+++ b/orte/mca/plm/rsh/plm_rsh_component.c
@@ -288,8 +288,6 @@ static int rsh_component_query(mca_base_module_t **module, 
int *priority)
}
mca_plm_rsh_component.agent = tmp;
mca_plm_rsh_component.using_qrsh = true;
-/* no tree spawn allowed under qrsh */
-mca_plm_rsh_component.no_tree_spawn = true;
goto success;
} else if (!mca_plm_rsh_component.disable_llspawn &&
   NULL != getenv("LOADL_STEP_ID")) {



On Jan 19, 2017, at 5:29 PM, r...@open-mpi.org wrote:

I’ll create a patch that you can try - if it works okay, we can commit it


On Jan 18, 2017, at 3:29 AM, William Hay  wrote:

On Tue, Jan 17, 2017 at 09:56:54AM -0800, r...@open-mpi.org wrote:

As I recall, the problem was that qrsh isn???t available on the backend compute 
nodes, and so we can???t use a tree for launch. If that isn???t true, then we 
can certainly adjust it.

qrsh should be available on all nodes of a SoGE cluster but, depending on how things are set up, may not be 
findable (ie not in the PATH) when you qrsh -inherit into a node.  A workaround would be to start backend 
processes with qrsh -inherit -v PATH which will copy the PATH from the master node to the slave node 
process or otherwise pass the location of qrsh from one node or another.  That of course assumes that 
qrsh is in the same location on all nodes.


I've tested that it is possible to qrsh from the head node of a job to a slave 
node and then on to
another slave node by this method.

William



On Jan 17, 2017, at 9:37 AM, Mark Dixon  wrote:

Hi,

While commissioning a new cluster, I wanted to run HPL across the whole thing 
using openmpi 2.0.1.

I couldn't get it to start on more than 129 hosts under Son of Gridengine (128 remote 
plus the localhost running the mpirun command). openmpi would sit there, waiting for all 
the orted's to check in; however, there were "only" a maximum of 128 qrsh 
processes, therefore a maximum of 128 orted's, therefore waiting a lng time.

Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to 
launch.

Is this intentional, please?

Doesn't openmpi use a tree-like startup sometimes - any particular reason it's 
not using it here?

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
---
Mark Dixon Email: m.c.di...@leeds.ac.uk
Advanced Research Computing (ARC)  Tel (int): 35429
IT Services building   Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
---___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Is gridengine integration broken in openmpi 2.0.2?

2017-02-03 Thread Mark Dixon

Hi,

Just tried upgrading from 2.0.1 to 2.0.2 and I'm getting error messages 
that look like openmpi is using ssh to login to remote nodes instead of 
qrsh (see below). Has anyone else noticed gridengine integration being 
broken, or am I being dumb?


I built with "./configure 
--prefix=/apps/developers/libraries/openmpi/2.0.2/1/intel-17.0.1 
--with-sge --with-io-romio-flags=--with-file-system=lustre+ufs 
--enable-mpi-cxx --with-cma"


Can see the gridengine component via:

$ ompi_info -a | grep gridengine
 MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v2.0.2)
  MCA ras gridengine: ---
  MCA ras gridengine: parameter "ras_gridengine_priority" (current value: 
"100", data source: default, level: 9 dev/all, type: int)
  Priority of the gridengine ras component
  MCA ras gridengine: parameter "ras_gridengine_verbose" (current value: 
"0", data source: default, level: 9 dev/all, type: int)
  Enable verbose output for the gridengine ras component
  MCA ras gridengine: parameter "ras_gridengine_show_jobid" (current value: 
"false", data source: default, level: 9 dev/all, type: bool)

Cheers,

Mark

ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased).
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Is gridengine integration broken in openmpi 2.0.2?

2017-02-03 Thread Mark Dixon

On Fri, 3 Feb 2017, Reuti wrote:
...
SGE on its own is not configured to use SSH? (I mean the entries in 
`qconf -sconf` for rsh_command resp. daemon).

...

Nope, everything left as the default:

$ qconf -sconf | grep _command
qlogin_command   builtin
rlogin_command   builtin
rsh_command  builtin

I have 2.0.1 and 2.0.2 installed side by side. 2.0.1 is happy but 2.0.2 
isn't.


I'll start digging, but I'd appreciate hearing from any other SGE user who 
had tried 2.0.2 and tell me if it had worked for them, please? :)


Cheers,

Mark
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Is gridengine integration broken in openmpi 2.0.2?

2017-02-06 Thread Mark Dixon

On Fri, 3 Feb 2017, r...@open-mpi.org wrote:

I do see a diff between 2.0.1 and 2.0.2 that might have a related 
impact. The way we handled the MCA param that specifies the launch agent 
(ssh, rsh, or whatever) was modified, and I don’t think the change is 
correct. It basically says that we don’t look for qrsh unless the MCA 
param has been changed from the coded default, which means we are not 
detecting SGE by default.


Try setting "-mca plm_rsh_agent foo" on your cmd line - that will get 
past the test, and then we should auto-detect SGE again

...

Ah-ha! "-mca plm_rsh_agent foo" fixes it!

Thanks very much - presumably I can stick that in the system-wide 
openmpi-mca-params.conf for now.


Cheers,

Mark___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Is gridengine integration broken in openmpi 2.0.2?

2017-02-06 Thread Mark Dixon

On Mon, 6 Feb 2017, Mark Dixon wrote:
...

Ah-ha! "-mca plm_rsh_agent foo" fixes it!

Thanks very much - presumably I can stick that in the system-wide 
openmpi-mca-params.conf for now.

...

Except if I do that, it means running ompi outside of the SGE environment 
no longer works :(


Should I just revoke the following commit?

Cheers,

Mark

commit d51c2af76b0c011177aca8e08a5a5fcf9f5e67db
Author: Jeff Squyres 
Date:   Tue Aug 16 06:58:20 2016 -0500

rsh: robustify the check for plm_rsh_agent default value

Don't strcmp against the default value -- the default value may change
over time.  Instead, check to see if the MCA var source is not
DEFAULT.

Signed-off-by: Jeff Squyres 

(cherry picked from commit 
open-mpi/ompi@71ec5cfb436977ea9ad409ba634d27e6addf6fae)

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] "-map-by socket:PE=1" doesn't do what I expect

2017-02-15 Thread Mark Dixon

Hi,

When combining OpenMPI 2.0.2 with OpenMP, I'm interested in launching a 
number of ranks and allocating a number of cores to each rank. Using 
"-map-by socket:PE=", switching to "-map-by node:PE=" if I want 
to allocate more than a single socket to a rank, seems to do what I want.


Except for "-map-by socket:PE=1". That seems to allocate an entire socket 
to each rank instead of a single core. Here's the output of a test program 
on a dual socket non-hyperthreading system that reports rank core bindings 
(odd cores on one socket, even on the other):


   $ mpirun -np 2 -map-by socket:PE=1 ./report_binding
   Rank 0 bound somehost.somewhere:  0 2 4 6 8 10 12 14 16 18 20 22
   Rank 1 bound somehost.somewhere:  1 3 5 7 9 11 13 15 17 19 21 23

   $ mpirun -np 2 -map-by socket:PE=2 ./report_binding
   Rank 0 bound somehost.somewhere:  0 2
   Rank 1 bound somehost.somewhere:  1 3

   $ mpirun -np 2 -map-by socket:PE=3 ./report_binding
   Rank 0 bound somehost.somewhere:  0 2 4
   Rank 1 bound somehost.somewhere:  1 3 5

   $ mpirun -np 2 -map-by socket:PE=4 ./report_binding
   Rank 0 bound somehost.somewhere:  0 2 4 6
   Rank 1 bound somehost.somewhere:  1 3 5 7

I get the same result if I change "socket" to "numa". Changing "socket" to 
either "core", "node" or "slot" binds each rank to a single core (good), 
but doesn't round-robin ranks across sockets like "socket" does (bad).


Is "-map-by socket:PE=1" doing the right thing, please? I tried reading 
the man page but I couldn't work out what the expected behaviour is :o


Cheers,

Mark

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] "-map-by socket:PE=1" doesn't do what I expect

2017-02-15 Thread Mark Dixon

On Wed, 15 Feb 2017, r...@open-mpi.org wrote:

Ah, yes - I know what the problem is. We weren’t expecting a PE value of 
1 - the logic is looking expressly for values > 1 as we hadn’t 
anticipated this use-case.


Is it a sensible use-case, or am I crazy?

I can make that change. I’m off to a workshop for the next day or so, 
but can probably do this on the plane.


You're a star - thanks :)

Mark___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Is building with "--enable-mpi-thread-multiple" recommended?

2017-02-17 Thread Mark Dixon

Hi,

We have some users who would like to try out openmpi MPI_THREAD_MULTIPLE 
support on our InfiniBand cluster. I am wondering if we should enable it 
on our production cluster-wide version, or install it as a separate "here 
be dragons" copy.


I seem to recall openmpi folk cautioning that MPI_THREAD_MULTIPLE support 
was pretty crazy and that enabling it could have problems for 
non-MPI_THREAD_MULTIPLE codes (never mind codes that explicitly used it), 
so such an install shouldn't be used unless for codes that actually need 
it.


Is that still the case, please?

Thanks,

Mark
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] "-map-by socket:PE=1" doesn't do what I expect

2017-02-17 Thread Mark Dixon

On Fri, 17 Feb 2017, r...@open-mpi.org wrote:

Mark - this is now available in master. Will look at what might be 
required to bring it to 2.0


Thanks Ralph,

To be honest, since you've given me an alternative, there's no rush from 
my point of view.


The logic's embedded in a script and it's been taught "--map-by socket 
--bind-to core" for the special case of 1. It'd be nice to get rid of it 
at some point, but there's no problem waiting for the next stable branch 
:)


Cheers,

Mark
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Is building with "--enable-mpi-thread-multiple" recommended?

2017-02-18 Thread Mark Dixon

On Fri, 17 Feb 2017, r...@open-mpi.org wrote:

Depends on the version, but if you are using something in the v2.x 
range, you should be okay with just one installed version


Thanks Ralph.

How good is MPI_THREAD_MULTIPLE support these days and how far up the 
wishlist is it, please?


We don't get many openmpi-specific queries from users but, other than core 
binding, it seems to be the thing we get asked about the most (I normally 
point those people at mvapich2 or intelmpi instead).


Cheers,

Mark
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] More confusion about --map-by!

2017-02-23 Thread Mark Dixon

Hi,

I'm still trying to figure out how to express the core binding I want to 
openmpi 2.x via the --map-by option. Can anyone help, please?


I bet I'm being dumb, but it's proving tricky to achieve the following 
aims (most important first):


1) Maximise memory bandwidth usage (e.g. load balance ranks across
   processor sockets)
2) Optimise for nearest-neighbour comms (in MPI_COMM_WORLD) (e.g. put
   neighbouring ranks on the same socket)
3) Have an incantation that's simple to change based on number of ranks
   and processes per rank I want.

Example:

Considering a 2 socket, 12 cores/socket box and a program with 2 threads 
per rank...


... this is great if I fully-populate the node:

$ mpirun -np 12 -map-by slot:PE=2 --bind-to core --report-bindings ./prog
[somehost:101235] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B/./././././././././.][./././././././././././.]
[somehost:101235] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 
3[hwt 0]]: [././B/B/./././././././.][./././././././././././.]
[somehost:101235] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 
5[hwt 0]]: [././././B/B/./././././.][./././././././././././.]
[somehost:101235] MCW rank 3 bound to socket 0[core 6[hwt 0]], socket 0[core 
7[hwt 0]]: [././././././B/B/./././.][./././././././././././.]
[somehost:101235] MCW rank 4 bound to socket 0[core 8[hwt 0]], socket 0[core 
9[hwt 0]]: [././././././././B/B/./.][./././././././././././.]
[somehost:101235] MCW rank 5 bound to socket 0[core 10[hwt 0]], socket 0[core 
11[hwt 0]]: [././././././././././B/B][./././././././././././.]
[somehost:101235] MCW rank 6 bound to socket 1[core 12[hwt 0]], socket 1[core 
13[hwt 0]]: [./././././././././././.][B/B/./././././././././.]
[somehost:101235] MCW rank 7 bound to socket 1[core 14[hwt 0]], socket 1[core 
15[hwt 0]]: [./././././././././././.][././B/B/./././././././.]
[somehost:101235] MCW rank 8 bound to socket 1[core 16[hwt 0]], socket 1[core 
17[hwt 0]]: [./././././././././././.][././././B/B/./././././.]
[somehost:101235] MCW rank 9 bound to socket 1[core 18[hwt 0]], socket 1[core 
19[hwt 0]]: [./././././././././././.][././././././B/B/./././.]
[somehost:101235] MCW rank 10 bound to socket 1[core 20[hwt 0]], socket 1[core 
21[hwt 0]]: [./././././././././././.][././././././././B/B/./.]
[somehost:101235] MCW rank 11 bound to socket 1[core 22[hwt 0]], socket 1[core 
23[hwt 0]]: [./././././././././././.][././././././././././B/B]


... but not if I don't [fails aim (1)]:

$ mpirun -np 6 -map-by slot:PE=2 --bind-to core --report-bindings ./prog
[somehost:102035] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B/./././././././././.][./././././././././././.]
[somehost:102035] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 
3[hwt 0]]: [././B/B/./././././././.][./././././././././././.]
[somehost:102035] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 
5[hwt 0]]: [././././B/B/./././././.][./././././././././././.]
[somehost:102035] MCW rank 3 bound to socket 0[core 6[hwt 0]], socket 0[core 
7[hwt 0]]: [././././././B/B/./././.][./././././././././././.]
[somehost:102035] MCW rank 4 bound to socket 0[core 8[hwt 0]], socket 0[core 
9[hwt 0]]: [././././././././B/B/./.][./././././././././././.]
[somehost:102035] MCW rank 5 bound to socket 0[core 10[hwt 0]], socket 0[core 
11[hwt 0]]: [././././././././././B/B][./././././././././././.]


... whereas if I map by socket instead of slot, I achieve aim (1) but 
fail on aim (2):


$ mpirun -np 6 -map-by socket:PE=2 --bind-to core --report-bindings ./prog
[somehost:105601] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B/./././././././././.][./././././././././././.]
[somehost:105601] MCW rank 1 bound to socket 1[core 12[hwt 0]], socket 1[core 
13[hwt 0]]: [./././././././././././.][B/B/./././././././././.]
[somehost:105601] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket 0[core 
3[hwt 0]]: [././B/B/./././././././.][./././././././././././.]
[somehost:105601] MCW rank 3 bound to socket 1[core 14[hwt 0]], socket 1[core 
15[hwt 0]]: [./././././././././././.][././B/B/./././././././.]
[somehost:105601] MCW rank 4 bound to socket 0[core 4[hwt 0]], socket 0[core 
5[hwt 0]]: [././././B/B/./././././.][./././././././././././.]
[somehost:105601] MCW rank 5 bound to socket 1[core 16[hwt 0]], socket 1[core 
17[hwt 0]]: [./././././././././././.][././././B/B/./././././.]


Any ideas, please?

Thanks,

Mark
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] More confusion about --map-by!

2017-02-23 Thread Mark Dixon
ore 6[hwt 0]], socket

0[core 7[hwt 0]]: [././././././B/B/./././.][./././././././././././.]

[somehost:102035] MCW rank 4 bound to socket 0[core 8[hwt 0]], socket

0[core 9[hwt 0]]: [././././././././B/B/./.][./././././././././././.]

[somehost:102035] MCW rank 5 bound to socket 0[core 10[hwt 0]], socket

0[core 11[hwt 0]]: [././././././././././B/B][./././././././././././.]



... whereas if I map by socket instead of slot, I achieve aim (1) but
fail on aim (2):

$ mpirun -np 6 -map-by socket:PE=2 --bind-to core --report-bindings ./

prog

[somehost:105601] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket

0[core 1[hwt 0]]: [B/B/./././././././././.][./././././././././././.]

[somehost:105601] MCW rank 1 bound to socket 1[core 12[hwt 0]], socket

1[core 13[hwt 0]]: [./././././././././././.][B/B/./././././././././.]

[somehost:105601] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket

0[core 3[hwt 0]]: [././B/B/./././././././.][./././././././././././.]

[somehost:105601] MCW rank 3 bound to socket 1[core 14[hwt 0]], socket

1[core 15[hwt 0]]: [./././././././././././.][././B/B/./././././././.]

[somehost:105601] MCW rank 4 bound to socket 0[core 4[hwt 0]], socket

0[core 5[hwt 0]]: [././././B/B/./././././.][./././././././././././.]

[somehost:105601] MCW rank 5 bound to socket 1[core 16[hwt 0]], socket

1[core 17[hwt 0]]: [./././././././././././.][././././B/B/./././././.]



Any ideas, please?

Thanks,

Mark
___
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




___
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users





--
-------
Mark Dixon Email: m.c.di...@leeds.ac.uk
Advanced Research Computing (ARC)  Tel (int): 35429
IT Services building   Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
---___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Is building with "--enable-mpi-thread-multiple" recommended?

2017-03-03 Thread Mark Dixon

On Fri, 3 Mar 2017, Paul Kapinos wrote:
...
Note that on 1.10.x series (even on 1.10.6), enabling of 
MPI_THREAD_MULTIPLE in lead to (silent) shutdown of the InfiniBand 
fabric for that application => SLOW!


2.x versions (tested: 2.0.1) handle MPI_THREAD_MULTIPLE on InfiniBand 
the right way up, however due to absence of memory hooks (= nut aligned 
memory allocation) we get 20% less bandwidth on IB with 2.x versions 
compared to 1.10.x versions of Open MPI (regardless with or without 
support of MPI_THREAD_MULTIPLE).


On Intel OmniPath network both above issues seem to be not present, but 
due to a performance bug in MPI_Free_mem your application can be 
horribly slow (seen: CP2K) if the InfiniBand failback of OPA not 
disabled manually, see 
https://www.mail-archive.com/users@lists.open-mpi.org//msg30593.html

...

Hi Paul,

All very useful - thanks :)

Our (limited) testing seems to show no difference on 2.x with 
MPI_THREAD_MULTIPLE enabled vs. disabled as well, which is good news. Glad 
to hear another opinion.


Your 20% memory bandwidth performance hit on 2.x and the OPA problem are 
concerning - will look at that. Are there tickets open for them?


Cheers,

Mark
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] MPI-IO Lustre driver update?

2010-11-29 Thread Mark Dixon

Hi,

I notice that there's been quite a bit of work recently on ROMIO's Lustre 
driver. As far as I can see from openmpi's SVN, this doesn't seem to have 
landed there yet (README indicates V04, yet V05 is in MPICH2 and 
MVAPICH2).


Is there a timescale for when this will be make it into a release, please?

Thanks,

Mark
--
-----
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-


Re: [OMPI users] MPI-IO Lustre driver update?

2010-11-29 Thread Mark Dixon

On Mon, 29 Nov 2010, Jeff Squyres wrote:

There's work going on right now to update the ROMIO in the OMPI v1.5 
series.  We hope to include it in v1.5.2.


Cheers Jeff :)

Mark
--
-
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-


[OMPI users] configure: mpi-threads disabled by default

2011-05-04 Thread Mark Dixon
I've been asked about mixed-mode MPI/OpenMP programming with OpenMPI, so 
have been digging through the past list messages on MPI_THREAD_*, etc. 
Interesting stuff :)


Before I go ahead and add "--enable-mpi-threads" to our standard configure 
flags, is there any reason it's disabled by default, please?


I'm a bit puzzled, as this default seems in conflict with whole "Law of 
Least Astonishment" thing. Have I missed some disaster that's going to 
happen?


Thanks,

Mark
--
-----
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-


Re: [OMPI users] configure: mpi-threads disabled by default

2011-05-05 Thread Mark Dixon

On Wed, 4 May 2011, Eugene Loh wrote:


Depending on what version you use, the option has been renamed
--enable-mpi-thread-multiple.

Anyhow, there is widespread concern whether the support is robust.  The
support is known to be limited and the performance poor.


Thanks :)

I absolutely see why support for MPI_THREAD_MULTIPLE is a configure option 
(not at all related to the fact I'm on a platform where my best 
interconnect gets disabled if you ask for it).


However, do the same concerns apply to MPI_THREAD_FUNNELED and 
MPI_THREAD_SERIALIZED? They are disabled by default too and they look 
difficult to enable without enabling some other functionality.



Details:

Jeff said (on this list, Tue, 14 Dec 2010 22:52:40) that, in OpenMPI, 
there's no difference between MPI_THREAD_SINGLE and MPI_THREAD_FUNNELED. 
Yet, with the default configure options, MPI_Init_thread will always 
return MPI_THREAD_SINGLE.


* Release 1.4.3 - MPI_THREAD_(FUNNELED|SERIALIZED) are only available if 
you specify "--mpi-threads". Codes that sensibly negotiate their thread 
level automatically start using MPI_THREAD_MULTIPLE and my interconnect 
(openib) is disabled.


* Release 1.5.3 - MPI_THREAD_(FUNNELED|SERIALIZED) are only available if 
you specify "--mpi-threads" (same problems as with 1.4.3), or enable 
asynchronous communication progress (whatever that is - but it sounds 
scary) with "--enable-progress-threads"


Things do look different again in trunk, but seem to require you to at 
least ask for "--enable-opal-multi-threads".



Are we supposed to be able to use MPI_THREAD_FUNNELED by default or not?

Best wishes,

Mark
--
-----
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-


[OMPI users] Mellanox MLX4_EVENT_TYPE_SRQ_LIMIT kernel messages

2012-09-28 Thread Mark Dixon

Hi,

We've been putting a new Mellanox QDR Intel Sandy Bridge cluster, based on 
CentOS 6.3, through its paces and we're getting repeated kernel messages 
we never used to get on CentOS 5. An example on one node:


Sep 28 09:58:20 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:27 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:27 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:29 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:29 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:31 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:31 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:32 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:45 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:45 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 10:08:23 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT

These messages appeared when running IMB compiled with openmpi 1.6.1 
across 256 cores (16 nodes, 16 cores per node). The job ran from 
09:56:54 to 10:08:46 and failed with no obvious error messages.


Now, I'm used to IMB running into trouble at larger core counts, but I'm 
wondering if anyone has seen these messages before and know if they 
indicate a problem?


We're running with an increased log_num_mtt mlx4_core option as 
recommended by the openmpi FAQ and increased log_num_srq to its maximum 
value in a failed attempt to get rid of the messages:


$ cat /etc/modprobe.d/libmlx4_local.conf
options mlx4_core log_num_mtt=24 log_mtts_per_seg=3 log_num_srq=20

Any thoughts?

Thanks,

Mark
--
-----
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-


[OMPI users] knem/openmpi performance?

2013-07-12 Thread Mark Dixon

Hi,

I'm taking a look at knem, to see if it improves the performance of any 
applications on our QDR InfiniBand cluster, so I'm eager to hear about 
other people's experiences. This doesn't appear to have been discussed on 
this list before.


I appreciate that any affect that knem will have is entirely dependent on 
the application, scale and input data, but:


* Does anyone know of any examples of popular software packages that 
benefit particularly from the knem support in openmpi?


* Has anyone noticed any downsides to using knem?

Thanks,

Mark
--
-----
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-


Re: [OMPI users] knem/openmpi performance?

2013-07-12 Thread Mark Dixon

On Fri, 12 Jul 2013, Jeff Squyres (jsquyres) wrote:
...
In short: doing 1 memcopy consumes half the memory bandwidth of 2 mem 
copies.  So when you have lots of MPI processes competing for memory 
bandwidth, it turns out that having each MPI process use half the 
bandwidth is a Really Good Idea.  :-)  This allows more MPI processes to 
do shared memory communications before you hit the memory bandwidth 
bottleneck.


Hi Jeff,

Lots of useful detail in there - thanks. We have plenty of memory-bound 
applications in use, so hopefully there's some good news in this.


I was hoping that someone might have some examples of real application 
behaviour rather than micro benchmarks. It can be crazy hard to get that 
information from users.


Unusually for us, we're putting in a second cluster with the same 
architecture, CPUs, memory and OS as the last one. I might be able to use 
this as a bigger stick to get some better feedback. If so, I'll pass it 
on.


Darius Buntinas, Brice Goglin, et al. wrote an excellent paper about 
exactly this set of issues; see http://runtime.bordeaux.inria.fr/knem/. 

...

I'll definitely take a look - thanks again.

All the best,

Mark
--
-----
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-


Re: [OMPI users] knem/openmpi performance?

2013-07-17 Thread Mark Dixon

On Mon, 15 Jul 2013, Elken, Tom wrote:
...

Hope these anecdotes are relevant to Open MPI users considering knem.

...

Brilliantly useful, thanks! It certainly looks like it may be greatly 
significant for some applications. Worth looking into.


All the best,

Mark
--
-
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-


Re: [OMPI users] knem/openmpi performance?

2013-07-29 Thread Mark Dixon

On Thu, 18 Jul 2013, Iliev, Hristo wrote:
...
Detailed results are coming in the near future, but the benchmarks done 

...

Hi Hristo,

Very interesting, thanks for sharing! Will be very interested to read your 
official results when you publish :)


All the best,

Mark
--
-
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-


[OMPI users] Open MPI unable to find threading support for PGI or Sun Studio

2008-07-28 Thread Mark Dixon

Hi,

I've been attempting to build Open MPI 1.2.6 using a variety of compilers 
including, but not limited to, PGI 7.1-6 and Sun Studio 12 (200709) on a 
CentOS 5.2 32-bit Intel box.


Building against either of the above compilers results in the following 
message produced by configure:




Open MPI was unable to find threading support on your system.  In the
near future, the OMPI development team is considering requiring
threading support for proper OMPI execution.  This is in part because
we are not aware of any users that do not have thread support - so we
need you to e-mail us at o...@ompi-mpi.org and let us know about this
problem.



I don't see this when building against the Intel 10.1.015 or GNU GCC 4.1.2 
compilers. I cannot see any answer to this in the FAQ or list archives.


I've attached files showing the output of configure and my environment to 
this message.


Is this expected?

Thanks,

Mark
--
-----
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-

sunstudio-build.txt.bz2
Description: Binary data


pgi-build.txt.bz2
Description: Binary data


Re: [OMPI users] Open MPI unable to find threading support for PGI or Sun Studio

2008-07-29 Thread Mark Dixon

On Mon, 28 Jul 2008, Jeff Squyres wrote:

FWIW: I compile with PGI 7.1.4 regularly on RHEL4U4 and don't see this 
problem.  It would be interesting to see the config.log's from these builds 
to see the actual details of what went wrong.


Thanks Jeff: it's good to know it's just me ;)

Following your message, I've tried building with PGI on a few systems:

Compiler  OS   Result
  ===  
32-bit 7.1.6  CentOS 5.2 (32-bit)  no threading
32-bit 7.1.4  CentOS 5.2 (32-bit)  no threading  **config.log attached**
32-bit 7.1.4  RHEL4u6 (64-bit) works!
32-bit 7.1.4  CentOS 5.1 (64-bit)  no threading

Each time it fails, it's because of "__builtin_expect" being undefined for 
pgcc and pgf77 (works for pgcpp) - or any of the Sun Studio compilers. 
Could this be a glibc 2.3 (RHEL4) vs. 2.5 (CentOS5) issue?


I've attached just the PGI config.log for now (I don't want to blow the 
100Kb posting limit), but the relevant sections from each appear to be:


PGI:

  configure:49065: checking if C compiler and POSIX threads work with -lpthread
  configure:49121: pgcc -o conftest -O -DNDEBUG   -D_REENTRANT  conftest.c -lnsl 
-lutil  -lpthread >&5
  conftest.c:
  conftest.o: In function `main':
  conftest.c:(.text+0x98): undefined reference to `__builtin_expect'

  configure:49272: checking if C++ compiler and POSIX threads work with 
-lpthread
  configure:49328: pgcpp -o conftest -O -DNDEBUGconftest.cpp -lnsl -lutil  
-lpthread >&5
  conftest.cpp:
  (skipped some non-fatal warning messages here)

  configure:49572: checking if F77 compiler and POSIX threads work with 
-lpthread
  configure:49654: pgcc -O -DNDEBUG  -I. -c conftest.c
  configure:49661: $? = 0
  configure:49671: pgf77  conftestf.f conftest.o -o conftest  -lnsl -lutil  
-lpthread
  conftestf.f:
  conftest.o: In function `pthreadtest_':
  conftest.c:(.text+0x92): undefined reference to `__builtin_expect'

Sun:

  configure:49065: checking if C compiler and POSIX threads work with -lpthread
  configure:49121: cc -o conftest -O -DNDEBUG   -D_REENTRANT  conftest.c -lnsl -lutil  
-lm -lpthread >&5
  "conftest.c", line 305: warning: can not set non-default alignment for 
automatic variable
  "conftest.c", line 305: warning: implicit function declaration: 
__builtin_expect
  conftest.o: In function `main':
  conftest.c:(.text+0x35): undefined reference to `__builtin_expect'

  configure:49272: checking if C++ compiler and POSIX threads work with 
-lpthread
  configure:49328: CC -o conftest -O -DNDEBUGconftest.cpp -lnsl -lutil  -lm 
-lpthread >&5
  "conftest.cpp", line 305: Error: The function "__builtin_expect" must have a 
prototype.
  1 Error(s) detected.

  configure:49572: checking if F77 compiler and POSIX threads work with 
-lpthread
  configure:49654: cc -O -DNDEBUG  -I. -c conftest.c
  "conftest.c", line 15: warning: can not set non-default alignment for 
automatic variable
  "conftest.c", line 15: warning: implicit function declaration: 
__builtin_expect
  configure:49661: $? = 0
  configure:49671: f77  conftestf.f conftest.o -o conftest  -lnsl -lutil  -lm 
-lpthread
  NOTICE: Invoking /apps/compilers/sunstudio/12_200709/1/sunstudio12/bin/f90 
-f77 -ftrap=%none conftestf.f conftest.o -o conftest -lnsl -lutil -lm -lpthread
  conftestf.f:
   MAIN fpthread:
  conftest.o: In function `pthreadtest_':
  conftest.c:(.text+0x41): undefined reference to `__builtin_expect'

Any ideas?

Cheers,

Mark
--
-
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-

config.log.bz2
Description: Binary data


Re: [OMPI users] Open MPI unable to find threading support for PGI or Sun Studio

2008-07-29 Thread Mark Dixon

On Tue, 29 Jul 2008, Jeff Squyres wrote:
...
I suggest that you bring this issue up with PGI support; they're fairly 
responsive on their web forums.

...

Will do: thanks for giving this a look, you've been really helpful.

Cheers,

Mark
--
-
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-


Re: [OMPI users] Open MPI unable to find threading support for PGI or Sun Studio

2008-08-01 Thread Mark Dixon

On Tue, 29 Jul 2008, Jeff Squyres wrote:


On Jul 29, 2008, at 6:52 AM, Mark Dixon wrote:

FWIW: I compile with PGI 7.1.4 regularly on RHEL4U4 and don't see this 
problem.  It would be interesting to see the config.log's from these 
builds to see the actual details of what went wrong.

...

Compiler  OS   Result
  ===  
32-bit 7.1.6  CentOS 5.2 (32-bit)  no threading
32-bit 7.1.4  CentOS 5.2 (32-bit)  no threading  **config.log attached**
32-bit 7.1.4  RHEL4u6 (64-bit) works!
32-bit 7.1.4  CentOS 5.1 (64-bit)  no threading

...
I'm afraid this one is out of my bailiwick -- I don't know.  Looking through 
your config.log file, it does look like this lack of __builtin_expect is the 
killer.  FWIW, here's my configure output when I run with pgcc v7.1.4:

...
I suggest that you bring this issue up with PGI support; they're fairly 
responsive on their web forums.

...

In case anyone's interested, the fix is to upgrade to at least PGI 7.2-2.

It seems that there was a change to glibc between RHEL4 and RHEL5 (2.3 vs. 
2.5) which requires __builtin_expect to be defined when using certain 
pthread library functions.


This also appears to be a problem for the Sun Studio 12 compiler (bug id 
6603861), but it would seem that Sun's not in a hurry to fix it.


Thanks for your time,

Mark
--
-----
Mark Dixon   Email: m.c.di...@leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-


[OMPI users] Failed to register memory (openmpi 2.0.2)

2017-10-18 Thread Mark Dixon

Hi,

We're intermittently seeing messages (below) about failing to register 
memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the 
vanilla IB stack as shipped by centos.


We're not using any mlx4_core module tweaks at the moment. On earlier 
machines we used to set registered memory as per the FAQ, but neither 
log_num_mtt nor num_mtt seem to exist these days (according to 
/sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to 
follow the FAQ.


The output of 'ulimit -l' shows as unlimited for every rank.

Does anyone have any advice, please?

Thanks,

Mark

-
Failed to register memory region (MR):

Hostname: dc1s0b1c
Address:  ec5000
Length:   20480
Error:Cannot allocate memory
--
--
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Failed to register memory (openmpi 2.0.2)

2017-10-19 Thread Mark Dixon

Thanks Ralph, will do.

Cheers,

Mark

On Wed, 18 Oct 2017, r...@open-mpi.org wrote:


Put “oob=tcp” in your default MCA param file


On Oct 18, 2017, at 9:00 AM, Mark Dixon  wrote:

Hi,

We're intermittently seeing messages (below) about failing to register memory 
with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the vanilla IB 
stack as shipped by centos.

We're not using any mlx4_core module tweaks at the moment. On earlier machines 
we used to set registered memory as per the FAQ, but neither log_num_mtt nor 
num_mtt seem to exist these days (according to 
/sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to follow 
the FAQ.

The output of 'ulimit -l' shows as unlimited for every rank.

Does anyone have any advice, please?

Thanks,

Mark

-
Failed to register memory region (MR):

Hostname: dc1s0b1c
Address:  ec5000
Length:   20480
Error:Cannot allocate memory
--
--
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Failed to register memory (openmpi 2.0.2)

2017-11-13 Thread Mark Dixon

Hi there,

We're intermittently seeing messages (below) about failing to register 
memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 / 24 core 
126G RAM Broadwell nodes and the vanilla IB stack as shipped by centos.


(We previously seen similar messages for the "ud" oob component but, as 
recommended in this thread, we stopped oob from using openib via an MCA 
parameter.)


I've checked to see what the registered memory limit is (by setting 
mlx4_core's debug_level, rebooting and examining kernel messages) and it's 
double the system RAM - which I understand is the recommended setting.


Any ideas about what might be going on, please?

Thanks,

Mark


--
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:

  Local host:    dc1s0b1a
  OMPI source:   btl_openib.c:752
  Function:  opal_free_list_init()
  Device:    mlx4_0
  Memlock limit: unlimited

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--
[dc1s0b1a][[59067,1],0][btl_openib.c:1035:mca_btl_openib_add_procs] could not 
prepare openib device for use
[dc1s0b1a][[59067,1],0][btl_openib.c:1186:mca_btl_openib_get_ep] could not 
prepare openib device for use
[dc1s0b1a][[59067,1],0][connect/btl_openib_connect_udcm.c:1522:udcm_find_endpoint]
 could not find endpoint with port: 1, lid: 69, msg_type: 100


On Thu, 19 Oct 2017, Mark Dixon wrote:


Thanks Ralph, will do.

Cheers,

Mark

On Wed, 18 Oct 2017, r...@open-mpi.org wrote:


 Put “oob=tcp” in your default MCA param file


 On Oct 18, 2017, at 9:00 AM, Mark Dixon  wrote:

 Hi,

 We're intermittently seeing messages (below) about failing to register
 memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the
 vanilla IB stack as shipped by centos.

 We're not using any mlx4_core module tweaks at the moment. On earlier
 machines we used to set registered memory as per the FAQ, but neither
 log_num_mtt nor num_mtt seem to exist these days (according to
 /sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to
 follow the FAQ.

 The output of 'ulimit -l' shows as unlimited for every rank.

 Does anyone have any advice, please?

 Thanks,

 Mark

 -
 Failed to register memory region (MR):

 Hostname: dc1s0b1c
 Address:  ec5000
 Length:   20480
 Error:Cannot allocate memory
 --
 --
 Open MPI has detected that there are UD-capable Verbs devices on your
 system, but none of them were able to be setup properly.  This may
 indicate a problem on this system.

 You job will continue, but Open MPI will ignore the "ud" oob component
 in this run.
 ___
 users mailing list
 users@lists.open-mpi.org
 https://lists.open-mpi.org/mailman/listinfo/users


 ___
 users mailing list
 users@lists.open-mpi.org
 https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] OMPI 4.0.1 + PHDF5 1.8.21 tests fail on Lustre

2019-08-05 Thread Mark Dixon via users
Hi,

I’ve built parallel HDF5 1.8.21 against OpenMPI 4.0.1 on CentOS 7 and a 
Lustre 2.12 filesystem using the OS-provided GCC 4.8.5 and am trying to 
run the testsuite. I’m failing the testphdf5 test: could anyone help, 
please?

I’ve successfully used the same method to pass tests when building HDF5 
1.8.21 against different MPIs - MVAPICH2 2.3.1 and IntelMPI 2019.4.243.

I’ve built openmpi 4.0.1 with configure options:

   ./configure --prefix=$prefix
 –with-sge
 –with-io-romio-flags=–with-file-system=lustre+ufs
 –enable-mpi-cxx
 –with-cma
 –enable-mpi1-compatibility
 –with-ucx=$prefix --without-verbs
 –enable-mca-no-build=btl-uct

I’ve set the following MCA param to try and force ROMIO:

   export OMPI_MCA_io=romio321

For OpenMPI 4.0.1, I’m getting this failure - any ideas, please?

Thanks,

Mark

$ cat testphdf5.chklog

  testphdf5  Test Log


===
PHDF5 TESTS START
===
MPI-process 1. hostname=login2.arc4.leeds.ac.uk
MPI-process 3. hostname=login2.arc4.leeds.ac.uk
MPI-process 4. hostname=login2.arc4.leeds.ac.uk
MPI-process 5. hostname=login2.arc4.leeds.ac.uk

For help use: 
/nobackup/issmcd/login2.arc4.leeds.ac.uk.u4q9A9ALkN/hdf5-1.8.21/testpar/.libs/testphdf5
 -help
Linked with hdf5 version 1.8 release 21

For help use: 
/nobackup/issmcd/login2.arc4.leeds.ac.uk.u4q9A9ALkN/hdf5-1.8.21/testpar/.libs/testphdf5
 -help
Linked with hdf5 version 1.8 release 21

For help use: 
/nobackup/issmcd/login2.arc4.leeds.ac.uk.u4q9A9ALkN/hdf5-1.8.21/testpar/.libs/testphdf5
 -help
Linked with hdf5 version 1.8 release 21

For help use: 
/nobackup/issmcd/login2.arc4.leeds.ac.uk.u4q9A9ALkN/hdf5-1.8.21/testpar/.libs/testphdf5
 -help
Linked with hdf5 version 1.8 release 21
MPI-process 2. hostname=login2.arc4.leeds.ac.uk

For help use: 
/nobackup/issmcd/login2.arc4.leeds.ac.uk.u4q9A9ALkN/hdf5-1.8.21/testpar/.libs/testphdf5
 -help
Linked with hdf5 version 1.8 release 21
MPI-process 0. hostname=login2.arc4.leeds.ac.uk

For help use: 
/nobackup/issmcd/login2.arc4.leeds.ac.uk.u4q9A9ALkN/hdf5-1.8.21/testpar/.libs/testphdf5
 -help
Linked with hdf5 version 1.8 release 21
Test filenames are:
 ParaTest.h5
Testing  -- fapl_mpio duplicate (mpiodup) 
Test filenames are:
 ParaTest.h5
Testing  -- fapl_mpio duplicate (mpiodup) 
Test filenames are:
 ParaTest.h5
Testing  -- fapl_mpio duplicate (mpiodup) 
Test filenames are:
 ParaTest.h5
Testing  -- fapl_mpio duplicate (mpiodup) 
Test filenames are:
 ParaTest.h5
Testing  -- fapl_mpio duplicate (mpiodup) 
*** Hint ***
You can use environment variable HDF5_PARAPREFIX to run parallel test files in a
different directory or to add file type prefix. E.g.,
HDF5_PARAPREFIX=pfs:/PFS/user/me
export HDF5_PARAPREFIX
*** End of Hint ***
Test filenames are:
 ParaTest.h5
Testing  -- fapl_mpio duplicate (mpiodup) 
Testing  -- dataset using split communicators (split) 
Testing  -- dataset using split communicators (split) 
Testing  -- dataset using split communicators (split) 
Testing  -- dataset using split communicators (split) 
Testing  -- dataset using split communicators (split) 
Testing  -- dataset using split communicators (split) 
Testing  -- dataset independent write (idsetw) 
Testing  -- dataset independent write (idsetw) 
Testing  -- dataset independent write (idsetw) 
Testing  -- dataset independent write (idsetw) 
Testing  -- dataset independent write (idsetw) 
Testing  -- dataset independent write (idsetw) 
Testing  -- dataset independent read (idsetr) 
Testing  -- dataset independent read (idsetr) 
Testing  -- dataset independent read (idsetr) 
Testing  -- dataset independent read (idsetr) 
Testing  -- dataset independent read (idsetr) 
Testing  -- dataset independent read (idsetr) 
Testing  -- dataset collective write (cdsetw) 
Testing  -- dataset collective write (cdsetw) 
Testing  -- dataset collective write (cdsetw) 
Testing  -- dataset collective write (cdsetw) 
Testing  -- dataset collective write (cdsetw) 
Testing  -- dataset collective write (cdsetw) 
Testing  -- dataset collective read (cdsetr) 
Testing  -- dataset collective read (cdsetr) 
Testing  -- dataset collective read (cdsetr) 
Testing  -- dataset collective read (cdsetr) 
Testing  -- dataset collective read (cdsetr) 
Testing  -- dataset collective read (cdsetr) 
Testing  -- extendible dataset independent write (eidsetw) 
Testing  -- extendible dataset independent write (eidsetw) 
Testing  -- extendible dataset independent write (eidsetw) 
Testing  -- extendible dataset independent write (eidsetw) 
Testing  -- extendible dataset independent write (eidsetw) 
Testing  -- extendible dataset independent write (eidsetw) 
Testing  -- extendible dataset independent read (eidsetr) 
Testing  -- extendible dataset independent read (eidsetr) 
Testing  -- extendible dataset independent read (eidsetr) 
Testing  -- extendible dataset independent read (eidsetr) 
Tes

[OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-16 Thread Mark Dixon via users

Hi all,

I'm confused about how openmpi supports mpi-io on Lustre these days, and 
am hoping that someone can help.


Back in the openmpi 2.0.0 release notes, it said that OMPIO is the default 
MPI-IO implementation on everything apart from Lustre, where ROMIO is 
used. Those release notes are pretty old, but it still appears to be true.


However, I cannot get HDF5 1.10.7 to pass its MPI-IO tests unless I tell 
openmpi to use OMPIO (OMPI_MCA_io=ompio) and tell UCX not to print warning 
messages (UCX_LOG_LEVEL=ERROR).


Can I just check: are we still supposed to be using ROMIO?

Thanks,

Mark


Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-16 Thread Mark Dixon via users

Hi Edgar,

Thanks for this - good to know that ompio is an option, despite the 
reference to potential performance issues.


I'm using openmpi 4.0.5 with ucx 1.9.0 and see the hdf5 1.10.7 test 
"testphdf5" timeout (with the timeout set to an hour) using romio. Is it a 
known issue there, please?


When it times out, the last few lines to be printed are these:

Testing  -- multi-chunk collective chunk io (cchunk3)
Testing  -- multi-chunk collective chunk io (cchunk3)
Testing  -- multi-chunk collective chunk io (cchunk3)
Testing  -- multi-chunk collective chunk io (cchunk3)
Testing  -- multi-chunk collective chunk io (cchunk3)
Testing  -- multi-chunk collective chunk io (cchunk3)

The other thing I note is that openmpi doesn't configure romio's lustre 
driver, even when given "--with-lustre". Regardless, I see the same result 
whether or not I add "--with-io-romio-flags=--with-file-system=lustre+ufs"


Cheers,

Mark

On Mon, 16 Nov 2020, Gabriel, Edgar via users wrote:

this is in theory still correct, the default MPI I/O library used by 
Open MPI on Lustre file systems is ROMIO in all release versions. That 
being said, ompio does have support for Lustre as well starting from the 
2.1 series, so you can use that as well. The main reason that we did not 
switch to ompio for Lustre as the default MPI I/O library is a 
performance issue that can arise under certain circumstances.


Which version of Open MPI are you using? There was a bug fix in the Open 
MPI to ROMIO integration layer sometime in the 4.0 series that fixed a 
datatype problem, which caused some problems in the HDF5 tests. You 
might be hitting that problem.


Thanks
Edgar

-Original Message-
From: users  On Behalf Of Mark Dixon via users
Sent: Monday, November 16, 2020 4:32 AM
To: users@lists.open-mpi.org
Cc: Mark Dixon 
Subject: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

Hi all,

I'm confused about how openmpi supports mpi-io on Lustre these days, and 
am hoping that someone can help.


Back in the openmpi 2.0.0 release notes, it said that OMPIO is the 
default MPI-IO implementation on everything apart from Lustre, where 
ROMIO is used. Those release notes are pretty old, but it still appears 
to be true.


However, I cannot get HDF5 1.10.7 to pass its MPI-IO tests unless I tell 
openmpi to use OMPIO (OMPI_MCA_io=ompio) and tell UCX not to print 
warning messages (UCX_LOG_LEVEL=ERROR).


Can I just check: are we still supposed to be using ROMIO?

Thanks,

Mark



Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-17 Thread Mark Dixon via users

Hi Edgar,

Pity, that would have been nice! But thanks for looking.

Checking through the ompi github issues, I now realise I logged exactly 
the same issue over a year ago (completely forgot - I've moved jobs since 
then), including a script to reproduce the issue on a Lustre system. 
Unfortunately there's been no movement:


https://github.com/open-mpi/ompi/issues/6871

If it helps anyone, I can confirm that hdf5 parallel tests pass with 
openmpi 3.1.6, but not in 4.0.5.


Surely I cannot be the only one who cares about using a recent openmpi 
with hdf5 on lustre?


Mark

On Mon, 16 Nov 2020, Gabriel, Edgar wrote:

hm, I think this sounds like a different issue, somebody who is more 
invested in the ROMIO Open MPI work should probably have a look.


Regarding compiling Open MPI with Lustre support for ROMIO, I cannot 
test it right now for various reasons, but if I recall correctly the 
trick was to provide the --with-lustre option twice, once inside of the 
"--with-io-romio-flags=" (along with the option that you provided), and 
once outside (for ompio).


Thanks
Edgar




Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-26 Thread Mark Dixon via users

On Wed, 25 Nov 2020, Dave Love via users wrote:


The perf test says romio performs a bit better.  Also -- from overall
time -- it's faster on IMB-IO (which I haven't looked at in detail, and
ran with suboptimal striping).


I take that back.  I can't reproduce a significant difference for total
IMB-IO runtime, with both run in parallel on 16 ranks, using either the
system default of a single 1MB stripe or using eight stripes.  I haven't
teased out figures for different operations yet.  That must have been
done elsewhere, but I've never seen figures.


But remember that IMB-IO doesn't cover everything. For example, hdf5's 
t_bigio parallel test appears to be a pathological case and OMPIO is 2 
orders of magnitude slower on a Lustre filesystem:


- OMPI's default MPI-IO implementation on Lustre (ROMIO): 21 seconds
- OMPI's alternative MPI-IO implementation on Lustre (OMPIO): 2554 seconds

End users seem to have the choice of:

- use openmpi 4.x and have some things broken (romio)
- use openmpi 4.x and have some things slow (ompio)
- use openmpi 3.x and everything works

My concern is that openmpi 3.x is near, or at, end of life.

Mark


t_bigio runs on centos 7, gcc 4.8.5, ppc64le, openmpi 4.0.5, hdf5 1.10.7, 
Lustre 2.12.5:

[login testpar]$ time mpirun -np 6 ./t_bigio

Testing Dataset1 write by ROW

Testing Dataset2 write by COL

Testing Dataset3 write select ALL proc 0, NONE others

Testing Dataset4 write point selection

Read Testing Dataset1 by COL

Read Testing Dataset2 by ROW

Read Testing Dataset3 read select ALL proc 0, NONE others

Read Testing Dataset4 with Point selection
***Express test mode on.  Several tests are skipped

real0m21.141s
user2m0.318s
sys 0m3.289s


[login testpar]$ export OMPI_MCA_io=ompio
[login testpar]$ time mpirun -np 6 ./t_bigio

Testing Dataset1 write by ROW

Testing Dataset2 write by COL

Testing Dataset3 write select ALL proc 0, NONE others

Testing Dataset4 write point selection

Read Testing Dataset1 by COL

Read Testing Dataset2 by ROW

Read Testing Dataset3 read select ALL proc 0, NONE others

Read Testing Dataset4 with Point selection
***Express test mode on.  Several tests are skipped

real42m34.103s
user213m22.925s
sys 8m6.742s



Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-26 Thread Mark Dixon via users

Hi Edgar,

Thank you so much for your reply. Having run a number of Lustre systems 
over the years, I fully sympathise with your characterisation of Lustre as 
being very unforgiving!


Best wishes,

Mark

On Thu, 26 Nov 2020, Gabriel, Edgar wrote:

I will have a look at the t_bigio tests on Lustre with ompio.  We had 
from collaborators some reports about the performance problems similar 
to the one that you mentioned here (which was the reason we were 
hesitant to make ompio the default on Lustre), but part of the problem 
is that we were not able to reproduce it reliably on the systems that we 
had access to, which we makes debugging and fixing the issue very 
difficult. Lustre is a very unforgiving file system, if you get 
something wrong with the settings, the performance is not just a bit 
off, but often orders of magnitude (as in your measurements).


Thanks!
Edgar

-Original Message-
From: users  On Behalf Of Mark Dixon via users
Sent: Thursday, November 26, 2020 9:38 AM
To: Dave Love via users 
Cc: Mark Dixon ; Dave Love 

Subject: Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

On Wed, 25 Nov 2020, Dave Love via users wrote:


The perf test says romio performs a bit better.  Also -- from overall
time -- it's faster on IMB-IO (which I haven't looked at in detail,
and ran with suboptimal striping).


I take that back.  I can't reproduce a significant difference for
total IMB-IO runtime, with both run in parallel on 16 ranks, using
either the system default of a single 1MB stripe or using eight
stripes.  I haven't teased out figures for different operations yet.
That must have been done elsewhere, but I've never seen figures.


But remember that IMB-IO doesn't cover everything. For example, hdf5's t_bigio 
parallel test appears to be a pathological case and OMPIO is 2 orders of 
magnitude slower on a Lustre filesystem:

- OMPI's default MPI-IO implementation on Lustre (ROMIO): 21 seconds
- OMPI's alternative MPI-IO implementation on Lustre (OMPIO): 2554 seconds

End users seem to have the choice of:

- use openmpi 4.x and have some things broken (romio)
- use openmpi 4.x and have some things slow (ompio)
- use openmpi 3.x and everything works

My concern is that openmpi 3.x is near, or at, end of life.

Mark


t_bigio runs on centos 7, gcc 4.8.5, ppc64le, openmpi 4.0.5, hdf5 1.10.7, 
Lustre 2.12.5:

[login testpar]$ time mpirun -np 6 ./t_bigio

Testing Dataset1 write by ROW

Testing Dataset2 write by COL

Testing Dataset3 write select ALL proc 0, NONE others

Testing Dataset4 write point selection

Read Testing Dataset1 by COL

Read Testing Dataset2 by ROW

Read Testing Dataset3 read select ALL proc 0, NONE others

Read Testing Dataset4 with Point selection ***Express test mode on.  Several 
tests are skipped

real0m21.141s
user2m0.318s
sys 0m3.289s


[login testpar]$ export OMPI_MCA_io=ompio [login testpar]$ time mpirun -np 6 
./t_bigio

Testing Dataset1 write by ROW

Testing Dataset2 write by COL

Testing Dataset3 write select ALL proc 0, NONE others

Testing Dataset4 write point selection

Read Testing Dataset1 by COL

Read Testing Dataset2 by ROW

Read Testing Dataset3 read select ALL proc 0, NONE others

Read Testing Dataset4 with Point selection ***Express test mode on.  Several 
tests are skipped

real42m34.103s
user213m22.925s
sys 8m6.742s




Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-30 Thread Mark Dixon via users

On Fri, 27 Nov 2020, Dave Love wrote:
...
It's less dramatic in the case I ran, but there's clearly something 
badly wrong which needs profiling.  It's probably useful to know how 
many ranks that's with, and whether it's the default striping.  (I 
assume with default ompio fs parameters.)


Hi Dave,

It was run the way hdf5's "make check" runs it - that's 6 ranks. I didn't 
do anything interesting with striping so, unless t_bigio changed it, it'd 
have a width of 1.


...

I can have a look with the current or older romio, unless someone else
is going to; we should sort this.


If you were willing, that would be brilliant, thanks :)


My concern is that openmpi 3.x is near, or at, end of life.


'Twas ever thus, but if it works?


Evidently it wouldn't fit the definition of "works" for some users, 
otherwise there wouldn't have been a version 4!


I just didn't want Lustre MPI-IO support to be forgotten about, 
considering the 4.x series is 2 years old now.


All the best,

Mark


Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-12-02 Thread Mark Dixon via users

Hi Mark,

Thanks so much for this - yes, applying that pull request against ompi 
4.0.5 allows hdf5 1.10.7's parallel tests to pass on our Lustre 
filesystem.


I'll certainly be applying it on our local clusters!

Best wishes,

Mark

On Tue, 1 Dec 2020, Mark Allen via users wrote:

At least for the topic of why romio fails with HDF5, I believe this is 
the fix we need (has to do with how romio processes the MPI datatypes in 
its flatten routine).  I made a different fix a long time ago in SMPI 
for that, then somewhat more recently it was re-broke it and I had to 
re-fix it.  So the below takes a little more aggressive approach, not 
totally redesigning the flatten function, but taking over how the array 
size counter is handled. https://github.com/open-mpi/ompi/pull/3975


Mark Allen