[OMPI users] change in behaviour 1.6 -> 1.8 under sge
Hi there, We've started looking at moving to the openmpi 1.8 branch from 1.6 on our CentOS6/Son of Grid Engine cluster and noticed an unexpected difference when binding multiple cores to each rank. Has openmpi's definition 'slot' changed between 1.6 and 1.8? It used to mean ranks, but now it appears to mean processing elements (see Details, below). Thanks, Mark PS Also, the man page for 1.8.3 reports that '--bysocket' is deprecated, but it doesn't seem to exist when we try to use it: mpirun: Error: unknown option "-bysocket" Type 'mpirun --help' for usage. == Details == On 1.6.5, we launch with the following core binding options: mpirun --bind-to-core --cpus-per-proc mpirun --bind-to-core --bysocket --cpus-per-proc where is calculated to maximise the number of cores available to use - I guess affectively max(1, int(number of cores per node / slots per node requested)). openmpi reads the file $PE_HOSTFILE and launches a rank for each slot defined in it, binding cores per rank. On 1.8.3, we've tried launching with the following core binding options (which we hoped were equivalent): mpirun -map-by node:PE= mpirun -map-by socket:PE= openmpi reads the file $PE_HOSTFILE and launches a factor of fewer ranks than under 1.6.5. We also notice that, where we wanted a single rank on the box and is the number of cores available, openmpi refuses to launch and we get the message: "There are not enough slots available in the system to satisfy the 1 slots that were requested by the application" I think that error message needs a little work :)
[OMPI users] Startup limited to 128 remote hosts in some situations?
Hi, While commissioning a new cluster, I wanted to run HPL across the whole thing using openmpi 2.0.1. I couldn't get it to start on more than 129 hosts under Son of Gridengine (128 remote plus the localhost running the mpirun command). openmpi would sit there, waiting for all the orted's to check in; however, there were "only" a maximum of 128 qrsh processes, therefore a maximum of 128 orted's, therefore waiting a lng time. Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to launch. Is this intentional, please? Doesn't openmpi use a tree-like startup sometimes - any particular reason it's not using it here? Cheers, Mark ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Startup limited to 128 remote hosts in some situations?
Hi, It works for me :) Thanks! Mark On Fri, 20 Jan 2017, r...@open-mpi.org wrote: Well, it appears we are already forwarding all envars, which should include PATH. Here is the qrsh command line we use: “qrsh --inherit --nostdin -V" So would you please try the following patch: diff --git a/orte/mca/plm/rsh/plm_rsh_component.c b/orte/mca/plm/rsh/plm_rsh_component.c index 0183bcc..1cc5aa4 100644 --- a/orte/mca/plm/rsh/plm_rsh_component.c +++ b/orte/mca/plm/rsh/plm_rsh_component.c @@ -288,8 +288,6 @@ static int rsh_component_query(mca_base_module_t **module, int *priority) } mca_plm_rsh_component.agent = tmp; mca_plm_rsh_component.using_qrsh = true; -/* no tree spawn allowed under qrsh */ -mca_plm_rsh_component.no_tree_spawn = true; goto success; } else if (!mca_plm_rsh_component.disable_llspawn && NULL != getenv("LOADL_STEP_ID")) { On Jan 19, 2017, at 5:29 PM, r...@open-mpi.org wrote: I’ll create a patch that you can try - if it works okay, we can commit it On Jan 18, 2017, at 3:29 AM, William Hay wrote: On Tue, Jan 17, 2017 at 09:56:54AM -0800, r...@open-mpi.org wrote: As I recall, the problem was that qrsh isn???t available on the backend compute nodes, and so we can???t use a tree for launch. If that isn???t true, then we can certainly adjust it. qrsh should be available on all nodes of a SoGE cluster but, depending on how things are set up, may not be findable (ie not in the PATH) when you qrsh -inherit into a node. A workaround would be to start backend processes with qrsh -inherit -v PATH which will copy the PATH from the master node to the slave node process or otherwise pass the location of qrsh from one node or another. That of course assumes that qrsh is in the same location on all nodes. I've tested that it is possible to qrsh from the head node of a job to a slave node and then on to another slave node by this method. William On Jan 17, 2017, at 9:37 AM, Mark Dixon wrote: Hi, While commissioning a new cluster, I wanted to run HPL across the whole thing using openmpi 2.0.1. I couldn't get it to start on more than 129 hosts under Son of Gridengine (128 remote plus the localhost running the mpirun command). openmpi would sit there, waiting for all the orted's to check in; however, there were "only" a maximum of 128 qrsh processes, therefore a maximum of 128 orted's, therefore waiting a lng time. Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to launch. Is this intentional, please? Doesn't openmpi use a tree-like startup sometimes - any particular reason it's not using it here? ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- --- Mark Dixon Email: m.c.di...@leeds.ac.uk Advanced Research Computing (ARC) Tel (int): 35429 IT Services building Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK ---___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Is gridengine integration broken in openmpi 2.0.2?
Hi, Just tried upgrading from 2.0.1 to 2.0.2 and I'm getting error messages that look like openmpi is using ssh to login to remote nodes instead of qrsh (see below). Has anyone else noticed gridengine integration being broken, or am I being dumb? I built with "./configure --prefix=/apps/developers/libraries/openmpi/2.0.2/1/intel-17.0.1 --with-sge --with-io-romio-flags=--with-file-system=lustre+ufs --enable-mpi-cxx --with-cma" Can see the gridengine component via: $ ompi_info -a | grep gridengine MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v2.0.2) MCA ras gridengine: --- MCA ras gridengine: parameter "ras_gridengine_priority" (current value: "100", data source: default, level: 9 dev/all, type: int) Priority of the gridengine ras component MCA ras gridengine: parameter "ras_gridengine_verbose" (current value: "0", data source: default, level: 9 dev/all, type: int) Enable verbose output for the gridengine ras component MCA ras gridengine: parameter "ras_gridengine_show_jobid" (current value: "false", data source: default, level: 9 dev/all, type: bool) Cheers, Mark ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory Permission denied, please try again. ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory Permission denied, please try again. ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased). -- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Is gridengine integration broken in openmpi 2.0.2?
On Fri, 3 Feb 2017, Reuti wrote: ... SGE on its own is not configured to use SSH? (I mean the entries in `qconf -sconf` for rsh_command resp. daemon). ... Nope, everything left as the default: $ qconf -sconf | grep _command qlogin_command builtin rlogin_command builtin rsh_command builtin I have 2.0.1 and 2.0.2 installed side by side. 2.0.1 is happy but 2.0.2 isn't. I'll start digging, but I'd appreciate hearing from any other SGE user who had tried 2.0.2 and tell me if it had worked for them, please? :) Cheers, Mark ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Is gridengine integration broken in openmpi 2.0.2?
On Fri, 3 Feb 2017, r...@open-mpi.org wrote: I do see a diff between 2.0.1 and 2.0.2 that might have a related impact. The way we handled the MCA param that specifies the launch agent (ssh, rsh, or whatever) was modified, and I don’t think the change is correct. It basically says that we don’t look for qrsh unless the MCA param has been changed from the coded default, which means we are not detecting SGE by default. Try setting "-mca plm_rsh_agent foo" on your cmd line - that will get past the test, and then we should auto-detect SGE again ... Ah-ha! "-mca plm_rsh_agent foo" fixes it! Thanks very much - presumably I can stick that in the system-wide openmpi-mca-params.conf for now. Cheers, Mark___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Is gridengine integration broken in openmpi 2.0.2?
On Mon, 6 Feb 2017, Mark Dixon wrote: ... Ah-ha! "-mca plm_rsh_agent foo" fixes it! Thanks very much - presumably I can stick that in the system-wide openmpi-mca-params.conf for now. ... Except if I do that, it means running ompi outside of the SGE environment no longer works :( Should I just revoke the following commit? Cheers, Mark commit d51c2af76b0c011177aca8e08a5a5fcf9f5e67db Author: Jeff Squyres Date: Tue Aug 16 06:58:20 2016 -0500 rsh: robustify the check for plm_rsh_agent default value Don't strcmp against the default value -- the default value may change over time. Instead, check to see if the MCA var source is not DEFAULT. Signed-off-by: Jeff Squyres (cherry picked from commit open-mpi/ompi@71ec5cfb436977ea9ad409ba634d27e6addf6fae) ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] "-map-by socket:PE=1" doesn't do what I expect
Hi, When combining OpenMPI 2.0.2 with OpenMP, I'm interested in launching a number of ranks and allocating a number of cores to each rank. Using "-map-by socket:PE=", switching to "-map-by node:PE=" if I want to allocate more than a single socket to a rank, seems to do what I want. Except for "-map-by socket:PE=1". That seems to allocate an entire socket to each rank instead of a single core. Here's the output of a test program on a dual socket non-hyperthreading system that reports rank core bindings (odd cores on one socket, even on the other): $ mpirun -np 2 -map-by socket:PE=1 ./report_binding Rank 0 bound somehost.somewhere: 0 2 4 6 8 10 12 14 16 18 20 22 Rank 1 bound somehost.somewhere: 1 3 5 7 9 11 13 15 17 19 21 23 $ mpirun -np 2 -map-by socket:PE=2 ./report_binding Rank 0 bound somehost.somewhere: 0 2 Rank 1 bound somehost.somewhere: 1 3 $ mpirun -np 2 -map-by socket:PE=3 ./report_binding Rank 0 bound somehost.somewhere: 0 2 4 Rank 1 bound somehost.somewhere: 1 3 5 $ mpirun -np 2 -map-by socket:PE=4 ./report_binding Rank 0 bound somehost.somewhere: 0 2 4 6 Rank 1 bound somehost.somewhere: 1 3 5 7 I get the same result if I change "socket" to "numa". Changing "socket" to either "core", "node" or "slot" binds each rank to a single core (good), but doesn't round-robin ranks across sockets like "socket" does (bad). Is "-map-by socket:PE=1" doing the right thing, please? I tried reading the man page but I couldn't work out what the expected behaviour is :o Cheers, Mark ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] "-map-by socket:PE=1" doesn't do what I expect
On Wed, 15 Feb 2017, r...@open-mpi.org wrote: Ah, yes - I know what the problem is. We weren’t expecting a PE value of 1 - the logic is looking expressly for values > 1 as we hadn’t anticipated this use-case. Is it a sensible use-case, or am I crazy? I can make that change. I’m off to a workshop for the next day or so, but can probably do this on the plane. You're a star - thanks :) Mark___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Is building with "--enable-mpi-thread-multiple" recommended?
Hi, We have some users who would like to try out openmpi MPI_THREAD_MULTIPLE support on our InfiniBand cluster. I am wondering if we should enable it on our production cluster-wide version, or install it as a separate "here be dragons" copy. I seem to recall openmpi folk cautioning that MPI_THREAD_MULTIPLE support was pretty crazy and that enabling it could have problems for non-MPI_THREAD_MULTIPLE codes (never mind codes that explicitly used it), so such an install shouldn't be used unless for codes that actually need it. Is that still the case, please? Thanks, Mark ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] "-map-by socket:PE=1" doesn't do what I expect
On Fri, 17 Feb 2017, r...@open-mpi.org wrote: Mark - this is now available in master. Will look at what might be required to bring it to 2.0 Thanks Ralph, To be honest, since you've given me an alternative, there's no rush from my point of view. The logic's embedded in a script and it's been taught "--map-by socket --bind-to core" for the special case of 1. It'd be nice to get rid of it at some point, but there's no problem waiting for the next stable branch :) Cheers, Mark ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Is building with "--enable-mpi-thread-multiple" recommended?
On Fri, 17 Feb 2017, r...@open-mpi.org wrote: Depends on the version, but if you are using something in the v2.x range, you should be okay with just one installed version Thanks Ralph. How good is MPI_THREAD_MULTIPLE support these days and how far up the wishlist is it, please? We don't get many openmpi-specific queries from users but, other than core binding, it seems to be the thing we get asked about the most (I normally point those people at mvapich2 or intelmpi instead). Cheers, Mark ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] More confusion about --map-by!
Hi, I'm still trying to figure out how to express the core binding I want to openmpi 2.x via the --map-by option. Can anyone help, please? I bet I'm being dumb, but it's proving tricky to achieve the following aims (most important first): 1) Maximise memory bandwidth usage (e.g. load balance ranks across processor sockets) 2) Optimise for nearest-neighbour comms (in MPI_COMM_WORLD) (e.g. put neighbouring ranks on the same socket) 3) Have an incantation that's simple to change based on number of ranks and processes per rank I want. Example: Considering a 2 socket, 12 cores/socket box and a program with 2 threads per rank... ... this is great if I fully-populate the node: $ mpirun -np 12 -map-by slot:PE=2 --bind-to core --report-bindings ./prog [somehost:101235] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././././././././.][./././././././././././.] [somehost:101235] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./././././././.][./././././././././././.] [somehost:101235] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B/./././././.][./././././././././././.] [somehost:101235] MCW rank 3 bound to socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: [././././././B/B/./././.][./././././././././././.] [somehost:101235] MCW rank 4 bound to socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]]: [././././././././B/B/./.][./././././././././././.] [somehost:101235] MCW rank 5 bound to socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: [././././././././././B/B][./././././././././././.] [somehost:101235] MCW rank 6 bound to socket 1[core 12[hwt 0]], socket 1[core 13[hwt 0]]: [./././././././././././.][B/B/./././././././././.] [somehost:101235] MCW rank 7 bound to socket 1[core 14[hwt 0]], socket 1[core 15[hwt 0]]: [./././././././././././.][././B/B/./././././././.] [somehost:101235] MCW rank 8 bound to socket 1[core 16[hwt 0]], socket 1[core 17[hwt 0]]: [./././././././././././.][././././B/B/./././././.] [somehost:101235] MCW rank 9 bound to socket 1[core 18[hwt 0]], socket 1[core 19[hwt 0]]: [./././././././././././.][././././././B/B/./././.] [somehost:101235] MCW rank 10 bound to socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: [./././././././././././.][././././././././B/B/./.] [somehost:101235] MCW rank 11 bound to socket 1[core 22[hwt 0]], socket 1[core 23[hwt 0]]: [./././././././././././.][././././././././././B/B] ... but not if I don't [fails aim (1)]: $ mpirun -np 6 -map-by slot:PE=2 --bind-to core --report-bindings ./prog [somehost:102035] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././././././././.][./././././././././././.] [somehost:102035] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./././././././.][./././././././././././.] [somehost:102035] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B/./././././.][./././././././././././.] [somehost:102035] MCW rank 3 bound to socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: [././././././B/B/./././.][./././././././././././.] [somehost:102035] MCW rank 4 bound to socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]]: [././././././././B/B/./.][./././././././././././.] [somehost:102035] MCW rank 5 bound to socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: [././././././././././B/B][./././././././././././.] ... whereas if I map by socket instead of slot, I achieve aim (1) but fail on aim (2): $ mpirun -np 6 -map-by socket:PE=2 --bind-to core --report-bindings ./prog [somehost:105601] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././././././././.][./././././././././././.] [somehost:105601] MCW rank 1 bound to socket 1[core 12[hwt 0]], socket 1[core 13[hwt 0]]: [./././././././././././.][B/B/./././././././././.] [somehost:105601] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./././././././.][./././././././././././.] [somehost:105601] MCW rank 3 bound to socket 1[core 14[hwt 0]], socket 1[core 15[hwt 0]]: [./././././././././././.][././B/B/./././././././.] [somehost:105601] MCW rank 4 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B/./././././.][./././././././././././.] [somehost:105601] MCW rank 5 bound to socket 1[core 16[hwt 0]], socket 1[core 17[hwt 0]]: [./././././././././././.][././././B/B/./././././.] Any ideas, please? Thanks, Mark ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] More confusion about --map-by!
ore 6[hwt 0]], socket 0[core 7[hwt 0]]: [././././././B/B/./././.][./././././././././././.] [somehost:102035] MCW rank 4 bound to socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]]: [././././././././B/B/./.][./././././././././././.] [somehost:102035] MCW rank 5 bound to socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: [././././././././././B/B][./././././././././././.] ... whereas if I map by socket instead of slot, I achieve aim (1) but fail on aim (2): $ mpirun -np 6 -map-by socket:PE=2 --bind-to core --report-bindings ./ prog [somehost:105601] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././././././././.][./././././././././././.] [somehost:105601] MCW rank 1 bound to socket 1[core 12[hwt 0]], socket 1[core 13[hwt 0]]: [./././././././././././.][B/B/./././././././././.] [somehost:105601] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./././././././.][./././././././././././.] [somehost:105601] MCW rank 3 bound to socket 1[core 14[hwt 0]], socket 1[core 15[hwt 0]]: [./././././././././././.][././B/B/./././././././.] [somehost:105601] MCW rank 4 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B/./././././.][./././././././././././.] [somehost:105601] MCW rank 5 bound to socket 1[core 16[hwt 0]], socket 1[core 17[hwt 0]]: [./././././././././././.][././././B/B/./././././.] Any ideas, please? Thanks, Mark ___ users mailing list users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- ------- Mark Dixon Email: m.c.di...@leeds.ac.uk Advanced Research Computing (ARC) Tel (int): 35429 IT Services building Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK ---___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Is building with "--enable-mpi-thread-multiple" recommended?
On Fri, 3 Mar 2017, Paul Kapinos wrote: ... Note that on 1.10.x series (even on 1.10.6), enabling of MPI_THREAD_MULTIPLE in lead to (silent) shutdown of the InfiniBand fabric for that application => SLOW! 2.x versions (tested: 2.0.1) handle MPI_THREAD_MULTIPLE on InfiniBand the right way up, however due to absence of memory hooks (= nut aligned memory allocation) we get 20% less bandwidth on IB with 2.x versions compared to 1.10.x versions of Open MPI (regardless with or without support of MPI_THREAD_MULTIPLE). On Intel OmniPath network both above issues seem to be not present, but due to a performance bug in MPI_Free_mem your application can be horribly slow (seen: CP2K) if the InfiniBand failback of OPA not disabled manually, see https://www.mail-archive.com/users@lists.open-mpi.org//msg30593.html ... Hi Paul, All very useful - thanks :) Our (limited) testing seems to show no difference on 2.x with MPI_THREAD_MULTIPLE enabled vs. disabled as well, which is good news. Glad to hear another opinion. Your 20% memory bandwidth performance hit on 2.x and the OPA problem are concerning - will look at that. Are there tickets open for them? Cheers, Mark ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] MPI-IO Lustre driver update?
Hi, I notice that there's been quite a bit of work recently on ROMIO's Lustre driver. As far as I can see from openmpi's SVN, this doesn't seem to have landed there yet (README indicates V04, yet V05 is in MPICH2 and MVAPICH2). Is there a timescale for when this will be make it into a release, please? Thanks, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -
Re: [OMPI users] MPI-IO Lustre driver update?
On Mon, 29 Nov 2010, Jeff Squyres wrote: There's work going on right now to update the ROMIO in the OMPI v1.5 series. We hope to include it in v1.5.2. Cheers Jeff :) Mark -- - Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -
[OMPI users] configure: mpi-threads disabled by default
I've been asked about mixed-mode MPI/OpenMP programming with OpenMPI, so have been digging through the past list messages on MPI_THREAD_*, etc. Interesting stuff :) Before I go ahead and add "--enable-mpi-threads" to our standard configure flags, is there any reason it's disabled by default, please? I'm a bit puzzled, as this default seems in conflict with whole "Law of Least Astonishment" thing. Have I missed some disaster that's going to happen? Thanks, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -
Re: [OMPI users] configure: mpi-threads disabled by default
On Wed, 4 May 2011, Eugene Loh wrote: Depending on what version you use, the option has been renamed --enable-mpi-thread-multiple. Anyhow, there is widespread concern whether the support is robust. The support is known to be limited and the performance poor. Thanks :) I absolutely see why support for MPI_THREAD_MULTIPLE is a configure option (not at all related to the fact I'm on a platform where my best interconnect gets disabled if you ask for it). However, do the same concerns apply to MPI_THREAD_FUNNELED and MPI_THREAD_SERIALIZED? They are disabled by default too and they look difficult to enable without enabling some other functionality. Details: Jeff said (on this list, Tue, 14 Dec 2010 22:52:40) that, in OpenMPI, there's no difference between MPI_THREAD_SINGLE and MPI_THREAD_FUNNELED. Yet, with the default configure options, MPI_Init_thread will always return MPI_THREAD_SINGLE. * Release 1.4.3 - MPI_THREAD_(FUNNELED|SERIALIZED) are only available if you specify "--mpi-threads". Codes that sensibly negotiate their thread level automatically start using MPI_THREAD_MULTIPLE and my interconnect (openib) is disabled. * Release 1.5.3 - MPI_THREAD_(FUNNELED|SERIALIZED) are only available if you specify "--mpi-threads" (same problems as with 1.4.3), or enable asynchronous communication progress (whatever that is - but it sounds scary) with "--enable-progress-threads" Things do look different again in trunk, but seem to require you to at least ask for "--enable-opal-multi-threads". Are we supposed to be able to use MPI_THREAD_FUNNELED by default or not? Best wishes, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -
[OMPI users] Mellanox MLX4_EVENT_TYPE_SRQ_LIMIT kernel messages
Hi, We've been putting a new Mellanox QDR Intel Sandy Bridge cluster, based on CentOS 6.3, through its paces and we're getting repeated kernel messages we never used to get on CentOS 5. An example on one node: Sep 28 09:58:20 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT Sep 28 09:58:27 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT Sep 28 09:58:27 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT Sep 28 09:58:29 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT Sep 28 09:58:29 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT Sep 28 09:58:31 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT Sep 28 09:58:31 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT Sep 28 09:58:32 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT Sep 28 09:58:45 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT Sep 28 09:58:45 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT Sep 28 10:08:23 g8s1n2 kernel: mlx4_core :01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT These messages appeared when running IMB compiled with openmpi 1.6.1 across 256 cores (16 nodes, 16 cores per node). The job ran from 09:56:54 to 10:08:46 and failed with no obvious error messages. Now, I'm used to IMB running into trouble at larger core counts, but I'm wondering if anyone has seen these messages before and know if they indicate a problem? We're running with an increased log_num_mtt mlx4_core option as recommended by the openmpi FAQ and increased log_num_srq to its maximum value in a failed attempt to get rid of the messages: $ cat /etc/modprobe.d/libmlx4_local.conf options mlx4_core log_num_mtt=24 log_mtts_per_seg=3 log_num_srq=20 Any thoughts? Thanks, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -
[OMPI users] knem/openmpi performance?
Hi, I'm taking a look at knem, to see if it improves the performance of any applications on our QDR InfiniBand cluster, so I'm eager to hear about other people's experiences. This doesn't appear to have been discussed on this list before. I appreciate that any affect that knem will have is entirely dependent on the application, scale and input data, but: * Does anyone know of any examples of popular software packages that benefit particularly from the knem support in openmpi? * Has anyone noticed any downsides to using knem? Thanks, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -
Re: [OMPI users] knem/openmpi performance?
On Fri, 12 Jul 2013, Jeff Squyres (jsquyres) wrote: ... In short: doing 1 memcopy consumes half the memory bandwidth of 2 mem copies. So when you have lots of MPI processes competing for memory bandwidth, it turns out that having each MPI process use half the bandwidth is a Really Good Idea. :-) This allows more MPI processes to do shared memory communications before you hit the memory bandwidth bottleneck. Hi Jeff, Lots of useful detail in there - thanks. We have plenty of memory-bound applications in use, so hopefully there's some good news in this. I was hoping that someone might have some examples of real application behaviour rather than micro benchmarks. It can be crazy hard to get that information from users. Unusually for us, we're putting in a second cluster with the same architecture, CPUs, memory and OS as the last one. I might be able to use this as a bigger stick to get some better feedback. If so, I'll pass it on. Darius Buntinas, Brice Goglin, et al. wrote an excellent paper about exactly this set of issues; see http://runtime.bordeaux.inria.fr/knem/. ... I'll definitely take a look - thanks again. All the best, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -
Re: [OMPI users] knem/openmpi performance?
On Mon, 15 Jul 2013, Elken, Tom wrote: ... Hope these anecdotes are relevant to Open MPI users considering knem. ... Brilliantly useful, thanks! It certainly looks like it may be greatly significant for some applications. Worth looking into. All the best, Mark -- - Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -
Re: [OMPI users] knem/openmpi performance?
On Thu, 18 Jul 2013, Iliev, Hristo wrote: ... Detailed results are coming in the near future, but the benchmarks done ... Hi Hristo, Very interesting, thanks for sharing! Will be very interested to read your official results when you publish :) All the best, Mark -- - Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -
[OMPI users] Open MPI unable to find threading support for PGI or Sun Studio
Hi, I've been attempting to build Open MPI 1.2.6 using a variety of compilers including, but not limited to, PGI 7.1-6 and Sun Studio 12 (200709) on a CentOS 5.2 32-bit Intel box. Building against either of the above compilers results in the following message produced by configure: Open MPI was unable to find threading support on your system. In the near future, the OMPI development team is considering requiring threading support for proper OMPI execution. This is in part because we are not aware of any users that do not have thread support - so we need you to e-mail us at o...@ompi-mpi.org and let us know about this problem. I don't see this when building against the Intel 10.1.015 or GNU GCC 4.1.2 compilers. I cannot see any answer to this in the FAQ or list archives. I've attached files showing the output of configure and my environment to this message. Is this expected? Thanks, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK - sunstudio-build.txt.bz2 Description: Binary data pgi-build.txt.bz2 Description: Binary data
Re: [OMPI users] Open MPI unable to find threading support for PGI or Sun Studio
On Mon, 28 Jul 2008, Jeff Squyres wrote: FWIW: I compile with PGI 7.1.4 regularly on RHEL4U4 and don't see this problem. It would be interesting to see the config.log's from these builds to see the actual details of what went wrong. Thanks Jeff: it's good to know it's just me ;) Following your message, I've tried building with PGI on a few systems: Compiler OS Result === 32-bit 7.1.6 CentOS 5.2 (32-bit) no threading 32-bit 7.1.4 CentOS 5.2 (32-bit) no threading **config.log attached** 32-bit 7.1.4 RHEL4u6 (64-bit) works! 32-bit 7.1.4 CentOS 5.1 (64-bit) no threading Each time it fails, it's because of "__builtin_expect" being undefined for pgcc and pgf77 (works for pgcpp) - or any of the Sun Studio compilers. Could this be a glibc 2.3 (RHEL4) vs. 2.5 (CentOS5) issue? I've attached just the PGI config.log for now (I don't want to blow the 100Kb posting limit), but the relevant sections from each appear to be: PGI: configure:49065: checking if C compiler and POSIX threads work with -lpthread configure:49121: pgcc -o conftest -O -DNDEBUG -D_REENTRANT conftest.c -lnsl -lutil -lpthread >&5 conftest.c: conftest.o: In function `main': conftest.c:(.text+0x98): undefined reference to `__builtin_expect' configure:49272: checking if C++ compiler and POSIX threads work with -lpthread configure:49328: pgcpp -o conftest -O -DNDEBUGconftest.cpp -lnsl -lutil -lpthread >&5 conftest.cpp: (skipped some non-fatal warning messages here) configure:49572: checking if F77 compiler and POSIX threads work with -lpthread configure:49654: pgcc -O -DNDEBUG -I. -c conftest.c configure:49661: $? = 0 configure:49671: pgf77 conftestf.f conftest.o -o conftest -lnsl -lutil -lpthread conftestf.f: conftest.o: In function `pthreadtest_': conftest.c:(.text+0x92): undefined reference to `__builtin_expect' Sun: configure:49065: checking if C compiler and POSIX threads work with -lpthread configure:49121: cc -o conftest -O -DNDEBUG -D_REENTRANT conftest.c -lnsl -lutil -lm -lpthread >&5 "conftest.c", line 305: warning: can not set non-default alignment for automatic variable "conftest.c", line 305: warning: implicit function declaration: __builtin_expect conftest.o: In function `main': conftest.c:(.text+0x35): undefined reference to `__builtin_expect' configure:49272: checking if C++ compiler and POSIX threads work with -lpthread configure:49328: CC -o conftest -O -DNDEBUGconftest.cpp -lnsl -lutil -lm -lpthread >&5 "conftest.cpp", line 305: Error: The function "__builtin_expect" must have a prototype. 1 Error(s) detected. configure:49572: checking if F77 compiler and POSIX threads work with -lpthread configure:49654: cc -O -DNDEBUG -I. -c conftest.c "conftest.c", line 15: warning: can not set non-default alignment for automatic variable "conftest.c", line 15: warning: implicit function declaration: __builtin_expect configure:49661: $? = 0 configure:49671: f77 conftestf.f conftest.o -o conftest -lnsl -lutil -lm -lpthread NOTICE: Invoking /apps/compilers/sunstudio/12_200709/1/sunstudio12/bin/f90 -f77 -ftrap=%none conftestf.f conftest.o -o conftest -lnsl -lutil -lm -lpthread conftestf.f: MAIN fpthread: conftest.o: In function `pthreadtest_': conftest.c:(.text+0x41): undefined reference to `__builtin_expect' Any ideas? Cheers, Mark -- - Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK - config.log.bz2 Description: Binary data
Re: [OMPI users] Open MPI unable to find threading support for PGI or Sun Studio
On Tue, 29 Jul 2008, Jeff Squyres wrote: ... I suggest that you bring this issue up with PGI support; they're fairly responsive on their web forums. ... Will do: thanks for giving this a look, you've been really helpful. Cheers, Mark -- - Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -
Re: [OMPI users] Open MPI unable to find threading support for PGI or Sun Studio
On Tue, 29 Jul 2008, Jeff Squyres wrote: On Jul 29, 2008, at 6:52 AM, Mark Dixon wrote: FWIW: I compile with PGI 7.1.4 regularly on RHEL4U4 and don't see this problem. It would be interesting to see the config.log's from these builds to see the actual details of what went wrong. ... Compiler OS Result === 32-bit 7.1.6 CentOS 5.2 (32-bit) no threading 32-bit 7.1.4 CentOS 5.2 (32-bit) no threading **config.log attached** 32-bit 7.1.4 RHEL4u6 (64-bit) works! 32-bit 7.1.4 CentOS 5.1 (64-bit) no threading ... I'm afraid this one is out of my bailiwick -- I don't know. Looking through your config.log file, it does look like this lack of __builtin_expect is the killer. FWIW, here's my configure output when I run with pgcc v7.1.4: ... I suggest that you bring this issue up with PGI support; they're fairly responsive on their web forums. ... In case anyone's interested, the fix is to upgrade to at least PGI 7.2-2. It seems that there was a change to glibc between RHEL4 and RHEL5 (2.3 vs. 2.5) which requires __builtin_expect to be defined when using certain pthread library functions. This also appears to be a problem for the Sun Studio 12 compiler (bug id 6603861), but it would seem that Sun's not in a hurry to fix it. Thanks for your time, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -
[OMPI users] Failed to register memory (openmpi 2.0.2)
Hi, We're intermittently seeing messages (below) about failing to register memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the vanilla IB stack as shipped by centos. We're not using any mlx4_core module tweaks at the moment. On earlier machines we used to set registered memory as per the FAQ, but neither log_num_mtt nor num_mtt seem to exist these days (according to /sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to follow the FAQ. The output of 'ulimit -l' shows as unlimited for every rank. Does anyone have any advice, please? Thanks, Mark - Failed to register memory region (MR): Hostname: dc1s0b1c Address: ec5000 Length: 20480 Error:Cannot allocate memory -- -- Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system. You job will continue, but Open MPI will ignore the "ud" oob component in this run. ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Failed to register memory (openmpi 2.0.2)
Thanks Ralph, will do. Cheers, Mark On Wed, 18 Oct 2017, r...@open-mpi.org wrote: Put “oob=tcp” in your default MCA param file On Oct 18, 2017, at 9:00 AM, Mark Dixon wrote: Hi, We're intermittently seeing messages (below) about failing to register memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the vanilla IB stack as shipped by centos. We're not using any mlx4_core module tweaks at the moment. On earlier machines we used to set registered memory as per the FAQ, but neither log_num_mtt nor num_mtt seem to exist these days (according to /sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to follow the FAQ. The output of 'ulimit -l' shows as unlimited for every rank. Does anyone have any advice, please? Thanks, Mark - Failed to register memory region (MR): Hostname: dc1s0b1c Address: ec5000 Length: 20480 Error:Cannot allocate memory -- -- Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system. You job will continue, but Open MPI will ignore the "ud" oob component in this run. ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Failed to register memory (openmpi 2.0.2)
Hi there, We're intermittently seeing messages (below) about failing to register memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 / 24 core 126G RAM Broadwell nodes and the vanilla IB stack as shipped by centos. (We previously seen similar messages for the "ud" oob component but, as recommended in this thread, we stopped oob from using openib via an MCA parameter.) I've checked to see what the registered memory limit is (by setting mlx4_core's debug_level, rebooting and examining kernel messages) and it's double the system RAM - which I understand is the recommended setting. Any ideas about what might be going on, please? Thanks, Mark -- The OpenFabrics (openib) BTL failed to initialize while trying to allocate some locked memory. This typically can indicate that the memlock limits are set too low. For most HPC installations, the memlock limits should be set to "unlimited". The failure occured here: Local host: dc1s0b1a OMPI source: btl_openib.c:752 Function: opal_free_list_init() Device: mlx4_0 Memlock limit: unlimited You may need to consult with your system administrator to get this problem fixed. This FAQ entry on the Open MPI web site may also be helpful: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages -- [dc1s0b1a][[59067,1],0][btl_openib.c:1035:mca_btl_openib_add_procs] could not prepare openib device for use [dc1s0b1a][[59067,1],0][btl_openib.c:1186:mca_btl_openib_get_ep] could not prepare openib device for use [dc1s0b1a][[59067,1],0][connect/btl_openib_connect_udcm.c:1522:udcm_find_endpoint] could not find endpoint with port: 1, lid: 69, msg_type: 100 On Thu, 19 Oct 2017, Mark Dixon wrote: Thanks Ralph, will do. Cheers, Mark On Wed, 18 Oct 2017, r...@open-mpi.org wrote: Put “oob=tcp” in your default MCA param file On Oct 18, 2017, at 9:00 AM, Mark Dixon wrote: Hi, We're intermittently seeing messages (below) about failing to register memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the vanilla IB stack as shipped by centos. We're not using any mlx4_core module tweaks at the moment. On earlier machines we used to set registered memory as per the FAQ, but neither log_num_mtt nor num_mtt seem to exist these days (according to /sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to follow the FAQ. The output of 'ulimit -l' shows as unlimited for every rank. Does anyone have any advice, please? Thanks, Mark - Failed to register memory region (MR): Hostname: dc1s0b1c Address: ec5000 Length: 20480 Error:Cannot allocate memory -- -- Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system. You job will continue, but Open MPI will ignore the "ud" oob component in this run. ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] OMPI 4.0.1 + PHDF5 1.8.21 tests fail on Lustre
Hi, I’ve built parallel HDF5 1.8.21 against OpenMPI 4.0.1 on CentOS 7 and a Lustre 2.12 filesystem using the OS-provided GCC 4.8.5 and am trying to run the testsuite. I’m failing the testphdf5 test: could anyone help, please? I’ve successfully used the same method to pass tests when building HDF5 1.8.21 against different MPIs - MVAPICH2 2.3.1 and IntelMPI 2019.4.243. I’ve built openmpi 4.0.1 with configure options: ./configure --prefix=$prefix –with-sge –with-io-romio-flags=–with-file-system=lustre+ufs –enable-mpi-cxx –with-cma –enable-mpi1-compatibility –with-ucx=$prefix --without-verbs –enable-mca-no-build=btl-uct I’ve set the following MCA param to try and force ROMIO: export OMPI_MCA_io=romio321 For OpenMPI 4.0.1, I’m getting this failure - any ideas, please? Thanks, Mark $ cat testphdf5.chklog testphdf5 Test Log === PHDF5 TESTS START === MPI-process 1. hostname=login2.arc4.leeds.ac.uk MPI-process 3. hostname=login2.arc4.leeds.ac.uk MPI-process 4. hostname=login2.arc4.leeds.ac.uk MPI-process 5. hostname=login2.arc4.leeds.ac.uk For help use: /nobackup/issmcd/login2.arc4.leeds.ac.uk.u4q9A9ALkN/hdf5-1.8.21/testpar/.libs/testphdf5 -help Linked with hdf5 version 1.8 release 21 For help use: /nobackup/issmcd/login2.arc4.leeds.ac.uk.u4q9A9ALkN/hdf5-1.8.21/testpar/.libs/testphdf5 -help Linked with hdf5 version 1.8 release 21 For help use: /nobackup/issmcd/login2.arc4.leeds.ac.uk.u4q9A9ALkN/hdf5-1.8.21/testpar/.libs/testphdf5 -help Linked with hdf5 version 1.8 release 21 For help use: /nobackup/issmcd/login2.arc4.leeds.ac.uk.u4q9A9ALkN/hdf5-1.8.21/testpar/.libs/testphdf5 -help Linked with hdf5 version 1.8 release 21 MPI-process 2. hostname=login2.arc4.leeds.ac.uk For help use: /nobackup/issmcd/login2.arc4.leeds.ac.uk.u4q9A9ALkN/hdf5-1.8.21/testpar/.libs/testphdf5 -help Linked with hdf5 version 1.8 release 21 MPI-process 0. hostname=login2.arc4.leeds.ac.uk For help use: /nobackup/issmcd/login2.arc4.leeds.ac.uk.u4q9A9ALkN/hdf5-1.8.21/testpar/.libs/testphdf5 -help Linked with hdf5 version 1.8 release 21 Test filenames are: ParaTest.h5 Testing -- fapl_mpio duplicate (mpiodup) Test filenames are: ParaTest.h5 Testing -- fapl_mpio duplicate (mpiodup) Test filenames are: ParaTest.h5 Testing -- fapl_mpio duplicate (mpiodup) Test filenames are: ParaTest.h5 Testing -- fapl_mpio duplicate (mpiodup) Test filenames are: ParaTest.h5 Testing -- fapl_mpio duplicate (mpiodup) *** Hint *** You can use environment variable HDF5_PARAPREFIX to run parallel test files in a different directory or to add file type prefix. E.g., HDF5_PARAPREFIX=pfs:/PFS/user/me export HDF5_PARAPREFIX *** End of Hint *** Test filenames are: ParaTest.h5 Testing -- fapl_mpio duplicate (mpiodup) Testing -- dataset using split communicators (split) Testing -- dataset using split communicators (split) Testing -- dataset using split communicators (split) Testing -- dataset using split communicators (split) Testing -- dataset using split communicators (split) Testing -- dataset using split communicators (split) Testing -- dataset independent write (idsetw) Testing -- dataset independent write (idsetw) Testing -- dataset independent write (idsetw) Testing -- dataset independent write (idsetw) Testing -- dataset independent write (idsetw) Testing -- dataset independent write (idsetw) Testing -- dataset independent read (idsetr) Testing -- dataset independent read (idsetr) Testing -- dataset independent read (idsetr) Testing -- dataset independent read (idsetr) Testing -- dataset independent read (idsetr) Testing -- dataset independent read (idsetr) Testing -- dataset collective write (cdsetw) Testing -- dataset collective write (cdsetw) Testing -- dataset collective write (cdsetw) Testing -- dataset collective write (cdsetw) Testing -- dataset collective write (cdsetw) Testing -- dataset collective write (cdsetw) Testing -- dataset collective read (cdsetr) Testing -- dataset collective read (cdsetr) Testing -- dataset collective read (cdsetr) Testing -- dataset collective read (cdsetr) Testing -- dataset collective read (cdsetr) Testing -- dataset collective read (cdsetr) Testing -- extendible dataset independent write (eidsetw) Testing -- extendible dataset independent write (eidsetw) Testing -- extendible dataset independent write (eidsetw) Testing -- extendible dataset independent write (eidsetw) Testing -- extendible dataset independent write (eidsetw) Testing -- extendible dataset independent write (eidsetw) Testing -- extendible dataset independent read (eidsetr) Testing -- extendible dataset independent read (eidsetr) Testing -- extendible dataset independent read (eidsetr) Testing -- extendible dataset independent read (eidsetr) Tes
[OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Hi all, I'm confused about how openmpi supports mpi-io on Lustre these days, and am hoping that someone can help. Back in the openmpi 2.0.0 release notes, it said that OMPIO is the default MPI-IO implementation on everything apart from Lustre, where ROMIO is used. Those release notes are pretty old, but it still appears to be true. However, I cannot get HDF5 1.10.7 to pass its MPI-IO tests unless I tell openmpi to use OMPIO (OMPI_MCA_io=ompio) and tell UCX not to print warning messages (UCX_LOG_LEVEL=ERROR). Can I just check: are we still supposed to be using ROMIO? Thanks, Mark
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Hi Edgar, Thanks for this - good to know that ompio is an option, despite the reference to potential performance issues. I'm using openmpi 4.0.5 with ucx 1.9.0 and see the hdf5 1.10.7 test "testphdf5" timeout (with the timeout set to an hour) using romio. Is it a known issue there, please? When it times out, the last few lines to be printed are these: Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) The other thing I note is that openmpi doesn't configure romio's lustre driver, even when given "--with-lustre". Regardless, I see the same result whether or not I add "--with-io-romio-flags=--with-file-system=lustre+ufs" Cheers, Mark On Mon, 16 Nov 2020, Gabriel, Edgar via users wrote: this is in theory still correct, the default MPI I/O library used by Open MPI on Lustre file systems is ROMIO in all release versions. That being said, ompio does have support for Lustre as well starting from the 2.1 series, so you can use that as well. The main reason that we did not switch to ompio for Lustre as the default MPI I/O library is a performance issue that can arise under certain circumstances. Which version of Open MPI are you using? There was a bug fix in the Open MPI to ROMIO integration layer sometime in the 4.0 series that fixed a datatype problem, which caused some problems in the HDF5 tests. You might be hitting that problem. Thanks Edgar -Original Message- From: users On Behalf Of Mark Dixon via users Sent: Monday, November 16, 2020 4:32 AM To: users@lists.open-mpi.org Cc: Mark Dixon Subject: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO? Hi all, I'm confused about how openmpi supports mpi-io on Lustre these days, and am hoping that someone can help. Back in the openmpi 2.0.0 release notes, it said that OMPIO is the default MPI-IO implementation on everything apart from Lustre, where ROMIO is used. Those release notes are pretty old, but it still appears to be true. However, I cannot get HDF5 1.10.7 to pass its MPI-IO tests unless I tell openmpi to use OMPIO (OMPI_MCA_io=ompio) and tell UCX not to print warning messages (UCX_LOG_LEVEL=ERROR). Can I just check: are we still supposed to be using ROMIO? Thanks, Mark
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Hi Edgar, Pity, that would have been nice! But thanks for looking. Checking through the ompi github issues, I now realise I logged exactly the same issue over a year ago (completely forgot - I've moved jobs since then), including a script to reproduce the issue on a Lustre system. Unfortunately there's been no movement: https://github.com/open-mpi/ompi/issues/6871 If it helps anyone, I can confirm that hdf5 parallel tests pass with openmpi 3.1.6, but not in 4.0.5. Surely I cannot be the only one who cares about using a recent openmpi with hdf5 on lustre? Mark On Mon, 16 Nov 2020, Gabriel, Edgar wrote: hm, I think this sounds like a different issue, somebody who is more invested in the ROMIO Open MPI work should probably have a look. Regarding compiling Open MPI with Lustre support for ROMIO, I cannot test it right now for various reasons, but if I recall correctly the trick was to provide the --with-lustre option twice, once inside of the "--with-io-romio-flags=" (along with the option that you provided), and once outside (for ompio). Thanks Edgar
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
On Wed, 25 Nov 2020, Dave Love via users wrote: The perf test says romio performs a bit better. Also -- from overall time -- it's faster on IMB-IO (which I haven't looked at in detail, and ran with suboptimal striping). I take that back. I can't reproduce a significant difference for total IMB-IO runtime, with both run in parallel on 16 ranks, using either the system default of a single 1MB stripe or using eight stripes. I haven't teased out figures for different operations yet. That must have been done elsewhere, but I've never seen figures. But remember that IMB-IO doesn't cover everything. For example, hdf5's t_bigio parallel test appears to be a pathological case and OMPIO is 2 orders of magnitude slower on a Lustre filesystem: - OMPI's default MPI-IO implementation on Lustre (ROMIO): 21 seconds - OMPI's alternative MPI-IO implementation on Lustre (OMPIO): 2554 seconds End users seem to have the choice of: - use openmpi 4.x and have some things broken (romio) - use openmpi 4.x and have some things slow (ompio) - use openmpi 3.x and everything works My concern is that openmpi 3.x is near, or at, end of life. Mark t_bigio runs on centos 7, gcc 4.8.5, ppc64le, openmpi 4.0.5, hdf5 1.10.7, Lustre 2.12.5: [login testpar]$ time mpirun -np 6 ./t_bigio Testing Dataset1 write by ROW Testing Dataset2 write by COL Testing Dataset3 write select ALL proc 0, NONE others Testing Dataset4 write point selection Read Testing Dataset1 by COL Read Testing Dataset2 by ROW Read Testing Dataset3 read select ALL proc 0, NONE others Read Testing Dataset4 with Point selection ***Express test mode on. Several tests are skipped real0m21.141s user2m0.318s sys 0m3.289s [login testpar]$ export OMPI_MCA_io=ompio [login testpar]$ time mpirun -np 6 ./t_bigio Testing Dataset1 write by ROW Testing Dataset2 write by COL Testing Dataset3 write select ALL proc 0, NONE others Testing Dataset4 write point selection Read Testing Dataset1 by COL Read Testing Dataset2 by ROW Read Testing Dataset3 read select ALL proc 0, NONE others Read Testing Dataset4 with Point selection ***Express test mode on. Several tests are skipped real42m34.103s user213m22.925s sys 8m6.742s
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Hi Edgar, Thank you so much for your reply. Having run a number of Lustre systems over the years, I fully sympathise with your characterisation of Lustre as being very unforgiving! Best wishes, Mark On Thu, 26 Nov 2020, Gabriel, Edgar wrote: I will have a look at the t_bigio tests on Lustre with ompio. We had from collaborators some reports about the performance problems similar to the one that you mentioned here (which was the reason we were hesitant to make ompio the default on Lustre), but part of the problem is that we were not able to reproduce it reliably on the systems that we had access to, which we makes debugging and fixing the issue very difficult. Lustre is a very unforgiving file system, if you get something wrong with the settings, the performance is not just a bit off, but often orders of magnitude (as in your measurements). Thanks! Edgar -Original Message- From: users On Behalf Of Mark Dixon via users Sent: Thursday, November 26, 2020 9:38 AM To: Dave Love via users Cc: Mark Dixon ; Dave Love Subject: Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO? On Wed, 25 Nov 2020, Dave Love via users wrote: The perf test says romio performs a bit better. Also -- from overall time -- it's faster on IMB-IO (which I haven't looked at in detail, and ran with suboptimal striping). I take that back. I can't reproduce a significant difference for total IMB-IO runtime, with both run in parallel on 16 ranks, using either the system default of a single 1MB stripe or using eight stripes. I haven't teased out figures for different operations yet. That must have been done elsewhere, but I've never seen figures. But remember that IMB-IO doesn't cover everything. For example, hdf5's t_bigio parallel test appears to be a pathological case and OMPIO is 2 orders of magnitude slower on a Lustre filesystem: - OMPI's default MPI-IO implementation on Lustre (ROMIO): 21 seconds - OMPI's alternative MPI-IO implementation on Lustre (OMPIO): 2554 seconds End users seem to have the choice of: - use openmpi 4.x and have some things broken (romio) - use openmpi 4.x and have some things slow (ompio) - use openmpi 3.x and everything works My concern is that openmpi 3.x is near, or at, end of life. Mark t_bigio runs on centos 7, gcc 4.8.5, ppc64le, openmpi 4.0.5, hdf5 1.10.7, Lustre 2.12.5: [login testpar]$ time mpirun -np 6 ./t_bigio Testing Dataset1 write by ROW Testing Dataset2 write by COL Testing Dataset3 write select ALL proc 0, NONE others Testing Dataset4 write point selection Read Testing Dataset1 by COL Read Testing Dataset2 by ROW Read Testing Dataset3 read select ALL proc 0, NONE others Read Testing Dataset4 with Point selection ***Express test mode on. Several tests are skipped real0m21.141s user2m0.318s sys 0m3.289s [login testpar]$ export OMPI_MCA_io=ompio [login testpar]$ time mpirun -np 6 ./t_bigio Testing Dataset1 write by ROW Testing Dataset2 write by COL Testing Dataset3 write select ALL proc 0, NONE others Testing Dataset4 write point selection Read Testing Dataset1 by COL Read Testing Dataset2 by ROW Read Testing Dataset3 read select ALL proc 0, NONE others Read Testing Dataset4 with Point selection ***Express test mode on. Several tests are skipped real42m34.103s user213m22.925s sys 8m6.742s
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
On Fri, 27 Nov 2020, Dave Love wrote: ... It's less dramatic in the case I ran, but there's clearly something badly wrong which needs profiling. It's probably useful to know how many ranks that's with, and whether it's the default striping. (I assume with default ompio fs parameters.) Hi Dave, It was run the way hdf5's "make check" runs it - that's 6 ranks. I didn't do anything interesting with striping so, unless t_bigio changed it, it'd have a width of 1. ... I can have a look with the current or older romio, unless someone else is going to; we should sort this. If you were willing, that would be brilliant, thanks :) My concern is that openmpi 3.x is near, or at, end of life. 'Twas ever thus, but if it works? Evidently it wouldn't fit the definition of "works" for some users, otherwise there wouldn't have been a version 4! I just didn't want Lustre MPI-IO support to be forgotten about, considering the 4.x series is 2 years old now. All the best, Mark
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Hi Mark, Thanks so much for this - yes, applying that pull request against ompi 4.0.5 allows hdf5 1.10.7's parallel tests to pass on our Lustre filesystem. I'll certainly be applying it on our local clusters! Best wishes, Mark On Tue, 1 Dec 2020, Mark Allen via users wrote: At least for the topic of why romio fails with HDF5, I believe this is the fix we need (has to do with how romio processes the MPI datatypes in its flatten routine). I made a different fix a long time ago in SMPI for that, then somewhat more recently it was re-broke it and I had to re-fix it. So the below takes a little more aggressive approach, not totally redesigning the flatten function, but taking over how the array size counter is handled. https://github.com/open-mpi/ompi/pull/3975 Mark Allen