Re: [OMPI users] OpenMPI & Slurm: mpiexec/mpirun vs. srun

2017-12-18 Thread r...@open-mpi.org
We have had reports of applications running faster when executing under OMPI’s mpiexec versus when started by srun. Reasons aren’t entirely clear, but are likely related to differences in mapping/binding options (OMPI provides a very large range compared to srun) and optimization flags provided

Re: [OMPI users] OpenMPI & Slurm: mpiexec/mpirun vs. srun

2017-12-19 Thread r...@open-mpi.org
eriences are welcome. Also, if > anyone is interested in the tmpdir spank plugin, you can contact me. We are > happy to share. > > Best and Merry Christmas to all, > > Charlie Taylor > UF Research Computing > > > >> On Dec 18, 2017, at 8:12 PM, r...@open-m

Re: [OMPI users] openmpi-v3.0.0: error for --map-by

2017-12-20 Thread r...@open-mpi.org
I just checked the head of both the master and 3.0.x branches, and they both work fine: $ mpirun --map-by ppr:1:socket:pe=1 date [rhc001:139231] SETTING BINDING TO CORE [rhc002.cluster:203672] SETTING BINDING TO CORE Wed Dec 20 00:20:55 PST 2017 Wed Dec 20 00:20:55 PST 2017 Tue Dec 19 18:37:03 P

Re: [OMPI users] openmpi-v3.0.0: error for --map-by

2017-12-20 Thread r...@open-mpi.org
t; an internal error - the locale of the following process was > not set by the mapper code: > ... > > > Kind regards > > Siegmar > > > On 12/20/17 09:22, r...@open-mpi.org <mailto:r...@open-mpi.org> wrote: >> I just checked the head of both the master and 3.0.x bran

Re: [OMPI users] Q: Binding to cores on AWS?

2017-12-22 Thread r...@open-mpi.org
Actually, that message is telling you that binding to core is available, but that we cannot bind memory to be local to that core. You can verify the binding pattern by adding --report-bindings to your cmd line. > On Dec 22, 2017, at 11:58 AM, Brian Dobbins wrote: > > > Hi all, > > We're t

Re: [OMPI users] Q: Binding to cores on AWS?

2017-12-22 Thread r...@open-mpi.org
> has a big impact. > > Thanks again, and merry Christmas! > - Brian > > > On Fri, Dec 22, 2017 at 1:53 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> > mailto:r...@open-mpi.org>> wrote: > Actually, that message is telling you that binding to core is available,

Re: [OMPI users] latest Intel CPU bug

2018-01-03 Thread r...@open-mpi.org
Well, it appears from that article that the primary impact comes from accessing kernel services. With an OS-bypass network, that shouldn’t happen all that frequently, and so I would naively expect the impact to be at the lower end of the reported scale for those environments. TCP-based systems,

Re: [OMPI users] latest Intel CPU bug

2018-01-04 Thread r...@open-mpi.org
“problem”. * containers and VMs don’t fully resolve the problem - the only solution other than the patches is to limit allocations to single users on a node HTH Ralph > On Jan 3, 2018, at 10:47 AM, r...@open-mpi.org wrote: > > Well, it appears from that article that the primary imp

Re: [OMPI users] latest Intel CPU bug

2018-01-04 Thread r...@open-mpi.org
rassing policies until forced to disclose by a > governmental agency being exploited by a foreign power is another example > that shines a harsh light on their ‘best practices’ line. There are many more > like this. Intel isn’t to be trusted for security practices or disclosures > becau

Re: [OMPI users] latest Intel CPU bug

2018-01-05 Thread r...@open-mpi.org
s > > > On 1/5/2018 3:54 PM, John Chludzinski wrote: > That article gives the best technical assessment I've seen of Intel's > architecture bug. I noted the discussion's subject and thought I'd add some > clarity. Nothing more. > > For the TL;DR c

Re: [OMPI users] Setting mpirun default parameters in a file

2018-01-10 Thread r...@open-mpi.org
Set the MCA param “rmaps_base_oversubscribe=1” in your default MCA param file, or in your environment > On Jan 10, 2018, at 4:42 AM, Florian Lindner wrote: > > Hello, > > a recent openmpi update on my Arch machine seems to have enabled > --nooversubscribe, as described in the manpage. Since I

[OMPI users] Fabric manager interactions: request for comments

2018-02-05 Thread r...@open-mpi.org
Hello all The PMIx community is starting work on the next phase of defining support for network interactions, looking specifically at things we might want to obtain and/or control via the fabric manager. A very preliminary draft is shown here: https://pmix.org/home/pmix-standard/fabric-manager-

Re: [OMPI users] Using OMPI Standalone in a Windows/Cygwin Environment

2018-02-26 Thread r...@open-mpi.org
There are a couple of problems here. First the “^tcp,self,sm” is telling OMPI to turn off all three of those transports, which probably leaves you with nothing. What you really want is to restrict to shared memory, so your param should be “-mca btl self,sm”. This will disable all transports othe

Re: [OMPI users] Cannot compile 1.10.2 under CentOS 7 rdma-core-devel-13-7.el7.x86_64

2018-02-28 Thread r...@open-mpi.org
Add --without-usnic > On Feb 28, 2018, at 7:50 AM, William T Jones wrote: > > I realize that OpenMPI 1.10.2 is quite old, however, for compatibility I > am attempting to compile it after a system upgrade to CentOS 7. > > This system does include infiniband and I have configured as follows > us

Re: [OMPI users] Cannot compile 1.10.2 under CentOS 7 rdma-core-devel-13-7.el7.x86_64

2018-02-28 Thread r...@open-mpi.org
Not unless you have a USNIC card in your machine! > On Feb 28, 2018, at 8:08 AM, William T Jones wrote: > > Thank you! > > Will that have any adverse side effects? > Performance penalties? > > On 02/28/2018 10:57 AM, r...@open-mpi.org wrote: >> Add --without-u

Re: [OMPI users] libopen-pal not found

2018-03-02 Thread r...@open-mpi.org
Not that I’ve heard - you need to put it in your LD_LIBRARY_PATH > On Mar 2, 2018, at 10:15 AM, Mahmood Naderan wrote: > > Hi, > After a successful installation of opmi v3 with cuda enabled, I see that ldd > can not find a right lib file although it exists. /usr/local/lib is one of > the defa

Re: [OMPI users] OpenMPI 3.0.0 on RHEL-7

2018-03-07 Thread r...@open-mpi.org
So long as it is libevent-devel-2.0.22, you should be okay. You might want to up PMIx to v1.2.5 as Slurm 16.05 should handle that okay. OMPI v3.0.0 has PMIx 2.0 in it, but should be okay with 1.2.5 last I checked (but it has been awhile and I can’t swear to it). > On Mar 7, 2018, at 2:03 PM, C

Re: [OMPI users] ARM/Allinea DDT

2018-04-11 Thread r...@open-mpi.org
You probably should provide a little more info here. I know the MPIR attach was broken in the v2.x series, but we fixed that - could be something remains broken in OMPI 3.x. FWIW: I doubt it's an Allinea problem. > On Apr 11, 2018, at 11:54 AM, Charles A Taylor wrote: > > > Contacting ARM se

Re: [OMPI users] ARM/Allinea DDT

2018-04-11 Thread r...@open-mpi.org
This inadvertently was sent directly to me, and Ryan asked if I could please post it on his behalf - he needs to get fully approved on the mailing list. Ralph > On Apr 11, 2018, at 1:19 PM, Ryan Hulguin wrote: > > Hello Charlie Taylor, > > I have replied to your support ticket indicating th

Re: [OMPI users] Error in hello_cxx.cc

2018-04-23 Thread r...@open-mpi.org
Also, I note from the screenshot that you appear to be running on Windows with a Windows binary. Correct? > On Apr 23, 2018, at 7:08 AM, Jeff Squyres (jsquyres) > wrote: > > Can you send all the information listed here: > >https://www.open-mpi.org/community/help/ > > > >> On Apr 22, 2

Re: [OMPI users] openmpi/slurm/pmix

2018-04-23 Thread r...@open-mpi.org
Hi Michael Looks like the problem is that you didn’t wind up with the external PMIx. The component listed in your error is the internal PMIx one which shouldn’t have built given that configure line. Check your config.out and see what happened. Also, ensure that your LD_LIBRARY_PATH is properly

Re: [OMPI users] Error in hello_cxx.cc

2018-04-23 Thread r...@open-mpi.org
and 3) we no longer support Windows. You could try using the cygwin port instead. > On Apr 23, 2018, at 7:52 PM, Nathan Hjelm wrote: > > Two things. 1) 1.4 is extremely old and you will not likely get much help > with it, and 2) the c++ bindings were deprecated in MPI-2.2 (2009) and > removed

Re: [OMPI users] openmpi/slurm/pmix

2018-04-25 Thread r...@open-mpi.org
> On Apr 25, 2018, at 8:16 AM, Michael Di Domenico > wrote: > > On Mon, Apr 23, 2018 at 6:07 PM, r...@open-mpi.org wrote: >> Looks like the problem is that you didn’t wind up with the external PMIx. >> The component listed in your error is the internal PMIx one

Re: [OMPI users] Memory leak with pmix_finalize not being called

2018-05-04 Thread r...@open-mpi.org
Ouch - trivial fix. I’m inserting it into PMIx and will provide it to the OMPI team > On May 4, 2018, at 5:20 AM, Saurabh T wrote: > > This is with valgrind 3.0.1 on a Centos 6 system. It appears pmix_finalize > isnt called and this reports leaks from valgrind despite the provided > suppressi

Re: [OMPI users] Openmpi-3.1.0 + slurm (fixed)

2018-05-08 Thread r...@open-mpi.org
Good news - thanks! > On May 8, 2018, at 7:02 PM, Bill Broadley wrote: > > > Sorry all, > > Chris S over on the slurm list spotted it right away. I didn't have the > MpiDefault set to pmix_v2. > > I can confirm that Ubuntu 18.04, gcc-7.3, openmpi-3.1.0, pmix-2.1.1, and > slurm-17.11.5 seem t

Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-14 Thread r...@open-mpi.org
You got that error because the orted is looking for its rank on the cmd line and not finding it. > On May 14, 2018, at 12:37 PM, Max Mellette wrote: > > Hi Gus, > > Thanks for the suggestions. The correct version of openmpi seems to be > getting picked up; I also prepended .bashrc with the i

Re: [OMPI users] slurm configuration override mpirun command line process mapping

2018-05-16 Thread r...@open-mpi.org
The problem here is that you have made an incorrect assumption. In the older OMPI versions, the -H option simply indicated that the specified hosts were available for use - it did not imply the number of slots on that host. Since you have specified 2 slots on each host, and you told mpirun to la

Re: [OMPI users] slurm configuration override mpirun command line process mapping

2018-05-17 Thread r...@open-mpi.org
mpirun takes the #slots for each node from the slurm allocation. Your hostfile (at least, what you provided) retained that information and shows 2 slots on each node. So both the original allocation _and_ your constructed hostfile are both telling mpirun to assign 2 slots on each node. Like I s

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread r...@open-mpi.org
Try running the attached example dynamic code - if that works, then it likely is something to do with how R operates. simple_spawn.c Description: Binary data > On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski > wrote: > > Hi, > > I have some problems running R + Rmpi with OpenMPI 3.1.0 + P

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread r...@open-mpi.org
ient disconnect >> [localhost.localdomain:11646] ext2x:client disconnect >> >> In your example it's only called once per process. >> >> Do you have any suspicion where the second call comes from? Might this be >> the reason for the hang? >> >>

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread r...@open-mpi.org
olutely sure* that your application will successfully > and correctly survive a call to fork(), you may disable this warning > by setting the mpi_warn_on_fork MCA parameter to 0. > -- > And the process hangs as wel

Re: [OMPI users] --oversubscribe option

2018-06-06 Thread r...@open-mpi.org
I’m not entirely sure what you are asking here. If you use oversubscribe, we do not bind your processes and you suffer some performance penalty for it. If you want to run one process/thread and retain binding, then do not use --oversubscribe and instead use --use-hwthread-cpus > On Jun 6, 2018

Re: [OMPI users] Fwd: OpenMPI 3.1.0 on aarch64

2018-06-07 Thread r...@open-mpi.org
You didn’t show your srun direct launch cmd line or what version of Slurm is being used (and how it was configured), so I can only provide some advice. If you want to use PMIx, then you have to do two things: 1. Slurm must be configured to use PMIx - depending on the version, that might be ther

Re: [OMPI users] Fwd: OpenMPI 3.1.0 on aarch64

2018-06-07 Thread r...@open-mpi.org
root 16 Jun 1 08:32 > /opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so > -rwxr-xr-x 1 root root 828232 May 30 15:20 > /opt/slurm/lib64/slurm/mpi_pmix_v2.so > > > Let me know if anything else would be helpful. > > Thanks,-- bennet > > On Thu,

Re: [OMPI users] Fwd: OpenMPI 3.1.0 on aarch64

2018-06-07 Thread r...@open-mpi.org
-utils/lib > > > I don't have a saved build log, but I can rebuild this and save the > build logs, in case any information in those logs would help. > > I will also mention that we have, in the past, used the > --disable-dlopen and --enable-shared flags, which we did not use

Re: [OMPI users] Fwd: OpenMPI 3.1.0 on aarch64

2018-06-08 Thread r...@open-mpi.org
olved. Thanks very much Ralph and Artem for your > help! > > -- bennet > > > On Thu, Jun 7, 2018 at 11:06 AM r...@open-mpi.org <mailto:r...@open-mpi.org> > mailto:r...@open-mpi.org>> wrote: > Odd - Artem, do you have any suggestions? > > >

Re: [OMPI users] Fwd: srun works, mpirun does not

2018-06-17 Thread r...@open-mpi.org
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why. > On Jun 17, 2018, at 9:07 AM, Bennet Fauber wrote: > > I have a compiled binary that

Re: [OMPI users] srun works, mpirun does not

2018-06-18 Thread r...@open-mpi.org
TIME NODES > NODELIST(REASON) > 158 standard bash bennet R 14:30 1 cav01 > [bennet@cavium-hpc ~]$ srun hostname > cav01.arc-ts.umich.edu > [ repeated 23 more times ] > > As always, your help is much appreciated, > > -- bennet > > On S

Re: [OMPI users] srun works, mpirun does not

2018-06-18 Thread r...@open-mpi.org
ad of ORTE_SUCCESS > > > At one point, I am almost certain that OMPI mpirun did work, and I am > at a loss to explain why it no longer does. > > I have also tried the 3.1.1rc1 version. I am now going to try 3.0.0, > and we'll try downgrading SLURM to a prior version.

Re: [OMPI users] srun works, mpirun does not

2018-06-18 Thread r...@open-mpi.org
t; the daemon itself. We cannot recover from this failure, and >> therefore will terminate the job. >> -- >> >> That makes sense, I guess. >> >> I'll keep you posted as to what hap

Re: [OMPI users] Enforcing specific interface and subnet usage

2018-06-18 Thread r...@open-mpi.org
I’m not entirely sure I understand what you are trying to do. The PMIX_SERVER_URI2 envar tells local clients how to connect to their local PMIx server (i.e., the OMPI daemon on that node). This is always done over the loopback device since it is a purely local connection that is never used for

Re: [OMPI users] Enforcing specific interface and subnet usage

2018-06-19 Thread r...@open-mpi.org
nment > On Jun 19, 2018, at 2:08 AM, Maksym Planeta > wrote: > > But what about remote connections parameter? Why is it not set? > > On 19/06/18 00:58, r...@open-mpi.org wrote: >> I’m not entirely sure I understand what you are trying to do. The >> PMIX_SERVER_UR

Re: [OMPI users] new core binding issues?

2018-06-22 Thread r...@open-mpi.org
I suspect it is okay. Keep in mind that OMPI itself is starting multiple progress threads, so that is likely what you are seeing. The binding patter in the mpirun output looks correct as the default would be to map-by socket and you asked that we bind-to core. > On Jun 22, 2018, at 9:33 AM, No

Re: [OMPI users] new core binding issues?

2018-06-22 Thread r...@open-mpi.org
Afraid I’m not familiar with that option, so I really don’t know :-( > On Jun 22, 2018, at 10:13 AM, Noam Bernstein > wrote: > >> On Jun 22, 2018, at 1:00 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> >> wrote: >> >> I suspect it is okay. Keep in min

Re: [OMPI users] Disable network interface selection

2018-06-22 Thread r...@open-mpi.org
> On Jun 22, 2018, at 7:31 PM, Gilles Gouaillardet > wrote: > > Carlos, > > By any chance, could > > mpirun—mca oob_tcp_if_exclude 192.168.100.0/24 ... > > work for you ? > > Which Open MPI version are you running ? > > > IIRC, subnets are internally translated

Re: [OMPI users] Disable network interface selection

2018-06-22 Thread r...@open-mpi.org
> On Jun 22, 2018, at 8:25 PM, r...@open-mpi.org wrote: > > > >> On Jun 22, 2018, at 7:31 PM, Gilles Gouaillardet >> mailto:gilles.gouaillar...@gmail.com>> wrote: >> >> Carlos, >> >> By any chance, could >> >> mpiru

Re: [OMPI users] Verbose output for MPI

2018-07-04 Thread r...@open-mpi.org
Try adding PMIX_MCA_ptl_base_verbose=10 to your environment > On Jul 4, 2018, at 8:51 AM, Maksym Planeta > wrote: > > Thanks for quick response, > > I tried this out and I do get more output: https://pastebin.com/JkXAYdM4. But > the line I need does not appear in the output. > > On 04/07/18

Re: [OMPI users] Locking down TCP ports used

2018-07-07 Thread r...@open-mpi.org
I suspect the OOB is working just fine and you are seeing the TCP/btl opening the other ports. There are two TCP elements at work here: the OOB (which sends management messages between daemons) and the BTL (which handles the MPI traffic). In addition to what you provided, you also need to provid

Re: [OMPI users] Error in file base/plm_base_launch_support.c: OPAL_HWLOC_TOPO

2018-07-21 Thread r...@open-mpi.org
More than likely the problem is the difference in hwloc versions - sounds like the topology to/from xml is different between the two versions, and the older one doesn’t understand the new one. > On Jul 21, 2018, at 12:04 PM, Brian Smith > wrote: > > Greetings, > > I'm having trouble getting

<    1   2   3