Re: [OMPI users] EXTERNAL: Re: Problem with shell when launching jobs with OpenMPI 1.6.5 rsh

2014-04-07 Thread Noam Bernstein
On Apr 7, 2014, at 4:36 PM, Blosch, Edwin L wrote: > I guess this is not OpenMPI related anymore. I can repeat the essential > problem interactively: > > % echo $SHELL > /bin/csh > > % echo $SHLVL > 1 > > % cat hello > echo Hello > > % /bin/bash hello > Hello > > % /bin/csh hello > Hello

Re: [OMPI users] Question about scheduler support

2014-05-15 Thread Noam Bernstein
I’m not sure how this would apply to other options, but for the scheduler, what about no scheduler-related options defaults to everything enabled (like before), but having any explicit scheduler enable option disables by default all the other schedulers? Multiple explicit enable options would en

[OMPI users] malloc related crash inside openmpi

2016-11-17 Thread Noam Bernstein
Hi - we’ve started seeing over the last few days crashes and hangs in openmpi, in a code that hasn’t been touched in months, and an openmpi installation (v. 1.8.5) that also hasn’t been touched in months. The symptoms are either a hang, with a stack trace (from attaching to the one running proc

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
> On Nov 17, 2016, at 3:22 PM, Noam Bernstein > wrote: > > Hi - we’ve started seeing over the last few days crashes and hangs in > openmpi, in a code that hasn’t been touched in months, and an openmpi > installation (v. 1.8.5) that also hasn’t been touched in months. T

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
e on the merits of going to 1.10 vs. 2.0 (from 1.8)? thanks, Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Ph

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
> On Nov 23, 2016, at 3:08 PM, Noam Bernstein > wrote: > >> On Nov 23, 2016, at 3:02 PM, George Bosilca > <mailto:bosi...@icl.utk.edu>> wrote: >> >> Noam, >> >> I do not recall exactly which version of Open MPI was affected, but we had &

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
\ --with-verbs=/usr \ --with-verbs-libdir=/usr/lib64 Followed by “make install” Any suggestions for getting 2.0.1 working? thanks, Noam || |U.S. NAVAL| |_RESEARCH_| LABORATO

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
on may have not fully worked and I didn’t notice. What’s the name of the library it’s looking for? Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials P

Re: [OMPI users] malloc related crash inside openmpi

2016-11-25 Thread Noam Bernstein
> On Nov 24, 2016, at 10:52 AM, r...@open-mpi.org wrote: > > Just to be clear: are you saying that mpirun exits with that message? Or is > your application process exiting with it? > > There is no reason for mpirun to be looking for that library. > > The library in question is in the /lib/openm

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Noam Bernstein
y some change in the memory allocator in a recent version of openmpi. Just e-mail me if that’s the case. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Tec

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Noam Bernstein
is not being called, it’s just a process dying). Maybe the patch in Ralph’s e-mail fixes it. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Noam Bernstein
ect ) at fock.F:1413 #8 0x02976478 in vamp () at main.F:2093 #9 0x00412f9e in main () #10 0x00383a41ed1d in __libc_start_main () from /lib64/libc.so.6 #11 0x00412ea9 in _start () || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Mate

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-09 Thread Noam Bernstein
of the tasks). Yours is similar, but not actually the same, since it’s actually trying to stop the task, and one would at least hope that OpenMPI could detect it and exit. Noam ____ || |U.S.

Re: [OMPI users] Mpirun: How to print STDOUT of just one process?

2012-02-01 Thread Noam Bernstein
man mpirun . . . -output-filename, --output-filename Redirect the stdout, stderr, and stddiag of all ranks to a rank-unique version of the specified filename. Any directories in the filename will automatically be created. Each output file will consist of fi

Re: [OMPI users] MPI/FORTRAN on a cluster system

2012-08-20 Thread Noam Bernstein
On Aug 20, 2012, at 11:12 AM, David Warren wrote: > The biggest issue you may have is that gnu fortran does not support all the > fortran constructs that all the others do. Most fortrans have supported the > standard plus the DEC extentions. Gnu fortran does not quite get all the > standards.I

[OMPI users] crashes in VASP with openmpi 1.6.x

2012-10-02 Thread Noam Bernstein
Hi - I've been trying to run VASP 5.2.12 with ScaLAPACK and openmpi 1.6.x on a single 32 core (4 x 8 core) Opteron node, purely shared memory. We've always had occasional hangs with older OpenMPI versions (1.4.3 and 1.5.5) on these machines, but infrequent enough to be usable and not worth my tim

Re: [OMPI users] crashes in VASP with openmpi 1.6.x

2012-10-03 Thread Noam Bernstein
Thanks to everyone who answered, in particular Ake Sandgren, it appears to be a weird problem with acml that somehow triggers a seg fault in libmpi, but only when running on Opterons. I'd still be interested in figuring out how to get a more complete backtrace, but at least the immediate problem i

[OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
/var/spool/torque/aux//4600.tin). thanks, Noam Noam Bernstein Center for Computational Materials Science NRL Code 6390 noam.bernst...@nrl.navy.mil

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 9:55 AM, Noam Bernstein wrote: > > This is completely unrepeatable - resubmitting the same job almost > always works the second time around. The line appears to be > associated with looking for the torque/maui generated node file, > and when I do somethi

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 10:04 AM, Noam Bernstein wrote: > > Never mind - I was sure that my earlier tests showed that the $PBS_NODEFILE > was there, but now it seems like every time the job fails it's because this > file really is missing. Time to check why torque isn't

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 10:22 AM, Reuti wrote: > > Is the location for the spool directory local or shared by NFS? Disk full? No - locally mounted, and far from full on all the nodes. Noam

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 10:36 AM, Noam Bernstein wrote: > > On Sep 20, 2013, at 10:22 AM, Reuti wrote: > >> >> Is the location for the spool directory local or shared by NFS? Disk full? > > No - locally mounted, and far from full on all the nodes. Another new obser

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 11:52 AM, Gus Correa wrote: > Hi Noam > > Could it be that Torque, or probably more likely NFS, > is too slow to create/make available the PBS_NODEFILE? > > What if you insert a "sleep 2", > or whatever number of seconds you want, > before the mpiexec command line? > Or may

[OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-16 Thread Noam Bernstein
Has anyone tried to use openmpi 1.7.3 with the latest CentOS kernel (well, nearly latest: 2.6.32-431.el6.x86_64), and especially with infiniband? I'm seeing lots of weird slowdowns, especially when using infiniband, but even when running with "--mca btl self,sm" (it's much worse with IB, though

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Noam Bernstein
On Dec 16, 2013, at 5:40 PM, Noam Bernstein wrote: > > Once I have some more detailed information I'll follow up. OK - I've tried to characterize the behavior with vasp, which accounts for most of our cluster usage, and it's quite odd. I ran my favorite benchmarking job

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Noam Bernstein
On Dec 17, 2013, at 11:04 AM, Ralph Castain wrote: > Are you binding the procs? We don't bind by default (this will change in > 1.7.4), and binding can play a significant role when comparing across kernels. > > add "--bind-to-core" to your cmd line I've previously always used mpi_paffinity_alo

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Noam Bernstein
On Dec 17, 2013, at 11:04 AM, Ralph Castain wrote: > Are you binding the procs? We don't bind by default (this will change in > 1.7.4), and binding can play a significant role when comparing across kernels. > > add "--bind-to-core" to your cmd line Yeay - it works. Thank you very much for the

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Noam Bernstein
On Dec 17, 2013, at 11:04 AM, Ralph Castain wrote: > Are you binding the procs? We don't bind by default (this will change in > 1.7.4), and binding can play a significant role when comparing across kernels. > > add "--bind-to-core" to your cmd line Now that it works, is there a way to set it v

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-18 Thread Noam Bernstein
Thanks to all who answered my question. The culprit was an interaction between 1.7.3 not supporting mpi_paffinity_alone (which we were using previously) and the new kernel. Switching to --bind-to core (actually the environment variable OMPI_MCA_hwloc_base_binding_policy=core) fixed the problem

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-18 Thread Noam Bernstein
On Dec 18, 2013, at 10:32 AM, Dave Love wrote: > Noam Bernstein writes: > >> We specifically switched to 1.7.3 because of a bug in 1.6.4 (lock up in some >> collective communication), but now I'm wondering whether I should just test >> 1.6.5. > > What bug,

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-19 Thread Noam Bernstein
On Dec 18, 2013, at 5:19 PM, Martin Siegert wrote: > > Thanks for figuring this out. Does this work for 1.6.x as well? > The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity > covers versions 1.2.x to 1.5.x. > Does 1.6.x support mpi_paffinity_alone = 1 ? > I set this in openmpi-m

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-27 Thread Noam Bernstein
On Feb 27, 2014, at 2:36 AM, Patrick Begou wrote: > Bernd Dammann wrote: >> Using the workaround '--bind-to-core' does only make sense for those jobs, >> that allocate full nodes, but the majority of our jobs don't do that. > Why ? > We still use this option in OpenMPI (1.6.x, 1.7.x) with OpenF

Re: [OMPI users] OpenMPI job initializing problem

2014-03-20 Thread Noam Bernstein
On Mar 20, 2014, at 2:13 PM, Ralph Castain wrote: > > On Mar 20, 2014, at 9:48 AM, Beichuan Yan wrote: > >> Hi, >> >> Today I tested OMPI v1.7.5rc5 and surprisingly, it works like a charm! >> >> I found discussions related to this issue: >> >> 1. http://www.open-mpi.org/community/lists/user

[OMPI users] mpi_paffinity_alone and Nehalem SMT

2009-10-23 Thread Noam Bernstein
Hi all - we have a new Nehalem cluster (dual quad core), and SMT is enabled in the BIOS (for now). I do want to do benchmarking on our applications, obviously, but I was also wondering what happens if I just set the number of slots to 8 in SGE, and just let things run. It particular, how will

Re: [OMPI users] Problem with mpi_comm_spawn_multiple

2010-05-07 Thread Noam Bernstein
I haven't been following this whole discussion, but I do know something about how Fortran allocates and passes string argument (the joys of Fortran/C/python inter-language calls), for what it's worth. By definition in the Fortran language all strings have a predefined length, which Fortran magic

[OMPI users] new hwloc error

2015-04-28 Thread Noam Bernstein
Noam --- Noam Bernstein Center for Computational Materials Science Naval Research Laboratory Code 6390 noam.bernst...@nrl.navy.mil phone: 202 404 8628 smime.p7s Description: S/MIME cryptographic signature

Re: [OMPI users] new hwloc error

2015-04-29 Thread Noam Bernstein
> On Apr 28, 2015, at 4:54 PM, Brice Goglin wrote: > > Hello, > Can you build hwloc and run lstopo on these nodes to check that everything > looks similar? > You have hyperthreading enabled on all nodes, and you're trying to bind > processes to entire cores, right? > Does 0,16 correspond to two

Re: [OMPI users] new hwloc error

2015-04-29 Thread Noam Bernstein
> On Apr 29, 2015, at 12:47 PM, Brice Goglin wrote: > > Thanks. It's indeed normal that OMPI failed to bind to cpuset 0,16 since > 16 doesn't exist at all. > Can you run "lstopo foo.xml" on one node where it failed, and send the > foo.xml that got generated? Just want to make sure we don't have

Re: [OMPI users] new hwloc error

2015-04-29 Thread Noam Bernstein
> On Apr 29, 2015, at 4:09 PM, Brice Goglin wrote: > > Nothing wrong in that XML. I don't see what could be happening besides a > node rebooting with hyper-threading enabled for random reasons. > Please run "lstopo foo.xml" again on the node next time you get the OMPI > failure (assuming you get

Re: [OMPI users] new hwloc error

2015-04-30 Thread Noam Bernstein
> On Apr 29, 2015, at 5:59 PM, Ralph Castain wrote: > > Try adding —hetero-nodes to the cmd line and see if that helps resolve the > problem. Of course, if all the machines are identical, then it won’t They are identical, and the problem is new. That’s what’s most mysterious about it. Can

Re: [OMPI users] new hwloc error

2015-06-01 Thread Noam Bernstein
> On Apr 30, 2015, at 1:16 PM, Noam Bernstein > wrote: > >> On Apr 30, 2015, at 12:03 PM, Ralph Castain wrote: >> >> The planning is pretty simple: at startup, mpirun launches a daemon on each >> node. If —hetero-nodes is provided, each daemon returns the

Re: [OMPI users] new hwloc error

2015-06-01 Thread Noam Bernstein
> On Jun 1, 2015, at 5:09 PM, Ralph Castain wrote: > > This probably isn’t very helpful, but fwiw: we added an automatic > “fingerprint” capability in the later OMPI versions just to detect things > like this. If the fingerprint of a backend node doesn’t match the head node, > we automatically

Re: [OMPI users] OpenMPI (1.8.3) and environment variable export

2015-06-12 Thread Noam Bernstein
> On Jun 12, 2015, at 11:08 AM, borno_bo...@gmx.de wrote: > > Hey there, > > I know that variable export in general can be done with the -x option of > mpirun, but I guess that won't help me. > More precisely I have a heterogeneous cluster (number of cores per cpu) and > one process for each n

Re: [OMPI users] single CPU vs four CPU result differences, is it normal?

2015-10-28 Thread Noam Bernstein
recommended. Noam ---- Noam Bernstein Center for Materials Physics and Technology NRL Code 6390 noam.bernst...@nrl.navy.mil phone: 703 683 2783

[OMPI users] --map-by

2017-11-16 Thread Noam Bernstein
Hi all - I’m trying to run mixed MPI/OpenMP, so I ideally want binding of each MPI process to a small set of cores (to allow for the OpenMP threads). From the mpirun docs at https://www.open-mpi.org//doc/current/man1/mpirun.1.php I got

Re: [OMPI users] --map-by

2017-11-16 Thread Noam Bernstein
|_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www.nrl.navy.mil/> ___ users mailing list users@lists.

Re: [OMPI users] --map-by

2017-11-21 Thread Noam Bernstein
Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www

Re: [OMPI users] --map-by

2017-11-27 Thread Noam Bernstein
> On Nov 21, 2017, at 8:53 AM, r...@open-mpi.org wrote: > >> On Nov 21, 2017, at 5:34 AM, Noam Bernstein > <mailto:noam.bernst...@nrl.navy.mil>> wrote: >> >>> >>> On Nov 20, 2017, at 7:02 PM, r...@open-mpi.org <mailto:r...@open-mpi.org>

Re: [OMPI users] IMB-MPI1 hangs after 30 minutes with Open MPI 3.0.0 (was: Openmpi 1.10.4 crashes with 1024 processes)

2017-12-01 Thread Noam Bernstein
> On Dec 1, 2017, at 8:10 AM, Götz Waschk wrote: > > On Fri, Dec 1, 2017 at 10:13 AM, Götz Waschk wrote: >> I have attached my slurm job script, it will simply do an mpirun >> IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for >> instance, vader is enabled. > I have tested a

[OMPI users] latest Intel CPU bug

2018-01-03 Thread Noam Bernstein
Out of curiosity, have any of the OpenMPI developers tested (or care to speculate) how strongly affected OpenMPI based codes (just the MPI part, obviously) will be by the proposed Intel CPU memory-mapping-related kernel patches that are all the rage? https://arstechnica.com/gadgets/201

[OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange way. Basically, there’s a Cartesian communicator, 4x16 (64 processes total), and despite the fact that the communication pattern is rather regular, one particular send/recv pair hangs consistently. Basically, across eac

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
> On Apr 5, 2018, at 11:03 AM, Reuti wrote: > > Hi, > >> Am 05.04.2018 um 16:16 schrieb Noam Bernstein : >> >> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange >> way. Basically, there’s a Cartesian communicator, 4x16 (64 proces

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
of hanging. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.nav

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
> On Apr 5, 2018, at 3:55 PM, George Bosilca wrote: > > Noam, > > The OB1 provide a mechanism to dump all pending communications in a > particular communicator. To do this I usually call mca_pml_ob1_dump(comm, 1), > with comm being the MPI_Comm and 1 being the verbose mode. I have no idea how

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
guess I need to recompile ompi in debug mode? Is that just a flag to configure? thanks, Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Be

Re: [OMPI users] mpi send/recv pair hangin

2018-04-06 Thread Noam Bernstein
> On Apr 5, 2018, at 4:11 PM, George Bosilca wrote: > > I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm, 1)". > This allows the debugger to make a call our function, and output internal > information about the library status. OK - after a number of missteps, I recompile

Re: [OMPI users] mpi send/recv pair hangin

2018-04-06 Thread Noam Bernstein
pected_seq 8942 ompi_proc 0xe8e1db0 send_seq 174 [compute-1-10:15673] [Rank 2] expected_seq 54 ompi_proc 0xe9d7940 send_seq 8561 [compute-1-10:15673] [Rank 3] expected_seq 126 ompi_proc 0xe9c20c0 send_seq 385 ____ || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Mate

Re: [OMPI users] mpi send/recv pair hangin

2018-04-08 Thread Noam Bernstein
Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www.nrl.navy.mil/> ___

Re: [OMPI users] mpi send/recv pair hangin

2018-04-09 Thread Noam Bernstein
n with OMP_NUM_THREADS explicitly 1 if you’d like to exclude that as a possibility.  opal_config.h is attached, from ./opal/include/opal_config.h in the build directory. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D.Center for Materials P

Re: [OMPI users] mpi send/recv pair hangin

2018-04-10 Thread Noam Bernstein
> On Apr 10, 2018, at 4:20 AM, Reuti wrote: > >> >> Am 10.04.2018 um 01:04 schrieb Noam Bernstein > <mailto:noam.bernst...@nrl.navy.mil>>: >> >>> On Apr 9, 2018, at 6:36 PM, George Bosilca >> <mailto:bosi...@icl.utk.edu>> wrote: >

[OMPI users] new core binding issues?

2018-06-22 Thread Noam Bernstein
Hi - for the last couple of weeks, more or less since we did some kernel updates, certain compute intensive MPI jobs have been behaving oddly as far as their speed - bits that should be quite fast sometimes (but not consistently) take a long time, and re-running sometimes fixes the issue, someti

Re: [OMPI users] new core binding issues?

2018-06-22 Thread Noam Bernstein
t mpirun asked for? Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www.nrl.navy.mil/> _

Re: [OMPI users] new core binding issues?

2018-06-22 Thread Noam Bernstein
> On Jun 22, 2018, at 2:14 PM, Brice Goglin wrote: > > If psr is the processor where the task is actually running, I guess we'd need > your lstopo output to find out where those processors are in the machine. > Excellent, that’s exactly the sort of thing I was hoping someone on the list would

[OMPI users] Seg fault in opal_progress

2018-07-10 Thread Noam Bernstein
Hi OpenMPI users - I’m trying to debug a non-deterministic crash, apparently in opal_progress, with OpenMPI 3.1.0. All of them seem to involve mpi_allreduce, although it’s different particular calls from this code (VASP), and they seem more frequent for larger core/mpi task counts (128 happens

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Noam Bernstein
> On Jul 10, 2018, at 5:15 PM, Noam Bernstein > wrote: > > > > What are useful steps I can do to debug? Recompile with —enable-debug? Are > there any other versions that are worth trying? I don’t recall this error > happening bef

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Noam Bernstein
> On Jul 11, 2018, at 9:58 AM, Noam Bernstein > wrote: > >> On Jul 10, 2018, at 5:15 PM, Noam Bernstein > <mailto:noam.bernst...@nrl.navy.mil>> wrote: >> >> >> >> What are useful steps I can do to debug? Recompile with —enable-debug? Are

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Noam Bernstein
--enable-debug Is there any way I can confirm that the version of the openmpi library I think I’m using really was compiled with debugging? Noam || |U.S. NAVAL| |_RESEARCH_| LABORATOR

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Noam Bernstein
> On Jul 11, 2018, at 11:29 AM, Jeff Squyres (jsquyres) via users > wrote: >>> >> >> After more extensive testing it’s clear that it still happens with 2.1.3, >> but much less frequently. I’m going to try to get more detailed info with >> version 3.1.1, where it’s easier to reproduce. objdu

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
I’ve recompiled 3.1.1 with —enable-debug —enable-mem-debug, and I still get no detailed information from the mpi libraries, only VASP (as before): ldd (at runtime, so I’m fairly sure it’s referring to the right executable and LD_LIBRARY_PATH) info: vexec /usr/local/vasp/bin/5.4.4/0test/vasp.gamm

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
> On Jul 12, 2018, at 8:37 AM, Noam Bernstein > wrote: > > I’m going to try the 3.1.x 20180710 nightly snapshot next. Same behavior, exactly - segfault, no debugging info beyond the vasp routine that calls m

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
at (or the lack of line info in the stack trace). Could be an intel compiler issue? Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Labor

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
> On Jul 12, 2018, at 11:02 AM, Jeff Squyres (jsquyres) > wrote: > > On Jul 12, 2018, at 10:59 AM, Noam Bernstein > wrote: >> >>> Do you get core files? >>> >>> Loading up the core file in a debugger might give us more information. >>

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
> On Jul 12, 2018, at 11:58 AM, Jeff Squyres (jsquyres) > wrote: > > > > (You may have already done this; I just want to make sure we're on the same > sheet of music here…) I’m not talking about the job script or shell startup files. The actual “executable” passed to mpirun on the command

Re: [OMPI users] Seg fault in opal_progress

2018-07-13 Thread Noam Bernstein
Just to summarize for the list. With Jeff’s prodding I got it generating core files with the debug (and mem-debug) version of openmpi, and below is the kind of stack trace I’m getting from gdb. It looks slightly different when I use a slightly different implementation that doesn’t use MPI_INPL

Re: [OMPI users] Seg fault in opal_progress

2018-07-16 Thread Noam Bernstein
> On Jul 14, 2018, at 1:31 AM, Nathan Hjelm via users > wrote: > > Please give master a try. This looks like another signature of running out of > space for shared memory buffers. Sorry, I wasn’t explicit on this point - I’m already using master, specifically openmpi-master-201807120327-34bc77

Re: [OMPI users] Seg fault in opal_progress

2018-07-16 Thread Noam Bernstein
> On Jul 16, 2018, at 8:34 AM, Noam Bernstein <mailto:noam.bernst...@nrl.navy.mil>> wrote: > >> On Jul 14, 2018, at 1:31 AM, Nathan Hjelm via users >> mailto:users@lists.open-mpi.org>> wrote: >> >> Please give master a try. This looks like anot

[OMPI users] no openmpi over IB on new CentOS 7 system

2018-10-09 Thread Noam Bernstein
Hi - I’m trying to get OpenMPI working on a newly configured CentOS 7 system, and I’m not even sure what information would be useful to provide. I’m using the CentOS built in libibverbs and/or libfabric, and I configure openmpi with just —with-verbs —with-ofi —prefix=$DEST also tried —w

Re: [OMPI users] --mca btl params

2018-10-09 Thread Noam Bernstein
> On Oct 9, 2018, at 6:01 PM, Jeffrey A Cummings > wrote: > > What are the allowable values for the –mca btl parameter on the mpirun > command line? That's basically what the output of ompi_info -a says. So it appears, for the moment at least, like things are magically better. In the proce

Re: [OMPI users] --mca btl params

2018-10-09 Thread Noam Bernstein
> On Oct 9, 2018, at 7:02 PM, Noam Bernstein > wrote: > >> On Oct 9, 2018, at 6:01 PM, Jeffrey A Cummings > <mailto:jeffrey.a.cummi...@aero.org>> wrote: >> >> What are the allowable values for the –mca btl parameter on the mpirun >> command l

Re: [OMPI users] no openmpi over IB on new CentOS 7 system

2018-10-10 Thread Noam Bernstein
> On Oct 10, 2018, at 4:51 AM, Dave Love wrote: > > RDMA was just broken in the last-but-one(?) RHEL7 kernel release, in > case that's the problem. (Fixed in 3.10.0-862.14.4.) I strongly suspect that this is it. In the process of getting everything organized to collect the info various people

[OMPI users] hang (in Bcast?) with OpenMPI 3.1.3

2018-11-27 Thread Noam Bernstein
Hi all - I've been trying to debug a segfault in OpenMPI 3.1.2, and in the process I noticed that 3.1.3 is out, so I thought I'd test it. However, with 3.1.3 the code (LAMMPS) hangs very early, in dealing with input. I'm running 16 tasks on a single 16 core node, with Infiniband (which it may

Re: [OMPI users] scaling problem with openmpi

2009-05-18 Thread Noam Bernstein
On May 18, 2009, at 12:50 PM, Pavel Shamis (Pasha) wrote: Roman, Can you please share with us Mvapich numbers that you get . Also what is mvapich version that you use. Default mvapich and openmpi IB tuning is very similar, so it is strange to see so big difference. Do you know what kind of

[OMPI users] CP2K mpi hang

2009-05-18 Thread Noam Bernstein
Hi all - I have a bizarre OpenMPI hanging problem. I'm running an MPI code called CP2K (related to, but not the same as cpmd). The complications of the software aside, here are the observations: At the base is a serial code that uses system() calls to repeatedly invoke mpirun cp2k.pop

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Noam Bernstein
On May 19, 2009, at 8:29 AM, Jeff Squyres wrote: fork() support in OpenFabrics has always been dicey -- it can lead to random behavior like this. Supposedly it works in a specific set of circumstances, but I don't have a recent enough kernel on my machines to test. It's best not to use

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Noam Bernstein
On May 19, 2009, at 9:32 AM, Ashley Pittman wrote: Can you confirm that *all* processes are in PMPI_Allreduce at some point, the collectives commonly get blamed for a lot of hangs and it's not always the correct place to look. For the openmpi run, every single process showed one of those two

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Noam Bernstein
On May 19, 2009, at 12:13 PM, Ashley Pittman wrote: On Tue, 2009-05-19 at 11:01 -0400, Noam Bernstein wrote: I'd suspect the filesystem too, except that it's hung up in an MPI call. As I said before, the whole thing is bizarre. It doesn't matter where the executable is, j

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Noam Bernstein
On May 19, 2009, at 12:13 PM, Ashley Pittman wrote: That is indeed odd but it shouldn't be too hard to track down, how often does the failure occur? Presumably when you say you have three invocations of the program they communicate via files, is the location of these files changing? Yeay.

[OMPI users] mpirun delay

2009-06-09 Thread Noam Bernstein
I have a serial code that repeatedly calls OpenMPI mpirun on a parallel code. Each run takes either 10 or 100 seconds, and the whole process repeated thousands of times. Each invocation of mpirun is gradually slower (adding maybe 15-20 seconds per run after about 1000 runs). This is with Op

Re: [OMPI users] 50% performance reduction due to OpenMPI v 1.3.2 forcing all MPI traffic over Ethernet instead of using Infiniband

2009-06-24 Thread Noam Bernstein
On Jun 23, 2009, at 6:19 PM, Gus Correa wrote: Hi Jim, list On my OpenMPI 1.3.2 ompi_info -config gives: Wrapper extra LIBS: -lrdmacm -libverbs -ltorque -lnuma -ldl -Wl,-- export-dynamic -lnsl -lutil -lm -ldl Yours doesn't seem to have the IB libraries: -lrdmacm -libverbs So, I would gues

Re: [OMPI users] 50% performance reduction due to OpenMPI v 1.3.2forcing all MPI traffic over Ethernet instead of using Infiniband

2009-06-24 Thread Noam Bernstein
On Jun 24, 2009, at 11:05 AM, Jim Kress wrote: Noam, Gus and List, Did you statically link your openmpi when you built it? If you did (the default is NOT to do this) then that could explain the discrepancy. Not explicitly: env CC=gcc CXX=g++ F77=ifort FC=ifort ./configure --prefix=/shar

[OMPI users] OpenMPI 4 and pmi2 support

2019-03-22 Thread Noam Bernstein via users
Hi - I'm trying to compile openmpi 4.0.0 with srun support, so I'm trying to tell openmpi's configure where to find the relevant files by doing $ ./configure --with-verbs --with-ofi --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr/lib64 --prefix=/share/apps/mpi/openmpi/4.0.0/ib/gnu verbs an

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-14 Thread Noam Bernstein via users
thanks, Noam > On Mar 23, 2019, at 10:07 AM, Noam Bernstein > wrote: > > Sadly, doesn't seem to be helping. From config.log: > It was created by Open MPI configure 4.0.1rc3, which was > generated by GNU Autoconf 2.69. Invocati

[OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
Hi - we’re having a weird problem with OpenMPI on our newish infiniband EDR (mlx5) nodes. We're running CentOS 7.6, with all the infiniband and ucx libraries as provided by CentOS, i.e. ucx-1.4.0-1.el7.x86_64 libibverbs-utils-17.2-3.el7.x86_64 libibverbs-17.2-3.el7.x86_64 libibumad-17.2-3.el7.x8

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www.nrl.navy.mil/> __

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
I tried to disable ucx (successfully, I think - I replaced the “—mca btl ucx —mca btl ^vader,tcp,openib” with “—mca btl_openib_allow_ib 1”, and attaching gdb to a running process shows no ucx-related routines active). It still has the same fast growing (1 GB/s) memory usage problem.

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www.nrl.navy.mil/> _

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
the other is down to about 1 GB. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.na

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
esn’t that suggest that it’s something lower level, like maybe a kernel issue? Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 9:40 AM, Jeff Squyres (jsquyres) > wrote: > > On Jun 20, 2019, at 9:31 AM, Noam Bernstein via users > wrote: >> >> One thing that I’m wondering if anyone familiar with the internals can >> explain is how you get a memory leak that isn’t

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 11:54 AM, Jeff Squyres (jsquyres) > wrote: > > On Jun 14, 2019, at 2:02 PM, Noam Bernstein via users > wrote: >> >> Hi Jeff - do you remember this issue from a couple of months ago? > > Noam: I'm sorry, I totally misse

  1   2   >