Re: [OMPI users] growing memory use from MPI application

2019-06-21 Thread Noam Bernstein via users
ware. That's a good idea. I'll try that. If only SuperMicro didn't make it so difficult to find the correct firmware. Noam ____ || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials

Re: [OMPI users] growing memory use from MPI application

2019-06-21 Thread Noam Bernstein via users
Perhaps I spoke too soon. Now, with the Mellanox OFED stack, we occasionally get the following failure on exit: [compute-4-20:68008:0:68008] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) 0 0x0002a3c5 opal_free_list_destruct() opal_free_list.c:0 1 0x

Re: [OMPI users] process mapping

2019-06-21 Thread Noam Bernstein via users
> On Jun 21, 2019, at 5:02 PM, Ralph Castain wrote: > > > > Too many emails to track :-( > > Should just be “--map-by core --rank-by core” - nothing fancy required. > Sounds like you are getting --map-by node, or at least --rank-by node, which > means somebody has set an MCA param either in

Re: [OMPI users] process mapping

2019-06-21 Thread Noam Bernstein via users
pu 0 and 1,3,5… on cpu 1. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628

Re: [OMPI users] process mapping

2019-06-21 Thread Noam Bernstein via users
> On Jun 21, 2019, at 4:04 PM, Ralph Castain via users > wrote: > > I’m unaware of any “map-to cartofile” option, nor do I find it in mpirun’s > help or man page. Are you seeing it somewhere? From "mpirun —help”: tin 1431 : mpirun --help mapping mpirun (Open MPI) 4.0.1 Usage: mpirun [OPTION]

[OMPI users] process mapping

2019-06-21 Thread Noam Bernstein via users
Hi - are there any examples of the cartofile format? Or is there some combo of —map, —rank, or —bind to achieve this mapping? [BB/..][../..] [../BB][../..] [../..][BB/..] [../..][../BB] I tried everything I could think of for —bind-to, —map-by, and —rank-by, and I can’t get it to happen. I can

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 1:38 PM, Nathan Hjelm via users > wrote: > > THAT is a good idea. When using Omnipath we see an issue with stale files in > /dev/shm if the application exits abnormally. I don't know if UCX uses that > space as well. No stale shm files. echo 3 > /proc/sys/vm/drop_caches

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 10:42 AM, Noam Bernstein via users > wrote: > > I haven’t yet tried the latest OFED or Mellanox low level stuff. That’s next > on my list, but slightly more involved to do, so I’ve been avoiding it. > Aha - using Mellanox’s OFED packaging seems to e

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Noam Bernstein via users
suggestions. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://w

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 12:25 PM, Carlson, Timothy S > wrote: > > As of recent you needed to use --with-slurm and --with-pmi2 > > While the configure line indicates it picks up pmi2 as part of slurm that is > not in fact true and you need to specifically tell it about pmi2 When I do “./configu

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 11:54 AM, Jeff Squyres (jsquyres) > wrote: > > On Jun 14, 2019, at 2:02 PM, Noam Bernstein via users > wrote: >> >> Hi Jeff - do you remember this issue from a couple of months ago? > > Noam: I'm sorry, I totally misse

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 9:40 AM, Jeff Squyres (jsquyres) > wrote: > > On Jun 20, 2019, at 9:31 AM, Noam Bernstein via users > wrote: >> >> One thing that I’m wondering if anyone familiar with the internals can >> explain is how you get a memory leak that isn’t

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
esn’t that suggest that it’s something lower level, like maybe a kernel issue? Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
the other is down to about 1 GB. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.na

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www.nrl.navy.mil/> _

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
I tried to disable ucx (successfully, I think - I replaced the “—mca btl ucx —mca btl ^vader,tcp,openib” with “—mca btl_openib_allow_ib 1”, and attaching gdb to a running process shows no ucx-related routines active). It still has the same fast growing (1 GB/s) memory usage problem.

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www.nrl.navy.mil/> __

[OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
Hi - we’re having a weird problem with OpenMPI on our newish infiniband EDR (mlx5) nodes. We're running CentOS 7.6, with all the infiniband and ucx libraries as provided by CentOS, i.e. ucx-1.4.0-1.el7.x86_64 libibverbs-utils-17.2-3.el7.x86_64 libibverbs-17.2-3.el7.x86_64 libibumad-17.2-3.el7.x8

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-14 Thread Noam Bernstein via users
thanks, Noam > On Mar 23, 2019, at 10:07 AM, Noam Bernstein > wrote: > > Sadly, doesn't seem to be helping. From config.log: > It was created by Open MPI configure 4.0.1rc3, which was > generated by GNU Autoconf 2.69. Invocati

[OMPI users] OpenMPI 4 and pmi2 support

2019-03-22 Thread Noam Bernstein via users
Hi - I'm trying to compile openmpi 4.0.0 with srun support, so I'm trying to tell openmpi's configure where to find the relevant files by doing $ ./configure --with-verbs --with-ofi --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr/lib64 --prefix=/share/apps/mpi/openmpi/4.0.0/ib/gnu verbs an

[OMPI users] hang (in Bcast?) with OpenMPI 3.1.3

2018-11-27 Thread Noam Bernstein
Hi all - I've been trying to debug a segfault in OpenMPI 3.1.2, and in the process I noticed that 3.1.3 is out, so I thought I'd test it. However, with 3.1.3 the code (LAMMPS) hangs very early, in dealing with input. I'm running 16 tasks on a single 16 core node, with Infiniband (which it may

Re: [OMPI users] no openmpi over IB on new CentOS 7 system

2018-10-10 Thread Noam Bernstein
> On Oct 10, 2018, at 4:51 AM, Dave Love wrote: > > RDMA was just broken in the last-but-one(?) RHEL7 kernel release, in > case that's the problem. (Fixed in 3.10.0-862.14.4.) I strongly suspect that this is it. In the process of getting everything organized to collect the info various people

Re: [OMPI users] --mca btl params

2018-10-09 Thread Noam Bernstein
> On Oct 9, 2018, at 7:02 PM, Noam Bernstein > wrote: > >> On Oct 9, 2018, at 6:01 PM, Jeffrey A Cummings > <mailto:jeffrey.a.cummi...@aero.org>> wrote: >> >> What are the allowable values for the –mca btl parameter on the mpirun >> command l

Re: [OMPI users] --mca btl params

2018-10-09 Thread Noam Bernstein
> On Oct 9, 2018, at 6:01 PM, Jeffrey A Cummings > wrote: > > What are the allowable values for the –mca btl parameter on the mpirun > command line? That's basically what the output of ompi_info -a says. So it appears, for the moment at least, like things are magically better. In the proce

[OMPI users] no openmpi over IB on new CentOS 7 system

2018-10-09 Thread Noam Bernstein
Hi - I’m trying to get OpenMPI working on a newly configured CentOS 7 system, and I’m not even sure what information would be useful to provide. I’m using the CentOS built in libibverbs and/or libfabric, and I configure openmpi with just —with-verbs —with-ofi —prefix=$DEST also tried —w

Re: [OMPI users] Seg fault in opal_progress

2018-07-16 Thread Noam Bernstein
> On Jul 16, 2018, at 8:34 AM, Noam Bernstein <mailto:noam.bernst...@nrl.navy.mil>> wrote: > >> On Jul 14, 2018, at 1:31 AM, Nathan Hjelm via users >> mailto:users@lists.open-mpi.org>> wrote: >> >> Please give master a try. This looks like anot

Re: [OMPI users] Seg fault in opal_progress

2018-07-16 Thread Noam Bernstein
> On Jul 14, 2018, at 1:31 AM, Nathan Hjelm via users > wrote: > > Please give master a try. This looks like another signature of running out of > space for shared memory buffers. Sorry, I wasn’t explicit on this point - I’m already using master, specifically openmpi-master-201807120327-34bc77

Re: [OMPI users] Seg fault in opal_progress

2018-07-13 Thread Noam Bernstein
Just to summarize for the list. With Jeff’s prodding I got it generating core files with the debug (and mem-debug) version of openmpi, and below is the kind of stack trace I’m getting from gdb. It looks slightly different when I use a slightly different implementation that doesn’t use MPI_INPL

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
> On Jul 12, 2018, at 11:58 AM, Jeff Squyres (jsquyres) > wrote: > > > > (You may have already done this; I just want to make sure we're on the same > sheet of music here…) I’m not talking about the job script or shell startup files. The actual “executable” passed to mpirun on the command

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
> On Jul 12, 2018, at 11:02 AM, Jeff Squyres (jsquyres) > wrote: > > On Jul 12, 2018, at 10:59 AM, Noam Bernstein > wrote: >> >>> Do you get core files? >>> >>> Loading up the core file in a debugger might give us more information. >>

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
at (or the lack of line info in the stack trace). Could be an intel compiler issue? Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Labor

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
> On Jul 12, 2018, at 8:37 AM, Noam Bernstein > wrote: > > I’m going to try the 3.1.x 20180710 nightly snapshot next. Same behavior, exactly - segfault, no debugging info beyond the vasp routine that calls m

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
I’ve recompiled 3.1.1 with —enable-debug —enable-mem-debug, and I still get no detailed information from the mpi libraries, only VASP (as before): ldd (at runtime, so I’m fairly sure it’s referring to the right executable and LD_LIBRARY_PATH) info: vexec /usr/local/vasp/bin/5.4.4/0test/vasp.gamm

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Noam Bernstein
> On Jul 11, 2018, at 11:29 AM, Jeff Squyres (jsquyres) via users > wrote: >>> >> >> After more extensive testing it’s clear that it still happens with 2.1.3, >> but much less frequently. I’m going to try to get more detailed info with >> version 3.1.1, where it’s easier to reproduce. objdu

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Noam Bernstein
--enable-debug Is there any way I can confirm that the version of the openmpi library I think I’m using really was compiled with debugging? Noam || |U.S. NAVAL| |_RESEARCH_| LABORATOR

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Noam Bernstein
> On Jul 11, 2018, at 9:58 AM, Noam Bernstein > wrote: > >> On Jul 10, 2018, at 5:15 PM, Noam Bernstein > <mailto:noam.bernst...@nrl.navy.mil>> wrote: >> >> >> >> What are useful steps I can do to debug? Recompile with —enable-debug? Are

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Noam Bernstein
> On Jul 10, 2018, at 5:15 PM, Noam Bernstein > wrote: > > > > What are useful steps I can do to debug? Recompile with —enable-debug? Are > there any other versions that are worth trying? I don’t recall this error > happening bef

[OMPI users] Seg fault in opal_progress

2018-07-10 Thread Noam Bernstein
Hi OpenMPI users - I’m trying to debug a non-deterministic crash, apparently in opal_progress, with OpenMPI 3.1.0. All of them seem to involve mpi_allreduce, although it’s different particular calls from this code (VASP), and they seem more frequent for larger core/mpi task counts (128 happens

Re: [OMPI users] new core binding issues?

2018-06-22 Thread Noam Bernstein
> On Jun 22, 2018, at 2:14 PM, Brice Goglin wrote: > > If psr is the processor where the task is actually running, I guess we'd need > your lstopo output to find out where those processors are in the machine. > Excellent, that’s exactly the sort of thing I was hoping someone on the list would

Re: [OMPI users] new core binding issues?

2018-06-22 Thread Noam Bernstein
t mpirun asked for? Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www.nrl.navy.mil/> _

[OMPI users] new core binding issues?

2018-06-22 Thread Noam Bernstein
Hi - for the last couple of weeks, more or less since we did some kernel updates, certain compute intensive MPI jobs have been behaving oddly as far as their speed - bits that should be quite fast sometimes (but not consistently) take a long time, and re-running sometimes fixes the issue, someti

Re: [OMPI users] mpi send/recv pair hangin

2018-04-10 Thread Noam Bernstein
> On Apr 10, 2018, at 4:20 AM, Reuti wrote: > >> >> Am 10.04.2018 um 01:04 schrieb Noam Bernstein > <mailto:noam.bernst...@nrl.navy.mil>>: >> >>> On Apr 9, 2018, at 6:36 PM, George Bosilca >> <mailto:bosi...@icl.utk.edu>> wrote: >

Re: [OMPI users] mpi send/recv pair hangin

2018-04-09 Thread Noam Bernstein
n with OMP_NUM_THREADS explicitly 1 if you’d like to exclude that as a possibility.  opal_config.h is attached, from ./opal/include/opal_config.h in the build directory. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D.Center for Materials P

Re: [OMPI users] mpi send/recv pair hangin

2018-04-08 Thread Noam Bernstein
Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www.nrl.navy.mil/> ___

Re: [OMPI users] mpi send/recv pair hangin

2018-04-06 Thread Noam Bernstein
pected_seq 8942 ompi_proc 0xe8e1db0 send_seq 174 [compute-1-10:15673] [Rank 2] expected_seq 54 ompi_proc 0xe9d7940 send_seq 8561 [compute-1-10:15673] [Rank 3] expected_seq 126 ompi_proc 0xe9c20c0 send_seq 385 ____ || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Mate

Re: [OMPI users] mpi send/recv pair hangin

2018-04-06 Thread Noam Bernstein
> On Apr 5, 2018, at 4:11 PM, George Bosilca wrote: > > I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm, 1)". > This allows the debugger to make a call our function, and output internal > information about the library status. OK - after a number of missteps, I recompile

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
guess I need to recompile ompi in debug mode? Is that just a flag to configure? thanks, Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Be

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
> On Apr 5, 2018, at 3:55 PM, George Bosilca wrote: > > Noam, > > The OB1 provide a mechanism to dump all pending communications in a > particular communicator. To do this I usually call mca_pml_ob1_dump(comm, 1), > with comm being the MPI_Comm and 1 being the verbose mode. I have no idea how

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
of hanging. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.nav

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
> On Apr 5, 2018, at 11:03 AM, Reuti wrote: > > Hi, > >> Am 05.04.2018 um 16:16 schrieb Noam Bernstein : >> >> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange >> way. Basically, there’s a Cartesian communicator, 4x16 (64 proces

[OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange way. Basically, there’s a Cartesian communicator, 4x16 (64 processes total), and despite the fact that the communication pattern is rather regular, one particular send/recv pair hangs consistently. Basically, across eac

[OMPI users] latest Intel CPU bug

2018-01-03 Thread Noam Bernstein
Out of curiosity, have any of the OpenMPI developers tested (or care to speculate) how strongly affected OpenMPI based codes (just the MPI part, obviously) will be by the proposed Intel CPU memory-mapping-related kernel patches that are all the rage? https://arstechnica.com/gadgets/201

Re: [OMPI users] IMB-MPI1 hangs after 30 minutes with Open MPI 3.0.0 (was: Openmpi 1.10.4 crashes with 1024 processes)

2017-12-01 Thread Noam Bernstein
> On Dec 1, 2017, at 8:10 AM, Götz Waschk wrote: > > On Fri, Dec 1, 2017 at 10:13 AM, Götz Waschk wrote: >> I have attached my slurm job script, it will simply do an mpirun >> IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for >> instance, vader is enabled. > I have tested a

Re: [OMPI users] --map-by

2017-11-27 Thread Noam Bernstein
> On Nov 21, 2017, at 8:53 AM, r...@open-mpi.org wrote: > >> On Nov 21, 2017, at 5:34 AM, Noam Bernstein > <mailto:noam.bernst...@nrl.navy.mil>> wrote: >> >>> >>> On Nov 20, 2017, at 7:02 PM, r...@open-mpi.org <mailto:r...@open-mpi.org>

Re: [OMPI users] --map-by

2017-11-21 Thread Noam Bernstein
Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www

Re: [OMPI users] --map-by

2017-11-16 Thread Noam Bernstein
|_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www.nrl.navy.mil/> ___ users mailing list users@lists.

[OMPI users] --map-by

2017-11-16 Thread Noam Bernstein
Hi all - I’m trying to run mixed MPI/OpenMP, so I ideally want binding of each MPI process to a small set of cores (to allow for the OpenMP threads). From the mpirun docs at https://www.open-mpi.org//doc/current/man1/mpirun.1.php I got

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-09 Thread Noam Bernstein
of the tasks). Yours is similar, but not actually the same, since it’s actually trying to stop the task, and one would at least hope that OpenMPI could detect it and exit. Noam ____ || |U.S.

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Noam Bernstein
ect ) at fock.F:1413 #8 0x02976478 in vamp () at main.F:2093 #9 0x00412f9e in main () #10 0x00383a41ed1d in __libc_start_main () from /lib64/libc.so.6 #11 0x00412ea9 in _start () || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Mate

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Noam Bernstein
is not being called, it’s just a process dying). Maybe the patch in Ralph’s e-mail fixes it. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Noam Bernstein
y some change in the memory allocator in a recent version of openmpi. Just e-mail me if that’s the case. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Tec

Re: [OMPI users] malloc related crash inside openmpi

2016-11-25 Thread Noam Bernstein
> On Nov 24, 2016, at 10:52 AM, r...@open-mpi.org wrote: > > Just to be clear: are you saying that mpirun exits with that message? Or is > your application process exiting with it? > > There is no reason for mpirun to be looking for that library. > > The library in question is in the /lib/openm

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
on may have not fully worked and I didn’t notice. What’s the name of the library it’s looking for? Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials P

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
\ --with-verbs=/usr \ --with-verbs-libdir=/usr/lib64 Followed by “make install” Any suggestions for getting 2.0.1 working? thanks, Noam || |U.S. NAVAL| |_RESEARCH_| LABORATO

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
> On Nov 23, 2016, at 3:08 PM, Noam Bernstein > wrote: > >> On Nov 23, 2016, at 3:02 PM, George Bosilca > <mailto:bosi...@icl.utk.edu>> wrote: >> >> Noam, >> >> I do not recall exactly which version of Open MPI was affected, but we had &

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
e on the merits of going to 1.10 vs. 2.0 (from 1.8)? thanks, Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Ph

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
> On Nov 17, 2016, at 3:22 PM, Noam Bernstein > wrote: > > Hi - we’ve started seeing over the last few days crashes and hangs in > openmpi, in a code that hasn’t been touched in months, and an openmpi > installation (v. 1.8.5) that also hasn’t been touched in months. T

[OMPI users] malloc related crash inside openmpi

2016-11-17 Thread Noam Bernstein
Hi - we’ve started seeing over the last few days crashes and hangs in openmpi, in a code that hasn’t been touched in months, and an openmpi installation (v. 1.8.5) that also hasn’t been touched in months. The symptoms are either a hang, with a stack trace (from attaching to the one running proc

Re: [OMPI users] single CPU vs four CPU result differences, is it normal?

2015-10-28 Thread Noam Bernstein
recommended. Noam ---- Noam Bernstein Center for Materials Physics and Technology NRL Code 6390 noam.bernst...@nrl.navy.mil phone: 703 683 2783

Re: [OMPI users] OpenMPI (1.8.3) and environment variable export

2015-06-12 Thread Noam Bernstein
> On Jun 12, 2015, at 11:08 AM, borno_bo...@gmx.de wrote: > > Hey there, > > I know that variable export in general can be done with the -x option of > mpirun, but I guess that won't help me. > More precisely I have a heterogeneous cluster (number of cores per cpu) and > one process for each n

Re: [OMPI users] new hwloc error

2015-06-01 Thread Noam Bernstein
> On Jun 1, 2015, at 5:09 PM, Ralph Castain wrote: > > This probably isn’t very helpful, but fwiw: we added an automatic > “fingerprint” capability in the later OMPI versions just to detect things > like this. If the fingerprint of a backend node doesn’t match the head node, > we automatically

Re: [OMPI users] new hwloc error

2015-06-01 Thread Noam Bernstein
> On Apr 30, 2015, at 1:16 PM, Noam Bernstein > wrote: > >> On Apr 30, 2015, at 12:03 PM, Ralph Castain wrote: >> >> The planning is pretty simple: at startup, mpirun launches a daemon on each >> node. If —hetero-nodes is provided, each daemon returns the

Re: [OMPI users] new hwloc error

2015-04-30 Thread Noam Bernstein
> On Apr 29, 2015, at 5:59 PM, Ralph Castain wrote: > > Try adding —hetero-nodes to the cmd line and see if that helps resolve the > problem. Of course, if all the machines are identical, then it won’t They are identical, and the problem is new. That’s what’s most mysterious about it. Can

Re: [OMPI users] new hwloc error

2015-04-29 Thread Noam Bernstein
> On Apr 29, 2015, at 4:09 PM, Brice Goglin wrote: > > Nothing wrong in that XML. I don't see what could be happening besides a > node rebooting with hyper-threading enabled for random reasons. > Please run "lstopo foo.xml" again on the node next time you get the OMPI > failure (assuming you get

Re: [OMPI users] new hwloc error

2015-04-29 Thread Noam Bernstein
> On Apr 29, 2015, at 12:47 PM, Brice Goglin wrote: > > Thanks. It's indeed normal that OMPI failed to bind to cpuset 0,16 since > 16 doesn't exist at all. > Can you run "lstopo foo.xml" on one node where it failed, and send the > foo.xml that got generated? Just want to make sure we don't have

Re: [OMPI users] new hwloc error

2015-04-29 Thread Noam Bernstein
> On Apr 28, 2015, at 4:54 PM, Brice Goglin wrote: > > Hello, > Can you build hwloc and run lstopo on these nodes to check that everything > looks similar? > You have hyperthreading enabled on all nodes, and you're trying to bind > processes to entire cores, right? > Does 0,16 correspond to two

[OMPI users] new hwloc error

2015-04-28 Thread Noam Bernstein
Noam --- Noam Bernstein Center for Computational Materials Science Naval Research Laboratory Code 6390 noam.bernst...@nrl.navy.mil phone: 202 404 8628 smime.p7s Description: S/MIME cryptographic signature

Re: [OMPI users] Question about scheduler support

2014-05-15 Thread Noam Bernstein
I’m not sure how this would apply to other options, but for the scheduler, what about no scheduler-related options defaults to everything enabled (like before), but having any explicit scheduler enable option disables by default all the other schedulers? Multiple explicit enable options would en

Re: [OMPI users] EXTERNAL: Re: Problem with shell when launching jobs with OpenMPI 1.6.5 rsh

2014-04-07 Thread Noam Bernstein
On Apr 7, 2014, at 4:36 PM, Blosch, Edwin L wrote: > I guess this is not OpenMPI related anymore. I can repeat the essential > problem interactively: > > % echo $SHELL > /bin/csh > > % echo $SHLVL > 1 > > % cat hello > echo Hello > > % /bin/bash hello > Hello > > % /bin/csh hello > Hello

Re: [OMPI users] OpenMPI job initializing problem

2014-03-20 Thread Noam Bernstein
On Mar 20, 2014, at 2:13 PM, Ralph Castain wrote: > > On Mar 20, 2014, at 9:48 AM, Beichuan Yan wrote: > >> Hi, >> >> Today I tested OMPI v1.7.5rc5 and surprisingly, it works like a charm! >> >> I found discussions related to this issue: >> >> 1. http://www.open-mpi.org/community/lists/user

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-27 Thread Noam Bernstein
On Feb 27, 2014, at 2:36 AM, Patrick Begou wrote: > Bernd Dammann wrote: >> Using the workaround '--bind-to-core' does only make sense for those jobs, >> that allocate full nodes, but the majority of our jobs don't do that. > Why ? > We still use this option in OpenMPI (1.6.x, 1.7.x) with OpenF

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-19 Thread Noam Bernstein
On Dec 18, 2013, at 5:19 PM, Martin Siegert wrote: > > Thanks for figuring this out. Does this work for 1.6.x as well? > The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity > covers versions 1.2.x to 1.5.x. > Does 1.6.x support mpi_paffinity_alone = 1 ? > I set this in openmpi-m

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-18 Thread Noam Bernstein
On Dec 18, 2013, at 10:32 AM, Dave Love wrote: > Noam Bernstein writes: > >> We specifically switched to 1.7.3 because of a bug in 1.6.4 (lock up in some >> collective communication), but now I'm wondering whether I should just test >> 1.6.5. > > What bug,

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-18 Thread Noam Bernstein
Thanks to all who answered my question. The culprit was an interaction between 1.7.3 not supporting mpi_paffinity_alone (which we were using previously) and the new kernel. Switching to --bind-to core (actually the environment variable OMPI_MCA_hwloc_base_binding_policy=core) fixed the problem

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Noam Bernstein
On Dec 17, 2013, at 11:04 AM, Ralph Castain wrote: > Are you binding the procs? We don't bind by default (this will change in > 1.7.4), and binding can play a significant role when comparing across kernels. > > add "--bind-to-core" to your cmd line Now that it works, is there a way to set it v

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Noam Bernstein
On Dec 17, 2013, at 11:04 AM, Ralph Castain wrote: > Are you binding the procs? We don't bind by default (this will change in > 1.7.4), and binding can play a significant role when comparing across kernels. > > add "--bind-to-core" to your cmd line Yeay - it works. Thank you very much for the

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Noam Bernstein
On Dec 17, 2013, at 11:04 AM, Ralph Castain wrote: > Are you binding the procs? We don't bind by default (this will change in > 1.7.4), and binding can play a significant role when comparing across kernels. > > add "--bind-to-core" to your cmd line I've previously always used mpi_paffinity_alo

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Noam Bernstein
On Dec 16, 2013, at 5:40 PM, Noam Bernstein wrote: > > Once I have some more detailed information I'll follow up. OK - I've tried to characterize the behavior with vasp, which accounts for most of our cluster usage, and it's quite odd. I ran my favorite benchmarking job

[OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-16 Thread Noam Bernstein
Has anyone tried to use openmpi 1.7.3 with the latest CentOS kernel (well, nearly latest: 2.6.32-431.el6.x86_64), and especially with infiniband? I'm seeing lots of weird slowdowns, especially when using infiniband, but even when running with "--mca btl self,sm" (it's much worse with IB, though

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 11:52 AM, Gus Correa wrote: > Hi Noam > > Could it be that Torque, or probably more likely NFS, > is too slow to create/make available the PBS_NODEFILE? > > What if you insert a "sleep 2", > or whatever number of seconds you want, > before the mpiexec command line? > Or may

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 10:36 AM, Noam Bernstein wrote: > > On Sep 20, 2013, at 10:22 AM, Reuti wrote: > >> >> Is the location for the spool directory local or shared by NFS? Disk full? > > No - locally mounted, and far from full on all the nodes. Another new obser

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 10:22 AM, Reuti wrote: > > Is the location for the spool directory local or shared by NFS? Disk full? No - locally mounted, and far from full on all the nodes. Noam

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 10:04 AM, Noam Bernstein wrote: > > Never mind - I was sure that my earlier tests showed that the $PBS_NODEFILE > was there, but now it seems like every time the job fails it's because this > file really is missing. Time to check why torque isn't

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 9:55 AM, Noam Bernstein wrote: > > This is completely unrepeatable - resubmitting the same job almost > always works the second time around. The line appears to be > associated with looking for the torque/maui generated node file, > and when I do somethi

[OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
/var/spool/torque/aux//4600.tin). thanks, Noam Noam Bernstein Center for Computational Materials Science NRL Code 6390 noam.bernst...@nrl.navy.mil

Re: [OMPI users] crashes in VASP with openmpi 1.6.x

2012-10-03 Thread Noam Bernstein
Thanks to everyone who answered, in particular Ake Sandgren, it appears to be a weird problem with acml that somehow triggers a seg fault in libmpi, but only when running on Opterons. I'd still be interested in figuring out how to get a more complete backtrace, but at least the immediate problem i

[OMPI users] crashes in VASP with openmpi 1.6.x

2012-10-02 Thread Noam Bernstein
Hi - I've been trying to run VASP 5.2.12 with ScaLAPACK and openmpi 1.6.x on a single 32 core (4 x 8 core) Opteron node, purely shared memory. We've always had occasional hangs with older OpenMPI versions (1.4.3 and 1.5.5) on these machines, but infrequent enough to be usable and not worth my tim

Re: [OMPI users] MPI/FORTRAN on a cluster system

2012-08-20 Thread Noam Bernstein
On Aug 20, 2012, at 11:12 AM, David Warren wrote: > The biggest issue you may have is that gnu fortran does not support all the > fortran constructs that all the others do. Most fortrans have supported the > standard plus the DEC extentions. Gnu fortran does not quite get all the > standards.I

Re: [OMPI users] Mpirun: How to print STDOUT of just one process?

2012-02-01 Thread Noam Bernstein
man mpirun . . . -output-filename, --output-filename Redirect the stdout, stderr, and stddiag of all ranks to a rank-unique version of the specified filename. Any directories in the filename will automatically be created. Each output file will consist of fi

Re: [OMPI users] Problem with mpi_comm_spawn_multiple

2010-05-07 Thread Noam Bernstein
I haven't been following this whole discussion, but I do know something about how Fortran allocates and passes string argument (the joys of Fortran/C/python inter-language calls), for what it's worth. By definition in the Fortran language all strings have a predefined length, which Fortran magic

  1   2   >