"Gabriel, Edgar" writes:
> Hm, thanks for the report, I will look into this. I did not run the
> romio tests, but the hdf5 tests are run regularly and with 3.1.2 you
> should not have any problems on a regular unix fs. How many processes
> did you use, and which tests did you run specifically? Th
"Gabriel, Edgar" writes:
> Ok, thanks. I usually run these test with 4 or 8, but the major item
> is that atomicity is one of the areas that are not well supported in
> ompio (along with data representations), so a failure in those tests
> is not entirely surprising .
If it's not expected to wo
RDMA was just broken in the last-but-one(?) RHEL7 kernel release, in
case that's the problem. (Fixed in 3.10.0-862.14.4.)
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
For what it's worth, I found the following from running ROMIO's tests
with OMPIO on Lustre mounted without flock (or localflock). I used 48
processes on two nodes with Lustre for tests which don't require a
specific number.
OMPIO fails tests atomicity, misc, and error on ext4; it additionally
fai
"Latham, Robert J." writes:
> it's hard to implement fcntl-lock-free versions of Atomic mode and
> Shared file pointer so file systems like PVFS don't support those modes
> (and return an error indicating such at open time).
Ah. For some reason I thought PVFS had the support to pass the tests
s
"Gabriel, Edgar" writes:
> a) if we detect a Lustre file system without flock support, we can
> printout an error message. Completely disabling MPI I/O is on the
> ompio architecture not possible at the moment, since the Lustre
> component can disqualify itself, but the generic Unix FS component
If you try to build somewhere out of tree, not in a subdir of the
source, the Fortran build is likely to fail because mpi-ext-module.F90
does
include
'/openmpi-4.0.0/ompi/mpiext/pcollreq/mpif-h/mpiext_pcollreq_mpifh.h'
and can exceed the fixed line length. It either needs to add (the
com
"Jeff Squyres (jsquyres) via users" writes:
> Hi Dave; thanks for reporting.
>
> Yes, we've fixed this -- it should be included in 4.0.1.
>
> https://github.com/open-mpi/ompi/pull/6121
Good, but I'm confused; I checked the repo before reporting it.
[I w
Jeff Hammond writes:
> Preprocessor is fine in Fortran compilers. We’ve used in NWChem for many
> years, and NWChem supports “all the compilers”.
>
> Caveats:
> - Cray dislikes recursive preprocessing logic that other compilers handle.
> You won’t use this so please ignore.
> - IBM XLF requires -
Is it possible to use the environment or mpirun flags to run an OMPI
that's been relocated from where it was configured/installed? (Say
you've unpacked a system package that expects to be under /usr and want
to run it from home without containers etc.) I thought that was
possible, but I haven't f
Reuti writes:
> export OPAL_PREFIX=
>
> to point it to the new location of installation before you start `mpiexec`.
Thanks; that's now familiar, and I don't know how I missed it with
strings.
It should be documented. I'd have expected --prefix to have the same
effect, and for there to be an MC
Reuti writes:
>> It should be documented.
>
> There is this FAQ entry:
>
> https://www.open-mpi.org/faq/?category=building#installdirs
For what it's worth, I looked under "running" in the FAQ, as I was after
a runtime switch. I expect FAQs to point to the actual documentation,
though, and an en
"Jeff Squyres (jsquyres) via users" writes:
> Reuti's right.
>
> Sorry about the potentially misleading use of "--prefix" -- we
> basically inherited that CLI option from a different MPI
> implementation (i.e., people asked for it). So we were locked into
> that meaning for the "--prefix" CLI op
In fact, setting OPAL_PREFIX doesn't work for a relocated tree (with
OMPI 1.10 or 3.0). You also need $OPAL_PREFIX/lib and
$OPAL_PREFIX/lib/openmpi on LD_LIBRARY_PATH (assuming $MPI_LIB=$MPI_HOME/lib):
$ OPAL_PREFIX=$(pwd)/usr/lib64/openmpi3 ./usr/lib64/openmpi3/bin/mpirun
mpirun true
./usr/
Jeff Squyres writes:
> Could the nodes be running out of shared memory and/or temp filesystem
> space?
I'm also seeing this non-reproducibly (on OpenSuSE 10.3, with Sun's
Clustertools 8.1 prerelease on dual Barcelona nodes during PMB runs
under SGE). I haven't had time to build the final 1.3 re
Prentice Bisbal writes:
> I just installed OpenMPI 1.3 with tight integration for SGE. Version
> 1.2.8 was working just fine for several months in the same arrangement.
>
> Now that I've upgraded to 1.3, I get the following errors in my standard
> error file:
>
> mca_common_sm_mmap_init: open /tm
M C writes:
> --- MCA component crs:blcr (m4 configuration macro)
> checking for MCA component crs:blcr compile mode... dso
> checking --with-blcr value... sanity check ok (/opt/blcr)
> checking --with-blcr-libdir value... sanity check ok (/opt/blcr/lib)
> configure: WARNING: BLCR support request
Rolf Vandevaart writes:
>> However, I found that if I explicitly specify the "-machinefile
>> $TMPDIR/machines", all 8 mpi processes were spawned within a single
>> node, i.e. node0002.
I had that sort of behaviour recently when the tight integration was
broken on the installation we'd been give
Josh Hursey writes:
> The configure flag that you are looking for is:
> --with-ft=cr
Is there a good reason why --with-blcr doesn't imply it?
> You may also want to consider using the thread options too for
> improved C/R response:
> --enable-mpi-threads --enable-ft-thread
Incidentally, the
Rolf Vandevaart writes:
> No, orte_leave_session_attached is needed to avoid the errno=2 errors
> from the sm btl. (It is fixed in 1.3.2 and trunk)
[It does cause other trouble, but I forget what the exact behaviour was
when I lost it as a default.]
>> Yes, but there's a problem with the recomm
I wrote:
> E.g. on
> 8-core nodes, if you submit a 16-process job, there are four cores left
> over on the relevant nodes which might get something else scheduled on
> them.
Of course, that doesn't make much sense because I thought `12' and typed
`16' for some reason... Thanks to Rolf for off-li
Josh Hursey writes:
> Thanks. I'll fix this and post a new draft soon (I have a few other
> items to put in there anyway).
One thing to note in the mean time is that building with BLCR failed for
me with the PGI compiler with a link-time message about a bad file
format. I assume it's a libtool
It's not reproducible, but I sometimes see messages like
[node01:29645] MX BTL delete procs
running 1.3.1 with Open-MX and the MX BTL. Looking at the code, it's a
dummy routine, but I didn't get as far as figuring out why it's
(sometimes) called and what its significance is. Can someone expl
Scott Atchley writes:
> I believe the answer is yes as long as all NICs are in the same fabric
> (they usually are).
Thanks. Do you mean it won't if, in this case, the two NICs are on
separate switches?
George Bosilca writes:
> It is not the BTL who open the second endpoint, it is the MTL. It's a
> very long story, but unfortunately right now the two components (MTL
> and BTL) each open an endpoint. Once the upper level complete the
> selection of the component for the run, one of the endpoints
Scott Atchley writes:
> George's answer supersedes mine. You must be using the MX bonding
> driver to use more than one NIC per host.
Will that be relevant for Open-MX, which I'm using rather than normal
MX? (I'm afraid I don't know anything about how MX systems work
generally.) For what it's
nce the default integer size used by g95 is 8 bytes
but the openmpi fortran interface was compiled with f77 which uses 4
byte integers.
Any suggestions on what to look for?
Thanks for the help,
Dave
program parallel_sum_mmnts
real(kind=8):: zmmnts(0:360,28,0:8)
c Use reduct
useful. This is a major issue since my parallel code heavily depends on
having the ability to open X windows on the remote machine. Any and all
help would be appreciated!
Thanks!
Dave
issue with the X server (xorg) or with the version of linux,
so I am also seeking help from the person who maintains caos linux. If
it matters, the machine uses myrinet for the interconnects.
Thanks!
Dave
Galen Shipman wrote:
what does your command line look like?
- Galen
On Nov 29,
Title: Re: [OMPI users] x11 forwarding
I don't think that that is the problem. As far as I can tell, the
DISPLAY environment variable is being set properly on the slave (it
will sometimes have a different value than in the shell where mpirun
was executed).
Dave
Ralph H Castain
my problem.
Dave
Galen Shipman wrote:
I think this might be as simple as adding "-d" to the mpirun command
line
If I run:
mpirun -np 2 -d -mca pls_rsh_agent "ssh -X" xterm -e gdb
./mpi-ping
All is well, I get the
eing picky.
Thanks!
Dave
Galen Shipman wrote:
-d leaves the ssh session open
Try using:
mpirun -d -host boxtop2 -mca pls_rsh_agent "ssh -X -n" xterm -e
cat
Note the "ssh -X -n", this will tell ssh not to open stdin..
You should then be
Is there a place where I can hack the openmpi code to force it to keep
the ssh sessions open without the -d option? I looked through some of
the code, including orterun.c and a few other places, but don't have
the familiarity with the code to find the place.
Thanks!
Dave
Galen Sh
ew command line flag to keep the ssh sessions running
without turning on the debugging output. I know that others have the
same XForwarding problem and this would offer a general solution.
Thanks for all of your help!!
Dave
Ralph Castain wrote:
I’m
afraid that would be a rather signi
run "autoreconf" by hand, make sure to run the "./autogen.sh" script that
is packaged with OMPI. It will also check your versions and warn you if they
are out of date.
Do you need to build OMPI from the SVN source? Or would a (pre-autogen'ed)
release tarball work for you?
-Dave
not pass a value,
then it is "/usr/local". Then reinstall (with "make install" in the OMPI build
tree).
What I think is happening is that you still have an "mca_btl_usnic.so" file
leftover from the last time you installed OMPI (before passing
"--enable-mca-no-build=btl-usnic"). So OMPI is using this shared library and
you get exactly the same problem.
-Dave
On Apr 2, 2014, at 12:57 PM, Filippo Spiga wrote:
> I still do not understand why this keeps appearing...
>
> srun: cluster configuration lacks support for cpu binding
>
> Any clue?
I don't know what causes that message. Ralph, any thoughts here?
-Dave
a different MPI implementation than you
are using to run it (e.g., MPICH vs. Open MPI).
-Dave
I don't know of any workaround. I've created a ticket to track this, but it
probably won't be very high priority in the short term:
https://svn.open-mpi.org/trac/ompi/ticket/4575
-Dave
On Apr 25, 2014, at 3:27 PM, Jamil Appa wrote:
>
> Hi
>
> The fol
ent, since any page you gift away should probably come from
mmap(2) directly).
Otherwise, as George mentioned, I would investigate converting your current
data collector processes to also be MPI processes so that they can simply
communicate the data to the rest of the cluster.
-Dave
/3772826/158513.
-Dave
On Sep 29, 2014, at 1:34 PM, Ralph Castain wrote:
> Afraid I cannot replicate a problem with singleton behavior in the 1.8 series:
>
> 11:31:52 /home/common/openmpi/v1.8/orte/test/mpi$ ./hello foo bar
> Hello, World, I am 0 of 1 [0 local peers]:
On Nov 24, 2014, at 12:06 AM, George Bosilca wrote:
> https://github.com/open-mpi/ompi/pull/285 is a potential answer. I would like
> to hear Dave Goodell comment on this before pushing it upstream.
>
> George.
I'll take a look at it today. My notification settings were m
requests (assuming they can be progressed). The following should
not deadlock:
✂
for (...) MPI_Isend(...)
for (...) MPI_Irecv(...)
MPI_Waitall(send_requests...)
MPI_Waitall(recv_requests...)
✂
-Dave
ption here.
-Dave
> On Sep 2, 2016, at 5:35 AM, Jeff Squyres (jsquyres)
> wrote:
>
> Greetings Lachlan.
>
> Yes, Gilles and John are correct: on Cisco hardware, our usNIC transport is
> the lowest latency / best HPC-performance transport. I'm not aware of any
>
ary --
But what we're getting is:
app ---> /usr/OMPI
\
--> library ---> ~ross/OMPI
If one of them was first linked against the /usr/OMPI and managed to get an
RPATH then it could override your LD_LIBRARY_PATH.
-Dave
On Mar 12, 2014, at 5:39 AM, Jeff Squyres (jsquyres)
ound for now though, and the "volatile" approach
seems fine to me.
-Dave
numa_maps". There's lots
of info about NUMA affinity here: https://queue.acm.org/detail.cfm?id=2513149
-Dave
Can anyone report experience with recent OMPI on POWER (ppc64le)
hardware, e.g. Summit? When I tried on similar nodes to Summit's (but
fewer!), the IMB-RMA benchmark SEGVs early on. Before I try to debug
it, I'd be interested to know if anyone else has investigated that or
had better luck and, if
Mark Dixon via users writes:
> Surely I cannot be the only one who cares about using a recent openmpi
> with hdf5 on lustre?
I generally have similar concerns. I dug out the romio tests, assuming
something more basic is useful. I ran them with ompi 4.0.5+ucx on
Mark's lustre system (similar to
I wrote:
> The perf test says romio performs a bit better. Also -- from overall
> time -- it's faster on IMB-IO (which I haven't looked at in detail, and
> ran with suboptimal striping).
I take that back. I can't reproduce a significant difference for total
IMB-IO runtime, with both run in par
Mark Dixon via users writes:
> But remember that IMB-IO doesn't cover everything.
I don't know what useful operations it omits, but it was the obvious
thing to run, that should show up pathology, with simple things first.
It does at least run, which was the first concern.
> For example, hdf5's
As a check of mpiP, I ran HDF5 testpar/t_bigio under it. This was on
one node with four ranks (interactively) on lustre with its default of
one 1MB stripe, ompi-4.0.5 + ucx-1.9, hdf5-1.10.7, MCA defaults.
I don't know how useful it is, but here's the summary:
romio:
@--- Aggregate Time (top t
Mark Allen via users writes:
> At least for the topic of why romio fails with HDF5, I believe this is the
> fix we need (has to do with how romio processes the MPI datatypes in its
> flatten routine). I made a different fix a long time ago in SMPI for that,
> then somewhat more recently it was r
After seeing several failures with RMA with the change needed to get
4.0.5 through IMB, I looked for simple tests. So, I built the mpich
3.4b1 tests -- or the ones that would build, and I haven't checked why
some fail -- and ran the rma set.
Three out of 180 passed. Many (most?) aborted in ucx,
Ralph Castain via users writes:
> Just a point to consider. OMPI does _not_ want to get in the mode of
> modifying imported software packages. That is a blackhole of effort we
> simply cannot afford.
It's already done that, even in flatten.c. Otherwise updating to the
current version would be t
"Pritchard Jr., Howard" writes:
> Hello Dave,
>
> There's an issue opened about this -
>
> https://github.com/open-mpi/ompi/issues/8252
Thanks. I don't know why I didn't find that, unless I searched before
it appeared. Obviously I was wrong to think it
I tried mpi-io tests from mpich 4.3 with openmpi 4.1 on the ac922 system
that I understand was used to fix ompio problems on lustre. I'm puzzled
that I still see failures.
I don't know why there are disjoint sets in mpich's test/mpi/io and
src/mpi/romio/test, but I ran all the non-Fortran ones wi
Why does 4.1 still not use the right defaults with UCX?
Without specifying osc=ucx, IMB-RMA crashes like 4.0.5. I haven't
checked what else it is UCX says you must set for openmpi to avoid
memory corruption, at least, but I guess that won't be right either.
Users surely shouldn't have to explore
"Jeff Squyres (jsquyres)" writes:
> Good question. I've filed
> https://github.com/open-mpi/ompi/issues/8379 so that we can track
> this.
For the benefit of the list: I mis-remembered that osc=ucx was general
advice. The UCX docs just say you need to avoid the uct btl, which can
cause memory
"Gabriel, Edgar via users" writes:
> I will have a look at those tests. The recent fixes were not
> correctness, but performance fixes.
> Nevertheless, we used to pass the mpich tests, but I admit that it is
> not a testsuite that we run regularly, I will have a look at them. The
> atomicity test
"Gabriel, Edgar via users" writes:
>> How should we know that's expected to fail? It at least shouldn't fail like
>> that; set_atomicity doesn't return an error (which the test is prepared for
>> on a filesystem like pvfs2).
>> I assume doing nothing, but appearing to, can lead to corrupt da
I meant to ask a while ago about vectorized reductions after I saw a
paper that I can't now find. I didn't understand what was behind it.
Can someone explain why you need to hand-code the avx implementations of
the reduction operations now used on x86_64? As far as I remember, the
paper didn't j
Gilles Gouaillardet via users writes:
> One motivation is packaging: a single Open MPI implementation has to be
> built, that can run on older x86 processors (supporting only SSE) and the
> latest ones (supporting AVX512).
I take dispatch on micro-architecture for granted, but it doesn't
require
work. It doesn't run global
tests, but
does point-to-point unidirectional, bi-directional, and aggregate and may
give
you some information about the performance change at 16 KB and whether
it is coming from OpenMPI or IB.
https://netpipe.cs.ksu.edu
Dave Turner
On Tue,
I see assorted problems with OMPI 4.1 on IB, including failing many of
the mpich tests (non-mpich-specific ones) particularly with RMA. Now I
wonder if UCX build options could have anything to do with it, but I
haven't found any relevant information.
What configure options would be recommended wi
Gilles Gouaillardet via users writes:
> Dave,
>
> If there is a bug you would like to report, please open an issue at
> https://github.com/open-mpi/ompi/issues and provide all the required
> information
> (in this case, it should also include the UCX library you are usin
to use -np 2 will not suffice.
Thank you,
Dave Martin
301 - 367 of 367 matches
Mail list logo