Posting this on UCX list.
On Thu, Oct 4, 2018 at 4:42 PM Charles A Taylor wrote:
>
> We are seeing a gaping memory leak when running OpenMPI 3.1.x (or 2.1.2,
> for that matter) built with UCX support. The leak shows up
> whether the “ucx” PML is specified for the run or not. The applications
I would suggest to post the error in UCX issues -
https://github.com/openucx/ucx/issues
It is typical IB error complaining about an access to unregistered memory.
Usually it caused by some pointer corruption in OMPI/UCX or application
code.
Best,
Pasha
On Thu, Sep 20, 2018 at 11:22 PM Ben Menad
You just have to switch PML to UCX.
You have some example of the command line here:
https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX
Best,
P.
On Thu, Jun 14, 2018 at 3:25 PM Charles A Taylor wrote:
> Hmmm. ompi_info only shows the ucx pml. I don’t see any “trans
Yossi, can you please comment.
Thanks,
P.
On Tue, Apr 17, 2018 at 07:24 marcin.krotkiewski <
marcin.krotkiew...@gmail.com> wrote:
> Hi, all,
>
> I'm reading in the changelog 3.0.0 that
>
> - Use UCX multi-threaded API in the UCX PML. Requires UCX 1.0 or later.
>
> Also, the changelog for 3.1.0 i
Alberto,
Have you tried the toolchain from linaro ?
https://releases.linaro.org/components/toolchain/binaries/latest/aarch64-linux-gnu/
Best,
Pasha
On Tue, Nov 14, 2017 at 4:07 AM, Alberto Ortiz
wrote:
> Hi,
> I am trying to run in this type of environment:
>
> 1- A linux PC in which I intend
I'm building on ARMv8 (64bit kernel, ompi master) and so far no problems.
On Wed, Sep 27, 2017 at 7:34 AM, Jeff Layton wrote:
> I could never get OpenMPI < 2.x to build on a Pi 2. I ended up using the
> binary from the repos. Pi 3 is a different matter - I got that to build
> after a little expe
ll.
I just downloaded the OSU benchmarks and tried osu_latency It's report ~40
microsecs for OpenMPI, and ~3 micrcosecs for MVAPICH. Still puzzled...
Steve
-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf
Of Pavel Shamis (Pasha)
Hey,
I only may to add the XRC and RC have the same latency.
What is the command line that you use to run this benchmark ?
What is the system configuration (one hca, one active port ) ?
Any addition information about system configuration, mpi command line,
etc. will help to analyze your issue.
Very strange. MPI tries to access CQ context and it get immediate error.
Please make sure that you limits configuration is ok, take a look on
this FAQ - http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Pasha.
Charles Wright wrote:
Hello,
I just got some new cluster hardwa
You will not be need the trick if you will configure Open Mpi with
follow flag:
--enable-mpirun-prefix-by-default
Pasha.
Hodgess, Erin wrote:
the LD_LIBRARY_PATH did the trick;
thanks so much!
Sincerely,
Erin
Erin M. Hodgess, PhD
Associate Professor
Department of Computer and Mathematica
Sangamesh,
The ib tunings that you added to your command line only delay the
problem but doesn't resolve it.
The node-0-2.local gets asynchronous event "IBV_EVENT_PORT_ERROR" and as
result
the processes fails to deliver packets to some remote hosts and as
result you see bunch of IB errors.
T
You may try to use ibdiagnet tool:
http://linux.die.net/man/1/ibdiagnet
The tool is part of OFED (http://www.openfabrics.org/)
Pasha.
Prentice Bisbal wrote:
Several jobs on my cluster just died with the error below.
Are there any IB/Open MPI diagnostics I should use to diagnose, should I
just
However, setting:
-mca btl_openib_eager_limit 65536
gave a 15% improvement so OpenMPI is now down to 326 (from previous
376 seconds). Still a lot more than ScaliMPI with 214 seconds.
Can you please run ibv_devinfo on one of compute nodes ? It is
interesting to know what kind of IB HW you have
If the above doesn't improve anything the next question is do you know
what the sizes of the messages are? For very small messages I believe
Scali shows a 2x better performance than Intel and OMPI (I think this
is due to a fastpath optimization).
I remember that mvapich was faster that sca
the following MPI functions being used:
MPI_Init
MPI_wtime
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_BUFFER_ATTACH
MPI_BSEND
MPI_PACK
MPI_UNPACK
MPI_PROBE
MPI_GET_COUNT
MPI_RECV
MPI_IPROBE
MPI_FINALIZE
where MPI_IPROBE is the clear winner in terms of number of calls.
/Torgny
Pavel Shamis (Pasha) wrote
Do you know if the application use some collective operations ?
Thanks
Pasha
Torgny Faxen wrote:
Hello,
we are seeing a large difference in performance for some applications
depending on what MPI is being used.
Attached are performance numbers and oprofile output (first 30 lines)
from one
Lenny,
You can find some details here:
http://icl.cs.utk.edu/news_pub/submissions/Flex-collective-euro-pvmmpi-2006.pdf
Pasha
Lenny Verkhovsky wrote:
Hi,
I am looking too for a file example of rules for dynamic collectives,
Have anybody tried it ? Where can I find a proper syntax for it ?
tha
We have a computational cluster which is consisting of 8 HP Proliant
ML370G5 with 32GB ram.
Each node has a Melanox single port infiniband DDR HCA card (20Gbit/s)
and connected each other through
a Voltaire ISR9024D-M DDR infiniband switch.
Now we want to increase the bandwidth to 40GBit/s
Hi,
You can select ib device used with openib btl by using follow parametres:
MCA btl: parameter "btl_openib_if_include" (current value: , data
source: default value)
Comma-delimited list of devices/ports to be
used (e.g. "mthca0,mthca1:2"; empty value means to
Jim,
Can you please share with us you mca conf file.
Pasha.
Jim Kress ORG wrote:
For the app I am using, ORCA (a Quantum Chemistry program), when it was
compiled using openMPI 1.2.8 and run under 1.2.8 with the following in
the openmpi-mca-params.conf file:
btl=self,openib
the app ran fine wit
I tried to run with the first dynamic rules file that Pavel proposed
and it works, the time per one MD step on 48 cores decreased from 2.8
s to 1.8 s as expected.
Good news :-)
Pasha.
Thanks
Roman
On Wed, May 20, 2009 at 7:18 PM, Pavel Shamis (Pasha) wrote:
Tomorrow I will add
Tomorrow I will add some printf to collective code and check what really
happens there...
Pasha
Peter Kjellstrom wrote:
On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:
Disabling basic_linear seems like a good idea but your config file sets
the cut-off at 128 Bytes for 64-ranks (the
Disabling basic_linear seems like a good idea but your config file sets the
cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to result
in a message size of that value divided by the number of ranks).
In my testing bruck seems to win clearly (at least for 64 ranks on my IB) u
The correct MCA parameters are the following:
-mca coll_tuned_use_dynamic_rules 1
-mca coll_tuned_dynamic_rules_filename ./dyn_rules
Ohh..it was my mistake
You can also run the following command:
ompi_info -mca coll_tuned_use_dynamic_rules 1 -param coll tuned
This will give some insight
Default algorithm thresholds in mvapich are different from ompi.
Using tunned collectives in Open MPI you may configure the Open MPI
Alltoall threshold as Mvapich defaults.
The follow mca parameters configure Open MPI to use custom rules that
are defined in configure(txt) file.
"--mca use_dynam
.
Pasha.
Roman Martonak wrote:
I've been using --mca mpi_paffinity_alone 1 in all simulations. Concerning "-mca
mpi_leave_pinned 1", I tried it with openmpi 1.2.X versions and it
makes no difference.
Best regards
Roman
On Mon, May 18, 2009 at 4:57 PM, Pavel Shamis (Pasha) wrot
1) I was told to add "-mca mpi_leave_pinned 0" to avoid problems with
Infinband. This was with OpenMPI 1.3.1. Not
Actually for 1.2.X version I will recommend you to enable leave pinned
"-mca mpi_leave_pinned 1"
sure if the problems were fixed on 1.3.2, but I am hanging on to that
setting j
RNR , receive is not ready - It means that on recv side MPI don't have
buffers to get the data.
It may point to some broken configuration in MPI/ofud or credit leak in
OFUD code.
Åke Sandgren wrote:
Hi!
I'm having problem with getting the "error polling LP CQ with status
RNR..." on an otherw
The (low level verbs) latency has AFAIR changed only a few times:
1) started at 5-6us with PCI-X Infinihost3
2) dropped to 3-4us with PCI-express Infinihost3
3) dropped to ~1us with PCI-express ConnectX
I would like to add that on PCI-EX Gen2 platforms the latency is sub
micro (~0.8-0.95)
I can't find a similar data set for Infiniband. I would appreciate any
comment/links.
Here is IB roadmap http://www.infinibandta.org/itinfo/IB_roadmap
...But I do not see there SDR
Pasha
Jan,
I guess that you have OFED driver installed on you machines. You may do
basic network verification with ibdiagnet utility
(http://linux.die.net/man/1/ibdiagnet) that is part of OFED installation.
Regards,
Pasha
Jeff Squyres wrote:
On May 4, 2009, at 9:50 AM, jan wrote:
Thank you Jef
You may try to use XRC, it should decrease openib btl memory footprint,
especially on multi-core system, like you have. The follow command will
switch default OMPI config to XRC:
" --mca btl_openib_receive_queues
X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32"
Regard
The fw version 2.3.0 is too old. I recommend you to upgrade to the
latest version (2.6.0) from
Mellanox website
http://www.mellanox.com/content/pages.php?pg=firmware_table_ConnectXIB
Thanks,
Pasha
Jeff Layton wrote:
Oops. I ran it on the head node and not the compute node. Here is the
output
Do you have the same HCA adapter type on all of your machines ?
In the error log I see mlx4 error message , and mlx4 is connectX driver,
but ibv_devinfo show some older hca.
Pasha
Jeff Layton wrote:
Pasha,
Here you go... :) Thanks for looking at this.
Jeff
hca_id: mthca0
fw_ver:
Thanks Pasha!
ibdiagnet reports the following:
-I---
-I- IPoIB Subnets Check
-I---
-I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Port localhost/P1 lid=0x00e2 guid=
Jeff,
Can you please provide more information about you HCA type (ibv_devinfo -v).
Do you see this error immediate during startup, or you get it during
your run ?
Thanks,
Pasha
Jeff Layton wrote:
Evening everyone,
I'm running a CFD code on IB and I've encountered an error I'm not
sure about
Time to dig up diagnostics tools and look at port statistics.
You may use ibdiagnet tool for the network debug -
*http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.
Pasha.
Usually "retry exceeded error" points to some network issues, like bad
cable or some bad connector. You may use ibdiagnet tool for the network
debug - *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.
Pasha
Brett Pemberton wrote:
Hey,
I've had a couple of errors recently, of
You may specify:
--mca btl openib,sm,self
Application sometime runs fast, sometimes runs slow
When you specify the parameter above, open mpi will use only three btls
openib - for Infiniband
sm - for shared memory communication
self - for "self" communication
NO other btl will be used.
And Op
Another thing to try is a change that we made late in the Open MPI
v1.2 series with regards to IB:
http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion
Thanks, this is something worth investigating. What would be the exact
syntax to use to turn off pml_ob1_use_
If the basic test run the installation is ok. So what happens when you
try to run your application ? What is command line ? What is the error
message ? do you run the application on the same set of machines with
the same command line as IMB ?
Pasha
yes to both questions: the OMPI version is
Teige, Scott W wrote:
Greetings,
I have observed strange behavior with an application running with
OpenMPI 1.2.8, OFED 1.2. The application runs in two "modes", fast
and slow. The exectution time is either within one second of 108 sec.
or within one second of 67 sec. My cluster has 1 Gig etherne
Biagio Lucini wrote:
Hello,
I am new to this list, where I hope to find a solution for a problem
that I have been having for quite a longtime.
I run various versions of openmpi (from 1.1.2 to 1.2.8) on a cluster
with Infiniband interconnects that I use and administer at the same
time. The o
recommend you upgrade your Open MPI installation. v1.2.8 has
a lot of bugfixes relative to v1.2.2. Also, Open MPI 1.3 should be
available "next month"... so watch for an announcement on that front.
BTW OMPI 1.2.8 also will be available as part of OFED 1.4 that will be
released in end of th
_
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
--
Pavel Shamis (Pasha)
Mellanox Technologies LTD.
Hi,
Can you please provide more information about your setup:
- OpenMPI version
- Runtime tuning
- Platform
- IB vendor and driver version
Thanks,
Pasha
Åke Sandgren wrote:
Hi!
We have a code that (at least sometimes) gets the following error
message:
[p-bc2909][0,1,98][btl_openib_endpoint.h:2
Usually OFED installs only 64 bit version of libibverbs. If you want to
install 32bit and 64bit version you need pass "--build32" flag to OFED
install. So after reinstalling OFED with 32bit support, you may rebuild
the OMPI for 32 bit support.
Regards,
Pasha
Mohd Radzi Nurul Azri wrote:
Hi
Amir Saad wrote:
I'll be starting some parallel programs in Open MPI and I would like
to find a guide or any docs of Open MPI, any suggestions please? I
couldn't find any docs on the website, how do I know about the APIs or
the functions that I should use?
Here are videos about OpenMPI/MPI -
with Intel and Pgi compilers:
http://www.mellanox.com/products/ofed.php
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985
On Jul 3, 2008, at 8:38 AM, Jeff Squyres wrote:
On Jul 2, 2008, at 11:51 PM, Pavel Shamis (Pasha) wrote:
In trying to build
there a way to shut off early completion in 1.2.3?
Sure, just add "--mca |pml_ob1_use_early_completion 0" to your command
line.| ||
Or the the above a known issues and i should use 1.2.7-pre or grab a
1.3 snap shot?
1.2.6 should be ok.
Regards,
Pasha
On Jul 2, 2008, at 10:42 AM, Pa
May be this FAQ will help :
http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion
Brock Palen wrote:
We have a code (arts) that locks up only when running on IB. Works
fine on tcp and sm.
When we ran it in a debugger. It locked up on a MPI_Comm_split()
That as far a
hu, 6/19/08, Pavel Shamis (Pasha)
//* wrote:
From: Pavel Shamis (Pasha)
Subject: Re: [OMPI users] Open MPI timeout problems.
To: pj...@cornell.edu, "Open MPI Users"
Date: Thursday, June 19, 2008, 5:20 AM
Usually the retry exceed point to some network issue on y
Usually the retry exceed point to some network issue on your cluster. I
see from the logs that you still
use MVAPI. If i remember correct, MVAPI include IBADM application that
should be able to check and debug the network.
BTW I recommend you to update your MVAPI driver to latest OpenFabric dri
Scott Shaw wrote:
Hi, I hope this is the right forum for my questions. I am running into
a problem when scaling >512 cores on a infiniband cluster which has
14,336 cores. I am new to openmpi and trying to figure out the right
-mca options to pass to avoid the "mca_oob_tcp_peer_complete_connect:
1 5 123391
Pavel Shamis (Pasha) wrote:
SLIM H.A. wrote:
Is it possible to get information about the usage of hca ports similar
to the result of the mx_endpoint_info command for Myrinet boards?
The ibstat command gives information like this:
Port 1:
State: Active
Physical state: LinkUp
using an infiniband port or
comunicates through plain ethernet.
I would be grateful for any advice
You have access to some counters in
/sys/class/infiniband/mlx4_0/ports/1/counters/ (counters for hca -
mlx4_0 , port 1)
--
Pavel Shamis (Pasha)
Mellanox Technologies
d second one will be reserver for back-up.
On network failure on the first port
all connections will migrate to second port. The APM works only on the
HCA level - I mean that you can not migrate between
different HCAs, you can migrate only between 2 ports of the same HCA.
--
Pavel Shamis (Pasha)
Mellanox Technologies
--
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Pavel Shamis (Pasha)
Mellanox Technologies
Ok, I will do.
Jeff Squyres wrote:
Sure, that would be fine.
Can you write it up in a little more FAQ-ish style? I can add it to
the web page. See this wiki item:
https://svn.open-mpi.org/trac/ompi/wiki/OMPIFAQEntries
On Mar 12, 2008, at 5:33 AM, Pavel Shamis (Pasha) wrote:
Run
above, but I cannot seem to find anywhere that will tell me how to
change the GID to something else.
Thanks,
Jon
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Pavel Shamis (Pasha)
Mellanox Technologies
Mpirun opens separate shell on each machine/node, so the "ulimit" will
not be available in new sheel. I think if you will add "ulimit -c
unlimited" to you default shell configuration file (~/.bashrc in BASH
case ant ~/.tcshrc in TCSH/CSH case) you will find your core files
61 matches
Mail list logo