[OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110

2006-06-17 Thread Tony Ladd
I am getting the following error with openmpi-1.1b1

mca_btl_tcp_frag_send: writev failed with errno=110

1) This does not ever happen with other MPI's I have tried like MPICH and
LAM
2) It only seems to happen with large numbers of cpus, 32 and occasionally
16, and with larger messages sizes. In this case it ws 128K.
3) It only seems to happen with dual cpus on each node.
4) My configuration is default with (in openmpi-mca-params.conf): 
pls_rsh_agent = rsh 
btl = tcp,self 
btl_tcp_if_include = eth1 
I also set --mca btl_tcp_eager_limit 131072 when running the program, though
leaving this out does not eliminate the problem.

My program is a communication test; it sends bidirectional point to point
messages among N cpus. In one test it exchanges messages between pairs of
cpus, in another it reads from the node on its left and sends to the node on
its right (a kind of ring), and in a third it uses MPI_ALL_REDUCE.

Finally: the tcp driver in openmpi seems not nearly as good as the one in
LAM. I got higher throughput with far fewer dropouts with LAM.

Tony


---
Tony Ladd
Professor, Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005

Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu 




Re: [OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110

2006-06-30 Thread Tony Ladd
Jeff

Thanks for the reply; I realize you guys must be really busy with the recent
release of openmpi. I tried 1.1 and I don't get error messages any more. But
the code now hangs; no error or exit. So I am not sure if this is the same
issue or something else. I am enclosing my source code. I compiled with icc
and linked against an icc compiled version of openmpi-1.1.

My program is a set of network benchmarks (a crude kind of netpipe) that
checks typical message passing patterns in my application codes. 
Typical output is:

 32 CPU's: sync call time = 1003.0time
rate (Mbytes/s) bandwidth (MBits/s)
 loop   buffers  size XC   XE   GS   MS XC
XE   GS   MS XC   XE   GS   MS
   1   6416384  2.48e-02 1.99e-02 1.21e+00 3.88e-02   4.23e+01
5.28e+01 8.65e-01 2.70e+01   1.08e+04 1.35e+04 4.43e+02 1.38e+04
   2   6416384  2.17e-02 2.09e-02 1.21e+00 4.10e-02   4.82e+01
5.02e+01 8.65e-01 2.56e+01   1.23e+04 1.29e+04 4.43e+02 1.31e+04
   3   6416384  2.20e-02 1.99e-02 1.01e+00 3.95e-02   4.77e+01
5.27e+01 1.04e+00 2.65e+01   1.22e+04 1.35e+04 5.33e+02 1.36e+04
   4   6416384  2.16e-02 1.96e-02 1.25e+00 4.00e-02   4.85e+01
5.36e+01 8.37e-01 2.62e+01   1.24e+04 1.37e+04 4.28e+02 1.34e+04
   5   6416384  2.25e-02 2.00e-02 1.25e+00 4.07e-02   4.66e+01
5.24e+01 8.39e-01 2.57e+01   1.19e+04 1.34e+04 4.30e+02 1.32e+04
   6   6416384  2.19e-02 1.99e-02 1.29e+00 4.05e-02   4.79e+01
5.28e+01 8.14e-01 2.59e+01   1.23e+04 1.35e+04 4.17e+02 1.33e+04
   7   6416384  2.19e-02 2.06e-02 1.25e+00 4.03e-02   4.79e+01
5.09e+01 8.38e-01 2.60e+01   1.23e+04 1.30e+04 4.29e+02 1.33e+04
   8   6416384  2.24e-02 2.06e-02 1.25e+00 4.01e-02   4.69e+01
5.09e+01 8.39e-01 2.62e+01   1.20e+04 1.30e+04 4.30e+02 1.34e+04
   9   6416384  4.29e-01 2.01e-02 6.35e-01 3.98e-02   2.45e+00
5.22e+01 1.65e+00 2.64e+01   6.26e+02 1.34e+04 8.46e+02 1.35e+04
  10   6416384  2.16e-02 2.06e-02 8.87e-01 4.00e-02   4.85e+01
5.09e+01 1.18e+00 2.62e+01   1.24e+04 1.30e+04 6.05e+02 1.34e+04

Time is total for all 64 buffers. Rate is one way across one link (# of
bytes/time).
1) XC is a bidirectional ring exchange. Each processor sends to the right
and receives from the left
2) XE is an edge exchange. Pairs of nodes exchange data, with each one
sending and receiving
3) GS is the MPI_AllReduce
4) MS is my version of MPI_AllReduce. It splits the vector into Np blocks
(Np is # of processors); each processor then acts as a head node for one
block. This uses the full bandwidth all the time, unlike AllReduce which
thins out as it gets to the top of the binary tree. On a 64 node Infiniband
system MS is about 5X faster than GS-in theory it would be 6X; ie log_2(64).
Here it is 25X-not sure why so much. But MS seems to be the cause of the
hangups with messages > 64K. I can run the other benchmarks OK,but this one
seems to hang for large messages. I think the problem is at least partly due
to the switch. All MS is doing is point to point communications, but
unfortunately it sometimes requires a high bandwidth between ASIC's. It
first it exchanges data between near neighbors in MPI_COMM_WORLD, but it
must progressively span wider gaps between nodes as it goes up the various
binary trees. After a while this requires extensive traffic between ASICS.
This seems to be a problem on both my HP 2724 and the Extreme Networks
Summit400t-48. I am currently working with Extreme to try to resolve the
switch issue. As I say; the code ran great on Infiniband, but I think those
switches have hardware flow control. Finally I checked the code again under
LAM and it ran OK. Slow, but no hangs.

To run the code compile and type:
mpirun -np 32 -machinefile hosts src/netbench 8
The 8 means 2^8 bytes (ie 256K). This was enough to hang every time on my
boxes.

You can also edit the header file (header.h). MAX_LOOPS is how many times it
runs each test (currently 10); NUM_BUF is the number of buffers in each test
(must be more than number of processors), SYNC defines the global sync
frequency-every SYNC buffers. NUM_SYNC is the number of sequential barrier
calls it uses to determine the mean barrier call time. You can also switch
the verious tests on and off, which can be useful for debugging

Tony

---
Tony Ladd
Professor, Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005

Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu 


src.tgz
Description: application/compressed


[OMPI users] dual Gigabit ethernet support

2006-10-23 Thread Tony Ladd
A couple of comments regarding issues raised by this thread.

1) In my opinion Netpipe is not such a great network benchmarking tool for
HPC applications. It measures timings based on the completion of the send
call on the transmitter not the completion of the receive. Thus, if there is
a delay in copying the send buffer across the net, it will report a
misleading timing compared with the wall-clock time. This is particularly
problematic with multiple pairs of edge exchanges, which can oversubscribe
most GigE switches. Here the netpipe timings can be off by orders of
magnitude compared with the wall clock. The good thing about writing your
own code is that you know what it has done (of course no one else knows,
which can be a problem). But it seems many people are unaware of the timing
issue in Netpipe.

2) Its worth distinguishing between ethernet and TCP/IP. With MPIGAMMA, the
Intel Pro 1000 NIC has a latency of 12 microsecs including the switch and a
duplex bandwidth of 220 MBytes/sec. With the Extreme Networks X450a-48t
switch we can sustain 220MBytes/sec over 48 ports at once. This is not IB
performance but it seems sufficient to scale a number of applications to the
100 cpu level, and perhaps beyond.

Tony


---
Tony Ladd
Professor, Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005

Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu 




[OMPI users] Dual Gigabit ethernet support

2006-10-24 Thread Tony Ladd
Lisandro

I use my own network testing program; I wrote it some time ago because
Netpipe only tested 1-way rates at that point. I havent tried IMB but I
looked at the source and its very similar to what I do. 1) set up buffers
with data. 2) Start clock 3) Call MPI_xxx N times 4) Stop clock 5) calculate
rate. IMB tests more things than I do; I just focused on the calls I use
(send recv allreduce). I have done a lot of testing of hardware and
software. I will have some web pages posted soon. I will put a note here
when I do. But a couple of things.
A) I have found the switch is the biggest discriminant if you want to run
HPC under Gigabit ethernet. Most GigE switches choke when all the ports are
being used at once. This is the usual HPC pattern, but not of a typical
network, which is what these switches are geared towards. The one exception
I have found is the Extreme Networks x450a-48t. In some test patterns I
found it to be 500 times faster (not a typo) than the s400-48t, which is its
predecessor. I have tested several GigE switches (Extreme, Force10, HP,
Asante) and the x450 is the only one that copes with high traffic loads in
all port configurations. Its expensive for a GigE switch (~$6500) but worth
it in my opinion if you want to do HPC. Its still much cheaper than
Infiniband.
B) You have to test the switch in different port configurations-a random
ring of SendRecv is good for this. I don't think IMB has it in its test
suite but its easy to program. Or you can change the order of nodes in the
machinefile to force unfavorable port assignments. A step of 12 is a good
test since many GigE switches use 12-port ASICS and this forces all the
traffic onto the backplane. On the Summit 400 this causes it to more or less
stop working-rates drop to a few Kbytes/sec along each wire, but the x450
has no problem with the same test. You need to know how your nodes are wired
to the switch to do this test.
C) GAMMA is an extraordinary accomplishment in my view; in a number of tests
with codes like DLPOLY, GROMACS, VASP it can be 2-3 times the speed of TCP
based programs with 64 cpus. In many instances I get comparable (and
occasionally better) scaling than with the university HPC system which has
an Infiniband interconnect. Note I am not saying GigE is comparable to IB;
but that a typical HPC setup with nodes scattered all over a fat tree
topology (including oversubscription of the links and switches) is enough of
a minus that an optimized GigE set up can compete; at least up to 48 nodes
(96 cpus in our case). I have worked with Giuseppe Ciaccio for the past 9
months eradicating some obscure bugs in GAMMA. I find them; he fixes them.
We have GAMMA running on 48 nodes quite reliably but there are still many
issues to address. GAMMA is very much a research tool-there are a number of
features(?) which would hinder it being used in an HPC environment.
Basically Giuseppe needs help with development. Any volunteers? 

Tony
---
Tony Ladd
Professor, Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005

Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu 




[OMPI users] Dual Gigabit Ethernet Support

2006-10-24 Thread Tony Ladd
Durga

I guess we have strayed a bit from the original post. My personal opinion is
that a number of codes can run in HPC-like mode over Gigabit ethernet, not
just the trivially parallelizable. The hardware components are one key;
PCI-X, low hardware latency NIC (Intel PRO 1000 is 6.6 microsecs vs about 14
for the Bcom 5721), and a non-blocking (that's the key word) switch. Then
you need a good driver and a good MPI software layer. At present MPICH is
ahead of LAM/OpenMPI/MVAPICH in its implementation of optimized collectives.
At least that's how it seems to me (let me say that quickly, before I get
flamed). MPICH got a bad rap performance wise because its TCP driver was
mediocre (compared with LAM and OpenMPI). But MPICH + GAMMA is very fast.
MPIGAMMA even beats out our Infiniband cluster running OpenMPI on the
MPI_Allreduce; the test was with 64 cpus-32 nodes on the GAMMA cluster (Dual
core P4) and 16 nodes on the Infiniband (Dual Dual-core Opterons). The IB
cluster worked out at 24MBytes/sec (vector size/time) and the GigE +
MPIGAMMA was 39MBytes/sec. On the other hand, if I use my own optimized
AllReduce (a simplified version of the one in MPICH) on the IB cluster it
gets 108MByte/sec. So the tricky thing is all the components need to be in
place to get good application performance.

GAMMA is not so easy to set up-I had considerable help from Giuseppe. It has
libraries to compile and the kernel needs to be recompiled. Once I got that
automated I can build and install a new version of GAMMA in about 5 mins.
The MPIGAMMA build is just like MPICH and MPIGAMMA works almost exactly the
same. So any application that will compile under MPICH should compile under
MPIGAMMA, just by changing the path. I have run half a dozen apps with
GAMMA. Netpipe, Netbench (my network tester-a simplified version of IMB),
Susp3D (my own code-a CFD like application), DLPOLY all compile out of the
box. Gromacs compiles but has a couple of "bugs" that crash on execution.
One is an archaic test for MPICH that prevents a clean exit-must have been a
bugfix for an earlier version of MPICH. The other seems to be an fclose of
an unassigned file pointer. It works OK in LAM but my guess is its illegal
strictly speaking. A student was supposed to check on that. VASP also
compiles out of the box if you can compile it with MPICH. But there is a
problem with the MPIGAMMA and the MPI_Alltoall function right now. It works
but it suffers from hangups and long delays. So GAMMA is not good for VASP
at this moment. You see the substantial performance improvements sometimes,
but other times its dreadfully slow. I can reproduce the problem with an
AlltoAll test code and Giuseppe is going to try to debug the problem.

So GAMMA is not a pancea. In most circumstances it is stable and
predictable; much more reproducble than MPI over TCP. But there are still
may be one or two bugs and several issues.
1) Since GAMMA is tightly entwined in the kernel a crash frequently brings
the whole system down, which is a bit annoying; also it can crash other
nodes in the same GAMMA Virtual Machine.
2) NIC's are very buggy hardware-if you look at a TCP driver there are a
large number of hardware bugfixes in them. A number of GAMMA problems can be
traced to this. It's a lot of work to reprogram all the workarounds.
3) GAMMA nodes have to be preconfigured at boot. You can run more than one
job on a GAMMA virtual machine, but it's a little iffy; there can be
interactions between nodes on the same VM even if they are running different
jobs. Different GAMMA VM's need a different VLAN. So a multiuser environment
is still problematic.
4) Giuseppe said MPIGAMMA was a very difficult code to write-so I would
guess a port to OpenMPI would not be trivial. Also I would want to see
optimized collectives in OpenMPI before I switched from MPICH

As far as I know GAMMA is the most advanced non TCP protocol. At core it
really works well, but it still needs a lot more testing and development.
Giuseppe is great to work with if anyone out there is interested. Go to the
MPIGAMMA website for more info
http://www.disi.unige.it/project/gamma/mpigamma/index.html.

Tony




[OMPI users] OMPI collectives

2006-10-26 Thread Tony Ladd

1) I think OpenMPI does not use optimal algorithms for collectives. But
neither does LAM. For example the MPI_Allreduce scales as log_2 N where N is
the number of processors. MPICH uses optimized collectives and the
MPI_Allreduce is essentially independent of N. Unfortunately MPICH has never
had a good TCP interface so its typically slower overall than LAM or
OpenMPI. Are there plans to develop optimized collectives for OpenMPI; if
so, is there a timeline

2) I have found an additional problem in OpenMPI over TCP. MPI_AllReduce can
run extremely slowly on large numbers of processors. Measuring throughput
(message size / time) for 48 nodes with 16KByte messages (for example) I get
only 0.12MBytes/sec. The same code with LAM gets 5.3MBytes/sec which is more
reasonable. The problem seems to arise for a) more than 16 nodes and b)
message sizes in the range 16-32KBytes. Normally this is the optimum size so
its odd. Other message sizes are closer to LAM (though typically a little
slower). I have run these tests with my own network test, but I can run IMB
if necessary.

Tony




[OMPI users] OMPI Collectives

2006-10-27 Thread Tony Ladd
George

Thanks for the info. When you say "highly optimized" do you algorithmically,
tuning, or both? In particular I wonder if OMPI optimized collectives use
the divide and conquer strategy to maximize network bandwidth.

Sorry to be dense but I could not find documanetation on how to access the
optimized collectives. I would be willing to fiddle with the parameters a
bit by hand if I had some guidance as to how to set things and what I might
vary.

The optimization I was talking about (divide and conquer) would work better
than the basic strategy regardless of network; only message size might have
some effect.

Thanks

Tony


---
Tony Ladd
Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005

Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu 




[OMPI users] OMPI Collectives

2006-10-28 Thread Tony Ladd
George

Thanks for the references. However, I was not able to figure out if it what
I am asking is so trivial it is simply passed over or so subtle that its
been overlooked (I suspect the former). The binary tree algorithm in
MPI_Allreduce takes a tiume proportional to 2*N*log_2M where N is the vector
length and M is the number of processes. There is a divide and conquer
strategy
(http://www.hlrs.de/organization/par/services/models/mpi/myreduce.html) that
mpich uses to do a MPI_Reduce in a time proportional to N. Is this algorithm
or something equivalent in OpenMPI at present? If so how do I turn it on?

I also found that OpenMPI is sometimes very slow on MPI_Allreduce using TCP.
Things are OK up to 16 processes but at 24 the rates (Message length divided
by time) are as follows:

Message size (Kbytes)  Throughput (Mbytes/sec)
M=24M=32M=48
1   1.381.301.09

2   2.281.941.50
4   2.922.351.73
8   3.562.811.99
16  3.971.940.12
32  0.340.240.13
64  3.072.331.57
128 3.702.801.89
256 4.103.102.08
512 4.193.282.08
10244.363.362.17

Around 16-32KBytes there is a pronouced slowdown-roughly a factor of 10,
which seems too much. Any idea whats going on?

Tony

---
Tony Ladd
Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005

Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu 




[OMPI users] OMPI collectives

2006-11-02 Thread Tony Ladd
George

I found the info I think you were referring to. Thanks. I then experimented
essentially randomly with different algorithms for all reduce. But the issue
with really bad performance for certain message sizes persisted with v1.1.
The good news is that the upgrade to 1.2 fixed my worst problem. Now the
performance is reasonable for all message sizes. I will test the tuned
algorithms again asap.

I had a couple of questions

1) Ompi_info lists only 3 or 4 algorithms for allreduce and reduce and about
5 for b'cast. But you can use higher numbers as well. Are these additional
undocmented algorithms (you mentioned a number like 15) or is it ignoring
out of range parameters?
2) It seems for allreduce you can select a tuned reduce and tuned bcast
instead of the binary tree. But there is a faster allreduce which is order
2N rather than 4N for Reduce + Bcast (N is msg size). It segments the vector
and distributes the root among the nodes; in an allreduce there is no need
to gather the root vector to one processor and then scatter it again. I
wrote a simple version for powers of 2 (MPI_SUM)-any chance of it being
implemented in OMPI. 

Tony




[OMPI users] Parallel application performance tests

2006-11-28 Thread Tony Ladd
I have recently completed a number of performance tests on a Beowulf
cluster, using up to 48 dual-core P4D nodes, connected by an Extreme
Networks Gigabit edge switch. The tests consist of single and multi-node
application benchmarks, including DLPOLY, GROMACS, and VASP, as well as
specific tests of network cards and switches. I used TCP sockets with
OpenMPI v1.2 and MPI/GAMMA over Gigabit ethernet. MPI/GAMMA leads to
significantly better scaling than OpenMPI/TCP in both network tests and in
application benchmarks. The overall performance of the MPI/GAMMA cluster on
a per cpu basis was found to be comparable to a dual-core Opteron cluster
with an Infiniband interconnect. The DLPoly benchmark showed similar scaling
to those reported for an IBM p690. The performance using TCP was typically a
factor of 2 less in these same tests. A detailed write up can be found at:
http://ladd.che.ufl.edu/research/beoclus/beoclus.htm

Tony Ladd
Chemical Engineering
University of Florida




[OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-17 Thread Tony Ladd via users
I would very much appreciate some advice in how to debug this problem. I 
am trying to get OpenMPI to work on my reconfigured cluster - upgrading 
from Centos 5 to Ubuntu 18. The problem is that a simple job using 
Intel's IMB message passing test code will not run on any of the new 
clients (4 so far).


mpirun -np 2 IMB-MPI1

just hangs - no printout, no messages in syslog. I left it for 1 hr and 
it remained in the same state.


On the other hand the same code runs fine on the server (see outfoam). 
Comparing the two it seems the client version hangs while trying to load 
the openib module (it works with tcp,self or vader,self).


Digging a bit more I found the --mca btl_base_verbose option. Now I can 
see a difference in the two cases:


On the server: ibv_obj->logical_index=1, my_obj->logical_index=0

On the client: ibv_obj->type set to NULL. I don't believe this is a good 
sign, but I don't understand what it means. My guess is that openib is 
not being initialized in this case.


The server (foam) is SuperMicro server with X10DAi m'board and 2XE52630 
(10 core).


The client (f34) is a Dell R410 server with 2XE5620 (4 core). The 
outputs from ompi_info are attached.


They are both running Ubuntu 18.04 with the latest updates. I installed 
openmpi-bin 2.1.1-8. Both boxes have Mellanox Connect X2 cards with the 
latest firmware (2.9.1000). I have checked that the cards send and 
receive packets using the IB protocols and pass the Mellanox diagnostics.


I did notice that the Mellanox card has the PCI address 81:00.0 on the 
server but 03:00.0 on the client. Not sure of the significance of this.


Any help anyone can offer would be much appreciated. I am stuck.

Thanks

Tony

--

Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

 Package: Open MPI buildd@lcy01-amd64-009 Distribution
Open MPI: 2.1.1
  Open MPI repo revision: v2.1.0-100-ga2fdb5b
   Open MPI release date: May 10, 2017
Open RTE: 2.1.1
  Open RTE repo revision: v2.1.0-100-ga2fdb5b
   Open RTE release date: May 10, 2017
OPAL: 2.1.1
  OPAL repo revision: v2.1.0-100-ga2fdb5b
   OPAL release date: May 10, 2017
 MPI API: 3.1.0
Ident string: 2.1.1
  Prefix: /usr
 Configured architecture: x86_64-pc-linux-gnu
  Configure host: lcy01-amd64-009
   Configured by: buildd
   Configured on: Mon Feb  5 19:59:59 UTC 2018
  Configure host: lcy01-amd64-009
Built by: buildd
Built on: Mon Feb  5 20:05:56 UTC 2018
  Built host: lcy01-amd64-009
  C bindings: yes
C++ bindings: yes
 Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
   Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to 
limitations in the gfortran compiler, does not support the following: array 
subsections, direct passthru (where possible) to underlying Open MPI's C 
functionality
  Fort mpi_f08 subarrays: no
   Java bindings: yes
  Wrapper compiler rpath: disabled
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
  C compiler version: 7.3.0
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
   Fort compiler: gfortran
   Fort compiler abs: /usr/bin/gfortran
 Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
  Fort optional args: yes
  Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
   Fort STORAGE_SIZE: yes
  Fort BIND(C) (all): yes
  Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
   Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
  Fort PROTECTED: yes
   Fort ABSTRACT: yes
   Fort ASYNCHRONOUS: yes
  Fort PROCEDURE: yes
 Fort USE...ONLY: yes
   Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
 Fort MPI_SIZEOF: yes
 C profiling: yes
   C++ profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
  C++ exceptions: no
  Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, 
OMPI progress: no, ORTE progress: yes, Event lib: yes)
   Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
  dl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
 MPI I/O support: yes
   MPI_WTIME support: native
 Symbol vis. support: yes
   Host to

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-17 Thread Tony Ladd via users
My apologies - I did not read the FAQ's carefully enough - with regard 
to 14:


1. openib

2. Ubuntu supplied drivers etc.

3. Ubuntu 18.04  4.15.0-112-generic

4. opensm-3.3.5_mlnx-0.1.g6b18e73

5. Attached

6. Attached

7. unlimited on foam and 16384 on f34

I changed the ulimit to unlimited on f34 but it did not help.

Tony

--

Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

foam:root(ib)> ibv_devinfo
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.9.1000
node_guid:  0002:c903:000f:666e
sys_image_guid: 0002:c903:000f:6671
vendor_id:  0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id:   MT_0D90110009
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid:   26
port_lmc:   0x00
link_layer: InfiniBand

root@f34:/home/tladd# ibv_devinfo 
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.9.1000
node_guid:  0002:c903:000a:af92
sys_image_guid: 0002:c903:000a:af95
vendor_id:  0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id:   MT_0D90110009
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid:   32
port_lmc:   0x00
link_layer: InfiniBand

root@f34:/home/tladd# ifconfig
eno1: flags=4163  mtu 1500
inet 10.1.2.34  netmask 255.255.255.0  broadcast 10.1.2.255
inet6 fe80::862b:2bff:fe18:3729  prefixlen 64  scopeid 0x20
ether 84:2b:2b:18:37:29  txqueuelen 1000  (Ethernet)
RX packets 1015244  bytes 146716710 (146.7 MB)
RX errors 0  dropped 234903  overruns 0  frame 0
TX packets 176298  bytes 17106041 (17.1 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ib0: flags=4163  mtu 2044
inet 10.2.2.34  netmask 255.255.255.0  broadcast 10.2.2.255
inet6 fe80::202:c903:a:af93  prefixlen 64  scopeid 0x20
unspec 80-00-02-08-FE-80-00-00-00-00-00-00-00-00-00-00  txqueuelen 256  
(UNSPEC)
RX packets 289257  bytes 333876570 (333.8 MB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 140385  bytes 324882131 (324.8 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73  mtu 65536
inet 127.0.0.1  netmask 255.0.0.0
inet6 ::1  prefixlen 128  scopeid 0x10
loop  txqueuelen 1000  (Local Loopback)
RX packets 317853  bytes 21490738 (21.4 MB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 317853  bytes 21490738 (21.4 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

foam:root(ib)> ifconfig
enp4s0: flags=4163  mtu 1500
inet 10.1.2.251  netmask 255.255.255.0  broadcast 10.1.2.255
inet6 fe80::ae1f:6bff:feb1:7f02  prefixlen 64  scopeid 0x20
ether ac:1f:6b:b1:7f:02  txqueuelen 1000  (Ethernet)
RX packets 1092343  bytes 98282221 (98.2 MB)
RX errors 0  dropped 176607  overruns 0  frame 0
TX packets 248746  bytes 206951391 (206.9 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
device memory 0xf040-f047  

enp5s0: flags=4163  mtu 1500
inet 192.168.1.2  netmask 255.255.255.0  broadcast 192.168.1.255
inet6 fe80::ae1f:6bff:feb1:7f03  prefixlen 64  scopeid 0x20
ether ac:1f:6b:b1:7f:03  txqueuelen 1000  (Ethernet)
RX packets 1039387  bytes 87199457 (87.1 MB)
RX errors 0  dropped 187625  overruns 0  frame 0
TX packets 5884980  bytes 8649612519 (8.6 GB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
device memory 0xf030-f037  

enp6s0: flags=4163  mtu 1500
inet 10.227.121.95  netmask 255.255.255.0  broadcast 10.227.121.255
inet6 fe80::6a05:caff:febd:397c  prefixlen 64  scopeid 0x20
ether 68:05:ca:bd:3

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-19 Thread Tony Ladd via users
One other update. I compiled OpenMPI-4.0.4 The outcome was the same but 
there is no mention of ibv_obj this time.


Tony

--

Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

f34:tladd(~)> mpirun -d --report-bindings --mca btl_openib_allow_ib 1 --mca btl 
openib,self --mca btl_base_verbose 30 -np 2 
mpi-benchmarks-IMB-v2019.3/src_c/IMB-MPI1 SendRecv
[f34:24079] procdir: /tmp/ompi.f34.501/pid.24079/0/0
[f34:24079] jobdir: /tmp/ompi.f34.501/pid.24079/0
[f34:24079] top: /tmp/ompi.f34.501/pid.24079
[f34:24079] top: /tmp/ompi.f34.501
[f34:24079] tmp: /tmp
[f34:24079] sess_dir_cleanup: job session dir does not exist
[f34:24079] sess_dir_cleanup: top session dir not empty - leaving
[f34:24079] procdir: /tmp/ompi.f34.501/pid.24079/0/0
[f34:24079] jobdir: /tmp/ompi.f34.501/pid.24079/0
[f34:24079] top: /tmp/ompi.f34.501/pid.24079
[f34:24079] top: /tmp/ompi.f34.501
[f34:24079] tmp: /tmp
[f34:24079] [[62672,0],0] Releasing job data for [INVALID]
[f34:24079] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././.][./././.]
[f34:24079] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./.][./././.]
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_forward_output = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
(i, host, exe, pid) = (0, f34, 
/home/tladd/mpi-benchmarks-IMB-v2019.3/src_c/IMB-MPI1, 24083)
(i, host, exe, pid) = (1, f34, 
/home/tladd/mpi-benchmarks-IMB-v2019.3/src_c/IMB-MPI1, 24084)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[f34:24084] procdir: /tmp/ompi.f34.501/pid.24079/1/1
[f34:24084] jobdir: /tmp/ompi.f34.501/pid.24079/1
[f34:24084] top: /tmp/ompi.f34.501/pid.24079
[f34:24084] top: /tmp/ompi.f34.501
[f34:24084] tmp: /tmp
[f34:24083] procdir: /tmp/ompi.f34.501/pid.24079/1/0
[f34:24083] jobdir: /tmp/ompi.f34.501/pid.24079/1
[f34:24083] top: /tmp/ompi.f34.501/pid.24079
[f34:24083] top: /tmp/ompi.f34.501
[f34:24083] tmp: /tmp
[f34:24084] mca: base: components_register: registering framework btl components
[f34:24084] mca: base: components_register: found loaded component self
[f34:24084] mca: base: components_register: component self register function 
successful
[f34:24084] mca: base: components_register: found loaded component openib
[f34:24084] mca: base: components_register: component openib register function 
successful
[f34:24084] mca: base: components_open: opening btl components
[f34:24084] mca: base: components_open: found loaded component self
[f34:24084] mca: base: components_open: component self open function successful
[f34:24084] mca: base: components_open: found loaded component openib
[f34:24084] mca: base: components_open: component openib open function 
successful
[f34:24084] select: initializing btl component self
[f34:24084] select: init of component self returned success
[f34:24084] select: initializing btl component openib
[f34:24083] mca: base: components_register: registering framework btl components
[f34:24083] mca: base: components_register: found loaded component self
[f34:24083] mca: base: components_register: component self register function 
successful
[f34:24083] mca: base: components_register: found loaded component openib
[f34:24083] mca: base: components_register: component openib register function 
successful
[f34:24083] mca: base: components_open: opening btl components
[f34:24083] mca: base: components_open: found loaded component self
[f34:24083] mca: base: components_open: component self open function successful
[f34:24083] mca: base: components_open: found loaded component openib
[f34:24083] mca: base: components_open: component openib open function 
successful
[f34:24083] select: initializing btl component self
[f34:24083] select: init of component self returned success
[f34:24083] select: initializing btl component openib
[f34:24084] Checking distance from this process to device=mlx4_0
[f34:24084] Process is not bound: distance to device is 0.00
[f34:24083] Checking distance from this process to device=mlx4_0
[f34:24083] Process is not bound: distance to device is 0.00
[f34:24083] [rank=0] openib: using port mlx4_0:1
[f34:24083] select: init of component openib returned success
[f34:24084] [rank=1] openib: using port mlx4_0:1
[f34:24084] select: init of component openib returned success
[f34:24083] mca: bml: Using self btl for send to [[62672,1],0] on node f34
[f34:24084] mca: bml: Using self btl for send to [[62672,1],1] on node f34
^C
[f34:24079] sess_dir_finalize: proc session dir does not exist
[f34:24079] sess_dir_finalize: job session dir does not exist
[f34:24079] sess_dir_finalize: jobfam session dir not empty - leaving
[f34:24079] sess_dir_finalize: jobfam session dir not empty - leaving
[f34:24079] sess_dir_finalize: top session dir not empty - leaving
[f34:24079] sess_dir_finalize: proc session

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-22 Thread Tony Ladd via users

Hi Jeff

I installed ucx as you suggested. But I can't get even the simplest code 
(ucp_client_server) to work across the network. I can compile openMPI 
with UCX but it has the same problem - mpi codes will not execute and 
there are no messages. Really, UCX is not helping. It is adding another 
(not so well documented) software layer, which does not offer better 
diagnostics as far as I can see. Its also unclear to me how to control 
what drivers are being loaded - UCX wants to make that decision for you. 
With openMPI I can see that (for instance) the tcp module works both 
locally and over the network - it must be using the Mellanox NIC for the 
bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But if I 
try to use openib (or allow ucx or openmpi to choose the transport 
layer) it just hangs. Annoyingly I have this server where everything 
works just fine - I can run locally over openib and its fine. All the 
other nodes cannot seem to load openib so even local jobs fail.


The only good (as best I can tell) diagnostic is from openMPI. ibv_obj 
(from v2.x) complains  that openib returns a NULL object, whereas on my 
server it returns logical_index=1. Can we not try to diagnose the 
problem with openib not loading (see my original post for details). I am 
pretty sure if we can that would fix the problem.


Thanks

Tony

PS I tried configuring two nodes back to back to see if it was a switch 
issue, but the result was the same.



On 8/19/20 1:27 PM, Jeff Squyres (jsquyres) wrote:

[External Email]

Tony --

Have you tried compiling Open MPI with UCX support?  This is Mellanox 
(NVIDIA's) preferred mechanism for InfiniBand support these days -- the openib 
BTL is legacy.

You can run: mpirun --mca pml ucx ...



On Aug 19, 2020, at 12:46 PM, Tony Ladd via users  
wrote:

One other update. I compiled OpenMPI-4.0.4 The outcome was the same but there 
is no mention of ibv_obj this time.

Tony

--

Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514




--
Jeff Squyres
jsquy...@cisco.com


--
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514



Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-23 Thread Tony Ladd via users

Hi John

Thanks for the response. I have run all those diagnostics, and as best I 
can tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients + 
server) and the fabric passes all the tests. There is 1 warning:


I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

but according to a number of sources this is harmless.

I have run Mellanox's P2P performance tests (ib_write_bw) between 
different pairs of nodes and it reports 3.22 GB/sec which is reasonable 
(its PCIe 2 x8 interface ie 4 GB/s). I have also configured 2 nodes back 
to back to check that the switch is not the problem - it makes no 
difference.


I have been playing with the btl params with openMPI (v. 2.1.1 which is 
what is relelased in Ubuntu 18.04). So with tcp as the transport layer 
everything works fine - 1 node or 2 node communication - I have tested 
up to 16 processes (8+8) and it seems fine. Of course the latency is 
much higher on the tcp interface, so I would still like to access the 
RDMA layer. But unless I exclude the openib module, it always hangs. 
Same with OpenMPI v4 compiled from source.


I think an important component is that Mellanox is not supporting 
Connect X2 for some time. This is really infuriating; a $500 network 
card with no supported drivers, but that is business for you I suppose. 
I have 50 NICS and I can't afford to replace them all. The other 
component is the MLNX-OFED is tied to specific software versions, so I 
can't just run an older set of drivers. I have not seen source files for 
the Mellanox drivers - I would take a crack at compiling them if I did. 
In the past I have used the OFED drivers (on Centos 5) with no problem, 
but I don't think this is an option now.


Ubuntu claims to support Connect X2 with their drivers (Mellanox 
confirms this), but of course this is community support and the number 
of cases is obviously small. I use the Ubuntu drivers right now because 
the OFED install seems broken and there is no help with it. Its not 
supported! Neat huh?


The only handle I have is with openmpi v. 2 when there is a message (see 
my original post) that ibv_obj returns a NULL result. But I don't 
understand the significance of the message (if any).


I am not enthused about UCX - the documentation has several obvious 
typos in it, which is not encouraging when you a floundering. I know its 
a newish project but I have used openib for 10+ years and its never had 
a problem until now. I think this is not so much openib as the software 
below. One other thing I should say is that if I run any recent version 
of mstflint is always complains:


Failed to identify the device - Can not create SignatureManager!

Going back to my original OFED 1.5 this did not happen, but they are at 
v5 now.


Everything else works as far as I can see. But I could not burn new 
firmware except by going back to the 1.5 OS. Perhaps this is connected 
with the obv_obj = NULL result.


Thanks for helping out. As you can see I am rather stuck.

Best

Tony

On 8/23/20 3:01 AM, John Hearns via users wrote:

*[External Email]*

Tony, start at a low level. Is the Infiniband fabric healthy?
Run
ibstatus   on every node
sminfo on one node
ibdiagnet on one node

On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users 
mailto:users@lists.open-mpi.org>> wrote:


Hi Jeff

I installed ucx as you suggested. But I can't get even the
simplest code
(ucp_client_server) to work across the network. I can compile openMPI
with UCX but it has the same problem - mpi codes will not execute and
there are no messages. Really, UCX is not helping. It is adding
another
(not so well documented) software layer, which does not offer better
diagnostics as far as I can see. Its also unclear to me how to
control
what drivers are being loaded - UCX wants to make that decision
for you.
With openMPI I can see that (for instance) the tcp module works both
locally and over the network - it must be using the Mellanox NIC
for the
bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But
if I
try to use openib (or allow ucx or openmpi to choose the transport
layer) it just hangs. Annoyingly I have this server where everything
works just fine - I can run locally over openib and its fine. All the
other nodes cannot seem to load openib so even local jobs fail.

The only good (as best I can tell) diagnostic is from openMPI.
ibv_obj
(from v2.x) complains  that openib returns a NULL object, whereas
on my
server it returns logical_index=1. Can we not try to diagnose the
problem with openib not loading (see my original post for
details). I am
pretty sure if we can that would fix the problem.

Thanks

Tony

PS I tried configuring two nodes back to back to see if it was a
switch
issue, 

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-24 Thread Tony Ladd via users

Hi Jeff

I appreciate your help (and John's as well). At this point I don't think 
is an OMPI problem - my mistake. I think the communication with RDMA is 
somehow disabled (perhaps its the verbs layer - I am not very 
knowledgeable with this). It used to work like a dream but Mellanox has 
apparently disabled some of the Connect X2 components, because neither 
ompi or ucx (with/without ompi) could connect with the RDMA layer. Some 
of the infiniband functions are also not working on the X2 (mstflint, 
mstconfig).


In fact ompi always tries to access the openib module. I have to 
explicitly disable it even to run on 1 node. So I think it is in 
initialization not communication that the problem lies. This is why (I 
think) ibv_obj returns NULL. The better news is that with the tcp stack 
everything works fine (ompi, ucx, 1 node, many nodes) - the bandwidth is 
similar to rdma so for large messages its semi OK. Its a partial 
solution - not all I wanted of course. The direct rdma functions 
ib_read_lat etc also work fine with expected results. I am suspicious 
this disabling of the driver is a commercial more than a technical decision.


I am going to try going back to Ubuntu 16.04 - there is a version of 
OFED that still supports the X2. But I think it may still get messed up 
by kernel upgrades (it does for 18.04 I found). So its not an easy path.


Thanks again.

Tony

On 8/24/20 11:35 AM, Jeff Squyres (jsquyres) wrote:

[External Email]

I'm afraid I don't have many better answers for you.

I can't quite tell from your machines, but are you running IMB-MPI1 Sendrecv 
*on a single node* with `--mca btl openib,self`?

I don't remember offhand, but I didn't think that openib was supposed to do loopback 
communication.  E.g., if both MPI processes are on the same node, `--mca btl 
openib,vader,self` should do the trick (where "vader" = shared memory support).

More specifically: are you running into a problem running openib (and/or UCX) 
across multiple nodes?

I can't speak to Nvidia support on various models of [older] hardware 
(including UCX support on that hardware).  But be aware that openib is 
definitely going away; it is wholly being replaced by UCX.  It may be that your 
only option is to stick with older software stacks in these hardware 
environments.



On Aug 23, 2020, at 9:46 PM, Tony Ladd via users  
wrote:

Hi John

Thanks for the response. I have run all those diagnostics, and as best I can 
tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients + server) 
and the fabric passes all the tests. There is 1 warning:

I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

but according to a number of sources this is harmless.

I have run Mellanox's P2P performance tests (ib_write_bw) between different 
pairs of nodes and it reports 3.22 GB/sec which is reasonable (its PCIe 2 x8 
interface ie 4 GB/s). I have also configured 2 nodes back to back to check that 
the switch is not the problem - it makes no difference.

I have been playing with the btl params with openMPI (v. 2.1.1 which is what is 
relelased in Ubuntu 18.04). So with tcp as the transport layer everything works 
fine - 1 node or 2 node communication - I have tested up to 16 processes (8+8) 
and it seems fine. Of course the latency is much higher on the tcp interface, 
so I would still like to access the RDMA layer. But unless I exclude the openib 
module, it always hangs. Same with OpenMPI v4 compiled from source.

I think an important component is that Mellanox is not supporting Connect X2 
for some time. This is really infuriating; a $500 network card with no 
supported drivers, but that is business for you I suppose. I have 50 NICS and I 
can't afford to replace them all. The other component is the MLNX-OFED is tied 
to specific software versions, so I can't just run an older set of drivers. I 
have not seen source files for the Mellanox drivers - I would take a crack at 
compiling them if I did. In the past I have used the OFED drivers (on Centos 5) 
with no problem, but I don't think this is an option now.

Ubuntu claims to support Connect X2 with their drivers (Mellanox confirms 
this), but of course this is community support and the number of cases is 
obviously small. I use the Ubuntu drivers right now because the OFED install 
seems broken and there is no help with it. Its not supported! Neat huh?

The only handle I have is with openmpi v. 2 when there is a message (see my 
original post) that ibv_obj returns a NULL result. But I don't understand the 
significance of the message (if any).

I am not enthused about UCX - the documentation has several obvious typos in 
it, which is not encouraging when you a floundering. I know its a newish 
project but I have used openib for 10+ years and its never had a problem until 
now. I think this

Re: [OMPI users] Problem in starting openmpi job - no output just hangs - SOLVED

2020-09-01 Thread Tony Ladd via users

Jeff

I found the solution - rdma needs significant memory so the limits on 
the shell have to be increased. I needed to add the lines


* soft memlock unlimited
* hard memlock unlimited

to the end of the file /etc/security/limits.conf. After that the openib 
driver loads and everything is fine - proper IB latency again.


I see that # 16 of the tuning FAQ discusses the same issue, but in my 
case there was no error or warning message. I am posting this in case 
anyone else runs into this issue.


The Mellanox OFED install adds those lines automatically, so I had not 
run into this before.


Tony


On 8/25/20 10:42 AM, Jeff Squyres (jsquyres) wrote:

[External Email]

On Aug 24, 2020, at 9:44 PM, Tony Ladd  wrote:

I appreciate your help (and John's as well). At this point I don't think is an 
OMPI problem - my mistake. I think the communication with RDMA is somehow 
disabled (perhaps its the verbs layer - I am not very knowledgeable with this). 
It used to work like a dream but Mellanox has apparently disabled some of the 
Connect X2 components, because neither ompi or ucx (with/without ompi) could 
connect with the RDMA layer. Some of the infiniband functions are also not 
working on the X2 (mstflint, mstconfig).

If the IB stack itself is not functioning, then you're right: Open MPI won't 
work, either (with openib or UCX).

You can try to keep poking with the low-layer diagnostic tools like ibv_devinfo 
and ibv_rc_pingpong.  If those don't work, Open MPI won't work over IB, either.


In fact ompi always tries to access the openib module. I have to explicitly 
disable it even to run on 1 node.

Yes, that makes sense: Open MPI will aggressively try to use every possible 
mechanism.


So I think it is in initialization not communication that the problem lies.

I'm not sure that's correct.

 From your initial emails, it looks like openib thinks it initialized properly.


This is why (I think) ibv_obj returns NULL.

I'm not sure if that's a problem or not.  That section of output is where Open 
MPI is measuring the distance from the current process to the PCI bus where the 
device lives.  I don't remember offhand if returning NULL in that area is 
actually a problem or just an indication of some kind of non-error condition.

Specifically: if returning NULL there was a problem, we *probably* would have 
aborted at that point.  I have not looked at the code to verify that, though.


The better news is that with the tcp stack everything works fine (ompi, ucx, 1 
node, many nodes) - the bandwidth is similar to rdma so for large messages its 
semi OK. Its a partial solution - not all I wanted of course. The direct rdma 
functions ib_read_lat etc also work fine with expected results. I am suspicious 
this disabling of the driver is a commercial more than a technical decision.
I am going to try going back to Ubuntu 16.04 - there is a version of OFED that 
still supports the X2. But I think it may still get messed up by kernel 
upgrades (it does for 18.04 I found). So its not an easy path.


I can't speak for Nvidia here, sorry.

--
Jeff Squyres
jsquy...@cisco.com


--
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514



Re: [OMPI users] MPI Exit Code:1 on an OpenFoam application

2021-01-05 Thread Tony Ladd via users
ated. The first process to do so was:

  Process name: [[32896,1],0]
  Exit code:    1
-- 




I looked into the Systemmonitor, but i didnt have a process with
this name or number.

If i execute
mpirun --version
the consol replys this message:
mpirun (Open MPI) 4.0.3

Report bugs to 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org_community_help_&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=kgFAU2BfgKe7cozjrP7uWDPH6xt6LAmYVlQPwQuK7ek&m=9YEwGLzNfCD1pAUuvNpqStsbpagtNfIzEt6wL6f3_7I&s=xXp3HlEJc7DzUAnJY0RVVKgKZ9HopKf0UUMePlaCV8w&e=


How can I solve this problem ?

Best regards
Kai











--
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514



--
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514



Re: [OMPI users] MPI Exit Code:1 on an OpenFoam application

2021-01-10 Thread Tony Ladd via users

Kai

That means your case directory is mostly OK. Exactly what command did 
you use to run the executable. By serial mode I actually meant a single 
processor. For example


simpleFoam

But then its surprising if it uses multiple cores. But it may be using 
multithreading by default.


On a multicore node something like

simpleFoam -parallel

might use a default number of cores (probably all of them).

For a proper parallel job you need to decompose the problem first with 
decomposePar. One possible source of error is the decomposeParDict file. 
If you don't get a proper decomposition that can be a problem. There are 
examples online.


My typical run script would be something like

decomposePar

mpirun -np 4 simpleFoam -parallel 2>&1 | tee log

reconstructPar

You can check the decomposition with paraview (in the individual 
processor dirs)


Tony


On 1/10/21 5:29 AM, Kahnbein Kai via users wrote:

[External Email]

Hey Tony,
it works without the -parallel flag, all four cpu's are at 100% and
running fine.

Best regards
Kai

Am 05.01.21 um 20:36 schrieb Tony Ladd via users:

Just run the executable without mpirun and the -parallel flag.

On 1/2/21 11:39 PM, Kahnbein Kai via users wrote:

*[External Email]*

Ok, sorry, what do you mean with the "serial version" ?

Best regards
Kai

Am 31.12.20 um 16:25 schrieb tladd via users:


I did not see the whole email chain before. The problem is not that
it cannot find the MPI directories. I think this INIT error comes
when the program cannot start for some reason. For example a missing
input file. Does the serial version work.


On 12/31/20 6:33 AM, Kahnbein Kai via users wrote:

*[External Email]*

I compared the /etc/bashrc files of both versions of OF (v7 and v8)
and i dont found any difference.
Here are the lines (i thought related to openmpi) of both files:

OpenFOAM v7:
Line 86 till 89:
#- MPI implementation:
#    WM_MPLIB = SYSTEMOPENMPI | OPENMPI | SYSTEMMPI | MPICH |
MPICH-GM | HPMPI
#   | MPI | FJMPI | QSMPI | SGIMPI | INTELMPI
export WM_MPLIB=SYSTEMOPENMPI

Line 169 till 174:
# Source user setup files for optional packages
# ~
_foamSource `$WM_PROJECT_DIR/bin/foamEtcFile config.sh/mpi`
_foamSource `$WM_PROJECT_DIR/bin/foamEtcFile config.sh/paraview`
_foamSource `$WM_PROJECT_DIR/bin/foamEtcFile config.sh/ensight`
_foamSource `$WM_PROJECT_DIR/bin/foamEtcFile config.sh/gperftools`

OpenFOAM v8:
Line 86 till 89:
#- MPI implementation:
#    WM_MPLIB = SYSTEMOPENMPI | OPENMPI | SYSTEMMPI | MPICH |
MPICH-GM | HPMPI
#   | MPI | FJMPI | QSMPI | SGIMPI | INTELMPI
export WM_MPLIB=SYSTEMOPENMPI

Line 169 till 174:
# Source user setup files for optional packages
# ~
_foamSource `$WM_PROJECT_DIR/bin/foamEtcFile config.sh/mpi`
_foamSource `$WM_PROJECT_DIR/bin/foamEtcFile config.sh/paraview`
_foamSource `$WM_PROJECT_DIR/bin/foamEtcFile config.sh/ensight`
_foamSource `$WM_PROJECT_DIR/bin/foamEtcFile config.sh/gperftools`


Are you think these are the right lines ?

I wish you a healthy start into the new year,
Kai

Am 30.12.20 um 15:25 schrieb tladd via users:

Probably because OF cannot find your mpi installation. Once you
set your OF environment, where is it looking for mpicc? Note the
OF environment overrides your .bashrc once you source the OF
bashrc. That takes its settings from the src/etc directory in the
OF source code.


On 12/29/20 10:23 AM, Kahnbein Kai via users wrote:

[External Email]

Thank you for this hint. I installed OpenFOAM v8 (the newest) on my
computer and it works ...

At the Version v7 i get still this mpi error 

I dont know why ...

I wish you a healthy start into the new year :)


Am 28.12.20 um 19:16 schrieb Benson Muite via users:

Have you tried reinstalling OpenFOAM? If you are mostly working
in a
desktop, there are pre-compiled versions available:
https://urldefense.proofpoint.com/v2/url?u=https-3A__openfoam.com_download_&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=kgFAU2BfgKe7cozjrP7uWDPH6xt6LAmYVlQPwQuK7ek&m=9YEwGLzNfCD1pAUuvNpqStsbpagtNfIzEt6wL6f3_7I&s=bZFAwh79J3ZL1Ut9Jt4qj-kBCubrvjsLNhq51hnAwXk&e= 




If you are using a pre-compiled version, do also consider 
reporting

the error to the packager. It seems unlikely to be an MPI error,
more
likely something with OpenFOAM and/or the setup.

On 12/28/20 6:25 PM, Kahnbein Kai via users wrote:

Good morning,
im trying to fix this error by myself and i have a little update.
The ompi version i use is the:
Code:

kai@Kai-Desktop:~/Dokumente$ mpirun --version
mpirun (Open MPI) 4.0.3

If i create a *.c file, with the following content:
Code:

#include 
#include 

int main(int argc, char** argv) {
 // Initialize the MPI environment
 MPI_Init(NULL, NULL);

 // Get the number of processes
 int world_size;
 MPI_Comm_size(MPI_COMM_WORLD, &world_size);

 // Get the rank of the process
 int world_rank;
 M