[O-MPI users] does btl_openib work ?

2006-01-18 Thread Jean-Christophe Hugly

Hi,

I have been trying for the past few days to get an MPI application (the
pallas bm) to run with ompi and openib.

My environment:
===
. two quad cpu hosts with one mlx hca each.
. the hosts are running suse10 (kernel 2.6.13) with the latest (close to
it) from openib. (rev 4904, specifically)
. opensm runs on third machine with the same os.
. openmpi is built from openmpi-1.1a1r8727.tar.bz2

Behaviour:
==
. openib seems to behave ok (ipoib works, rdma_bw and rdma_lat work, osm
works)
. I can mpirun any non-mpi program like ls, hostname, or ompi_info all
right.
. I can mpirun the pallas bm on any single host (the local one or the
other)
. I can mpirun the pallas bm on  the two nodes provided that I disable
the openib btl
. If I try to use the openib btl, the bm does not start (at best I get
the initial banner, sometimes not). On both hosts, I see that the PMB
processes (the correct number for each host) use 99% cpu.

I obtained the exact same behaviour with the following src packages:
 openmpi-1.0.1.tar.bz2
 openmpi-1.0.2a3r8706.tar.bz2
 openmpi-1.1a1r8727.tar.bz2

Earlier on, I also did the same experiment with openmpi-1.0.1 and the
stock gen2 of the suse kernel; same thing.

Configuration:
==
For building, I tried the following variants:

./configure --prefix=/opt/ompi --enable-mpi-threads --enable-progress-thread
./configure --prefix=/opt/ompi
./configure --prefix=/opt/ompi --disable-smp-locks

I also tried many variations to mca-params.conf. What I normally use for trying 
openib is:
rmaps_base_schedule_policy = node
btl = ^tcp
mpi_paffinity_alone = 1

The mpirun cmd I normally use is:
mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np 2 PMB-MPI1

My machine file being:
bench1 slots=4 max-slots=4
bench2 slots=4 max-slots=4

Am I doing something obviously wrong ?

Thanks for any help !

-- 
Jean-Christophe Hugly 
PANTA



[O-MPI users] does btl_openib work ?

2006-01-18 Thread Jean-Christophe Hugly
More info.

Environment of remotely exec'ed progs (by running mpirun ... env):
==

OMPI_MCA_rds_hostfile_path=/root/machines
SHELL=/bin/bash
SSH_CLIENT=10.1.40.61 42076 22
USER=root
LD_LIBRARY_PATH=/opt/ompi/lib:
LS_COLORS=
MAIL=/var/mail/root
PATH=/opt/ompi/bin:/usr/bin:/bin:/usr/sbin:/sbin
PWD=/root
LANG=en_US.UTF-8
SHLVL=1
HOME=/root
LS_OPTIONS=-a -N --color=none -T 0
LOGNAME=root
SSH_CONNECTION=10.1.40.61 42076 10.1.40.63 22
_=/opt/ompi/bin/orted
OMPI_MCA_universe=root@bench1:default-universe-10053
OMPI_MCA_ns_nds=env
OMPI_MCA_ns_nds_vpid_start=0
OMPI_MCA_ns_nds_num_procs=1
OMPI_MCA_mpi_paffinity_processor=0
OMPI_MCA_ns_replica_uri=0.0.0;tcp://10.1.40.61:33050
OMPI_MCA_gpr_replica_uri=0.0.0;tcp://10.1.40.61:33050
OMPI_MCA_orte_base_nodename=bench2
OMPI_MCA_ns_nds_cellid=0
OMPI_MCA_ns_nds_jobid=1
OMPI_MCA_ns_nds_vpid=0

ulimits (by running mpirun ... sh -c "ulimit -a"):
==

core file size(blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) 1073741824
max memory size   (kbytes, -m) unlimited
open files(-n) 1024
pipe size  (512 bytes, -p) 8
stack size(kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes(-u) 81920
virtual memory(kbytes, -v) unlimited

I forgot anything ?

-- 
Jean-Christophe Hugly 
PANTA



Re: [O-MPI users] does btl_openib work ?

2006-01-30 Thread Jean-Christophe Hugly
On Fri, 2006-01-20 at 09:37 -0700, Galen M. Shipman wrote:
> Jean,
> 
> I am not able to reproduce this problem on a non-threaded build, can  
> you try taking a fresh src package and configuring without thread  
> support. I am wondering if this is simply a threading issue. I did  
> note that you said you configured both with and without threads but  
> try the configure on a fresh source, not on one that had previously  
> been configured with thread support.

I rebuilt everything from fresh src (took the oppotunity to refresh).
Same behaviour...
Am I the only one ?

-- 
Jean-Christophe Hugly 
PANTA



Re: [O-MPI users] does btl_openib work ?

2006-02-02 Thread Jean-Christophe Hugly
On Thu, 2006-02-02 at 15:19 -0700, Galen M. Shipman wrote:
> By using slots=4 you are telling Open MPI to put the first 4  
> processes on the "bench1" host.
> Open MPI will therefore use shared memory to communicate between the  
> processes not Infiniband.

Well, actually not, unless I'm mistaken about that. In my
mca-params.conf I have :

rmaps_base_schedule_policy = node

That would spread processes over nodes, right ?

> You might try:
> 
> 
> mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np  
> 2 -d xterm -e gdb PMB-MPI1

Thanks for the tip. The last time I tried this it took quite a few
attempts before getting it right. As I did not remember the magic trick,
I was somewhat reluctant to go in that direction. Since you just handed
me the recipe on a sliver plate, I'll do it.

J-C




Re: [O-MPI users] does btl_openib work ?

2006-02-02 Thread Jean-Christophe Hugly
On Thu, 2006-02-02 at 15:19 -0700, Galen M. Shipman wrote:

> Is it possible for you to get a stack trace where this is hanging?
> 
> You might try:
> 
> 
> mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np  
> 2 -d xterm -e gdb PMB-MPI1
> 
> 

I did that, and when it was hanging I control-C'd in each gdb and asked
for a bt.

Here's the debug output from the mpirun command:
==

mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np 2
-d xterm -e gdb PMB-MPI1
[bench1:16017] procdir: (null)
[bench1:16017] jobdir: (null)
[bench1:16017]
unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe
[bench1:16017] top: openmpi-sessions-root@bench1_0
[bench1:16017] tmp: /tmp
[bench1:16017] connect_uni: contact info read
[bench1:16017] connect_uni: connection not allowed
[bench1:16017] [0,0,0] setting up session dir with
[bench1:16017]  tmpdir /tmp
[bench1:16017]  universe default-universe-16017
[bench1:16017]  user root
[bench1:16017]  host bench1
[bench1:16017]  jobid 0
[bench1:16017]  procid 0
[bench1:16017]
procdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0/0
[bench1:16017]
jobdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0
[bench1:16017]
unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017
[bench1:16017] top: openmpi-sessions-root@bench1_0
[bench1:16017] tmp: /tmp
[bench1:16017] [0,0,0]
contact_file 
/tmp/openmpi-sessions-root@bench1_0/default-universe-16017/universe-setup.txt
[bench1:16017] [0,0,0] wrote setup file
[bench1:16017] spawn: in job_state_callback(jobid = 1, state = 0x1)
[bench1:16017] pls:rsh: local csh: 0, local bash: 1
[bench1:16017] pls:rsh: assuming same remote shell as local shell
[bench1:16017] pls:rsh: remote csh: 0, remote bash: 1
[bench1:16017] pls:rsh: final template argv:
[bench1:16017] pls:rsh: /usr/bin/ssh -X  orted --debug
--bootproxy 1 --name  --num_procs 3 --vpid_start 0 --nodename
 --universe root@bench1:default-universe-16017 --nsreplica
"0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica
"0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-yield 0
[bench1:16017] pls:rsh: launching on node bench2
[bench1:16017] pls:rsh: not oversubscribed -- setting
mpi_yield_when_idle to 0
[bench1:16017] pls:rsh: bench2 is a REMOTE node
[bench1:16017] pls:rsh: executing: /usr/bin/ssh -X bench2
PATH=/opt/ompi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/ompi/lib:
$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/ompi/bin/orted --debug
--bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename
bench2 --universe root@bench1:default-universe-16017 --nsreplica
"0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica
"0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-yield 0
[bench2:16980] [0,0,1] setting up session dir with
[bench2:16980]  universe default-universe-16017
[bench2:16980]  user root
[bench2:16980]  host bench2
[bench2:16980]  jobid 0
[bench2:16980]  procid 1
[bench2:16980]
procdir: /tmp/openmpi-sessions-root@bench2_0/default-universe-16017/0/1
[bench2:16980]
jobdir: /tmp/openmpi-sessions-root@bench2_0/default-universe-16017/0
[bench2:16980]
unidir: /tmp/openmpi-sessions-root@bench2_0/default-universe-16017
[bench2:16980] top: openmpi-sessions-root@bench2_0
[bench2:16980] tmp: /tmp
[bench1:16017] pls:rsh: launching on node bench1
[bench1:16017] pls:rsh: not oversubscribed -- setting
mpi_yield_when_idle to 0
[bench1:16017] pls:rsh: bench1 is a LOCAL node
[bench1:16017] pls:rsh: reset
PATH: 
/opt/ompi/bin:/sbin:/usr/sbin:/usr/local/sbin:/opt/gnome/sbin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/ompi/bin
[bench1:16017] pls:rsh: reset LD_LIBRARY_PATH: /opt/ompi/lib
[bench1:16017] pls:rsh: executing: orted --debug --bootproxy 1 --name
0.0.2 --num_procs 3 --vpid_start 0 --nodename bench1 --universe
root@bench1:default-universe-16017 --nsreplica
"0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica
"0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-yield 0
[bench1:16021] [0,0,2] setting up session dir with
[bench1:16021]  universe default-universe-16017
[bench1:16021]  user root
[bench1:16021]  host bench1
[bench1:16021]  jobid 0
[bench1:16021]  procid 2
[bench1:16021]
procdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0/2
[bench1:16021]
jobdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0
[bench1:16021]
unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017
[bench1:16021] top: openmpi-sessions-root@bench1_0
[bench1:16021] tmp: /tmp
Warning: translation table syntax error: Unknown keysym name:  DRemove
Warning: ... found while parsing 'DRemove: ignore()'
Warning: String to TranslationTable conversion encountered errors
Warning: translation table syntax error: Unknown keysym name:  DRemove
Warning: ... found while parsi

Re: [O-MPI users] does btl_openib work with multiple ports ?

2006-02-07 Thread Jean-Christophe Hugly
figuring ompi for it to work properly with
multiple ports ?


-- 
Jean-Christophe Hugly 
PANTA



[O-MPI users] direct openib btl and latency

2006-02-08 Thread Jean-Christophe Hugly

Hi guys,

Does someone know what the framework costs in term of latency ?

Righ now the latency I get with the openib btl is not great: 5.35 us. I
was looking at what I could do to get it down. I tried to get openib be
the only btl but the build process refused.

On the other hand I am not sure it could even work at all, as whenever I
tried at run-time to limit the list to just one transport (be it tcp or
openib, btw), mpi apps would not start.

Either way, I'm curious if it's even worth trying and if there's other
cuts that can be made to shave off one us or two (ok, I'l settle for
1.5 :-) )

Any advice ?

-- 
Jean-Christophe Hugly 
PANTA



Re: [O-MPI users] direct openib btl and latency

2006-02-08 Thread Jean-Christophe Hugly


> you need to specify both the transport and self, such as:
> mpirun -mca btl self,tcp

I found that the reason why I was no-longer able to run without openib
was that I had some openib-specific tunables on the command line. I
expected the params would get ignored, but instead it just sat there.

The other funny thing I found was that on the cmd line I do not need to
specify self; just openib, or tcp will do. But in the param file I must
specify tcp,openib, or self,openib; self,tcp would not work. Oh well, I
do not care for tcp anyway :-).

But should I understand from all this that the "direct" mode will never
actually work ? It seems that if you need at least two transports, then
none of them can be the hardwired unique one, right ? Unless there's a
built-in switch between a built-in self and the built-in other
transport.

> For Heroic latencies on IB we would need to use small message RDMA and 
> poll each peers dedicated memory region for completion.

Well, I tried to play around with the eager_limit, min_rdma, etc. I did
not see the latency of messages of a given size be lowered by changing
the tresholds to make them rdma'd. Rather the opposite (which was my
initial expectation, actually). May be I just misunderstood the whole
set of tunables. My understanding was that messages under the eager
limit would never be rdma'd by definition, and that the others would or
would not be, depending on the min_rdma_size. 

-- 
Jean-Christophe Hugly 
PANTA



Re: [O-MPI users] direct openib btl and latency

2006-02-08 Thread Jean-Christophe Hugly
> As we plan to drop all support for the old
> > generation of PML/PTL, I don't think is a wise idea to spend time on
> > the openib PTL to make it working with uniq ...
> >
> >Thanks,
> >  george.
> >
> 
> With the change to ob1/BTLs, there was also a refactoring of data
> structures that reduced the overall latency through the stack. As
> Galen indicated, if you do a direct comparison w/ send/recv semantics,
> I think you will find the overall latency through the stack is lower
> than other implementations (on the order of 0.5 us).

Thanks, guys. I'll stop worrying about that then !


-- 
Jean-Christophe Hugly 
PANTA



Re: [O-MPI users] direct openib btl and latency

2006-02-09 Thread Jean-Christophe Hugly

> 
> > So far, the best latency I got from ompi is 5.24 us, and the best I  
> > got from mvapich is 3.15.
> > I am perfectly ready to accept that ompi scales better and that may be
> > more important (except to the marketing dept :-) ), but I do not
> > understand your explanation based on small-message RDMA. Either I
> > missunderstood something badly (my best guess), or the 2 us are lost to
> > something else than an RDMA-size tradeoff.
> >
> Again this is small message RDMA with polling versus send/receive  
> semantics, we will be adding small message RDMA and should have  
> performance equal to that of mvapich for small messages, but it is only  
> relevant for a small working set of peers / micro benchmarks.

Thanks a lot. I was being fooled by various levels of size thresholds in
the mvapich code. It was indeed doing rdma for small messages. After
turning that off, I get numbers comparable to yours. Well, mvapich still
beats ompi by a hair on my configuration.  5.11 vs. 5.25 but that's in
the near-irrelevant range compared to other benefits.

>From an adoption perspective, though, the ability to shine in
micro-benchmarks is important, even if it means using an ad-hoc tuning.
There is some justification for it after all. There are small clusters
out there (many more than big ones, in fact) so taking maximum advantage
of a small scale is relevant.

When do you plan on having the small-msg rdma option available ?

J-C


-- 
Jean-Christophe Hugly 
PANTA



Re: [O-MPI users] direct openib btl and latency

2006-02-09 Thread Jean-Christophe Hugly
On Thu, 2006-02-09 at 14:05 -0700, Ron Brightwell wrote:
> > [...]
> > 
> > >From an adoption perspective, though, the ability to shine in
> > micro-benchmarks is important, even if it means using an ad-hoc tuning.
> > There is some justification for it after all. There are small clusters
> > out there (many more than big ones, in fact) so taking maximum advantage
> > of a small scale is relevant.
> 
> I'm obliged to point out that you jumped to a conclusion -- possibly true
> in some cases, but not always.
> 
> You assumed that a performance increase for a two-node micro-benchmark
> would result in an application performance increase for a small cluster.
> Using RDMA for short messages is the default on small clusters *because*
> of the two-node micro-benchmark, not because the cluster is small.

No, I assumed it based on comparisions between doing and not doing small
msg rdma at various scales, from a paper Galen pointed out to me.
http://www.cs.unm.edu/~treport/tr/05-10/Infiniband.pdf

Benchmarks are what they are. In the above paper, the tests place the
cross-over at around 64 nodes and that confirms a number of anecdotal
reports I got. It may well be that in some situations, small-msg rdma is
better only for 2 nodes, but that's note such a likely scenario; reality
is sometimes linear (at least at our scale :-) ) after all.

The scale threshold could be tunable, couldnt it ?

-- 
Jean-Christophe Hugly 
PANTA



Re: [O-MPI users] direct openib btl and latency

2006-02-09 Thread Jean-Christophe Hugly
On Thu, 2006-02-09 at 16:37 -0700, Brightwell, Ronald wrote:

> I apologize if it seems like I'm picking on you.
No offense taken.

>   I'm hypersensitive to
> people trying to make judgements based on micro-benchmark performance.
> I've been trying to make an argument that two-node ping-pong latency
> comparisons really only have meaning in the context of a whole system.

It's very clear to me that micro-benchmarks do not tell you very much
about real application behaviour; that's not the question. They are
nevertheless relevant to me because, right or wrong, people who buy
stuff look at them. And I work for a commercial outfit.

I may sound silly saying that, but they might be right to look at it,
they just need to look at the rest too. A micro-benchmarks tells you how
much you have of a given currency, that you can trade for another. It
tells you something about the implementation; how efficient the code is,
how well the hardware is utilized, etc. Not in every respect, but some.

It also tells you how far you can emphasize a given feature at the
expense of all others, if it happens that at some point in time it is
what you most need.

By making the argument that a particular characteristic is irrelevant,
you are essentially making a hard coded tradeoff, rather than letting
the users do it.

Back to the specific issue of latency vs. scale. Okay for CG and FT, the
cross-over may be <32, but that's not for all the cases and the
difference visible at 32 is pretty small. So, it is application
dependent, no question about it, but small-msg rdma is beneficial below
a given (application-dependent) cluster size.

-- 
Jean-Christophe Hugly 
PANTA



Re: [OMPI users] Open MPI and MultiRail InfiniBand

2006-03-13 Thread Jean-Christophe Hugly
On Mon, 2006-03-13 at 10:57 -0700, Galen Shipman wrote:
> >> This was my oversight, I am getting to it know, should have something
> >> in just a bit.
> >>
> >> - Galen
> >
> > I can live with that, certainly.  Fortunately, there's a couple months
> > until I have a real /need/ for this.
> > -- 
> 
> Hi Troy,
> 
> I have added max_btls to the openib component on the trunk, try:
> 
> mpirun --mca btl_openib_max_btls 1 ...etc
> 
> I don't have a dual nic machine handy to test on, if this checks out we 
> can patch the release branch.

Actually you do...:-)

Please let me know if you ever intend to use that system. I am now
letting someone else use it, but it can be shared.

-- 
Jean-Christophe Hugly 
PANTA