[O-MPI users] does btl_openib work ?
Hi, I have been trying for the past few days to get an MPI application (the pallas bm) to run with ompi and openib. My environment: === . two quad cpu hosts with one mlx hca each. . the hosts are running suse10 (kernel 2.6.13) with the latest (close to it) from openib. (rev 4904, specifically) . opensm runs on third machine with the same os. . openmpi is built from openmpi-1.1a1r8727.tar.bz2 Behaviour: == . openib seems to behave ok (ipoib works, rdma_bw and rdma_lat work, osm works) . I can mpirun any non-mpi program like ls, hostname, or ompi_info all right. . I can mpirun the pallas bm on any single host (the local one or the other) . I can mpirun the pallas bm on the two nodes provided that I disable the openib btl . If I try to use the openib btl, the bm does not start (at best I get the initial banner, sometimes not). On both hosts, I see that the PMB processes (the correct number for each host) use 99% cpu. I obtained the exact same behaviour with the following src packages: openmpi-1.0.1.tar.bz2 openmpi-1.0.2a3r8706.tar.bz2 openmpi-1.1a1r8727.tar.bz2 Earlier on, I also did the same experiment with openmpi-1.0.1 and the stock gen2 of the suse kernel; same thing. Configuration: == For building, I tried the following variants: ./configure --prefix=/opt/ompi --enable-mpi-threads --enable-progress-thread ./configure --prefix=/opt/ompi ./configure --prefix=/opt/ompi --disable-smp-locks I also tried many variations to mca-params.conf. What I normally use for trying openib is: rmaps_base_schedule_policy = node btl = ^tcp mpi_paffinity_alone = 1 The mpirun cmd I normally use is: mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np 2 PMB-MPI1 My machine file being: bench1 slots=4 max-slots=4 bench2 slots=4 max-slots=4 Am I doing something obviously wrong ? Thanks for any help ! -- Jean-Christophe Hugly PANTA
[O-MPI users] does btl_openib work ?
More info. Environment of remotely exec'ed progs (by running mpirun ... env): == OMPI_MCA_rds_hostfile_path=/root/machines SHELL=/bin/bash SSH_CLIENT=10.1.40.61 42076 22 USER=root LD_LIBRARY_PATH=/opt/ompi/lib: LS_COLORS= MAIL=/var/mail/root PATH=/opt/ompi/bin:/usr/bin:/bin:/usr/sbin:/sbin PWD=/root LANG=en_US.UTF-8 SHLVL=1 HOME=/root LS_OPTIONS=-a -N --color=none -T 0 LOGNAME=root SSH_CONNECTION=10.1.40.61 42076 10.1.40.63 22 _=/opt/ompi/bin/orted OMPI_MCA_universe=root@bench1:default-universe-10053 OMPI_MCA_ns_nds=env OMPI_MCA_ns_nds_vpid_start=0 OMPI_MCA_ns_nds_num_procs=1 OMPI_MCA_mpi_paffinity_processor=0 OMPI_MCA_ns_replica_uri=0.0.0;tcp://10.1.40.61:33050 OMPI_MCA_gpr_replica_uri=0.0.0;tcp://10.1.40.61:33050 OMPI_MCA_orte_base_nodename=bench2 OMPI_MCA_ns_nds_cellid=0 OMPI_MCA_ns_nds_jobid=1 OMPI_MCA_ns_nds_vpid=0 ulimits (by running mpirun ... sh -c "ulimit -a"): == core file size(blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) 1073741824 max memory size (kbytes, -m) unlimited open files(-n) 1024 pipe size (512 bytes, -p) 8 stack size(kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes(-u) 81920 virtual memory(kbytes, -v) unlimited I forgot anything ? -- Jean-Christophe Hugly PANTA
Re: [O-MPI users] does btl_openib work ?
On Fri, 2006-01-20 at 09:37 -0700, Galen M. Shipman wrote: > Jean, > > I am not able to reproduce this problem on a non-threaded build, can > you try taking a fresh src package and configuring without thread > support. I am wondering if this is simply a threading issue. I did > note that you said you configured both with and without threads but > try the configure on a fresh source, not on one that had previously > been configured with thread support. I rebuilt everything from fresh src (took the oppotunity to refresh). Same behaviour... Am I the only one ? -- Jean-Christophe Hugly PANTA
Re: [O-MPI users] does btl_openib work ?
On Thu, 2006-02-02 at 15:19 -0700, Galen M. Shipman wrote: > By using slots=4 you are telling Open MPI to put the first 4 > processes on the "bench1" host. > Open MPI will therefore use shared memory to communicate between the > processes not Infiniband. Well, actually not, unless I'm mistaken about that. In my mca-params.conf I have : rmaps_base_schedule_policy = node That would spread processes over nodes, right ? > You might try: > > > mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np > 2 -d xterm -e gdb PMB-MPI1 Thanks for the tip. The last time I tried this it took quite a few attempts before getting it right. As I did not remember the magic trick, I was somewhat reluctant to go in that direction. Since you just handed me the recipe on a sliver plate, I'll do it. J-C
Re: [O-MPI users] does btl_openib work ?
On Thu, 2006-02-02 at 15:19 -0700, Galen M. Shipman wrote: > Is it possible for you to get a stack trace where this is hanging? > > You might try: > > > mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np > 2 -d xterm -e gdb PMB-MPI1 > > I did that, and when it was hanging I control-C'd in each gdb and asked for a bt. Here's the debug output from the mpirun command: == mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np 2 -d xterm -e gdb PMB-MPI1 [bench1:16017] procdir: (null) [bench1:16017] jobdir: (null) [bench1:16017] unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe [bench1:16017] top: openmpi-sessions-root@bench1_0 [bench1:16017] tmp: /tmp [bench1:16017] connect_uni: contact info read [bench1:16017] connect_uni: connection not allowed [bench1:16017] [0,0,0] setting up session dir with [bench1:16017] tmpdir /tmp [bench1:16017] universe default-universe-16017 [bench1:16017] user root [bench1:16017] host bench1 [bench1:16017] jobid 0 [bench1:16017] procid 0 [bench1:16017] procdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0/0 [bench1:16017] jobdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0 [bench1:16017] unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017 [bench1:16017] top: openmpi-sessions-root@bench1_0 [bench1:16017] tmp: /tmp [bench1:16017] [0,0,0] contact_file /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/universe-setup.txt [bench1:16017] [0,0,0] wrote setup file [bench1:16017] spawn: in job_state_callback(jobid = 1, state = 0x1) [bench1:16017] pls:rsh: local csh: 0, local bash: 1 [bench1:16017] pls:rsh: assuming same remote shell as local shell [bench1:16017] pls:rsh: remote csh: 0, remote bash: 1 [bench1:16017] pls:rsh: final template argv: [bench1:16017] pls:rsh: /usr/bin/ssh -X orted --debug --bootproxy 1 --name --num_procs 3 --vpid_start 0 --nodename --universe root@bench1:default-universe-16017 --nsreplica "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-yield 0 [bench1:16017] pls:rsh: launching on node bench2 [bench1:16017] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [bench1:16017] pls:rsh: bench2 is a REMOTE node [bench1:16017] pls:rsh: executing: /usr/bin/ssh -X bench2 PATH=/opt/ompi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/ompi/lib: $LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/ompi/bin/orted --debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename bench2 --universe root@bench1:default-universe-16017 --nsreplica "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-yield 0 [bench2:16980] [0,0,1] setting up session dir with [bench2:16980] universe default-universe-16017 [bench2:16980] user root [bench2:16980] host bench2 [bench2:16980] jobid 0 [bench2:16980] procid 1 [bench2:16980] procdir: /tmp/openmpi-sessions-root@bench2_0/default-universe-16017/0/1 [bench2:16980] jobdir: /tmp/openmpi-sessions-root@bench2_0/default-universe-16017/0 [bench2:16980] unidir: /tmp/openmpi-sessions-root@bench2_0/default-universe-16017 [bench2:16980] top: openmpi-sessions-root@bench2_0 [bench2:16980] tmp: /tmp [bench1:16017] pls:rsh: launching on node bench1 [bench1:16017] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [bench1:16017] pls:rsh: bench1 is a LOCAL node [bench1:16017] pls:rsh: reset PATH: /opt/ompi/bin:/sbin:/usr/sbin:/usr/local/sbin:/opt/gnome/sbin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/ompi/bin [bench1:16017] pls:rsh: reset LD_LIBRARY_PATH: /opt/ompi/lib [bench1:16017] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename bench1 --universe root@bench1:default-universe-16017 --nsreplica "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-yield 0 [bench1:16021] [0,0,2] setting up session dir with [bench1:16021] universe default-universe-16017 [bench1:16021] user root [bench1:16021] host bench1 [bench1:16021] jobid 0 [bench1:16021] procid 2 [bench1:16021] procdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0/2 [bench1:16021] jobdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0 [bench1:16021] unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017 [bench1:16021] top: openmpi-sessions-root@bench1_0 [bench1:16021] tmp: /tmp Warning: translation table syntax error: Unknown keysym name: DRemove Warning: ... found while parsing 'DRemove: ignore()' Warning: String to TranslationTable conversion encountered errors Warning: translation table syntax error: Unknown keysym name: DRemove Warning: ... found while parsi
Re: [O-MPI users] does btl_openib work with multiple ports ?
figuring ompi for it to work properly with multiple ports ? -- Jean-Christophe Hugly PANTA
[O-MPI users] direct openib btl and latency
Hi guys, Does someone know what the framework costs in term of latency ? Righ now the latency I get with the openib btl is not great: 5.35 us. I was looking at what I could do to get it down. I tried to get openib be the only btl but the build process refused. On the other hand I am not sure it could even work at all, as whenever I tried at run-time to limit the list to just one transport (be it tcp or openib, btw), mpi apps would not start. Either way, I'm curious if it's even worth trying and if there's other cuts that can be made to shave off one us or two (ok, I'l settle for 1.5 :-) ) Any advice ? -- Jean-Christophe Hugly PANTA
Re: [O-MPI users] direct openib btl and latency
> you need to specify both the transport and self, such as: > mpirun -mca btl self,tcp I found that the reason why I was no-longer able to run without openib was that I had some openib-specific tunables on the command line. I expected the params would get ignored, but instead it just sat there. The other funny thing I found was that on the cmd line I do not need to specify self; just openib, or tcp will do. But in the param file I must specify tcp,openib, or self,openib; self,tcp would not work. Oh well, I do not care for tcp anyway :-). But should I understand from all this that the "direct" mode will never actually work ? It seems that if you need at least two transports, then none of them can be the hardwired unique one, right ? Unless there's a built-in switch between a built-in self and the built-in other transport. > For Heroic latencies on IB we would need to use small message RDMA and > poll each peers dedicated memory region for completion. Well, I tried to play around with the eager_limit, min_rdma, etc. I did not see the latency of messages of a given size be lowered by changing the tresholds to make them rdma'd. Rather the opposite (which was my initial expectation, actually). May be I just misunderstood the whole set of tunables. My understanding was that messages under the eager limit would never be rdma'd by definition, and that the others would or would not be, depending on the min_rdma_size. -- Jean-Christophe Hugly PANTA
Re: [O-MPI users] direct openib btl and latency
> As we plan to drop all support for the old > > generation of PML/PTL, I don't think is a wise idea to spend time on > > the openib PTL to make it working with uniq ... > > > >Thanks, > > george. > > > > With the change to ob1/BTLs, there was also a refactoring of data > structures that reduced the overall latency through the stack. As > Galen indicated, if you do a direct comparison w/ send/recv semantics, > I think you will find the overall latency through the stack is lower > than other implementations (on the order of 0.5 us). Thanks, guys. I'll stop worrying about that then ! -- Jean-Christophe Hugly PANTA
Re: [O-MPI users] direct openib btl and latency
> > > So far, the best latency I got from ompi is 5.24 us, and the best I > > got from mvapich is 3.15. > > I am perfectly ready to accept that ompi scales better and that may be > > more important (except to the marketing dept :-) ), but I do not > > understand your explanation based on small-message RDMA. Either I > > missunderstood something badly (my best guess), or the 2 us are lost to > > something else than an RDMA-size tradeoff. > > > Again this is small message RDMA with polling versus send/receive > semantics, we will be adding small message RDMA and should have > performance equal to that of mvapich for small messages, but it is only > relevant for a small working set of peers / micro benchmarks. Thanks a lot. I was being fooled by various levels of size thresholds in the mvapich code. It was indeed doing rdma for small messages. After turning that off, I get numbers comparable to yours. Well, mvapich still beats ompi by a hair on my configuration. 5.11 vs. 5.25 but that's in the near-irrelevant range compared to other benefits. >From an adoption perspective, though, the ability to shine in micro-benchmarks is important, even if it means using an ad-hoc tuning. There is some justification for it after all. There are small clusters out there (many more than big ones, in fact) so taking maximum advantage of a small scale is relevant. When do you plan on having the small-msg rdma option available ? J-C -- Jean-Christophe Hugly PANTA
Re: [O-MPI users] direct openib btl and latency
On Thu, 2006-02-09 at 14:05 -0700, Ron Brightwell wrote: > > [...] > > > > >From an adoption perspective, though, the ability to shine in > > micro-benchmarks is important, even if it means using an ad-hoc tuning. > > There is some justification for it after all. There are small clusters > > out there (many more than big ones, in fact) so taking maximum advantage > > of a small scale is relevant. > > I'm obliged to point out that you jumped to a conclusion -- possibly true > in some cases, but not always. > > You assumed that a performance increase for a two-node micro-benchmark > would result in an application performance increase for a small cluster. > Using RDMA for short messages is the default on small clusters *because* > of the two-node micro-benchmark, not because the cluster is small. No, I assumed it based on comparisions between doing and not doing small msg rdma at various scales, from a paper Galen pointed out to me. http://www.cs.unm.edu/~treport/tr/05-10/Infiniband.pdf Benchmarks are what they are. In the above paper, the tests place the cross-over at around 64 nodes and that confirms a number of anecdotal reports I got. It may well be that in some situations, small-msg rdma is better only for 2 nodes, but that's note such a likely scenario; reality is sometimes linear (at least at our scale :-) ) after all. The scale threshold could be tunable, couldnt it ? -- Jean-Christophe Hugly PANTA
Re: [O-MPI users] direct openib btl and latency
On Thu, 2006-02-09 at 16:37 -0700, Brightwell, Ronald wrote: > I apologize if it seems like I'm picking on you. No offense taken. > I'm hypersensitive to > people trying to make judgements based on micro-benchmark performance. > I've been trying to make an argument that two-node ping-pong latency > comparisons really only have meaning in the context of a whole system. It's very clear to me that micro-benchmarks do not tell you very much about real application behaviour; that's not the question. They are nevertheless relevant to me because, right or wrong, people who buy stuff look at them. And I work for a commercial outfit. I may sound silly saying that, but they might be right to look at it, they just need to look at the rest too. A micro-benchmarks tells you how much you have of a given currency, that you can trade for another. It tells you something about the implementation; how efficient the code is, how well the hardware is utilized, etc. Not in every respect, but some. It also tells you how far you can emphasize a given feature at the expense of all others, if it happens that at some point in time it is what you most need. By making the argument that a particular characteristic is irrelevant, you are essentially making a hard coded tradeoff, rather than letting the users do it. Back to the specific issue of latency vs. scale. Okay for CG and FT, the cross-over may be <32, but that's not for all the cases and the difference visible at 32 is pretty small. So, it is application dependent, no question about it, but small-msg rdma is beneficial below a given (application-dependent) cluster size. -- Jean-Christophe Hugly PANTA
Re: [OMPI users] Open MPI and MultiRail InfiniBand
On Mon, 2006-03-13 at 10:57 -0700, Galen Shipman wrote: > >> This was my oversight, I am getting to it know, should have something > >> in just a bit. > >> > >> - Galen > > > > I can live with that, certainly. Fortunately, there's a couple months > > until I have a real /need/ for this. > > -- > > Hi Troy, > > I have added max_btls to the openib component on the trunk, try: > > mpirun --mca btl_openib_max_btls 1 ...etc > > I don't have a dual nic machine handy to test on, if this checks out we > can patch the release branch. Actually you do...:-) Please let me know if you ever intend to use that system. I am now letting someone else use it, but it can be shared. -- Jean-Christophe Hugly PANTA