Hello, On 28/09/12 10:00 AM, Jeff Squyres wrote: > On Sep 28, 2012, at 9:50 AM, Sébastien Boisvert wrote: > >> I did not know about shared queues. >> >> It does not run out of memory. ;-) > > It runs out of *registered* memory, which could be far less than your actual > RAM. Check this FAQ item in particular: > > http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem >
I see. $ cat /sys/module/mlx4_core/parameters/log_num_mtt 0 $ cat /sys/module/mlx4_core/parameters/log_mtts_per_seg 0 $ getconf PAGE_SIZE 4096 With the formula max_reg_mem = (2^log_num_mtt) * (2^log_mtts_per_seg) * PAGE_SIZE = (2^0) * (2^0) * 4096 = 1 * 1 * 4096 = 4096 bytes Whoa ! one page. That should help. There are 32 GiB of memory. So I will ask someone to set log_num_mtt=23 and log_mtts_per_seg=1. => 68719476736 = (2**23)*(2**1)*4096 >> But the latency is not very good. >> >> ** Test 1 >> >> --mca btl_openib_max_send_size 4096 \ >> --mca btl_openib_eager_limit 4096 \ >> --mca btl_openib_rndv_eager_limit 4096 \ >> --mca btl_openib_receive_queues S,4096,2048,1024,32 \ >> >> I get 1.5 milliseconds. >> >> => https://gist.github.com/3799889 >> >> ** Test 2 >> >> --mca btl_openib_receive_queues S,65536,256,128,32 \ >> >> I get around 1.5 milliseconds too. >> >> => https://gist.github.com/3799940 > > Are you saying 1.5us is bad? 1.5 us is very good. But I get 1.5 ms with shared queues (see above). > That's actually not bad at all. On the most modern hardware with a bunch of > software tuning, you can probably get closer to 1us. > >> With my virtual router I am sure I can get something around 270 microseconds. > > OTOH, that's pretty bad. :-) I know, all my Ray processes are doing busy waiting, if MPI was event-driven, I would call my software sleepy Ray when latency is high. > > I'm not sure why it would be so bad -- are you hammering the virtual router > with small incoming messages? There are 24 AMD Opteron(tm) Processor 6172 cores for 1 Mellanox Technologies MT26428 on each node. That may be the cause too. > You might need to do a little profiling to see where the bottlenecks are. > Well, with the very valuable information you provided about log_num_mtt and log_mtts_per_seg for the Linux kernel module mlx4_core, I think this may be the root of our problem. We get 20-30 us on 4096 processes on Cray XE6, so it is unlikely that the bottleneck is in our software. >> Just out of curiosity, does Open-MPI utilize heavily negative values >> internally for user-provided MPI tags ? > > I know offhand we use them for collectives. Something is tickling my brain > that we use them for other things, too (CID allocation, perhaps?), but I > don't remember offhand. > The only collective I use is a few MPI_Barrier. > I'm just saying: YMMV. Buyer be warned. And all that. :-) > Yes, I agree on this, non-portable code is not portable and all with unexpected behaviors. >> If the negative tags are internal to Open-MPI, my code will not touch >> these private variables, right ? > > It's not a variable that's the issue. If you do a receive for tag -3 and > OMPI sends an internal control message with tag -3, you might receive it > instead of OMPI's core. And that would be Bad. > Ah I see. By removing the checks in my silly patch, I can now dictate things to do to OMPI. Hehe.