Hello,

On 28/09/12 10:00 AM, Jeff Squyres wrote:
> On Sep 28, 2012, at 9:50 AM, Sébastien Boisvert wrote:
> 
>> I did not know about shared queues.
>>
>> It does not run out of memory. ;-)
> 
> It runs out of *registered* memory, which could be far less than your actual 
> RAM.  Check this FAQ item in particular:
> 
>     http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem
> 

I see.

$ cat /sys/module/mlx4_core/parameters/log_num_mtt
0

$ cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
0

$ getconf PAGE_SIZE
4096

With the formula

max_reg_mem = (2^log_num_mtt) * (2^log_mtts_per_seg) * PAGE_SIZE

            = (2^0) * (2^0) * 4096

            = 1     * 1     * 4096

            = 4096 bytes

Whoa ! one page.

That should help.

There are 32 GiB of memory.

So I will ask someone to set log_num_mtt=23 and log_mtts_per_seg=1.

  => 68719476736 = (2**23)*(2**1)*4096

>> But the latency is not very good.
>>
>> ** Test 1
>>
>> --mca btl_openib_max_send_size 4096 \
>> --mca btl_openib_eager_limit 4096 \
>> --mca btl_openib_rndv_eager_limit 4096 \
>> --mca btl_openib_receive_queues S,4096,2048,1024,32 \
>>
>> I get 1.5 milliseconds.
>>
>>  => https://gist.github.com/3799889
>>
>> ** Test 2
>>
>> --mca btl_openib_receive_queues S,65536,256,128,32 \
>>
>> I get around 1.5 milliseconds too.
>>
>>  => https://gist.github.com/3799940
> 
> Are you saying 1.5us is bad?  

1.5 us is very good. But I get 1.5 ms with shared queues (see above).

> That's actually not bad at all.  On the most modern hardware with a bunch of 
> software tuning, you can probably get closer to 1us.
> 
>> With my virtual router I am sure I can get something around 270 microseconds.
> 
> OTOH, that's pretty bad.  :-)

I know, all my Ray processes are doing busy waiting, if MPI was event-driven,
I would call my software sleepy Ray when latency is high.

> 
> I'm not sure why it would be so bad -- are you hammering the virtual router 
> with small incoming messages?

There are 24 AMD Opteron(tm) Processor 6172 cores for 1 Mellanox Technologies 
MT26428
on each node. That may be the cause too.

>  You might need to do a little profiling to see where the bottlenecks are.
> 

Well, with the very valuable information you provided about log_num_mtt and 
log_mtts_per_seg
for the Linux kernel module mlx4_core, I think this may be the root of our 
problem.

We get 20-30 us on 4096 processes on Cray XE6, so it is unlikely that the 
bottleneck is in 
our software.

>> Just out of curiosity, does Open-MPI utilize heavily negative values
>> internally for user-provided MPI tags ?
> 
> I know offhand we use them for collectives.  Something is tickling my brain 
> that we use them for other things, too (CID allocation, perhaps?), but I 
> don't remember offhand.
> 

The only collective I use is a few MPI_Barrier. 

> I'm just saying: YMMV.  Buyer be warned.  And all that. :-)
> 

Yes, I agree on this, non-portable code is not portable and all with unexpected 
behaviors.

>> If the negative tags are internal to Open-MPI, my code will not touch
>> these private variables, right ?
> 
> It's not a variable that's the issue.  If you do a receive for tag -3 and 
> OMPI sends an internal control message with tag -3, you might receive it 
> instead of OMPI's core.  And that would be Bad.
> 

Ah I see. By removing the checks in my silly patch, I can now dictate things to 
do to
OMPI. Hehe.

Reply via email to