Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling

Tariq Toukan Sun, 12 Feb 2017 09:25:28 -0800


On 12/02/2017 5:32 PM, Eric Dumazet wrote:

On Sun, Feb 12, 2017 at 7:04 AM, Tariq Toukan <[email protected]> wrote:

We consistently see this behavior: the higher the BW, the sharper the
degradation.

This is because the page-cache is of a fixed-size. Any fixed-size page-cache
will always meet one of the following:
1) Too small to keep the pace when load is high.
2) Too big (in terms of memory footprint) when load is low.

So, we had the order-0 allocations for years at Google, then made the
horrible mistake to rebase mlx4 driver from the upstream one,
and we had all these issues under load.

I decided to redo the work I did years ago and upstream it.

Thanks for that. I really appreciate and like your re-factorization.


I have warned Mellanox in the past (for cx-5 driver) that _any_ high
order allocation strategy was nice in benchmarks, but terrible in face
of real server workloads.
( And I am not even referring to malicious attacks )

In mlx5, we fully completed the transition to order-0 allocations inStriding RQ.

Think about what happens on real servers : In the order of 100,000 TCP
sockets opened.

Then some incast or outcast problem (Mapreduce jobs are fond of this)
make thousands of TCP socket accumulate _millions_ of TCP messages in
their out of order queue per second.

There is no way you can hold millions of pages in mlx4 driver.
A "dynamic" page pool is going to fail very badly.

I understand your point. Today I am totally aware of the advantages inusing order-0 pages, I am just tryingto have the bread buttered on both sides, by reducing the allocationoverhead.Even though the iperf benchmarks are less realistic than the ones youdescribed, I think it is still niceif we could find solutions for the page allocator in order to keep thehigh rates we had before.As a common bottleneck, we will always gain by improving the pageallocator, no matter what is the pages order.


Just two points regarding the dynamic page-cache I implemented:

1) We define an upper limit for the size of the dynamic page-cache, sothe mata-data do not grow too much.2) When load is high, our dynamic page-cache _does not exclusively holdtoo many pages_, it just keeps trackof pages that are being anyway processed in stack. In memoryfootprints accounting, I would not accountsuch page into the "driver's footprint", as it is being used by thestack.


Sure, your iperf bench will look great. But who cares ? Doyou really
have customers dedicating hosts to run 1 iperf full time ?

Make sure you run tests with 100,000 TCP sockets, and add networking
small flaps, with 5% packet losses.
This is what we really care here.

I definitely agree that benchmarks should improve to reflect morerealistic use cases.


I will send the v3 of the patch series, I really hope that it will go
in, because we at Google very much need it ASAP, and I would rather
not have to keep it private in our tree.

Do not focus on your benchmarks, that is marketing only
Focus on ability of the servers to _survive_ and continue their work.

You did not answer to my questions by the way.

ethtool -g eth0
ethtool -l eth0

Yes, sorry the delayed reply, it was sent separately.


Thanks.

Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling

Reply via email to