On Sun, Feb 12, 2017 at 7:04 AM, Tariq Toukan <ttoukan.li...@gmail.com> wrote:
> > We consistently see this behavior: the higher the BW, the sharper the > degradation. > > This is because the page-cache is of a fixed-size. Any fixed-size page-cache > will always meet one of the following: > 1) Too small to keep the pace when load is high. > 2) Too big (in terms of memory footprint) when load is low. > So, we had the order-0 allocations for years at Google, then made the horrible mistake to rebase mlx4 driver from the upstream one, and we had all these issues under load. I decided to redo the work I did years ago and upstream it. I have warned Mellanox in the past (for cx-5 driver) that _any_ high order allocation strategy was nice in benchmarks, but terrible in face of real server workloads. ( And I am not even referring to malicious attacks ) Think about what happens on real servers : In the order of 100,000 TCP sockets opened. Then some incast or outcast problem (Mapreduce jobs are fond of this) make thousands of TCP socket accumulate _millions_ of TCP messages in their out of order queue per second. There is no way you can hold millions of pages in mlx4 driver. A "dynamic" page pool is going to fail very badly. Sure, your iperf bench will look great. But who cares ? Doyou really have customers dedicating hosts to run 1 iperf full time ? Make sure you run tests with 100,000 TCP sockets, and add networking small flaps, with 5% packet losses. This is what we really care here. I will send the v3 of the patch series, I really hope that it will go in, because we at Google very much need it ASAP, and I would rather not have to keep it private in our tree. Do not focus on your benchmarks, that is marketing only Focus on ability of the servers to _survive_ and continue their work. You did not answer to my questions by the way. ethtool -g eth0 ethtool -l eth0 Thanks.