Hi Pieter,
On 1/11/2019 12:41 AM, Pieter Noordhuis wrote: > I'm looking into an issue with mlx5 on 4.11.3. It is triggered by high memory > pressure but continues for long after the memory pressure is gone. It starts > to continuously use pfmemalloc pages, some of which appear to be coming from > an RX queue's page cache. > > Attached is a log file showing a second-by-second diff of ethtool counters > for a single RX queue that was showing this behavior. This log doesn't > capture the start of these drops, because the ethtool monitoring is only > started until after the first drops are detected. Every increase of the > “cache_waive” counter means mlx5 refused to add a page to its page cache > because it was a pfmemalloc page. It also means the corresponding packet gets > dropped in sk_filter_trim_cap. > > Initially, the log shows the “cache_busy” counter increasing, meaning that > the first page in the page cache has >1 references, so can't be used. Right, it's a head-of-the-queue blocking. So pages are allocated instead. > Then after roughly a minute, it switches to increasing the “cache_reuse” and > “cache_waive” counters. This means that the pages are coming from the RX > queue's page cache *and** *are not put back because they are pfmemalloc pages. This means the head-of-queue is released, pages are popped from queue but fails to get re-pushed, due to the mlx5e_page_is_reserved() check. So the cache eventually gets empty. >This is highly suspicious, as they shouldn't end up in the page cache in the >first place. Then, after reusing 255 pages from the page cache, the >“cache_empty” counter starts to increase, in lock step with the “cache_waive” >counter. This means that the pages are allocated with dev_alloc_pages and not >placed in the page cache, because they are pfmemalloc pages. This is also >suspicious, because with the memory pressure gone, dev_alloc_pages shouldn't >be returning pfmemalloc pages. Notice that the mlx5e_page_is_reserved() combines two conditions: [1] page_is_pfmemalloc(page) [2] page_to_nid(page) != numa_mem_id(); In your case [2] could hold. Can you repro and check that? > By the time it stops incrementing “cache_waive”, a total of 3804 pages were > waived (and packets were dropped), over a duration of 1895 seconds. > > What I would expect to happen is the “cache_reuse” and “cache_waive” to never > be incremented in lock step, as pfmemalloc pages must never be added to the > RX queue page cache to begin with. Similarly, I would expect “cache_empty” > and “cache_waive” to never be incremented in lock step if there is no memory > pressure. > > Static analysis of mlx5 on 4.11.3 has so far not lead to any insights as to > why this is happening. Any help in this investigation is much appreciated. If > there is any additional information I can provide please me know. Please try to identify the specific reason of mlx5e_page_is_reserved(), you might need to hook/modify the driver. If we see that [2] holds, then it would explain the behavior. > > Pieter > Regards, Tariq