iova: Flush CPU rcache for when a depot fills

John Garry Mon, 18 Jan 2021 04:52:16 -0800

On 15/01/2021 19:21, Robin Murphy wrote:


It would be good to understand why the rcache doesn't stabilize. Could be
a bug, or just need some tuning

In strict mode, if a driver does Alloc-Free-Alloc and the first alloc
misses the rcache, the second allocation hits it. The same sequence in
non-strict mode misses the cache twice, because the IOVA is added to the
flush queue on Free.

So rather than AFAFAF.. we get AAA..FFF.., only once the fq_timertriggers

or the FQ is full.


Sounds right

Interestingly the FQ size is 2x IOVA_MAG_SIZE, so we
could allocate 2 magazines worth of fresh IOVAs before alloc starts
hitting the cache. If a job allocates more than that, some magazines are
going to the depot, and with multi-CPU jobs those will get used on other
CPUs during the next alloc bursts, causing the progressive increase in

rcache consumption. I wonder if setting IOVA_MAG_SIZE > IOVA_FQ_SIZEhelps

reuse of IOVAs?

Looking back through the lore history, I don't know where theIOVA_FQ_SIZE = 256 came from. I guess it's size of 2x IOVA_MAG_SIZE (1xfor loaded and 1x for prev) for the reason you mention.

Then again I haven't worked out the details, might be entirely wrong.I'll
have another look next week.


cheers

I did start digging into the data (thanks for that!) before Christmas,but between being generally frazzled and trying to remember how to writePerl to massage the numbers out of the log dump I never got round toresponding, sorry.


As you may have seen:
https://raw.githubusercontent.com/hisilicon/kernel-dev/064c4dc8869b3f2ad07edffceafde0b129f276b0/lsi3008_dmesg

I had to change some block configs via sysfs to ever get IOVA locationsfor size > 0. And even then, I still got none bigger thanIOVA_RANGE_CACHE_MAX_SIZE.


Note: For a log like:
[13175.361915] print_iova2 iova_allocs(=5000000 ... too_big=47036

47036 is number of IOVA size > IOVA_RANGE_CACHE_MAX_SIZE, in case it wasnot clear.

And I never hit the critical point of a depot bin filling, but it mayjust take even longer.

However with IOVA size = 0 always occurring, then I noticed that thedepot size = 0 bin fills up relatively quickly. As such, I am nowslightly skeptical of the approach I have taken here, i.e purge thewhole rcache.

The partial thoughts that I can recall right now are firstly that thetotal numbers of IOVAs are actually pretty meaningless, it really needsto be broken down by size (that's where my Perl-hacking stalled...);secondly that the pattern is far more than just a steady increase - theCPU rcache count looks to be heading asymptotically towards ~65K IOVAsall the time, representing (IIRC) two sizes being in heavy rotation,while the depot is happily ticking along in a steady state as expected,until it suddenly explodes out of nowhere; thirdly, I'd really like tosee instrumentation of the flush queues at the same time, since I thinkthey're the real culprit.
My theory so far is that everyone is calling queue_iova() frequentlyenough to keep the timer at bay and their own queues drained. Then atthe ~16H mark, *something* happens that pauses unmaps long enough forthe timer to fire, and at that point all hell breaks loose.

So do you think that the freeing the IOVA magazines when the depot fillsis the cause of this? That was our analysis.

The depot issuddenly flooded with IOVAs of *all* sizes, indicative of all the queuesbeing flushed at once (note that the two most common sizes have beenhovering perilously close to "full" the whole time), but then,crucially, *that keeps happening*. My guess is that the load offq_flush_timeout() slows things down enough that the the timer thenkeeps getting the chance to expire and repeat the situation.


Not sure on that one.

The main conclusion I draw from this is the same one that was my initialgut feeling; that MAX_GLOBAL_MAGS = 32 is utter bollocks.

Yeah, I tend to agree with that. Or, more specifically, how things worktoday is broken, and MAX_GLOBAL_MAGS = 32 is very much involved with that.

The CPU rcachecapacity scales with the number of CPUs; the flush queue capacity scaleswith the number of CPUs; it is nonsensical that the depot size does notcorrespondingly scale with the number of CPUs (I note that the testingon the original patchset cites a 16-CPU system, where that depotcapacity is conveniently equal to the total rcache capacity).
Now yes, purging the rcaches when the depot is full does indeed helpmitigate this scenario - I assume it provides enough of a buffer wherethe regular free_iova_fast() calls don't hit queue_iova() for a while(and gives fq_ring_free() some reprieve on the CPU handling thetimeout), giving enough leeway for the flood to finish before anyonestarts hitting queues/locks/etc. and stalling again, and thus break theself-perpetuating timeout cycle. But that's still only a damagelimitation exercise! It's planning for failure to just lie down andassume that the depot is going to be full if fq_flush_timeout() everfires because it's something like an order of magnitude smaller than theflush queue capacity (even for a uniform distribution of IOVA sizes) onsuper-large systems.
I'm honestly tempted to move my position further towards a hard NAK onthis approach, because all the evidence so far points to it being abodge around a clear and easily-fixed scalability oversight. At the veryleast I'd now want to hear a reasoned justification for why you want tokeep the depot at an arbitrary fixed size while the whole rest of thesystem scales

(I'm assuming that since my previous suggestion to trychanges in that area seems to have been ignored).

So I said that it should fix the problem of the throughput going throughthe floor at this 16h mark.


But we see 2x tightly coupled problems:

a. leading up to the ~16H critical point, throughput is slowly degradingand becomes quite unstable (not shown in the log)For the LSI3008 card, we don't see that. But then no IOVA size >IOVA_RANGE_CACHE_MAX_SIZE occur there.


b. at the critical point, throughput goes through the floor

So b. should be fixed with the suggestion to have unlimited/higher depotmax bin size, but I reckon that we would still see a. And I put thatdown to the fact that we have IOVA sizes > IOVA_RANGE_CACHE_MAX_SIZE ata certain rate always. As the rb tree grows over time, they becomeslower and slower to alloc+free - that's our theory. Allowing the depotto grow further isn’t going to help that.

Maybe Leizhen's idea to trim the rcache periodically is overall better,but I am concerned on implementation.

If not, then if we allow depot bin size to scale/grow, I would like tosee more efficient handling for IOVA size > IOVA_RANGE_CACHE_MAX_SIZE.


Thanks,
John

Re: [PATCH v4 3/3] iommu/iova: Flush CPU rcache for when a depot fills

Reply via email to