Well yeah, in your particular case you're allocating from a heavily
over-contended address space, so much of the time it is genuinely full.
Plus you're primarily churning one or two sizes of IOVA, so there's a
high chance that you will either allocate immediately from the cached
node (after a previous free), or search the whole space and fail. In
case it was missed, searching only some arbitrary subset of the space
before giving up is not a good behaviour for an allocator to have in
general.
So since the retry means that we search through the complete pfn
range most of the time (due to poor success rate), we should be able
to do a better job at maintaining an accurate max alloc size, by
calculating it from the range search, and not relying on max alloc
failed or resetting it frequently. Hopefully that would mean that
we're smarter about not trying the allocation.
So I tried that out and we seem to be able to scrap back an
appreciable amount of performance. Maybe 80% of original, with with
another change, below.
TBH if you really want to make allocation more efficient I think there
are more radical changes that would be worth experimenting with, like
using some form of augmented rbtree to also encode the amount of free
space under each branch, or representing the free space in its own
parallel tree, or whether some other structure entirely might be a
better bet these days.
And if you just want to make your thing acceptably fast, now I'm going
to say stick a quirk somewhere to force the "forcedac" option on your
platform ;)
Easier said than done :)
But still, I'd like to just be able to cache all IOVA sizes for my DMA
engine, so we should not have to go near the RB tree often.
I have put together a series to allow upper limit of rcache range be
increased per domain. So naturally that gives better performance than we
originally had.
I don't want to prejudice the solution by saying what I think of it now,
so will send it out...
[...]
@@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct
iova_domain *iovad,
if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
high_pfn = limit_pfn;
low_pfn = retry_pfn;
- curr = &iovad->anchor.node;
+ curr = iova_find_limit(iovad, limit_pfn);
I see that it is now applied. However, alternatively could we just add
a zero-length 32b boundary marker node for the 32b pfn restart point?
That would need special cases all over the place to prevent the marker
getting merged into reservations or hit by lookups, and at worst break
the ordering of the tree if a legitimate node straddles the boundary. I
did consider having the insert/delete routines keep track of yet another
cached node for whatever's currently the first thing above the 32-bit
boundary, but I was worried that might be a bit too invasive.
Yeah, I did think of that. I don't think that it would have too much
overhead.
FWIW I'm currently planning to come back to this again when I have a bit
more time, since the optimum thing to do (modulo replacing the entire
algorithm...) is actually to make the second part of the search
*upwards* from the cached node to the limit. Furthermore, to revive my
arch/arm conversion I think we're realistically going to need a
compatibility option for bottom-up allocation to avoid too many nasty
surprises, so I'd like to generalise things to tackle both concerns at
once.
Thanks,
John