On Sat, 2007-04-28 at 01:55 -0700, Andrew Morton wrote: > On Sat, 28 Apr 2007 10:32:56 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > On Sat, 2007-04-28 at 01:22 -0700, Andrew Morton wrote: > > > On Sat, 28 Apr 2007 10:04:08 +0200 Peter Zijlstra <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > > > > > The other thing is that we can batch up pagecache page insertions for > > > > > bulk > > > > > writes as well (that is. write(2) with buffer size > page size). I > > > > > should > > > > > have a patch somewhere for that as well if anyone interested. > > > > > > > > Together with the optimistic locking from my concurrent pagecache that > > > > should bring most of the gains: > > > > > > > > sequential insert of 8388608 items: > > > > > > > > CONFIG_RADIX_TREE_CONCURRENT=n > > > > > > > > [ffff81007d7f60c0] insert 0 done in 15286 ms > > > > > > > > CONFIG_RADIX_TREE_OPTIMISTIC=y > > > > > > > > [ffff81006b36e040] insert 0 done in 3443 ms > > > > > > > > only 4.4 times faster, and more scalable, since we don't bounce the > > > > upper level locks around. > > > > > > I'm not sure what we're looking at here. radix-tree changes? Locking > > > changes? Both? > > > > Both, the radix tree is basically node locked, and all modifying > > operations are taught to node lock their way around the radix tree. > > This, as Christoph pointed out, will suck on SMP because all the node > > locks will bounce around like mad. > > > > So what I did was couple that with an 'optimistic' RCU lookup of the > > highest node that _needs_ to be locked for the operation to succeed, > > lock that node, verify it is indeed still the node we need, and proceed > > as normal (node locked). This avoid taking most/all of the upper level > > node locks; esp for insert, which typically only needs one lock. > > hm. Maybe if we were taking the lock once per N pages rather than once per > page, we could avoid such trickiness.
For insertion, maybe, not all insertion is coupled with strong readahead. But this also work for the other modifiers, like radix_tree_tag_set(), which is typically called on single pages. The whole idea is to entirely remove mapping->tree_lock, and serialize the pagecache on page level. > > > If we have a whole pile of pages to insert then there are obvious gains > > > from not taking the lock once per page (gang insert). But I expect there > > > will also be gains from not walking down the radix tree once per page too: > > > walk all the way down and populate all the way to the end of the node. > > > > Yes, it will be even faster if we could batch insert a whole leaf node. > > That would save 2^6 - 1 tree traversals and lock/unlock cycles. > > Certainly not unwanted. > > > > > The implementation could get a bit tricky, handling pages which a racer > > > instantiated when we dropped the lock, and suitably adjusting ->index. > > > Not > > > rocket science though. > > > > *nod* > > > > > The depth of the radix tree matters (ie, the file size). 'twould be > > > useful > > > to always describe the tree's size when publishing microbenchmark results > > > like this. > > > > Right, this was with two such sequences of 2^23, so 2^24 elements in > > total, with 0 offset, which gives: 24/6 = 4 levels. > > Where an element would be a page, so we're talking about a 64GB file with > 4k pagesize? > > Gosh that tree has huge fanout. We fiddled around a bit with > RADIX_TREE_MAP_SHIFT in the early days. Changing it by quite large amounts > didn't appear to have much effect on anything, actually. It might make sense to decrease it a bit with this work, it gives more opportunity to parallelize. Currently the tree_lock is pretty much the most expensive operation in this area, it happily bounces around the machine. The lockless work showed there is lots of room for improvement here. > btw, regarding __do_page_cache_readahead(): that function is presently > optimised for file rereads - if all pages are present it'll only take the > lock once. The first version of it was optimised the other way around - it > took the lock once for uncached file reads (no pages present), but once per > page for rereads. I made this change because Anton Blanchard had some > workload which hurt, and we figured that rereads are the common case, and > uncached reads are limited by disk speed anyway. > > Thinking about it, I guess that was a correct decision, but we did > pessimise these read-a-gargantuan-file cases. > > But we could actually implement both flavours and pick the right one to > call based upon the hit-rate metrics which the readahead code is > maintaining (or add new ones). > > Of course, gang-populate would be better. I could try to do a gang insert operation on top of my concurrent pagecache tree. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/