On (27/04/07 20:05), Nick Piggin didst pronounce: > Christoph Hellwig wrote: > >On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote: > > > >>>Well maybe you could explain what you want. Preferably without > >>>redefining the established terms? > >> > >>Support for larger buffers than page cache pages. > > > > > >I don't think you really want this :) The whole non-pagecache I/O > >path before 2.3 was a toal pain just because it used buffers to drive > >I/O. Add to that buffers bigger than a page and you add another > >two mangnitudes of complexity. If you want to see a mess like that > >download on of the eary XFS/Linux releases that had an I/O path > >like that. I _really_ _really_ don't want to go there. > > I'm not actually suggesting to add anything like that. But I think > larger blocks can be doable while retaining the "buffer" layer as a > relatively simple pagecache to block translation. > > Anyway, I'm working on patches... they might crash and burn, but we > might have something to talk about later. > > > >Linux has a long tradition of trading a tiny bit of efficieny for > >much cleaner code, and I'd for 100% go down Christoph's route here. > >Then again I'd actually be rather surprised if > page buffers > >were more efficient - you'd run into shitloads over overhead due to > >them beeing non-contingous like calling vmap all over the place, > >reprogramming iommus to at least make them look virtually contingous [1], > >etc.. > > I still think hardware should work reasonably well with 4K pages. The > SGI io controllers and/or the Linux block layer that doesn't allow more > than 128 sg entries is clearly suboptimal if the hardware runs twice as > fast with 2MB submissions. > > > >I also don't quite get what your problem with higher order allocations > >are. order 1 allocations are generally just fine, and in fact > >thread stacks are >= oder 1 on most architectures. And if the pagecache > >uses higher order allocations that means we'll finally fix our problems > >with them, which we have to do anyway. Workloads continue to grow and > >with them the kernel overhead to manage them, while the pagesize for > >many architectures is fixed. So we'll have to deal with order 1 > >and order 2 allocations better just for backing kmalloc and co. > > The pagecache is much bigger and often a lot more activity than these > other things though. Also, the more things you add to higher order > allocations, the more pressure you have. > > I like PAGE_SIZE pagecache, because it is reliable and really fast, if > you need to reclaim a page it should be almost O(1). > > > >Or think jumboframes for that matter. > > They can actually run into problems if the hardware wants contiguous > memory. > > I don't know why you think the fragmentation issues are just magically > fixed. It is hard and inefficient to reclaim larger order blocks (even > with lumpy reclaim), and Mel's patches aren't perfect. Actually, last > time I looked, they needed to keep at least 16MB of pages free to be > reasonably effective (or do we just say that people with less than XMB > of memory shouldn't be accessing these filesystems anyway?)
It'll work without adjusting the min_free_kbytes at all. The 16MB free had better results after fragmentation stress tests but this was a few percent of memory when allocating as huge pages as opposed to it falling apart. The success rates were still way way higher than the vanilla kernel. >, and I'm > not sure if they have been tested for long term stability in the > presence of a reasonable amount of higher order allocations. > I don't have a sample workload that has reasonable amount of higher order allocations over longer period of time. When the next -mm comes out, SLUB will be able to use high-order pages so I'll boot my machine with less memory to pressure it more. Assuming the kernel boots on my desktop machine, I should get some idea of what its long-term behaviour looks like. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/