On (22/03/07 10:38), Christoph Hellwig didst pronounce: > On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote: > > The main reason I am advocating a set of pagetable_operations is to > > enable the development of a new hugetlb interface. During the hugetlb > > BOFS at OLS last year, we talked about a character device that would > > behave like /dev/zero. Many of the people were talking about how they > > just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss > > about the hugetlbfs filesystem. /dev/zero is a familiar interface for > > getting anonymous memory so bringing that model to huge pages would make > > programming for anonymous huge pages easier. > > That is a very laudable goal, but an utterly wrong way to get there. > Despite Linus' veto a while ago what we really want is support for transparent > super pages.
A year ago, I may have agreed with you. However, Linus not only veto'd it but stamped on it repeatadly at VM Summit. He couldn't have made it clearer if he wore a t-shirt a hat and held up a neon sign. The assertion at the time was that variable page support of any sort had to be outside of the core VM because automatic support will get it wrong in some cases and makes the core VM harder to understand (because it's super-clear at the moment). Others attending agreed with the position. That position rules out drivers or filesystems giving hints about superpage sizes in the foreseeable future. What they did not have any problem with was providing better interfaces to program against as long as they were on the side of the VM like hugetlbfs and not in the core. The character device for private mappings is an example of an interface that is easier to program against than hugetlbfs. It's far easier for an application to mmap a file at a fixed location than trying to discover if hugetlbfs is mounted or not. However, to support that sort of interface, there needs to be a way of telling the VM to call the an alternative pagetable handler - hence Adam's patches. Someone with sufficient energy could try implementing variable page support entirely as a device using Adam's interface. If it turned out to be a good idea, then another push could be made for transparent support later. As it is, transparent superpage support is a also bit of a bitch for Power and IA64. Power because in many cases (not all), pages of two different sizes cannot be in the same virtual address range. IA64 has issues because with the *current* pagetable implementation, hugepages are limited to fixed address ranges. These sort of issues alone make transparent support in the kernel a non-trivial problem. > Adding random pointer indirections where we had the direct > hugetlb calls before isn't helpful for that at all. They aren't random, they are pretty specific. Also, even when paths like fault is entered, the cost of an indirect call is insignificant in comparison to the page allocation, clearing the page and updating page tables. In Adam's current patches, the indirect call only happens when a driver is using the pagetable ops. In the tests I looked at, the cost of the branch could only be detected on an instruction-level profile and even the branch cost was pretty damn tiny. If it was a case that indirect calls always took place, it *might* be a bit more noticable but still nothing in comparison to the cost of the remainder of the operation. > As a start you might > want to make a clear destinction between core hugetlb code and the > filesystem interface to it without all the useless indirections. The indirect calls are about supporting interfaces to userspace. In practice, the hugetlbfs interface, the shared memory interface and the character device interface would share a large amount of core code. Admittadly that code could do with restructuring because it's all mangled together at the moment. The core hugetlb code as you call it is mainly dealing with page cache and huge page pool management. The filesystem layer is relatively thin on top of it. With Adams pagetable abstraction, it would make more sense to restructuring the huge page code and separate out core-support-for-superpages from hugetlbfs. > That > should get you as far as your char dev interface. No, it wouldn't. Restructing the current code would allow better sharing between interfaces but that's it. At the end of the restructuring, we'd still need a way of saying "this VMA should be using some but not all the hugetlb code over there even though I'm not hugetlbfs". At that point, we'd be back at the pagetable ops abstraction. > But over the long > term the core VM needs to deal with multiple (and probably not just two) > page sizes. Given that the code to deal with different sized pages is > essentially the same just on different units on most architectures cries > for a better method to implement this than adding random function indirection > that point to mostly identical code. > Internally, a semi-sane way of supporting multiple page sizes would be to have one internal VFS mount per page size and using the hugetlbfs page cache management code. Currently, HugetlbFS is basically a wrapper around an internal VFS mount whose pages happen to be a specific size. That said, variable page sizes is a different problem to the one Adam is addressing here. In fact, someone with sufficient energy could implement a variable page device behind Adam's abstraction just to see if it worked in practice or not. Restructering to support something like variable page support and more than one interface would look something like; HugetlbFS Interface Shared Memory Interface Char Device | | | | | |----------------- | | | | Internal hugetlbfs mount Internal mount mount mount size HugePageSize size HugPageSize HugePageSize different size | | | | |----------------------------------------------------------- | Hugeage Reservation tracking | Hugepage pool management | Page allocator But if this nice arrangement existed today, Adams patches would still be needed to make it usable. > And your driver is the best example of why we utterly don't want > a page_table operations interface. The last thing we want is random > driver taking over core VM functionality. Who said anything about random? If a new driver of any sort shows up and using pagetable_ops, the developer will certainly be asked what they are doing that for. > The right way would be to a > filesystem/driver to tell (or maybe just give hints) which page size > to use for this mapping. > If a driver wants to "tell" what pagesize to use, they can override the ops to call the appropriate hugetlb code. Hints will be damn near impossible to get right in all cases. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/