On Thu, Mar 11, 2010 at 05:55:10PM +0000, Paul Brook wrote: > sysconf(_SC_HUGEPAGESIZE); would seem to be the obvious answer.
There's not just one hugepage size and that thing doesn't exist yet plus it'd require mangling over glibc too. If it existed I could use it but I think this is better: $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size 2097152 Ok? If this file doesn't exist we won't align, so we also align on qemu not only on kvm for the concern below on the first and last bytes. > > Also khugepaged can later zero out the pte_none regions to create a > > full segment all backed by hugepages, however if we do that khugepaged > > will eat into the free memory space. At the moment I kept khugepaged a > > zero-memory-footprint thing. But I'm currently adding an option called > > collapse_unmapped to allow khugepaged to collapse unmapped pages too > > so if there are only 2/3 pages in the region before the memalign, they > > also can be mapped by a large tlb to allow qemu run faster. > > I don't really understand what you're getting at here. Surely a naturally > aligned block is always going to be easier to defragment than a misaligned > block. Basically was I was saying it's about touching subpage 0, 1 of an hugepage, then posix_memalign extends the vma and nobody is ever going to touch page 2-511 because those are the virtual addresses wasted. khugepaged before couldn't allocate an hugepage for only page 0, 1 because the vma stopped there, but later after the vma is extended it can. So previously I wasn't mapping this range with an hugepage, but now I'm mapping it with an hugepage too. And a sysfs control will select the max number of unmapped subpages for the collapse to happen. For just 1 subpage mapped in the hugepage virtual range, it won't make sense to use large tlb and waste 511 pages of ram. > If the allocation size is not a multiple of the preferred alignment, then you > probably loose either way, and we shouldn't be requesting increased alignment. That's probably good idea. Also note, if we were to allocate separately the 0-640k 1m-end, for NPT to work we'd need to start the second block misaligned at a 1m address. So maybe I should move the alignment out of qemu_ram_alloc and have it in the caller? > I wouldn't be surprised if putting the start of guest ram on a large TLB > entry > was a win. Your guest kernel often lives there! Yep, that's easy to handle with the hpage_pmd_size ;). > Assuming we're allocating in large chunks, I doubt an extra hugepage worth of > VMA is a big issue. > > Either way I'd argue that this isn't something qemu should have to care > about, > and is actually a bug in posix_memalign. Hmm the last is a weird claim considering posix_memalign gets an explicit alignment parameter and it surely can't choose what alignment to use. We can argue about the kernel side having to align automatically but again if it would do that, it'd generate unnecessary vma holes which we don't want. I think it's quite simple, just use my new sysfs control, if it exists always use that alignment instead of the default. We've only to decide if to align inside or outside of qemu_ram_alloc.