On Fri, Mar 12, 2010 at 11:36:33AM +0000, Paul Brook wrote: > > On Thu, Mar 11, 2010 at 05:55:10PM +0000, Paul Brook wrote: > > > sysconf(_SC_HUGEPAGESIZE); would seem to be the obvious answer. > > > > There's not just one hugepage size > > We only have one madvise flag...
Transparent hugepage support means _really_ transparent, it's not up to userland to know what hugepage size the kernel uses. There is no way for userland to notice anything but that it runs faster. The madvise flag is one and it only exists for 1 reason: embedded systems that may want to turn off the transparency feature to avoid the risk of using a little more memory during anonymous memory copy-on-writes after fork or similar. But for things like kvm there is absolutely zero memory waste in enabling hugepages so even embedded definitely wants to enable transparent hugepage and run faster on their underpowered CPU. If it wasn't for embedded the madvise flag would need to be dropped as it would be pointless. It's not about the page size at all. > > and that thing doesn't exist yet > > plus it'd require mangling over glibc too. If it existed I could use > > it but I think this is better: > > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size > > 2097152 > > Is "pmd" x86 specific? It's linux specific, this is common code, nothing x86 specific. In fact on x86 it's not called pmd but Page Directory. I've actually no idea what pmd stands for but it's definitely not x86 specific and it's just about the linux common code common to all archs. The reason this is called hpage_pmd_size is because it's a #define HPAGE_PMD_SIZE in the kernel code. So this entirely match the kernel internals _common_code_. > > > If the allocation size is not a multiple of the preferred alignment, then > > > you probably loose either way, and we shouldn't be requesting increased > > > alignment. > > > > That's probably good idea. Also note, if we were to allocate > > separately the 0-640k 1m-end, for NPT to work we'd need to start the > > second block misaligned at a 1m address. So maybe I should move the > > alignment out of qemu_ram_alloc and have it in the caller? > > I think the only viable solution if you care about EPT/NPT is to not do that. > With your current code the 1m-end region will be misaligned - your code Well with my current code on top of current qemu code, there is no risk of misalignment because the 0-4G is allocated in a single qemu_ram_alloc. I'm sure it works right because /debugfs/kvm/largepages shows all ram in largepages and otherwise I wouldn't get a reproducible 6% boost on kernel compiles in guest even on a common $150 quad core workstation (without even thinking at the boost on huge systems). > allocates it on a 2M boundary. I suspect you actually want (base % 2M) == 1M. > Aligning on a 1M boundary will only DTRT half the time. The 1m-end is an hypothetical worry that come to mind as I was discussing the issue with you. Basically my point is that if the pc.c code will change and it'll pretend to qemu_ram_alloc the 0-640k and 1M-4G range with two separate calls (this is _not_ what qemu does right now), the alignment in qemu_ram_alloc that works right now, would then stop working. This is why I thought maybe it's more correct (and less virtual-ram-wasteful) to move the alignment in the caller even if the patch will grow in size and it'll be pc.c specific (which it wouldn't need to if other archs will support transparent hugepage). I think with what you're saying above you're basically agreeing with me I should move the alignment in the caller. Correct me if I misunderstood. > But that's only going to happen if you align the allocation. Yep, this is why I agree with you, it's better to always align even when kvm_enabled() == 0. > It can't choose what align to use, but it can (should?) choose how to achieve > that alignment. Ok but I don't see a problem in how it achieves it, in fact I think it's more efficient than a kernel assisted alignment that would then force to split the vma generating a micro-slowdown.