On Thu, Jun 13, 2013 at 09:40:14AM +0800, Wanlong Gao wrote: > On 06/11/2013 09:40 PM, Eduardo Habkost wrote: > > On Tue, Jun 11, 2013 at 03:22:13PM +0800, Wanlong Gao wrote: > >> On 06/05/2013 09:46 PM, Eduardo Habkost wrote: > >>> On Wed, Jun 05, 2013 at 11:58:25AM +0800, Wanlong Gao wrote: > >>>> Add monitor command mem-nodes to show the huge mapped > >>>> memory nodes locations. > >>>> > >>> > >>> This is for machine consumption, so we need a QMP command. > >>> > >>>> (qemu) info mem-nodes > >>>> /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 > >>>> /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 > >>>> /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 > >>>> /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 > >>> > >>> Are node0/node1 _host_ nodes? > >>> > >>> How do I know what's the _guest_ address/node corresponding to each > >>> file/range above? > >>> > >>> What I am really looking for is: > >>> > >>> * The correspondence between guest (virtual) NUMA nodes and guest > >>> physical address ranges (it could be provided by the QMP version of > >>> "info numa") > >> > >> AFAIK, the guest NUMA nodes and guest physical address ranges are set > >> by seabios, we can't get this information from QEMU, > > > > QEMU _has_ to know about it, otherwise we would never be able to know > > which virtual addresses inside the QEMU process (or offsets inside the > > backing files) belong to which virtual NUMA node. > > Nope, if I'm right, actually it's linear except that there are holes in > the physical address spaces. So we can know which node the guest virtual > address is included just by each numa node size.
You are just describing a way to accomplish the item I asked about above: finding out the correspondence between guest physical addresses and NUMA nodes. :) (But I would prefer to have something more explicit in the QMP interface instead of something implicit that assumes a predefined binding) > It's enough for us if we > can provide a QMP interface from QEMU to let external tools like libvirt > set the host memory binding polices according to the QMP interface, and > we can also provide the QEMU command line option to be able to set host > bindings through command line options before we start QEMU process. And how would you identify memory regions through this memory binding QMP interface, if not by guest physical addresses? > > > > > (After all, the NUMA wiring is a hardware feature, not something that > > the BIOS can decide) > > But this is ACPI table which wrote by seabios now. AFAIK, there is no > unified idea about moving this part to QEMU with the QEMU interfaces > for seabios removed or just stay it there. It doesn't matter who writes the ACPI table. QEMU must always know on which virtual NUMA node each memory region is located. > > > >> and I think this > >> information is useless for pinning memory range to host. > > > > Well, we have to somehow identify each region of guest memory when > > deciding how to pin it. How would you identify it without using guest > > physical addresses? Guest physical addresses are more meaningful than > > the QEMU virtual addresses your patch exposes (that are meaningless > > outside QEMU). > > As I mentioned above, we can know this just by the guest node memory size, > and can set the host bindings by treating this sizes as offsets. > And I think we only need to set the host memory binding polices to each > guest numa nodes. It's unnecessary to set polices to each region as you > said. I believe an interface based on guest physical memory addresses is more flexible (and even simpler!) than one that only allows binding of whole virtual NUMA nodes. (And I still don't understand why you are exposing QEMU virtual memory addresses in the new command, if they are useless). > > > > > >>> * The correspondence between guest physical address ranges and ranges > >>> inside the mapped files (so external tools could set the policy on > >>> those files instead of requiring QEMU to set it directly) > >>> > >>> I understand that your use case may require additional information and > >>> additional interfaces. But if we provide the information above we will > >>> allow external components set the policy on the hugetlbfs files before > >>> we add new interfaces required for your use case. > >> > >> But the file backed memory is not good for the host which has many > >> virtual machines, in this situation, we can't handle anon THP yet. > > > > I don't understand what you mean, here. What prevents someone from using > > file-backed memory with multiple virtual machines? > > While if we use hugetlbfs backed memory, we should know how many virtual > machines, > how much memory each vm will use, then reserve these pages for them. And even > should reserve more pages for external tools(numactl) to set memory polices. > Even the memory reservation also has it's own memory policies. It's very hard > to control it to what we want to set. Well, it's hard because we don't even have tools to help on that, yet. Anyway, I understand that you want to make it work with THP as well. But if THP works with tmpfs (does it?), people then could use exactly the same file-based mechanisms with tmpfs and keep THP working. (Right now I am doing some experiments to understand how the system behaves when using numactl on hugetlbfs and tmpfs, before and after getting the files mapped). > > > >> > >> And as I mentioned, the cross numa node access performance regression > >> is caused by pci-passthrough, it's a very long time bug, we should > >> back port the host memory pinning patch to old QEMU to resolve this > >> performance > >> problem, too. > > > > If it's a regression, what's the last version of QEMU where the bug > > wasn't present? > > > > As QEMU doesn't support host memory binding, I think > this was present since we support guest NUMA, and the pci-passthrough made > it even worse. If the problem was always present, it is not a regression, is it? -- Eduardo