On 3/11/2022 4:42 AM, Igor Mammedov wrote: > On Thu, 10 Mar 2022 13:18:35 -0500 > Steven Sistare <steven.sist...@oracle.com> wrote: > >> On 3/10/2022 12:28 PM, Steven Sistare wrote: >>> On 3/10/2022 11:00 AM, Igor Mammedov wrote: >>>> On Thu, 10 Mar 2022 10:36:08 -0500 >>>> Steven Sistare <steven.sist...@oracle.com> wrote: >>>> >>>>> On 3/8/2022 2:20 AM, Igor Mammedov wrote: >>>>>> On Tue, 8 Mar 2022 01:50:11 -0500 >>>>>> "Michael S. Tsirkin" <m...@redhat.com> wrote: >>>>>> >>>>>>> On Mon, Mar 07, 2022 at 09:41:44AM -0500, Steven Sistare wrote: >>>>>>>> On 3/4/2022 5:41 AM, Igor Mammedov wrote: >>>>>>>>> On Thu, 3 Mar 2022 12:21:15 -0500 >>>>>>>>> "Michael S. Tsirkin" <m...@redhat.com> wrote: >>>>>>>>> >>>>>>>>>> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote: >>>>>>>>>>> Allocate anonymous memory using memfd_create if the memfd-alloc >>>>>>>>>>> machine >>>>>>>>>>> option is set. >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Steve Sistare <steven.sist...@oracle.com> >>>>>>>>>>> --- >>>>>>>>>>> hw/core/machine.c | 19 +++++++++++++++++++ >>>>>>>>>>> include/hw/boards.h | 1 + >>>>>>>>>>> qemu-options.hx | 6 ++++++ >>>>>>>>>>> softmmu/physmem.c | 47 >>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++--------- >>>>>>>>>>> softmmu/vl.c | 1 + >>>>>>>>>>> trace-events | 1 + >>>>>>>>>>> util/qemu-config.c | 4 ++++ >>>>>>>>>>> 7 files changed, 70 insertions(+), 9 deletions(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c >>>>>>>>>>> index 53a99ab..7739d88 100644 >>>>>>>>>>> --- a/hw/core/machine.c >>>>>>>>>>> +++ b/hw/core/machine.c >>>>>>>>>>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, >>>>>>>>>>> bool value, Error **errp) >>>>>>>>>>> ms->mem_merge = value; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp) >>>>>>>>>>> +{ >>>>>>>>>>> + MachineState *ms = MACHINE(obj); >>>>>>>>>>> + >>>>>>>>>>> + return ms->memfd_alloc; >>>>>>>>>>> +} >>>>>>>>>>> + >>>>>>>>>>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error >>>>>>>>>>> **errp) >>>>>>>>>>> +{ >>>>>>>>>>> + MachineState *ms = MACHINE(obj); >>>>>>>>>>> + >>>>>>>>>>> + ms->memfd_alloc = value; >>>>>>>>>>> +} >>>>>>>>>>> + >>>>>>>>>>> static bool machine_get_usb(Object *obj, Error **errp) >>>>>>>>>>> { >>>>>>>>>>> MachineState *ms = MACHINE(obj); >>>>>>>>>>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass >>>>>>>>>>> *oc, void *data) >>>>>>>>>>> object_class_property_set_description(oc, "mem-merge", >>>>>>>>>>> "Enable/disable memory merge support"); >>>>>>>>>>> >>>>>>>>>>> + object_class_property_add_bool(oc, "memfd-alloc", >>>>>>>>>>> + machine_get_memfd_alloc, machine_set_memfd_alloc); >>>>>>>>>>> + object_class_property_set_description(oc, "memfd-alloc", >>>>>>>>>>> + "Enable/disable allocating anonymous memory using >>>>>>>>>>> memfd_create"); >>>>>>>>>>> + >>>>>>>>>>> object_class_property_add_bool(oc, "usb", >>>>>>>>>>> machine_get_usb, machine_set_usb); >>>>>>>>>>> object_class_property_set_description(oc, "usb", >>>>>>>>>>> diff --git a/include/hw/boards.h b/include/hw/boards.h >>>>>>>>>>> index 9c1c190..a57d7a0 100644 >>>>>>>>>>> --- a/include/hw/boards.h >>>>>>>>>>> +++ b/include/hw/boards.h >>>>>>>>>>> @@ -327,6 +327,7 @@ struct MachineState { >>>>>>>>>>> char *dt_compatible; >>>>>>>>>>> bool dump_guest_core; >>>>>>>>>>> bool mem_merge; >>>>>>>>>>> + bool memfd_alloc; >>>>>>>>>>> bool usb; >>>>>>>>>>> bool usb_disabled; >>>>>>>>>>> char *firmware; >>>>>>>>>>> diff --git a/qemu-options.hx b/qemu-options.hx >>>>>>>>>>> index 7d47510..33c8173 100644 >>>>>>>>>>> --- a/qemu-options.hx >>>>>>>>>>> +++ b/qemu-options.hx >>>>>>>>>>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \ >>>>>>>>>>> " vmport=on|off|auto controls emulation of >>>>>>>>>>> vmport (default: auto)\n" >>>>>>>>>>> " dump-guest-core=on|off include guest memory >>>>>>>>>>> in a core dump (default=on)\n" >>>>>>>>>>> " mem-merge=on|off controls memory merge >>>>>>>>>>> support (default: on)\n" >>>>>>>>>>> + " memfd-alloc=on|off controls allocating >>>>>>>>>>> anonymous guest RAM using memfd_create (default: off)\n" >>>>>>>>>> >>>>>>>>>> Question: are there any disadvantages associated with using >>>>>>>>>> memfd_create? I guess we are using up an fd, but that seems minor. >>>>>>>>>> Any >>>>>>>>>> reason not to set to on by default? maybe with a fallback option to >>>>>>>>>> disable that? >>>>>>>> >>>>>>>> Old Linux host kernels, circa 4.1, do not support huge pages for >>>>>>>> shared memory. >>>>>>>> Also, the tunable to enable huge pages for share memory is different >>>>>>>> than for >>>>>>>> anon memory, so there could be performance loss if it is not set >>>>>>>> correctly. >>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled >>>>>>>> vs >>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled >>>>>>> >>>>>>> I guess we can test this when launching the VM, and select >>>>>>> a good default. >>>>>>> >>>>>>>> It might make sense to use memfd_create by default for the secondary >>>>>>>> segments. >>>>>>> >>>>>>> Well there's also KSM now you mention it. >>>>>> >>>>>> then another quest, is there downside to always using memfd_create >>>>>> without any knobs being involved? >>>>> >>>>> Lower performance if small pages are used (but Michael suggests qemu >>>>> could >>>>> automatically check the tunable and use anon memory instead) >>>>> >>>>> KSM (same page merging) is not supported for shared memory, so >>>>> ram_block_add -> >>>>> memory_try_enable_merging will not enable it. >>>>> >>>>> In both cases, I expect the degradation would be negligible if >>>>> memfd_create is >>>>> only automatically applied to the secondary segments, which are typically >>>>> small. >>>>> But, someone's secondary segment could be larger, and it is time >>>>> consuming to >>>>> prove innocence when someone claims your change caused their performance >>>>> regression. >>>> >>>> Adding David as memory subsystem maintainer, maybe he will a better >>>> idea instead of introducing global knob that would also magically alter >>>> backends' behavior despite of its their configured settings. >>> >>> OK, in ram_block_add I can set the RAM_SHARED flag based on the >>> memory-backend object's >>> shared flag. I already set the latter in create_default_memdev when >>> memfd-alloc is >>> specified. With that change, we do not override configured settings. >>> Users can no longer >>> use memory-backend-ram for CPR, and must change all memory-backend-ram to >>> memory-backend-memfd >>> in the command-line arguments. That is fine. >>> >>> With that change, are you OK with this patch? >> >> Sorry, I mis-read my own code in ram_block_add. The existing code is >> correct and does >> not alter any backend's behavior. It only sets the shared flag when the >> ram is *not* >> being allocated for a backend: >> >> if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) { >> new_block->flags |= RAM_SHARED; >> } >> > > ok, maybe instead of introducing a generic option, introduce the high level > feature one that turns this and other necessary quirks for it to work (i.e. > something like live-update=on|off). > That will not make QEMU internals any better but at least it will hide obscure > memfd-alloc from users.
That occurred to me, but during this review, a few folks have said it would be useful to expose memfd-alloc directly. And, I don't think that hiding memfd-alloc under another flag is helpful, as some users will still want to understand how enabling cpr affects the VM environment. I would still be documenting the memfd behavior for the hypothetical new flag. I currently document memfd-alloc in the following places. If you think this is confusing or incomplete, please let me know: qemu-options.hx @item memfd-alloc=on|off Enables or disables allocation of anonymous guest RAM using memfd_create. Any associated memory-backend objects are created with share=on. The memfd-alloc default is off. hmp-commands.hx @item cpr-save @var{filename} @var{mode} ... If @var{mode} is 'restart', the checkpoint remains valid after restarting qemu using a subsequent cpr-exec. All guest RAM objects must be shared. The share=on property is required for memory created with an explicit -object option, and the memfd-alloc machine property is required for memory that is implicitly created. And this error message in a few places for the only-cpr-capable command line option: "only-cpr-capable requires -machine memfd-alloc=on" > Is there a patch that makes QEMU error out if backend without > shared=on is used? No. I will add that check, thanks. > Also, can you answer question below, pls > or point to a patch in series that takes care of that invariant? > > [...] > >>>>>>>> There is currently no way to specify memory backends for the secondary >>>>>>>> memory >>>>>>>> segments (vram, roms, etc), and IMO it would be onerous to specify a >>>>>>>> backend for >>>>>>>> each of them. On x86_64, these include pc.bios, vga.vram, pc.rom, >>>>>>>> vga.rom, >>>>>>>> /rom@etc/acpi/tables, /rom@etc/table-loader, /rom@etc/acpi/rsdp. >>>> >>>> MemoryRegion is not the only place where state is stored. >>>> If we only talk about fwcfg entries state, it can also reference >>>> plain malloced memory allocated elsewhere or make a deep copy internally. >>>> Similarly devices also may store state outside of RamBlock framework. >>>> >>>> How are you dealing with that? Sorry, I missed this before. fwcfg defines vmstate handlers that save and restore all state to a file across the cpr operation, similar to live migration. In general, if it works for live migration, then it works for cpr. If you find a counter-example, please let me know. - Steve