Pasha Tatashin wrote: > On Mon, Apr 21, 2025 at 7:21 PM Dan Williams <dan.j.willi...@intel.com> wrote: > > > > Michal Clapinski wrote: > > > Currently, the user has to specify each memory region to be used with > > > nvdimm via the memmap parameter. Due to the character limit of the > > > command line, this makes it impossible to have a lot of pmem devices. > > > This new parameter solves this issue by allowing users to divide > > > one e820 entry into many nvdimm regions. > > > > > > This change is needed for the hypervisor live update. VMs' memory will > > > be backed by those emulated pmem devices. To support various VM shapes > > > I want to create devdax devices at 1GB granularity similar to hugetlb. > > > > This looks fairly straightforward, but if this moves forward I would > > explicitly call the parameter something like "split" instead of "pmem" > > to align it better with its usage. > > > > However, while this is expedient I wonder if you would be better > > served with ACPI table injection to get more control and configuration > > options... > > > > > It's also possible to expand this parameter in the future, > > > e.g. to specify the type of the device (fsdax/devdax). > > > > ...for example, if you injected or customized your BIOS to supply an > > ACPI NFIT table you could get to deeper degrees of customization without > > wrestling with command lines. Supply an ACPI NFIT that carves up a large > > memory-type range into an aribtrary number of regions. In the NFIT there > > is a natural place to specify whether the range gets sent to PMEM. See > > call to nvdimm_pmem_region_create() near NFIT_SPA_PM in > > acpi_nfit_register_region()", and "simply" pick a new guid to signify > > direct routing to device-dax. I say simply, but that implies new ACPI > > NFIT driver plumbing for the new mode. > > > > Another overlooked detail about NFIT is that there is an opportunity to > > determine cases where the platform might have changed the physical > > address map from one boot to the next. In other words, I cringe at the > > fragility of memmap=, but I understand that it has the benefit of being > > simple. See the "nd_set cookie" concept in > > acpi_nfit_init_interleave_set(). > > I also dislike the potential fragility of the memmap= parameter; > however, in our environment, kernel parameters are specifically > crafted for target machine configurations and supplied separately from > the kernel binary, giving us good control. > > Regarding the ACPI NFIT suggestion: Our use case involves reusing the > same physical machines (with unchanged firmware) for various > configurations (similar to loaning them out). An advantage for us is > that switching the machine's role only requires changing the kernel > parameters. The ACPI approach, potentially requiring firmware changes, > would break this dynamic reconfiguration. > > As I understand, using ACPI injection instead of firmware change > doesn't eliminate fragility concerns either. We would still need to > carefully reserve the specific physical range for a particular machine > configuration, and it also adds a dependency on managing and packaging > an external NFIT injection file and process. We have a process for > kernel parameters but doing this externally would complicate things > for us.
Lets unpack a few things. My assumption is that ACPI table injection deployment is similar in complexity to kernel parameters because it is data appended to an initrd. So if a deployment flow can: echo $parameters >> $boot_config ...it can instead: cat $base_initrd $nfit > $amended_initrd As for the fragility I do agree that without platform firmware changes (base system NFIT) then it would be difficult to detect that the platform is booting in an unexpected physical memory layout. So memmap= would be used to mark the memory as Reserved and then the injected NFIT carves it up and optionally routes it to pmem or devdax. The aspect I have not tried though is injecting an ACPI0012 device if the platform does not already have one... I think it is solvable and avoids continuing to stress the kernel command line interface where ACPI can takeover. At a minimum confirm whether amending initrds is a non-starter in your environment. > Also, I might be missing something, but I haven't found a standard way > to automatically create devdax devices using NFIT injection. Our Yes, this is not there today, but would fit cleanly as a new Linux specific "Address Range Type GUID". > current plan is to expand the proposed kernel parameter. We are > working on making it default to creating either fsdax or devdax type > regions, without requiring explicit labels, and ensuring these regions > remain stable across kexec as long as the kernel parameter itself > doesn't change (in a way kernel parameters take the role of the > labels). Yes, this should all work without labels.