On Fri, 19 Jul 2024 at 00:03, Danny Canter <danny_can...@apple.com> wrote: > > This patch's main focus is to enable creating VMs with > 63GB > of RAM on Apple Silicon machines by using some new HVF APIs. In > pursuit of this a couple of things related to how we handle the > physical address range we expose to guests were altered: > > The default IPA size on all Apple Silicon machines for HVF is > currently 36 bits. This bars making a VM with > 63GB (as RAM > starts at 1GB in the memory map) of RAM. Currently, to get the > IPA size we were reading id_aa64mmfr0_el1's PARange field > from a newly made vcpu. Unfortunately HVF just returns the > hosts PARange directly for the initial value and not the IPA > size that will actually back the VM, so we believe we have much > more address space than we actually do today it seems.
So just to check my understanding, this means that with current QEMU, on all Apple hardware, attempting to create a VM with more than 63 GB of RAM will always fail in the same way, regardless of whether that CPU's hardware has a 36 bit IPA or a larger IPA? That is, we don't change the default IPA for the VM, so it's 36 bits, and then the hvf command to map in the RAM to the guest address space fails with HV_BAD_ARGUMENT, per https://gitlab.com/qemu-project/qemu/-/issues/1816 . > Starting in macOS 13.0 some APIs were introduced to be able to > query the maximum IPA size the kernel supports, and to set the IPA > size for a given VM. However, this still has a couple of issues > on < macOS 15. Up until macOS 15 (and if the hardware supported > it) the max IPA size was 39 bits which is not a valid PARange > value, so we can't clamp down what we advertise in the vcpu's > id_aa64mmfr0_el1 to our IPA size. Starting in macOS 15 however, > the maximum IPA size is 40 bits (if it's supported in the hardware > as well) which is also a valid PARange value, so we can set our IPA > size to the maximum as well as clamp down the PARange we advertise > to the guest. This allows VMs with 63+ GB of ram and should fix the > oddness of the PARange situation as well. So (again to clarify that I understand what's happening here) for macos 13-14 we'll effectively continue to use a 36-bit IPA range because we clamp the "39" value down to the next lowest actually-valid value of 36 ? And so if you want >63GB of memory you'll need all of: * a host CPU which supports at least a 40 bit IPA (is there a definition somewhere of which these are?) * macos 15 * a QEMU with these changes ? (That seems fine to me: I'm happy to say "get macos 15 if you want this" rather than trying to cope with the non-standard 39 bit IPA in QEMU. We should make sure the error message in the IPA-too-small case is comprehensible -- I think at the moment we somewhat unhelpfully assert()...) > For the implementation of this I've decided to only bump the IPA > size if the amount of RAM requested is encroaching on the default IPA > size of 36 bits, as at 40 bits of IPA space we have to have one extra > level of stage2 page tables. > > Signed-off-by: Danny Canter <danny_can...@apple.com> > Reviewed-by: Cameron Esfahani <di...@apple.com> > @@ -929,6 +977,66 @@ void hvf_arch_vcpu_destroy(CPUState *cpu) > { > } > > +hv_return_t hvf_arch_vm_create(MachineState *ms) > +{ > + uint32_t default_ipa_size = hvf_get_default_ipa_bit_size(); > + uint32_t max_ipa_size = hvf_get_max_ipa_bit_size(); > + hv_return_t ret; > + > + chosen_ipa_bit_size = default_ipa_size; > + > + /* > + * Set the IPA size for the VM: > + * > + * Starting from macOS 13 a new set of APIs were introduced that allow > you > + * to query for the maximum IPA size that is supported on your system. > macOS > + * 13 and 14's kernel both return a value less than 40 bits (typically > 39, > + * but depends on hardware), however starting in macOS 15 and up the IPA > + * size supported (in the kernel at least) is up to 40 bits. A common > scheme > + * for attempting to get the IPA size prior to the introduction of these > new > + * APIs was to read ID_AA64MMFR0.PARange from a vcpu in the hopes that > HVF > + * was returning the maximum IPA size in that. However, this was not the > + * case. HVF would return the host's PARange value directly which is > + * generally larger than 40 bits. > + * > + * Using that value we could set up our memory map with regions much > outside > + * the actually supported IPA size, and also advertise a much larger > + * physical address space to the guest. On the hardware+OS combos where > + * the IPA size is less than 40 bits, but greater than 36, we also don't > + * have a valid PARange value to round down to before 36 bits which is > + * already the default. > + * > + * With that in mind, before we make the VM lets grab the maximum > supported > + * IPA size and clamp it down to the first valid PARange value so we can > + * advertise the correct address size for the guest later on. Then if > it's > + * >= 40 set this as the IPA size for the VM using the new APIs. There's > a > + * small heuristic for actually altering the IPA size for the VM which is > + * if our requested RAM is encroaching on the top of our default IPA > size. > + * This is just an optimization, as at 40 bits we need to create one more > + * level of stage2 page tables. > + */ > +#if defined(MAC_OS_VERSION_13_0) && \ > + MAC_OS_X_VERSION_MIN_REQUIRED >= MAC_OS_VERSION_13_0 > + hv_vm_config_t config = hv_vm_config_create(); > + > + /* In our memory map RAM starts at 1GB. */ This is not board-specific code, so you can't assume that. The board gets to pick the memory map and where RAM starts in it. You probably need to do something similar to what we do in hw/arm/virt.c:virt_kvm_type() where we find out what the best IPA the hypervisor supports is, set the board memory map to respect that, diagnose an error if the user asked for more RAM than fits into that IPA range, and then arrange for the actual VM/vcpu creation to be done with the required IPA. This is unfortunately probably going to imply a bit of extra plumbing to be implemented for hvf -- that MachineClass::kvm_type method is (as the name suggests) KVM specific. (Multi-patch patchset for that, where we add the plumbing in as its own separate patch (and/or whatever other split of functionality into coherent chunks makes sense), rather than one-big-patch, please.) > + uint64_t threshold = (1ull << default_ipa_size) - (1 * GiB); > + if (ms->ram_size >= threshold && max_ipa_size >= FIRST_HIGHMEM_PARANGE) { > + ret = hv_vm_config_set_ipa_size(config, max_ipa_size); > + assert_hvf_ok(ret); > + > + chosen_ipa_bit_size = max_ipa_size; > + } > + > + ret = hv_vm_create(config); > + os_release(config); > +#else > + ret = hv_vm_create(NULL); > +#endif > + > + return ret; > +} > +uint8_t round_down_to_parange_index(uint8_t bit_size) > +{ > + for (int i = ARRAY_SIZE(pamax_map) - 1; i >= 0; i--) { > + if (pamax_map[i] <= bit_size) { > + return i; > + } > + } > + g_assert_not_reached(); > +} > + > +uint8_t round_down_to_parange_bit_size(uint8_t bit_size) > +{ > + for (int i = ARRAY_SIZE(pamax_map) - 1; i >= 0; i--) { > + if (pamax_map[i] <= bit_size) { > + return pamax_map[i]; > + } > + } > + g_assert_not_reached(); We could implement this as return pamax_map[round_down_to_parange_index(bit_size)]; and avoid having to code the loop twice, right? thanks -- PMM