On 6/3/25 06:47, Xiaoyao Li wrote: > On 6/3/2025 3:41 PM, Xiaoyao Li wrote: >> On 3/29/2025 4:30 AM, Tom Lendacky wrote: >>> A page state change is typically followed by an access of the page(s) and >>> results in another VMEXIT in order to map the page into the nested page >>> table. Depending on the size of page state change request, this can >>> generate a number of additional VMEXITs. For example, under SNP, when >>> Linux is utilizing lazy memory acceptance, memory is typically accepted in >>> 4M chunks. A page state change request is submitted to mark the pages as >>> private, followed by validation of the memory. Since the guest_memfd >>> currently only supports 4K pages, each page validation will result in >>> VMEXIT to map the page, resulting in 1024 additional exits. >>> >>> When performing a page state change, invoke KVM_PRE_FAULT_MEMORY for the >>> size of the page state change in order to pre-map the pages and avoid the >>> additional VMEXITs. This helps speed up boot times. >> >> Unfortunately, it breaks TDX guest. >> >> kvm_hc_map_gpa_range gpa 0x80000000 size 0x200000 attributes 0x0 >> flags 0x1 >> >> For TDX guest, it uses MAPGPA to maps the range [0x8000 0000, >> +0x0x200000] to shared. The call of KVM_PRE_FAULT_MEMORY on such range >> leads to the TD being marked as bugged >> >> [353467.266761] WARNING: CPU: 109 PID: 295970 at arch/x86/kvm/mmu/ >> tdp_mmu.c:674 tdp_mmu_map_handle_target_level+0x301/0x460 [kvm] > > It turns out to be a KVM bug. > > The gpa passed in in KVM_PRE_FAULT_MEMORY, i.e., range->gpa has no > indication for share vs. private. KVM directly passes range->gpa to > kvm_tdp_map_page() in kvm_arch_vcpu_pre_fault_memory(), which is then > assigned to fault.addr > > However, fault.addr is supposed to be a gpa of real access in TDX guest, > which means it needs to have shared bit set if the map is for shared > access, for TDX case. tdp_mmu_get_root_for_fault() will use it to > determine which root to be used. > > For this case, the pre fault is on the shared memory, while the fault.addr > leads to mirror_root which is for private memory. Thus it triggers > KVM_BUG_ON().
Is this something that can be fixed in KVM (determine if the range is private or shared) or does the call to KVM_PRE_FAULT_MEMORY require modification in some way that works for both TDX and SNP? Thanks, Tom > > >> [353472.621399] WARNING: CPU: 109 PID: 295970 at arch/x86/kvm/../../../ >> virt/kvm/kvm_main.c:4281 kvm_vcpu_pre_fault_memory+0x167/0x1a0 [kvm] >> >> >> It seems the pre map on the non MR back'ed range has issue. But I'm >> still debugging it to understand the root cause. >> >> >