On Wed, Jul 16, 2025 at 03:47:37PM +0800, BillXiang wrote: > Consider a system with 8 harts, where each hart supports 5 > Guest Interrupt Files (GIFs), yielding 40 total GIFs. > If we launch a QEMU guest with over 5 vCPUs using > "-M virt,aia='aplic-imsic' -accel kvm,riscv-aia=hwaccel" – which > relies solely on VS-files (not SW-files) for higher performance – the > guest requires more than 5 GIFs. However, the current Linux scheduler > lacks GIF awareness, potentially scheduling >5 vCPUs to a single hart. > This triggers VS-file allocation failure, and since no handler exists > for this error, the QEMU guest becomes corrupted.
What do you mean by "become corrupted"? Shouldn't the VM just stop after the vcpu dumps register state? > > To address this, we introduce this simple handler by rescheduling vCPU > to alternative harts when VS-file allocation fails on the current hart. > > Signed-off-by: BillXiang <xiangwench...@lanxincomputing.com> > --- > target/riscv/kvm/kvm-cpu.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/target/riscv/kvm/kvm-cpu.c b/target/riscv/kvm/kvm-cpu.c > index 5c19062c19..7cf258604f 100644 > --- a/target/riscv/kvm/kvm-cpu.c > +++ b/target/riscv/kvm/kvm-cpu.c > @@ -1706,6 +1706,9 @@ static bool kvm_riscv_handle_debug(CPUState *cs) > int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run) > { > int ret = 0; > + uint64_t code; > + cpu_set_t set; > + long cpus; > switch (run->exit_reason) { > case KVM_EXIT_RISCV_SBI: > ret = kvm_riscv_handle_sbi(cs, run); > @@ -1718,6 +1721,18 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run > *run) > ret = EXCP_DEBUG; > } > break; > + case KVM_EXIT_FAIL_ENTRY: > + code = run->fail_entry.hardware_entry_failure_reason; > + if (code == CSR_HSTATUS) { > + // Schedule vcpu to next hart upon VS-file > + // allocation failure on current hart. > + cpus = sysconf(_SC_NPROCESSORS_ONLN); > + CPU_ZERO(&set); > + CPU_SET((run->fail_entry.cpu+1)%cpus, &set); > + ret = sched_setaffinity(0, sizeof(set), &set); If other guests have already consumed all the VS-files on the selected hart then this will fail again and the next hart will be tried and if all VS-files of the system are already consumed then we'll just go around and around. Other than that problem, this isn't the right approach because QEMU should not be pinning vcpus - that's a higher level virt management layer's job since it's a policy. A better solution to this is to teach KVM to track free VS-files and then migrate (but not pin) vcpus to harts with free VS-files, rather than immediately fail. But, if all guests are configured to only use VS-files, then upper layers of the virt stack will still need to be aware that they can never schedule more vcpus than supported by the number of total VS-files. And, if upper layers are already involved in the scheduling, then pinning is also an option to avoid this problem. Indeed pinning is better for the failure case of over scheduling, since over scheduling with the KVM vcpu migration approach can result in a VM launched earlier to be killed, whereas with the upper layer pinning approach, the last guest launched will fail before it runs. Thanks, drew > + break; > + } > + /* FALLTHRU */ > default: > qemu_log_mask(LOG_UNIMP, "%s: un-handled exit reason %d\n", > __func__, run->exit_reason); > -- > 2.46.2.windows.1 >