On Wed, Jul 16, 2025 at 03:47:37PM +0800, BillXiang wrote:
> Consider a system with 8 harts, where each hart supports 5
> Guest Interrupt Files (GIFs), yielding 40 total GIFs.
> If we launch a QEMU guest with over 5 vCPUs using
> "-M virt,aia='aplic-imsic' -accel kvm,riscv-aia=hwaccel" – which
> relies solely on VS-files (not SW-files) for higher performance – the
> guest requires more than 5 GIFs. However, the current Linux scheduler
> lacks GIF awareness, potentially scheduling >5 vCPUs to a single hart.
> This triggers VS-file allocation failure, and since no handler exists
> for this error, the QEMU guest becomes corrupted.

What do you mean by "become corrupted"? Shouldn't the VM just stop after
the vcpu dumps register state?

> 
> To address this, we introduce this simple handler by rescheduling vCPU
> to alternative harts when VS-file allocation fails on the current hart.
> 
> Signed-off-by: BillXiang <xiangwench...@lanxincomputing.com>
> ---
>  target/riscv/kvm/kvm-cpu.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/target/riscv/kvm/kvm-cpu.c b/target/riscv/kvm/kvm-cpu.c
> index 5c19062c19..7cf258604f 100644
> --- a/target/riscv/kvm/kvm-cpu.c
> +++ b/target/riscv/kvm/kvm-cpu.c
> @@ -1706,6 +1706,9 @@ static bool kvm_riscv_handle_debug(CPUState *cs)
>  int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
>  {
>      int ret = 0;
> +    uint64_t code;
> +    cpu_set_t set;
> +    long cpus;
>      switch (run->exit_reason) {
>      case KVM_EXIT_RISCV_SBI:
>          ret = kvm_riscv_handle_sbi(cs, run);
> @@ -1718,6 +1721,18 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run 
> *run)
>              ret = EXCP_DEBUG;
>          }
>          break;
> +    case KVM_EXIT_FAIL_ENTRY:
> +        code = run->fail_entry.hardware_entry_failure_reason;
> +        if (code == CSR_HSTATUS) {
> +            // Schedule vcpu to next hart upon VS-file 
> +            // allocation failure on current hart.
> +            cpus = sysconf(_SC_NPROCESSORS_ONLN);
> +            CPU_ZERO(&set);
> +            CPU_SET((run->fail_entry.cpu+1)%cpus, &set);
> +            ret = sched_setaffinity(0, sizeof(set), &set);

If other guests have already consumed all the VS-files on the selected
hart then this will fail again and the next hart will be tried and if all
VS-files of the system are already consumed then we'll just go around and
around.

Other than that problem, this isn't the right approach because QEMU should
not be pinning vcpus - that's a higher level virt management layer's job
since it's a policy.

A better solution to this is to teach KVM to track free VS-files and then
migrate (but not pin) vcpus to harts with free VS-files, rather than
immediately fail.

But, if all guests are configured to only use VS-files, then upper layers
of the virt stack will still need to be aware that they can never schedule
more vcpus than supported by the number of total VS-files. And, if upper
layers are already involved in the scheduling, then pinning is also an
option to avoid this problem. Indeed pinning is better for the failure
case of over scheduling, since over scheduling with the KVM vcpu migration
approach can result in a VM launched earlier to be killed, whereas with
the upper layer pinning approach, the last guest launched will fail before
it runs.

Thanks,
drew

> +            break;
> +        }
> +        /* FALLTHRU */
>      default:
>          qemu_log_mask(LOG_UNIMP, "%s: un-handled exit reason %d\n",
>                        __func__, run->exit_reason);
> -- 
> 2.46.2.windows.1
> 

Reply via email to