Hi drew,Thanks for your reply.

I agree this patch remains insufficient. Regardless, Guest Interrupt Files
represent a new and critical resource type that upper layers currently
lack awareness of, and significant work is still needed to fully integrate them.

> From: "Andrew Jones"<ajo...@ventanamicro.com>
> Date:  Thu, Jul 17, 2025, 16:02
> Subject:  Re: [PATCH] target/riscv/kvm: Introduce simple handler for VS-file 
> allocation failure
> To: "BillXiang"<xiangwench...@lanxincomputing.com>
> Cc: <pal...@dabbelt.com>, <alistair.fran...@wdc.com>, <liwei1...@gmail.com>, 
> <dbarb...@ventanamicro.com>, <zhiwei_...@linux.alibaba.com>, 
> <qemu-ri...@nongnu.org>, <qemu-devel@nongnu.org>
> On Wed, Jul 16, 2025 at 03:47:37PM +0800, BillXiang wrote:
> > Consider a system with 8 harts, where each hart supports 5
> > Guest Interrupt Files (GIFs), yielding 40 total GIFs.
> > If we launch a QEMU guest with over 5 vCPUs using
> > "-M virt,aia='aplic-imsic' -accel kvm,riscv-aia=hwaccel" – which
> > relies solely on VS-files (not SW-files) for higher performance – the
> > guest requires more than 5 GIFs. However, the current Linux scheduler
> > lacks GIF awareness, potentially scheduling >5 vCPUs to a single hart.
> > This triggers VS-file allocation failure, and since no handler exists
> > for this error, the QEMU guest becomes corrupted.
> 
> What do you mean by "become corrupted"? Shouldn't the VM just stop after
> the vcpu dumps register state?
> 
> > 
> > To address this, we introduce this simple handler by rescheduling vCPU
> > to alternative harts when VS-file allocation fails on the current hart.
> > 
> > Signed-off-by: BillXiang <xiangwench...@lanxincomputing.com>
> > ---
> >  target/riscv/kvm/kvm-cpu.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> > 
> > diff --git a/target/riscv/kvm/kvm-cpu.c b/target/riscv/kvm/kvm-cpu.c
> > index 5c19062c19..7cf258604f 100644
> > --- a/target/riscv/kvm/kvm-cpu.c
> > +++ b/target/riscv/kvm/kvm-cpu.c
> > @@ -1706,6 +1706,9 @@ static bool kvm_riscv_handle_debug(CPUState *cs)
> >  int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
> >  {
> >      int ret = 0;
> > +    uint64_t code;
> > +    cpu_set_t set;
> > +    long cpus;
> >      switch (run->exit_reason) {
> >      case KVM_EXIT_RISCV_SBI:
> >          ret = kvm_riscv_handle_sbi(cs, run);
> > @@ -1718,6 +1721,18 @@ int kvm_arch_handle_exit(CPUState *cs, struct 
> > kvm_run *run)
> >              ret = EXCP_DEBUG;
> >          }
> >          break;
> > +    case KVM_EXIT_FAIL_ENTRY:
> > +        code = run->fail_entry.hardware_entry_failure_reason;
> > +        if (code == CSR_HSTATUS) {
> > +            // Schedule vcpu to next hart upon VS-file 
> > +            // allocation failure on current hart.
> > +            cpus = sysconf(_SC_NPROCESSORS_ONLN);
> > +            CPU_ZERO(&set);
> > +            CPU_SET((run->fail_entry.cpu+1)%cpus, &set);
> > +            ret = sched_setaffinity(0, sizeof(set), &set);
> 
> If other guests have already consumed all the VS-files on the selected
> hart then this will fail again and the next hart will be tried and if all
> VS-files of the system are already consumed then we'll just go around and
> around.
> 
> Other than that problem, this isn't the right approach because QEMU should
> not be pinning vcpus - that's a higher level virt management layer's job
> since it's a policy.
> 
> A better solution to this is to teach KVM to track free VS-files and then
> migrate (but not pin) vcpus to harts with free VS-files, rather than
> immediately fail.
> 
> But, if all guests are configured to only use VS-files, then upper layers
> of the virt stack will still need to be aware that they can never schedule
> more vcpus than supported by the number of total VS-files. And, if upper
> layers are already involved in the scheduling, then pinning is also an
> option to avoid this problem. Indeed pinning is better for the failure
> case of over scheduling, since over scheduling with the KVM vcpu migration
> approach can result in a VM launched earlier to be killed, whereas with
> the upper layer pinning approach, the last guest launched will fail before
> it runs.
> 
> Thanks,
> drew
> 
> > +            break;
> > +        }
> > +        /* FALLTHRU */
> >      default:
> >          qemu_log_mask(LOG_UNIMP, "%s: un-handled exit reason %d\n",
> >                        __func__, run->exit_reason);
> > -- 
> > 2.46.2.windows.1
> >
> 

Reply via email to