at 1:51 PM, Klaus Jensen <i...@irrelevant.dk> wrote: > On Jul 6 19:34, Jinhao Fan wrote: >> at 2:43 AM, Keith Busch <kbu...@kernel.org> wrote: >> >>> On Tue, Jul 05, 2022 at 07:11:36PM +0200, Klaus Jensen wrote: >>>> On Jul 5 22:24, Jinhao Fan wrote: >>>>> @@ -1374,7 +1374,14 @@ static void nvme_enqueue_req_completion(NvmeCQueue >>>>> *cq, NvmeRequest *req) >>>>> >>>>> QTAILQ_REMOVE(&req->sq->out_req_list, req, entry); >>>>> QTAILQ_INSERT_TAIL(&cq->req_list, req, entry); >>>>> - timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500); >>>>> + >>>>> + if (req->sq->ioeventfd_enabled) { >>>>> + /* Post CQE directly since we are in main loop thread */ >>>>> + nvme_post_cqes(cq); >>>>> + } else { >>>>> + /* Schedule the timer to post CQE later since we are in vcpu >>>>> thread */ >>>>> + timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + >>>>> 500); >>>>> + } >>>> >>>> Actually, we are only in the vcpu thread if we come here from >>>> nvme_process_db that in very rare circumstances may enqueue the >>>> completion of an AER due to an invalid doorbell write. >>>> >>>> In general, nvme_enqueue_req_completion is only ever called from the >>>> main iothread. Which actually causes me to wonder why we defer this work >>>> in the first place. It does have the benefit that we queue up several >>>> completions before posting them in one go and raising the interrupt. >>>> But I wonder if that could be handled better. >>> >>> I think the timer is used because of the cq_full condition. We need to >>> restart >>> completions when it becomes not full, which requires a doorbell write. >>> Having >>> everyone from the main iothread use the same timer as the doorbell handler >>> just >>> ensures single threaded list access. >> >> Could we let nvme_process_aers register another timer/BH to trigger >> nvme_enqueue_req_completion in the iothread? In this way we won’t need the >> timer_mod in nvme_enqueue_req_completion. > > Yes, we could have process_aers in a timer. Which would probably be > preferable in order to limit the amount of work the mmio handler is > doing in that rare case. However, its such a rare case (only misbehaving > drivers) that it's probably not worth optimizing for.
I think putting nvme_process_aers in a timer can help make sure nvme_enqueue_req_completion is only called in an iothread context. Currently it can be called either in iothread or vcpu thread. So nvme_enqueue_req_completion has to infer its context, in my patch using the cq->ioeventfd_enabled flag. This approach is probably error-prone. Honestly I am not sure whether cq->ioeventfd_enabled is enough to guarantee we are in iothread. >> We can also avoid some potential currency problems because CQ is only >> modified in the iothread. > > There are currently no concurrency problems because of the Big QEMU > Lock. When the mmio handler is running, the vcpu holds the BQL (and > whenever the main iothread is running, it is holding the BQL). Will this still hold when we move on to iothreads? > >> BTW, are there any reason that we must use timers (not BH) here? Also why do >> we choose to delay for 500ns? > > No particular reason. do not see any reason why this could not be bottom > halfs. This will likely change into bhs when we add iothread support > anyway. I will try BH when I start the iothread work.