On Thu, Aug 08, 2024 at 08:10:34AM -0400, Michael S. Tsirkin wrote:
> On Thu, Aug 08, 2024 at 10:51:41AM +0300, Kirill A. Shutemov wrote:
> > Hongyu reported a hang on kexec in a VM. QEMU reported invalid memory
> > accesses during the hang.
> > 
> >     Invalid read at addr 0x102877002, size 2, region '(null)', reason: 
> > rejected
> >     Invalid write at addr 0x102877A44, size 2, region '(null)', reason: 
> > rejected
> >     ...
> > 
> > It was traced down to virtio-console. Kexec works fine if virtio-console
> > is not in use.
> 
> virtio is not doing a lot of 16 bit reads.
> Are these the reads:
> 
>                 virtio_cread(vdev, struct virtio_console_config, cols, &cols);
>                 virtio_cread(vdev, struct virtio_console_config, rows, &rows);
> 
> ?
> 
> write is a bit puzzling too. This one?
> 
> bool vp_notify(struct virtqueue *vq)
> {       
>         /* we write the queue's selector into the notification register to
>          * signal the other end */
>         iowrite16(vq->index, (void __iomem *)vq->priv);
>         return true;
> }

Given that we are talking about console issue, any suggestion on how to
check?

> > 
> > Looks like virtio-console continues to write to the MMIO even after
> > underlying virtio-pci device is removed.
> 
> You mention both MMIO and pci, I am confused.

By MMIO, I mean accesses to PCI BARs. But it is only my *guess* on the
situation, I have limited knowledge of the area. I am not drivers guy.

> Removed by what? In what sense?

So device_shutdown() iterates over all device and we hit the problem when
we get to virtio-pci devices and call pci_device_shutdown on them.

I *think* PCI BAR (or something else?) becomes unavailable after that but
it is still accessed.

> > 
> > The problem can be mitigated by removing all virtio devices on virtio
> > bus shutdown.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> > Reported-by: Hongyu Ning <hongyu.n...@linux.intel.com>
> 
> A bit worried about doing so much activity on shutdown,
> and for all devices, too. I'd like to understand what
> is going on a bit better - could be a symptom of
> a bigger problem (e.g. missing handling for suprise
> removal?).

I probably should have marked the patch as RFC. The patch was intended to
start conversation. I am not sure it is correct. This patch just happened
to work in our setup.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Reply via email to