Hi Nadav,
On 3/18/21 2:12 AM, Nadav Amit wrote:
On Mar 17, 2021, at 2:35 AM, Longpeng (Mike, Cloud Infrastructure Service Product
Dept.) <longpe...@huawei.com> wrote:
Hi Nadav,
-----Original Message-----
From: Nadav Amit [mailto:nadav.a...@gmail.com]
reproduce the problem with high probability (~50%).
I saw Lu replied, and he is much more knowledgable than I am (I was just
intrigued
by your email).
However, if I were you I would try also to remove some “optimizations” to look
for
the root-cause (e.g., use domain specific invalidations instead of
page-specific).
Good suggestion! But we did it these days, we tried to use global invalidations
as follow:
iommu->flush.flush_iotlb(iommu, did, 0, 0,
DMA_TLB_DSI_FLUSH);
But can not resolve the problem.
The first thing that comes to my mind is the invalidation hint (ih) in
iommu_flush_iotlb_psi(). I would remove it to see whether you get the failure
without it.
We also notice the IH, but the IH is always ZERO in our case, as the spec says:
'''
Paging-structure-cache entries caching second-level mappings associated with
the specified
domain-id and the second-level-input-address range are invalidated, if the
Invalidation Hint
(IH) field is Clear.
'''
It seems the software is everything fine, so we've no choice but to suspect the
hardware.
Ok, I am pretty much out of ideas. I have two more suggestions, but
they are much less likely to help. Yet, they can further help to rule
out software bugs:
1. dma_clear_pte() seems to be wrong IMHO. It should have used WRITE_ONCE()
to prevent split-write, which might potentially cause “invalid” (partially
cleared) PTE to be stored in the TLB. Having said that, the subsequent
IOTLB flush should have prevented the problem.
Agreed. The pte read/write should use READ/WRITE_ONCE() instead.
2. Consider ensuring that the problem is not somehow related to queued
invalidations. Try to use __iommu_flush_iotlb() instead of
qi_flush_iotlb().
Regards,
Nadav
Best regards,
baolu