Hi Catalin, > -----Original Message----- > From: Catalin Marinas [mailto:catalin.mari...@arm.com] > Sent: Wednesday, January 30, 2019 3:11 AM > To: Zhang, Lei > Cc: 'linux-kernel@vger.kernel.org'; 'Mark Rutland'; > 'linux-arm-ker...@lists.infradead.org'; 'will.dea...@arm.com'; > 'james.mo...@arm.com' > Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX > erratum 010001 > > Could you please copy the whole description from the cover letter to the > actual patch and only send one email (full description as in here > together with the patch)? If we commit this to the kernel, it would be > useful to have the information in the log for reference later on.
Thank you for your suggestion. I will send one email with whole description. > So this looks like new information on the hardware behaviour since the > v2 of the patch. Can this fault occur for any type of instruction > accessing the memory or only for SVE instructions? This erratum is that any load/store instruction, including Armv8 and SVE, except non-fault access might occur a spurious fault. > How likely is it to trigger this erratum? In other words, aren't we > better off with a spurious fault that we ignore rather than toggling the > TCR_ELx.NFD1 bit? Although the erratum occurs exceptionally rare, this path is required to handle the issue pointed out by James and Mark in: https://lkml.org/lkml/2019/1/22/533, https://lkml.org/lkml/2019/1/22/642. As James and Mark pointed, if the erratum occurs at EL1/EL2 before system registers, ELR and SPSR, are backed up, these registers will be overwritten and we will lose that information. So, we set the TCR_ELx.NFD1=0 during EL1/EL2. Please see the supplemental explanation in the end of this mail. > The problem is that this bit may be cached in the TLB (I haven't checked > the ARM ARM but that's usually the case with the TCR_ELx bits). If > that's the case, you can't guarantee a change unless you also perform > a > TLBI VMALL. Arguably, if Fujitsu's microarchitecture doesn't cache the > NFD bits in the TLB, we could apply the workaround but I'd rather have > the spurious trap if it's not too often. It is not necessary to perform a TLBI VMALL in A64FX microarchitecture to guarantee a change of TCR_ELx.{NFD0,NFD1}. > Could speculative loads also trigger this? Another option would be to > toggle it during kernel_neon_begin/end (with the caveat of TLBI as > mentioned above). No, a speculative load does not trigger this erratum. Here are supplemental explanations: Since this erratum occurs only when TCR_ELx.NFD1=1, we keep TCR_ELx.NFD1=0 during EL1/EL2. By doing so, the erratum occurs only in EL0 and the spurious trap can be handled by the fault handler. To keep TCR_ELx.NFD1=0 in EL1/EL2, there are two critical sections to assure the completeness of the implementation. One is the transition from EL0 to EL1/EL2 and the other is from EL1/EL2 to EL0 For the former case, I set TCR_ELx.NFD1=0 at codes tramp_map_kernel. And there is no load/store instruction before setting TCR_ELx.NFD1=0 at EL1/EL2, so undefined fault will not be happened. For the latter case, I set TCR_ELx.NFD1=1 at codes tramp_unmap_kernel. And there is no load/store instruction after setting TCR_ELx.NFD1=1 at EL1/EL2, so undefined fault will not be happened. To handle the spurious fault in EL0, I replace the fault handler for Data abort DFSC=0b111111 with a new fault handler to ignore this spurious fault caused by the erratum. Thanks, Zhang Lei