RE: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

Zhang, Lei Tue, 05 Feb 2019 04:49:43 -0800

Hi Catalin,

> -----Original Message-----
> From: Catalin Marinas [mailto:catalin.mari...@arm.com]
> Sent: Wednesday, January 30, 2019 3:11 AM
> To: Zhang, Lei 
> Cc: 'linux-kernel@vger.kernel.org'; 'Mark Rutland';
> 'linux-arm-ker...@lists.infradead.org'; 'will.dea...@arm.com';
> 'james.mo...@arm.com'
> Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX
> erratum 010001
> 
> Could you please copy the whole description from the cover letter to the
> actual patch and only send one email (full description as in here
> together with the patch)? If we commit this to the kernel, it would be
> useful to have the information in the log for reference later on.


Thank you for your suggestion. I will send one email with whole description.

> So this looks like new information on the hardware behaviour since the
> v2 of the patch. Can this fault occur for any type of instruction
> accessing the memory or only for SVE instructions?

This erratum is that any load/store instruction, including Armv8 and SVE, 
except non-fault access might occur a spurious fault.

> How likely is it to trigger this erratum? In other words, aren't we
> better off with a spurious fault that we ignore rather than toggling the
> TCR_ELx.NFD1 bit?

Although the erratum occurs exceptionally rare, this path is required 
to handle the issue pointed out by James and Mark in:
  https://lkml.org/lkml/2019/1/22/533,
  https://lkml.org/lkml/2019/1/22/642.

As James and Mark pointed, if the erratum occurs at EL1/EL2 before 
system registers, ELR and SPSR, are backed up, these registers will 
be overwritten and we will lose that information.

So, we set the TCR_ELx.NFD1=0 during EL1/EL2.
Please see the supplemental explanation in the end of this mail.

> The problem is that this bit may be cached in the TLB (I haven't checked
> the ARM ARM but that's usually the case with the TCR_ELx bits). If
> that's the case, you can't guarantee a change unless you also perform
> a
> TLBI VMALL. Arguably, if Fujitsu's microarchitecture doesn't cache the
> NFD bits in the TLB, we could apply the workaround but I'd rather have
> the spurious trap if it's not too often.

It is not necessary to perform a TLBI VMALL in A64FX microarchitecture 
to guarantee a change of TCR_ELx.{NFD0,NFD1}. 

> Could speculative loads also trigger this? Another option would be to
> toggle it during kernel_neon_begin/end (with the caveat of TLBI as
> mentioned above).

No, a speculative load does not trigger this erratum. 

Here are supplemental explanations:

Since this erratum occurs only when TCR_ELx.NFD1=1, 
we keep TCR_ELx.NFD1=0 during EL1/EL2.
By doing so, the erratum occurs only in EL0 and the 
spurious trap can be handled by the fault handler.

To keep TCR_ELx.NFD1=0 in EL1/EL2, there are two critical 
sections to assure the completeness of the implementation.
One is the transition from EL0 to EL1/EL2 and the other 
is from EL1/EL2 to EL0

For the former case, I set TCR_ELx.NFD1=0 at codes tramp_map_kernel. 
And there is no load/store instruction before setting 
TCR_ELx.NFD1=0 at EL1/EL2, so undefined fault will not be happened.

For the latter case, I set TCR_ELx.NFD1=1 at codes tramp_unmap_kernel. 
And there is no load/store instruction after setting 
TCR_ELx.NFD1=1 at EL1/EL2, so undefined fault will not be happened.

To handle the spurious fault in EL0,
I replace the fault handler for Data abort DFSC=0b111111 with 
a new fault handler to ignore this spurious fault caused by the erratum.

Thanks,
Zhang Lei

RE: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

Reply via email to