Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-19 Thread James Morse
Hi Hawa, On 17/06/2019 14:00, Hawa, Hanna wrote: >> I don't think it can, on a second reading, it looks to be even more >> complicated than I >> thought! That bit is described as disabling forwarding of uncorrected data, >> but it looks >> like the uncorrected data never actually reaches the oth

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-17 Thread Hawa, Hanna
+static void al_a57_edac_l2merrsr(void *arg) +{ +    edac_device_handle_ce(edac_dev, 0, 0, "L2 Error"); How do we know this is corrected? If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to uncorrected/corrected. Is this what you are depending on here? No - not on this. Rep

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-14 Thread Borislav Petkov
Reply part 2. On Thu, Jun 13, 2019 at 09:54:18AM +1000, Benjamin Herrenschmidt wrote: > Why ? Because one or two historical drivers mix MC and PCI then "it > makes sense" to do that for everybody ? Because it was like that. And now all of a sudden ARM wants something different. So we must at leas

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-14 Thread James Morse
Hi Hawa, On 13/06/2019 18:05, James Morse wrote: > On 11/06/2019 20:56, Hawa, Hanna wrote: >> James Morse wrote: >>> Hawa, Hanna wrote: +    if (cluster != last_cluster) { +    smp_call_function_single(cpu, al_a57_edac_l2merrsr, + edac_dev, 0); >>

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-13 Thread James Morse
Hi Hawa, On 11/06/2019 20:56, Hawa, Hanna wrote: > James Morse wrote: >> Hawa, Hanna wrote: >>> +    edac_device_handle_ce(edac_dev, 0, 0, "L2 Error"); >> >> How do we know this was corrected? >> >> 6.4.8 "Error Correction Code" has "Double-bit ECC errors set the fatal bit." >> in a >> paragraph

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-13 Thread Benjamin Herrenschmidt
On Wed, 2019-06-12 at 13:00 +0200, Borislav Petkov wrote: > On Wed, Jun 12, 2019 at 07:42:42AM -0300, Mauro Carvalho Chehab wrote: > > That's said, from the admin PoV, it makes sense to have a single > > daemon that collect errors from all error sources and take the > > needed actions. > > Doing r

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-13 Thread Benjamin Herrenschmidt
On Wed, 2019-06-12 at 12:42 +0200, Borislav Petkov wrote: > On Wed, Jun 12, 2019 at 06:29:26PM +1000, Benjamin Herrenschmidt wrote: > > I tend to disagree here. We've been down that rabbit hole in the past > > and we (Linux in general) are trying to move away from that sort of > > "platform" overar

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-13 Thread Benjamin Herrenschmidt
On Wed, 2019-06-12 at 14:25 +0200, Borislav Petkov wrote: > > But for the main case that really needs to be in the kernel, which is > > DRAM, the recovery can usually be contained to the MC driver anyway. > > Right, if that is enough to handle the error properly. > > The memory-failure.c example

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-13 Thread Borislav Petkov
On Thu, Jun 13, 2019 at 09:54:18AM +1000, Benjamin Herrenschmidt wrote: > It tends to be a slippery slope. Also in the ARM world, most SoC tend > to re-use IP blocks, so you get a lot of code duplication, bug fixed in > one and not the other etc... Yes, I'd like to be able to reuse EDAC drivers if

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-12 Thread Borislav Petkov
On Wed, Jun 12, 2019 at 03:35:31PM +0300, Hawa, Hanna wrote: > We have daemon script that collects correctable/uncorrectable errors from > EDAC sysfs and reports to Amazon service that allow us to take action on > specific error thresholds. What does "take action" mean, more specifically? Is the s

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-12 Thread Hawa, Hanna
Hi Boris, Yap, I think we're in agreement here. I believe the important question is whether you need to get error information from multiple sources together in order to do proper recovery or doing it per error source suffices. And I think the actual use cases could/should dictate our drivers/o

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-12 Thread Borislav Petkov
On Wed, Jun 12, 2019 at 09:57:40PM +1000, Benjamin Herrenschmidt wrote: > On Wed, 2019-06-12 at 08:42 -0300, Mauro Carvalho Chehab wrote: > > > Yes, we do have different error reporting facilities but I still > > > think > > > that concentrating all the error information needed in order to do > > >

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-12 Thread Benjamin Herrenschmidt
On Wed, 2019-06-12 at 08:42 -0300, Mauro Carvalho Chehab wrote: > > Yes, we do have different error reporting facilities but I still > > think > > that concentrating all the error information needed in order to do > > proper recovery action is the better approach here. And make that > > part > > of

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-12 Thread Mauro Carvalho Chehab
Em Wed, 12 Jun 2019 13:00:39 +0200 Borislav Petkov escreveu: > On Wed, Jun 12, 2019 at 07:42:42AM -0300, Mauro Carvalho Chehab wrote: > > That's said, from the admin PoV, it makes sense to have a single > > daemon that collect errors from all error sources and take the > > needed actions. > >

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-12 Thread Borislav Petkov
On Wed, Jun 12, 2019 at 07:42:42AM -0300, Mauro Carvalho Chehab wrote: > That's said, from the admin PoV, it makes sense to have a single > daemon that collect errors from all error sources and take the > needed actions. Doing recovery actions in userspace is too flaky. Daemon can get killed at an

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-12 Thread Borislav Petkov
On Wed, Jun 12, 2019 at 06:29:26PM +1000, Benjamin Herrenschmidt wrote: > I tend to disagree here. We've been down that rabbit hole in the past > and we (Linux in general) are trying to move away from that sort of > "platform" overarching driver as much as possible. Why is a "platform" driver like

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-12 Thread Mauro Carvalho Chehab
Em Wed, 12 Jun 2019 18:29:26 +1000 Benjamin Herrenschmidt escreveu: > On Wed, 2019-06-12 at 05:48 +0200, Borislav Petkov wrote: > > On Wed, Jun 12, 2019 at 08:25:52AM +1000, Benjamin Herrenschmidt wrote: > > > Yes, we would be in a world of pain already if tracepoints couldn't > > > handle conc

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-12 Thread Benjamin Herrenschmidt
On Wed, 2019-06-12 at 05:48 +0200, Borislav Petkov wrote: > On Wed, Jun 12, 2019 at 08:25:52AM +1000, Benjamin Herrenschmidt wrote: > > Yes, we would be in a world of pain already if tracepoints couldn't > > handle concurrency :-) > > Right, lockless buffer and the whole shebang :) Yup. > > Sort

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-11 Thread Borislav Petkov
On Wed, Jun 12, 2019 at 08:25:52AM +1000, Benjamin Herrenschmidt wrote: > Yes, we would be in a world of pain already if tracepoints couldn't > handle concurrency :-) Right, lockless buffer and the whole shebang :) > Sort-of... I still don't see a race in what we propose but I might be > missing

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-11 Thread Benjamin Herrenschmidt
On Tue, 2019-06-11 at 13:56 +0200, Borislav Petkov wrote: > On Tue, Jun 11, 2019 at 05:21:39PM +1000, Benjamin Herrenschmidt wrote: > > So looking again ... all the registration/removal of edac devices seem > > to already be protected by mutexes, so that's not a problem. > > > > Tell me more about

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-11 Thread Hawa, Hanna
Hi James, Allowing linux to access these implementation-defined registers has come up before: https://www.spinics.net/lists/kernel/msg2750349.html It looks like you've navigated most of the issues. Accessing implementation-defined registers is frowned on, but this stuff can't be done generic

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-11 Thread Borislav Petkov
On Tue, Jun 11, 2019 at 10:29:55AM +0300, Hawa, Hanna wrote: > In the near future we plan to push EDAC drivers for L1/L2 and memory > controller. There's no common resources/shared data between them. Ok, you should be safe then. If you need to do more involved interaction in the future, you know w

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-11 Thread Borislav Petkov
On Tue, Jun 11, 2019 at 05:21:39PM +1000, Benjamin Herrenschmidt wrote: > So looking again ... all the registration/removal of edac devices seem > to already be protected by mutexes, so that's not a problem. > > Tell me more about what specific races you think we might have here, > I'm not sure I

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-11 Thread Borislav Petkov
On Tue, Jun 11, 2019 at 03:50:40PM +1000, Benjamin Herrenschmidt wrote: > Should we fix that then instead ? Sure. > What are the big issues with adding some basic locking ? being called > from NMIs ? That is one possible issue. I know we don't call the error decoding routines in NMI context on x

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-11 Thread Hawa, Hanna
Hi Ben, Boris On 6/11/2019 8:50 AM, Benjamin Herrenschmidt wrote: Anyway, let's get back to the specific case of our Amazon platform here since it's a concrete example. Hanna, can you give us a reasonably exhaustive list of how many such "drivers" we'll want in the EDAC subsystem and whether y

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-11 Thread Benjamin Herrenschmidt
On Tue, 2019-06-11 at 15:50 +1000, Benjamin Herrenschmidt wrote: > On Sat, 2019-06-08 at 11:05 +0200, Borislav Petkov wrote: > > On Sat, Jun 08, 2019 at 10:16:11AM +1000, Benjamin Herrenschmidt wrote: > > > Those IP blocks don't need any SW coordination at runtime. The drivers > > > don't share dat

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-10 Thread Benjamin Herrenschmidt
On Sat, 2019-06-08 at 11:05 +0200, Borislav Petkov wrote: > On Sat, Jun 08, 2019 at 10:16:11AM +1000, Benjamin Herrenschmidt wrote: > > Those IP blocks don't need any SW coordination at runtime. The drivers > > don't share data nor communicate with each other. There is absolultely > > no reason to

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-08 Thread Borislav Petkov
On Sat, Jun 08, 2019 at 10:16:11AM +1000, Benjamin Herrenschmidt wrote: > Those IP blocks don't need any SW coordination at runtime. The drivers > don't share data nor communicate with each other. There is absolultely > no reason to go down that path. Let me set one thing straight: the EDAC "subsy

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-07 Thread Benjamin Herrenschmidt
On Fri, 2019-06-07 at 16:11 +0100, James Morse wrote: > I'm coming at this from somewhere else. This stuff has to be considered all > the way > through the system. Just because each component supports error detection, > doesn't mean you > aren't going to get silent corruption. Likewise if another

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-07 Thread Benjamin Herrenschmidt
On Thu, 2019-06-06 at 11:33 +0100, James Morse wrote: > > Disagree. The various drivers don't depend on each other. > > I think we should keep the drivers separated as they are distinct and > > independent IP blocks. > > But they don't exist in isolation, they both depend on the > integration-c

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-07 Thread James Morse
Hi guys, On 06/06/2019 12:37, Shenhar, Talel wrote: >>> Disagree. The various drivers don't depend on each other. >>> I think we should keep the drivers separated as they are distinct and >>> independent IP >>> blocks. >> But they don't exist in isolation, they both depend on the >> integration-

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-06 Thread Shenhar, Talel
Disagree. The various drivers don't depend on each other. I think we should keep the drivers separated as they are distinct and independent IP blocks. But they don't exist in isolation, they both depend on the integration-choices/firmware that makes up your platform. Other platforms may hav

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-06 Thread Borislav Petkov
On Thu, Jun 06, 2019 at 11:33:30AM +0100, James Morse wrote: > All these are integration choices between the two IP blocks, done as separate > drivers we > don't have anywhere to store that information. Even if you don't care about > this, making > them separate drivers should only be done to mak

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-06 Thread James Morse
Hi Hawa, On 06/06/2019 08:53, Hawa, Hanna wrote: > On 5/31/2019 8:14 AM, Borislav Petkov wrote: >> On Fri, May 31, 2019 at 01:15:33AM +, Herrenschmidt, Benjamin wrote: >>> This isn't terribly helpful, there's nothing telling anybody which of >>> those files corresponds to an ARM SoC :-) >> >>

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-06 Thread Borislav Petkov
On Thu, Jun 06, 2019 at 10:53:42AM +0300, Hawa, Hanna wrote: > Disagree. The various drivers don't depend on each other. > I think we should keep the drivers separated as they are distinct and > independent IP blocks. This topic comes up each time someone submits a new ARM EDAC driver: EDAC can't

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-06 Thread Hawa, Hanna
On 5/31/2019 8:14 AM, Borislav Petkov wrote: On Fri, May 31, 2019 at 01:15:33AM +, Herrenschmidt, Benjamin wrote: This isn't terribly helpful, there's nothing telling anybody which of those files corresponds to an ARM SoC :-) drivers/edac/altera_edac.c is one example. Also, James and I

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-05 Thread James Morse
Hi Hana, On 30/05/2019 11:15, Hanna Hawa wrote: > Add support for error detection and correction for Amazon's Annapurna > Labs SoCs for L1/L2 caches. > > Amazon's Annapurna Labs SoCs based on ARM CA57 and CA72, the driver > support both cortex based on compatible string. > diff --git a/drivers/e

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-05 Thread James Morse
Hi guys, On 31/05/2019 06:14, Borislav Petkov wrote: > On Fri, May 31, 2019 at 01:15:33AM +, Herrenschmidt, Benjamin wrote: >> This isn't terribly helpful, there's nothing telling anybody which of >> those files corresponds to an ARM SoC :-) > > drivers/edac/altera_edac.c is one example. > >

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-02 Thread Hawa, Hanna
On 5/31/2019 4:15 AM, Herrenschmidt, Benjamin wrote: On Thu, 2019-05-30 at 11:19 -0700, Boris Petkov wrote: On May 30, 2019 3:15:29 AM PDT, Hanna Hawa wrote: Add support for error detection and correction for Amazon's Annapurna Labs SoCs for L1/L2 caches. So this should be a driver for t

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-05-30 Thread Borislav Petkov
On Fri, May 31, 2019 at 01:15:33AM +, Herrenschmidt, Benjamin wrote: > This isn't terribly helpful, there's nothing telling anybody which of > those files corresponds to an ARM SoC :-) drivers/edac/altera_edac.c is one example. Also, James and I have a small writeup on how an arm driver shoul

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-05-30 Thread Herrenschmidt, Benjamin
On Thu, 2019-05-30 at 11:19 -0700, Boris Petkov wrote: > On May 30, 2019 3:15:29 AM PDT, Hanna Hawa wrote: > > Add support for error detection and correction for Amazon's > > Annapurna > > Labs SoCs for L1/L2 caches. > > > So this should be a driver for the whole annapurna platform and not > onl

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-05-30 Thread Boris Petkov
On May 30, 2019 3:15:29 AM PDT, Hanna Hawa wrote: >Add support for error detection and correction for Amazon's Annapurna >Labs SoCs for L1/L2 caches. So this should be a driver for the whole annapurna platform and not only about the RAS functionality in an IP like the caches. See other ARM EDAC

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-05-30 Thread Joe Perches
On Thu, 2019-05-30 at 15:52 +0300, hhh...@amazon.com wrote: > On 5/30/19 2:57 PM, Greg KH wrote: > > On Thu, May 30, 2019 at 01:15:29PM +0300, Hanna Hawa wrote: > > > +static void al_a57_edac_cpumerrsr(void *arg) > > > +{ > > > + struct edac_device_ctl_info *edac_dev = > > > + (struct edac_

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-05-30 Thread hhhawa
On 5/30/19 2:57 PM, Greg KH wrote: On Thu, May 30, 2019 at 01:15:29PM +0300, Hanna Hawa wrote: +static void al_a57_edac_cpumerrsr(void *arg) +{ + struct edac_device_ctl_info *edac_dev = + (struct edac_device_ctl_info *)arg; No need for casting anything here, just assign it.

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-05-30 Thread Greg KH
On Thu, May 30, 2019 at 01:15:29PM +0300, Hanna Hawa wrote: > +static void al_a57_edac_cpumerrsr(void *arg) > +{ > + struct edac_device_ctl_info *edac_dev = > + (struct edac_device_ctl_info *)arg; No need for casting anything here, just assign it. Doesn't checkpatch catch this typ

[PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-05-30 Thread Hanna Hawa
Add support for error detection and correction for Amazon's Annapurna Labs SoCs for L1/L2 caches. Amazon's Annapurna Labs SoCs based on ARM CA57 and CA72, the driver support both cortex based on compatible string. Signed-off-by: Hanna Hawa --- MAINTAINERS| 7 + drivers