Hi Hawa,
On 17/06/2019 14:00, Hawa, Hanna wrote:
>> I don't think it can, on a second reading, it looks to be even more
>> complicated than I
>> thought! That bit is described as disabling forwarding of uncorrected data,
>> but it looks
>> like the uncorrected data never actually reaches the oth
+static void al_a57_edac_l2merrsr(void *arg)
+{
+ edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
How do we know this is corrected?
If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to
uncorrected/corrected. Is
this what you are depending on here?
No - not on this. Rep
Reply part 2.
On Thu, Jun 13, 2019 at 09:54:18AM +1000, Benjamin Herrenschmidt wrote:
> Why ? Because one or two historical drivers mix MC and PCI then "it
> makes sense" to do that for everybody ?
Because it was like that. And now all of a sudden ARM wants something
different. So we must at leas
Hi Hawa,
On 13/06/2019 18:05, James Morse wrote:
> On 11/06/2019 20:56, Hawa, Hanna wrote:
>> James Morse wrote:
>>> Hawa, Hanna wrote:
+ if (cluster != last_cluster) {
+ smp_call_function_single(cpu, al_a57_edac_l2merrsr,
+ edac_dev, 0);
>>
Hi Hawa,
On 11/06/2019 20:56, Hawa, Hanna wrote:
> James Morse wrote:
>> Hawa, Hanna wrote:
>>> + edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
>>
>> How do we know this was corrected?
>>
>> 6.4.8 "Error Correction Code" has "Double-bit ECC errors set the fatal bit."
>> in a
>> paragraph
On Wed, 2019-06-12 at 13:00 +0200, Borislav Petkov wrote:
> On Wed, Jun 12, 2019 at 07:42:42AM -0300, Mauro Carvalho Chehab wrote:
> > That's said, from the admin PoV, it makes sense to have a single
> > daemon that collect errors from all error sources and take the
> > needed actions.
>
> Doing r
On Wed, 2019-06-12 at 12:42 +0200, Borislav Petkov wrote:
> On Wed, Jun 12, 2019 at 06:29:26PM +1000, Benjamin Herrenschmidt wrote:
> > I tend to disagree here. We've been down that rabbit hole in the past
> > and we (Linux in general) are trying to move away from that sort of
> > "platform" overar
On Wed, 2019-06-12 at 14:25 +0200, Borislav Petkov wrote:
> > But for the main case that really needs to be in the kernel, which is
> > DRAM, the recovery can usually be contained to the MC driver anyway.
>
> Right, if that is enough to handle the error properly.
>
> The memory-failure.c example
On Thu, Jun 13, 2019 at 09:54:18AM +1000, Benjamin Herrenschmidt wrote:
> It tends to be a slippery slope. Also in the ARM world, most SoC tend
> to re-use IP blocks, so you get a lot of code duplication, bug fixed in
> one and not the other etc...
Yes, I'd like to be able to reuse EDAC drivers if
On Wed, Jun 12, 2019 at 03:35:31PM +0300, Hawa, Hanna wrote:
> We have daemon script that collects correctable/uncorrectable errors from
> EDAC sysfs and reports to Amazon service that allow us to take action on
> specific error thresholds.
What does "take action" mean, more specifically? Is the s
Hi Boris,
Yap, I think we're in agreement here. I believe the important question
is whether you need to get error information from multiple sources
together in order to do proper recovery or doing it per error source
suffices.
And I think the actual use cases could/should dictate our
drivers/o
On Wed, Jun 12, 2019 at 09:57:40PM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2019-06-12 at 08:42 -0300, Mauro Carvalho Chehab wrote:
> > > Yes, we do have different error reporting facilities but I still
> > > think
> > > that concentrating all the error information needed in order to do
> > >
On Wed, 2019-06-12 at 08:42 -0300, Mauro Carvalho Chehab wrote:
> > Yes, we do have different error reporting facilities but I still
> > think
> > that concentrating all the error information needed in order to do
> > proper recovery action is the better approach here. And make that
> > part
> > of
Em Wed, 12 Jun 2019 13:00:39 +0200
Borislav Petkov escreveu:
> On Wed, Jun 12, 2019 at 07:42:42AM -0300, Mauro Carvalho Chehab wrote:
> > That's said, from the admin PoV, it makes sense to have a single
> > daemon that collect errors from all error sources and take the
> > needed actions.
>
>
On Wed, Jun 12, 2019 at 07:42:42AM -0300, Mauro Carvalho Chehab wrote:
> That's said, from the admin PoV, it makes sense to have a single
> daemon that collect errors from all error sources and take the
> needed actions.
Doing recovery actions in userspace is too flaky. Daemon can get killed
at an
On Wed, Jun 12, 2019 at 06:29:26PM +1000, Benjamin Herrenschmidt wrote:
> I tend to disagree here. We've been down that rabbit hole in the past
> and we (Linux in general) are trying to move away from that sort of
> "platform" overarching driver as much as possible.
Why is a "platform" driver like
Em Wed, 12 Jun 2019 18:29:26 +1000
Benjamin Herrenschmidt escreveu:
> On Wed, 2019-06-12 at 05:48 +0200, Borislav Petkov wrote:
> > On Wed, Jun 12, 2019 at 08:25:52AM +1000, Benjamin Herrenschmidt wrote:
> > > Yes, we would be in a world of pain already if tracepoints couldn't
> > > handle conc
On Wed, 2019-06-12 at 05:48 +0200, Borislav Petkov wrote:
> On Wed, Jun 12, 2019 at 08:25:52AM +1000, Benjamin Herrenschmidt wrote:
> > Yes, we would be in a world of pain already if tracepoints couldn't
> > handle concurrency :-)
>
> Right, lockless buffer and the whole shebang :)
Yup.
> > Sort
On Wed, Jun 12, 2019 at 08:25:52AM +1000, Benjamin Herrenschmidt wrote:
> Yes, we would be in a world of pain already if tracepoints couldn't
> handle concurrency :-)
Right, lockless buffer and the whole shebang :)
> Sort-of... I still don't see a race in what we propose but I might be
> missing
On Tue, 2019-06-11 at 13:56 +0200, Borislav Petkov wrote:
> On Tue, Jun 11, 2019 at 05:21:39PM +1000, Benjamin Herrenschmidt wrote:
> > So looking again ... all the registration/removal of edac devices seem
> > to already be protected by mutexes, so that's not a problem.
> >
> > Tell me more about
Hi James,
Allowing linux to access these implementation-defined registers has come up
before:
https://www.spinics.net/lists/kernel/msg2750349.html
It looks like you've navigated most of the issues. Accessing
implementation-defined
registers is frowned on, but this stuff can't be done generic
On Tue, Jun 11, 2019 at 10:29:55AM +0300, Hawa, Hanna wrote:
> In the near future we plan to push EDAC drivers for L1/L2 and memory
> controller. There's no common resources/shared data between them.
Ok, you should be safe then. If you need to do more involved interaction
in the future, you know w
On Tue, Jun 11, 2019 at 05:21:39PM +1000, Benjamin Herrenschmidt wrote:
> So looking again ... all the registration/removal of edac devices seem
> to already be protected by mutexes, so that's not a problem.
>
> Tell me more about what specific races you think we might have here,
> I'm not sure I
On Tue, Jun 11, 2019 at 03:50:40PM +1000, Benjamin Herrenschmidt wrote:
> Should we fix that then instead ?
Sure.
> What are the big issues with adding some basic locking ? being called
> from NMIs ?
That is one possible issue. I know we don't call the error decoding
routines in NMI context on x
Hi Ben, Boris
On 6/11/2019 8:50 AM, Benjamin Herrenschmidt wrote:
Anyway, let's get back to the specific case of our Amazon platform here
since it's a concrete example.
Hanna, can you give us a reasonably exhaustive list of how many such
"drivers" we'll want in the EDAC subsystem and whether y
On Tue, 2019-06-11 at 15:50 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2019-06-08 at 11:05 +0200, Borislav Petkov wrote:
> > On Sat, Jun 08, 2019 at 10:16:11AM +1000, Benjamin Herrenschmidt wrote:
> > > Those IP blocks don't need any SW coordination at runtime. The drivers
> > > don't share dat
On Sat, 2019-06-08 at 11:05 +0200, Borislav Petkov wrote:
> On Sat, Jun 08, 2019 at 10:16:11AM +1000, Benjamin Herrenschmidt wrote:
> > Those IP blocks don't need any SW coordination at runtime. The drivers
> > don't share data nor communicate with each other. There is absolultely
> > no reason to
On Sat, Jun 08, 2019 at 10:16:11AM +1000, Benjamin Herrenschmidt wrote:
> Those IP blocks don't need any SW coordination at runtime. The drivers
> don't share data nor communicate with each other. There is absolultely
> no reason to go down that path.
Let me set one thing straight: the EDAC "subsy
On Fri, 2019-06-07 at 16:11 +0100, James Morse wrote:
> I'm coming at this from somewhere else. This stuff has to be considered all
> the way
> through the system. Just because each component supports error detection,
> doesn't mean you
> aren't going to get silent corruption. Likewise if another
On Thu, 2019-06-06 at 11:33 +0100, James Morse wrote:
> > Disagree. The various drivers don't depend on each other.
> > I think we should keep the drivers separated as they are distinct and
> > independent IP blocks.
>
> But they don't exist in isolation, they both depend on the
> integration-c
Hi guys,
On 06/06/2019 12:37, Shenhar, Talel wrote:
>>> Disagree. The various drivers don't depend on each other.
>>> I think we should keep the drivers separated as they are distinct and
>>> independent IP
>>> blocks.
>> But they don't exist in isolation, they both depend on the
>> integration-
Disagree. The various drivers don't depend on each other.
I think we should keep the drivers separated as they are distinct and
independent IP blocks.
But they don't exist in isolation, they both depend on the
integration-choices/firmware
that makes up your platform.
Other platforms may hav
On Thu, Jun 06, 2019 at 11:33:30AM +0100, James Morse wrote:
> All these are integration choices between the two IP blocks, done as separate
> drivers we
> don't have anywhere to store that information. Even if you don't care about
> this, making
> them separate drivers should only be done to mak
Hi Hawa,
On 06/06/2019 08:53, Hawa, Hanna wrote:
> On 5/31/2019 8:14 AM, Borislav Petkov wrote:
>> On Fri, May 31, 2019 at 01:15:33AM +, Herrenschmidt, Benjamin wrote:
>>> This isn't terribly helpful, there's nothing telling anybody which of
>>> those files corresponds to an ARM SoC :-)
>>
>>
On Thu, Jun 06, 2019 at 10:53:42AM +0300, Hawa, Hanna wrote:
> Disagree. The various drivers don't depend on each other.
> I think we should keep the drivers separated as they are distinct and
> independent IP blocks.
This topic comes up each time someone submits a new ARM EDAC driver:
EDAC can't
On 5/31/2019 8:14 AM, Borislav Petkov wrote:
On Fri, May 31, 2019 at 01:15:33AM +, Herrenschmidt, Benjamin wrote:
This isn't terribly helpful, there's nothing telling anybody which of
those files corresponds to an ARM SoC :-)
drivers/edac/altera_edac.c is one example.
Also, James and I
Hi Hana,
On 30/05/2019 11:15, Hanna Hawa wrote:
> Add support for error detection and correction for Amazon's Annapurna
> Labs SoCs for L1/L2 caches.
>
> Amazon's Annapurna Labs SoCs based on ARM CA57 and CA72, the driver
> support both cortex based on compatible string.
> diff --git a/drivers/e
Hi guys,
On 31/05/2019 06:14, Borislav Petkov wrote:
> On Fri, May 31, 2019 at 01:15:33AM +, Herrenschmidt, Benjamin wrote:
>> This isn't terribly helpful, there's nothing telling anybody which of
>> those files corresponds to an ARM SoC :-)
>
> drivers/edac/altera_edac.c is one example.
>
>
On 5/31/2019 4:15 AM, Herrenschmidt, Benjamin wrote:
On Thu, 2019-05-30 at 11:19 -0700, Boris Petkov wrote:
On May 30, 2019 3:15:29 AM PDT, Hanna Hawa wrote:
Add support for error detection and correction for Amazon's
Annapurna
Labs SoCs for L1/L2 caches.
So this should be a driver for t
On Fri, May 31, 2019 at 01:15:33AM +, Herrenschmidt, Benjamin wrote:
> This isn't terribly helpful, there's nothing telling anybody which of
> those files corresponds to an ARM SoC :-)
drivers/edac/altera_edac.c is one example.
Also, James and I have a small writeup on how an arm driver shoul
On Thu, 2019-05-30 at 11:19 -0700, Boris Petkov wrote:
> On May 30, 2019 3:15:29 AM PDT, Hanna Hawa wrote:
> > Add support for error detection and correction for Amazon's
> > Annapurna
> > Labs SoCs for L1/L2 caches.
>
>
> So this should be a driver for the whole annapurna platform and not
> onl
On May 30, 2019 3:15:29 AM PDT, Hanna Hawa wrote:
>Add support for error detection and correction for Amazon's Annapurna
>Labs SoCs for L1/L2 caches.
So this should be a driver for the whole annapurna platform and not only about
the RAS functionality in an IP like the caches. See other ARM EDAC
On Thu, 2019-05-30 at 15:52 +0300, hhh...@amazon.com wrote:
> On 5/30/19 2:57 PM, Greg KH wrote:
> > On Thu, May 30, 2019 at 01:15:29PM +0300, Hanna Hawa wrote:
> > > +static void al_a57_edac_cpumerrsr(void *arg)
> > > +{
> > > + struct edac_device_ctl_info *edac_dev =
> > > + (struct edac_
On 5/30/19 2:57 PM, Greg KH wrote:
On Thu, May 30, 2019 at 01:15:29PM +0300, Hanna Hawa wrote:
+static void al_a57_edac_cpumerrsr(void *arg)
+{
+ struct edac_device_ctl_info *edac_dev =
+ (struct edac_device_ctl_info *)arg;
No need for casting anything here, just assign it.
On Thu, May 30, 2019 at 01:15:29PM +0300, Hanna Hawa wrote:
> +static void al_a57_edac_cpumerrsr(void *arg)
> +{
> + struct edac_device_ctl_info *edac_dev =
> + (struct edac_device_ctl_info *)arg;
No need for casting anything here, just assign it. Doesn't checkpatch
catch this typ
Add support for error detection and correction for Amazon's Annapurna
Labs SoCs for L1/L2 caches.
Amazon's Annapurna Labs SoCs based on ARM CA57 and CA72, the driver
support both cortex based on compatible string.
Signed-off-by: Hanna Hawa
---
MAINTAINERS| 7 +
drivers
46 matches
Mail list logo