On Thu, Aug 15, 2013 at 07:14:34PM +, Luck, Tony wrote:
> Yes - but the serial port is too slow to log everything that you might
> conceivably need to debug your problem. Imagine trying to log every
> interrupt and every pagefault on every processor down a single 115200
> baud connection. Thus
On Thu, Aug 15, 2013 at 07:20:29PM +, Luck, Tony wrote:
> In theory it could. The ACPI generic error structure used to report
> includes a 20-byte free format field which a BIOS could use to
> describe the location of the error. Haven't seen anyone do this yet -
> and our internal BIOS people l
> AFAIKT, APEI doesn't provide the silkscreen label. Some code (or some
> datasheet) is needed to translate between what APEI provides into the
> silkscreen label.
In theory it could. The ACPI generic error structure used to report includes
a 20-byte free format field which a BIOS could use to des
> Well, if I have serial connected to the box, it will contain basically
> everything the machine said, no?
Yes - but the serial port is too slow to log everything that you might
conceivably need to debug your problem. Imagine trying to log every
interrupt and every pagefault on every processor d
On Thu, Aug 15, 2013 at 06:16:48PM +, Luck, Tony wrote:
> > * We parse some APEI table and disable those MCA banks which the BIOS
> > wants to handle first.
>
> We have no idea which errors the BIOS has chosen for itself. We just
> know which bank numbers ...
Well, those which BIOS hasn't chos
> * We parse some APEI table and disable those MCA banks which the BIOS
> wants to handle first.
We have no idea which errors the BIOS has chosen for itself. We just know which
bank numbers ... and Intel processors change mappings of which errors are logged
in which banks in every new processor t
On Thu, Aug 15, 2013 at 11:14:07AM -0300, Mauro Carvalho Chehab wrote:
> I don't see why should we have those two alternatives, as, at worse
> case (e. g. if ghes_edac can't enrich the APEI data with labels),
> they'll basically provide the very same data to userspace, and the
> EDAC extra overhead
Em Thu, 15 Aug 2013 15:44:54 +0200
Borislav Petkov escreveu:
> On Thu, Aug 15, 2013 at 10:26:07AM -0300, Mauro Carvalho Chehab wrote:
> > I mean that the edac core needs to know that, on a given system, the
> > BIOS is accessing the hardware registers and sending the data via
> > ghes_edac.
>
>
On Thu, Aug 15, 2013 at 10:34:21AM -0300, Mauro Carvalho Chehab wrote:
> Yes, but the thing is that it is not safe to use the hardware driver
> if the BIOS is also reading the hardware error registers directly, as,
> on several hardware, a read cause the error data to be cleaned on such
> register.
On Thu, Aug 15, 2013 at 10:26:07AM -0300, Mauro Carvalho Chehab wrote:
> I mean that the edac core needs to know that, on a given system, the
> BIOS is accessing the hardware registers and sending the data via
> ghes_edac.
Right, that's the firmware-first thing which Naveen did - see
mce_disable_b
Em Thu, 15 Aug 2013 12:01:32 +0200
Borislav Petkov escreveu:
> On Wed, Aug 14, 2013 at 09:15:04PM -0300, Mauro Carvalho Chehab wrote:
> > > - Two, if ghes_edac is enabled, it prevents other edac drivers
> > > from being loaded. It looks like the assumption here is that if
> > > ghes/firmware firs
Em Thu, 15 Aug 2013 11:38:31 +0200
Borislav Petkov escreveu:
> On Wed, Aug 14, 2013 at 09:22:11PM -0300, Mauro Carvalho Chehab wrote:
> > 1) EDAC core needs to know that it should reject "hardware first"
> > drivers.
>
> -ENOPARSE. What do you mean?
I mean that the edac core needs to know t
On Wed, Aug 14, 2013 at 06:38:09PM +, Luck, Tony wrote:
> We've wandered around different strategies here. We definitely
> want the panic log. Some people want all other "kernel exit" logs
> (shutdown, reboot, kexec). When there is enough space in the pstore
> backend we might also want the "oo
On Wed, Aug 14, 2013 at 08:56:38PM -0300, Mauro Carvalho Chehab wrote:
> Better to spend a little more time discussing than implementing a new trace
> that will be removed on a near future.
Right, "in the meantime" we established that this new TP doesn't bring
us anything new so we might just as w
On Wed, Aug 14, 2013 at 09:15:04PM -0300, Mauro Carvalho Chehab wrote:
> > - Two, if ghes_edac is enabled, it prevents other edac drivers
> > from being loaded. It looks like the assumption here is that if
> > ghes/firmware first is enabled, then *all* memory errors are
> > reported through ghes wh
On Wed, Aug 14, 2013 at 09:00:35PM -0300, Mauro Carvalho Chehab wrote:
> I agree: per-type of error events is better than a big generic one.
There are many types of hardware errors and having a single TP for each
is not a good design. Especially if the error formats are comparable
to a high degree
On Wed, Aug 14, 2013 at 09:22:11PM -0300, Mauro Carvalho Chehab wrote:
> 1) EDAC core needs to know that it should reject "hardware first"
> drivers.
-ENOPARSE. What do you mean?
> 3) If BIOS vendors add later some solution to enumerate the DIMMS
> per memory controller, channel, socket wi
Em Wed, 14 Aug 2013 16:27:06 +0530
"Naveen N. Rao" escreveu:
> On 08/13/2013 11:28 PM, Borislav Petkov wrote:
> > On Tue, Aug 13, 2013 at 11:02:08PM +0530, Naveen N. Rao wrote:
> >> If I'm not mistaken, even for systems that have EDAC drivers, it looks
> >> to me like EDAC can't really decode to
Em Wed, 14 Aug 2013 16:17:26 +0530
"Naveen N. Rao" escreveu:
> On 08/13/2013 11:09 PM, Luck, Tony wrote:
> >> In the meantime, like Boris suggests, I think we can have a different
> >> trace event for raw APEI reports - userspace can use it as it pleases.
> >>
> >> Once ghes_edac gets better, use
Em Wed, 14 Aug 2013 07:43:22 +0200
Borislav Petkov escreveu:
> On Tue, Aug 13, 2013 at 08:13:56PM +, Luck, Tony wrote:
> > Generic tracepoints are architected to be able to fire at very high
> > rates and log huge amounts of information. So we'd need something
> > special to say just log thes
Em Tue, 13 Aug 2013 23:02:08 +0530
"Naveen N. Rao" escreveu:
> On 08/13/2013 06:12 PM, Borislav Petkov wrote:
> > On Tue, Aug 13, 2013 at 04:51:33PM +0530, Naveen N. Rao wrote:
> >> You're right - my trace point makes all the data provided by apei
> >> as-is to userspace. However, ghes_edac seems
Em Tue, 13 Aug 2013 22:47:36 +0530
"Naveen N. Rao" escreveu:
> On 08/13/2013 06:11 PM, Mauro Carvalho Chehab wrote:
> > Em Tue, 13 Aug 2013 17:11:18 +0530
> > "Naveen N. Rao" escreveu:
> >
> >> On 08/12/2013 08:14 PM, Mauro Carvalho Chehab wrote:
> But, this only seems to expose the APEI da
Em Tue, 13 Aug 2013 22:25:58 +0530
"Naveen N. Rao" escreveu:
(sorry for a late answer, I had to do a small travel yesterday)
> On 08/13/2013 05:51 PM, Mauro Carvalho Chehab wrote:
> > Em Tue, 13 Aug 2013 17:06:14 +0530
> > "Naveen N. Rao" escreveu:
> >
> >> On 08/12/2013 11:26 PM, Borislav Petk
> Didn't we say at some point, "log only the panic messsage which kills
> the machine"?
We've wandered around different strategies here. We definitely want
the panic log. Some people want all other "kernel exit" logs (shutdown,
reboot, kexec). When there is enough space in the pstore backend we
On Wed, Aug 14, 2013 at 04:17:26PM +0530, Naveen N. Rao wrote:
> - One, the logging format for APEI data is a bit verbose and hard
> to parse. But, I suppose we could work with this if we make a few
> changes. Is it ok to change how the APEI data is made available
> through mc_event->driver_detail?
On 08/13/2013 11:28 PM, Borislav Petkov wrote:
On Tue, Aug 13, 2013 at 11:02:08PM +0530, Naveen N. Rao wrote:
If I'm not mistaken, even for systems that have EDAC drivers, it looks
to me like EDAC can't really decode to the DIMM given what is provided
by the bios in the APEI report currently. If
On 08/13/2013 11:09 PM, Luck, Tony wrote:
In the meantime, like Boris suggests, I think we can have a different
trace event for raw APEI reports - userspace can use it as it pleases.
Once ghes_edac gets better, users can decide whether they want raw APEI
reports or the EDAC-processed version and
On Tue, Aug 13, 2013 at 08:13:56PM +, Luck, Tony wrote:
> Generic tracepoints are architected to be able to fire at very high
> rates and log huge amounts of information. So we'd need something
> special to say just log these special tracepoints to network/serial.
>
> > Which reminds me, pstore
> What about sending tracepoint data over serial and/or network? I agree
> that dmesg over serial would be helpful but we need a similar sure-fire
> way for carrying error info out.
Generic tracepoints are architected to be able to fire at very high rates and
log huge amounts of information. So w
On Tue, Aug 13, 2013 at 06:05:02PM +, Luck, Tony wrote:
> If the errors are serious (or a precursor to something serious) that
> process may never get the chance to save the log.
What about sending tracepoint data over serial and/or network? I agree
that dmesg over serial would be helpful but
> Why would you need dmesg if you get your hw errors over the tracepoint?
Redundancy is a good thing when talking about mission critical systems. dmesg
may be feeding to a serial console to be logged and analysed on another system.
The tracepoint data goes to a process on the system experiencing
On Tue, Aug 13, 2013 at 11:02:08PM +0530, Naveen N. Rao wrote:
> If I'm not mistaken, even for systems that have EDAC drivers, it looks
> to me like EDAC can't really decode to the DIMM given what is provided
> by the bios in the APEI report currently. If and when ghes_edac gains
> this capability,
> In the meantime, like Boris suggests, I think we can have a different
> trace event for raw APEI reports - userspace can use it as it pleases.
>
> Once ghes_edac gets better, users can decide whether they want raw APEI
> reports or the EDAC-processed version and choose one or the other trace
>
On 08/13/2013 06:12 PM, Borislav Petkov wrote:
On Tue, Aug 13, 2013 at 04:51:33PM +0530, Naveen N. Rao wrote:
You're right - my trace point makes all the data provided by apei
as-is to userspace. However, ghes_edac seems to squash some of this
data into a string when reporting through mc_event.
On 08/13/2013 06:11 PM, Mauro Carvalho Chehab wrote:
Em Tue, 13 Aug 2013 17:11:18 +0530
"Naveen N. Rao" escreveu:
On 08/12/2013 08:14 PM, Mauro Carvalho Chehab wrote:
But, this only seems to expose the APEI data as a string
and doesn't look to really make all the fields available to user-spac
On 08/13/2013 05:51 PM, Mauro Carvalho Chehab wrote:
Em Tue, 13 Aug 2013 17:06:14 +0530
"Naveen N. Rao" escreveu:
On 08/12/2013 11:26 PM, Borislav Petkov wrote:
On Mon, Aug 12, 2013 at 02:25:57PM -0300, Mauro Carvalho Chehab wrote:
Userspace still needs the EDAC sysfs, in order to identify h
On Tue, Aug 13, 2013 at 04:51:33PM +0530, Naveen N. Rao wrote:
> You're right - my trace point makes all the data provided by apei
> as-is to userspace. However, ghes_edac seems to squash some of this
> data into a string when reporting through mc_event.
Right, for systems which don't need EDAC to
Em Tue, 13 Aug 2013 17:11:18 +0530
"Naveen N. Rao" escreveu:
> On 08/12/2013 08:14 PM, Mauro Carvalho Chehab wrote:
> >> But, this only seems to expose the APEI data as a string
> >> and doesn't look to really make all the fields available to user-space
> >> in a raw manner. Not sure how well thi
On Tue, Aug 13, 2013 at 09:21:54AM -0300, Mauro Carvalho Chehab wrote:
> > > More specifically, what are those gdata_fru_id and gdata_fru_text
> > > things?
> >
> > My understanding was that this provides the DIMM serial number, but I'm
> > double checking just to be sure.
Hm, ok, then.
If this
Em Tue, 13 Aug 2013 17:06:14 +0530
"Naveen N. Rao" escreveu:
> On 08/12/2013 11:26 PM, Borislav Petkov wrote:
> > On Mon, Aug 12, 2013 at 02:25:57PM -0300, Mauro Carvalho Chehab wrote:
> >> Userspace still needs the EDAC sysfs, in order to identify how the
> >> memory is organized, and do the pro
On 08/12/2013 08:14 PM, Mauro Carvalho Chehab wrote:
But, this only seems to expose the APEI data as a string
and doesn't look to really make all the fields available to user-space
in a raw manner. Not sure how well this can be utilised by a user-space
tool. Do you have suggestions on how we can
On 08/12/2013 11:26 PM, Borislav Petkov wrote:
On Mon, Aug 12, 2013 at 02:25:57PM -0300, Mauro Carvalho Chehab wrote:
Userspace still needs the EDAC sysfs, in order to identify how the
memory is organized, and do the proper memory labels association.
What edac_ghes does is to fill those sysfs n
On 08/12/2013 06:23 PM, Borislav Petkov wrote:
On Mon, Aug 12, 2013 at 06:11:49PM +0530, Naveen N. Rao wrote:
So, I looked at ghes_edac and it basically seems to boil down to
trace_mc_event. But, this only seems to expose the APEI data as a
string and doesn't look to really make all the fields a
On Mon, Aug 12, 2013 at 02:25:57PM -0300, Mauro Carvalho Chehab wrote:
> Userspace still needs the EDAC sysfs, in order to identify how the
> memory is organized, and do the proper memory labels association.
>
> What edac_ghes does is to fill those sysfs nodes, and to call the
> existing tracing to
>> We are, of course, going to have only one tracepoint which reports
>> memory errors, not two.
>
> Yes, that's my point.
Is life that simple?
We have systems that have no EDAC driver (in some cases because the
architecture precludes one from ever being written, in other because
we either don't
Em Mon, 12 Aug 2013 17:04:24 +0200
Borislav Petkov escreveu:
> On Mon, Aug 12, 2013 at 11:49:32AM -0300, Mauro Carvalho Chehab wrote:
> > Clear win from what PoV? Userspace will need to decode a different type
> > of tracing, and implement a different logic for APEI.
>
> There's no different typ
On Mon, Aug 12, 2013 at 11:49:32AM -0300, Mauro Carvalho Chehab wrote:
> Clear win from what PoV? Userspace will need to decode a different type
> of tracing, and implement a different logic for APEI.
There's no different type of tracing - it is the same info as in both
cases it comes from APEI. A
Em Mon, 12 Aug 2013 14:38:13 +0200
Borislav Petkov escreveu:
> On Mon, Aug 12, 2013 at 08:33:55AM -0300, Mauro Carvalho Chehab wrote:
> > APEI is just the mechanism that collects the data, not the mechanism
> > that reports to userspace.
>
> Both methods add a tracepoint - no difference.
>
> >
Em Mon, 12 Aug 2013 18:11:49 +0530
"Naveen N. Rao" escreveu:
> On 08/12/2013 05:03 PM, Mauro Carvalho Chehab wrote:
> > Em Sat, 10 Aug 2013 20:03:22 +0200
> > Borislav Petkov escreveu:
> >
> >> On Thu, Aug 08, 2013 at 04:38:22PM -0300, Mauro Carvalho Chehab wrote:
> >>> Em Thu, 08 Aug 2013 23:57
On Mon, Aug 12, 2013 at 06:11:49PM +0530, Naveen N. Rao wrote:
> So, I looked at ghes_edac and it basically seems to boil down to
> trace_mc_event. But, this only seems to expose the APEI data as a
> string and doesn't look to really make all the fields available to
> user-space in a raw manner. No
On 08/12/2013 05:03 PM, Mauro Carvalho Chehab wrote:
Em Sat, 10 Aug 2013 20:03:22 +0200
Borislav Petkov escreveu:
On Thu, Aug 08, 2013 at 04:38:22PM -0300, Mauro Carvalho Chehab wrote:
Em Thu, 08 Aug 2013 23:57:51 +0530
"Naveen N. Rao" escreveu:
Enable memory error trace event in cper.c
On Mon, Aug 12, 2013 at 08:33:55AM -0300, Mauro Carvalho Chehab wrote:
> APEI is just the mechanism that collects the data, not the mechanism
> that reports to userspace.
Both methods add a tracepoint - no difference.
> I really don't see any sense on adding yet-another-way to report the
> very s
Em Sat, 10 Aug 2013 20:03:22 +0200
Borislav Petkov escreveu:
> On Thu, Aug 08, 2013 at 04:38:22PM -0300, Mauro Carvalho Chehab wrote:
> > Em Thu, 08 Aug 2013 23:57:51 +0530
> > "Naveen N. Rao" escreveu:
> >
> > > Enable memory error trace event in cper.c
> >
> > Why do we need to do that? Memo
On Thu, Aug 08, 2013 at 04:38:22PM -0300, Mauro Carvalho Chehab wrote:
> Em Thu, 08 Aug 2013 23:57:51 +0530
> "Naveen N. Rao" escreveu:
>
> > Enable memory error trace event in cper.c
>
> Why do we need to do that? Memory error events are already handled
> via edac_ghes module,
If APEI gives me
Em Thu, 08 Aug 2013 23:57:51 +0530
"Naveen N. Rao" escreveu:
> Enable memory error trace event in cper.c
Why do we need to do that? Memory error events are already handled
via edac_ghes module, in a standard way that allows the same
tracing to be used by either BIOS or by direct hardware error
r
55 matches
Mail list logo