> To: Joshi, Mukul
> > Cc: Borislav Petkov ; Alex Deucher
> > ; x86-ml ; Kasiviswanathan,
> > Harish ; lkml
> > ; amd-gfx@lists.freedesktop.org
> > Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for
> > Aldebaran
> >
> > On
ect: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
> ...
> > > Is that the same deferred interrupt which calls
> > > amd_deferred_error_interrupt() ?
> >
> > Sorry picking this up af
On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
...
> > Is that the same deferred interrupt which calls
> > amd_deferred_error_interrupt() ?
>
> Sorry picking this up after sometime. I thought I had replied to this email.
> Yes it is the same deferred interrupt which calls
> amd_def
: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> On Thu, May 13, 2021 at 11:14:30PM +, Joshi, Mukul wrote:
> > Are you OK with a new MCE priority (MCE_PRIO_ACCEL) or do you want us
> > to use something else?
>
> I sti
On Fri, May 14, 2021 at 01:06:33PM +, Joshi, Mukul wrote:
> We have RAS functionality in other ASICs that is not dependent on
> CONFIG_X86_MCE_AMD. So, I don't think we would want to do that just
> for one ASIC.
Lemme try again: you said that those errors do get reported through a
deferred int
On Thu, May 13, 2021 at 11:10:34PM +, Joshi, Mukul wrote:
> That's probably not the best example to look at.
Oh, it is the *perfect* example but...
> smca_get_long_name() is used in drivers/edac/mce_amd.c and this file
> doesn't get compiled when CONFIG_X86_MCE_AMD is not defined.
>
> And amd
On Thu, May 13, 2021 at 11:14:30PM +, Joshi, Mukul wrote:
> Are you OK with a new MCE priority (MCE_PRIO_ACCEL) or do you want us to use
> something else?
I still don't know why a separate priority is needed. Maybe this still
needs answering:
> It is a deferred interrupt that generates an MCE
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> On Thu, May 13, 2021 at 11:10:34PM +, Joshi, Mukul wrote:
> > That's probably not the best example to look at.
>
> Oh, it is the *perfect* example but...
&g
desktop.org
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> On Thu, May 13, 2021 at 10:32:45AM -0400, Alex Deucher wrote:
> > Right. The sys admin can query the bad page count and decide when to
> > retir
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> On Thu, May 13, 2021 at 03:20:36AM +, Joshi, Mukul wrote:
> > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is
> defined.
> > I would
On Thu, May 13, 2021 at 10:57 AM Borislav Petkov wrote:
>
> On Thu, May 13, 2021 at 10:32:45AM -0400, Alex Deucher wrote:
> > Right. The sys admin can query the bad page count and decide when to
> > retire the card.
>
> Yap, although the driver should actively "tell" the sysadmin when some
> crit
On Thu, May 13, 2021 at 10:32:45AM -0400, Alex Deucher wrote:
> Right. The sys admin can query the bad page count and decide when to
> retire the card.
Yap, although the driver should actively "tell" the sysadmin when some
critical counts of retired VRAM pages are reached because I doubt all
admi
On Thu, May 13, 2021 at 10:17:47AM -0400, Alex Deucher wrote:
> The bad pages are stored in an EEPROM on the board and the next time
> the driver loads it reads the EEPROM so that it can reserve the bad
> pages at init time so they don't get used again.
And that works automagically on the next boo
On Thu, May 13, 2021 at 10:30 AM Borislav Petkov wrote:
>
> On Thu, May 13, 2021 at 10:17:47AM -0400, Alex Deucher wrote:
> > The bad pages are stored in an EEPROM on the board and the next time
> > the driver loads it reads the EEPROM so that it can reserve the bad
> > pages at init time so they
On Thu, May 13, 2021 at 9:26 AM Borislav Petkov wrote:
>
> On Thu, May 13, 2021 at 03:20:36AM +, Joshi, Mukul wrote:
> > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is
> > defined.
> > I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the
> > amdgpu
>
On Thu, May 13, 2021 at 03:20:36AM +, Joshi, Mukul wrote:
> Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
> I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
> driver when CONFIG_X86_MCE_AMD is not defined.
> I can avoid all that by u
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> On Wed, May 12, 2021 at 07:00:58PM +, Joshi, Mukul wrote:
> > SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is
> > only interested in e
On Wed, May 12, 2021 at 07:00:58PM +, Joshi, Mukul wrote:
> SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is
> only interested in errors on GPU UMC.
So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2.
> We cannot know this without is_smca_umc_v2.
You don't need it
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> Hi,
>
> so this is a drive-by review using the lore.kernel.org mail because I wasn't
> CCed
> on this.
>
> On Tue, May 11, 2021 at 09:30:58PM -0400,
Hi,
so this is a drive-by review using the lore.kernel.org mail because I
wasn't CCed on this.
On Tue, May 11, 2021 at 09:30:58PM -0400, Mukul Joshi wrote:
> +static int amdgpu_bad_page_notifier(struct notifier_block *nb,
> + unsigned long val, void *data)
> +{
> +
20 matches
Mail list logo