RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-09-12 Thread Joshi, Mukul
> To: Joshi, Mukul > > Cc: Borislav Petkov ; Alex Deucher > > ; x86-ml ; Kasiviswanathan, > > Harish ; lkml > > ; amd-gfx@lists.freedesktop.org > > Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for > > Aldebaran > > > > On

RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-07-29 Thread Joshi, Mukul
ect: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran > > On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote: > ... > > > Is that the same deferred interrupt which calls > > > amd_deferred_error_interrupt() ? > > > > Sorry picking this up af

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-06-03 Thread Yazen Ghannam
On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote: ... > > Is that the same deferred interrupt which calls > > amd_deferred_error_interrupt() ? > > Sorry picking this up after sometime. I thought I had replied to this email. > Yes it is the same deferred interrupt which calls > amd_def

RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-27 Thread Joshi, Mukul
: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran > > [CAUTION: External Email] > > On Thu, May 13, 2021 at 11:14:30PM +, Joshi, Mukul wrote: > > Are you OK with a new MCE priority (MCE_PRIO_ACCEL) or do you want us > > to use something else? > > I sti

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-14 Thread Borislav Petkov
On Fri, May 14, 2021 at 01:06:33PM +, Joshi, Mukul wrote: > We have RAS functionality in other ASICs that is not dependent on > CONFIG_X86_MCE_AMD. So, I don't think we would want to do that just > for one ASIC. Lemme try again: you said that those errors do get reported through a deferred int

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-14 Thread Borislav Petkov
On Thu, May 13, 2021 at 11:10:34PM +, Joshi, Mukul wrote: > That's probably not the best example to look at. Oh, it is the *perfect* example but... > smca_get_long_name() is used in drivers/edac/mce_amd.c and this file > doesn't get compiled when CONFIG_X86_MCE_AMD is not defined. > > And amd

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-14 Thread Borislav Petkov
On Thu, May 13, 2021 at 11:14:30PM +, Joshi, Mukul wrote: > Are you OK with a new MCE priority (MCE_PRIO_ACCEL) or do you want us to use > something else? I still don't know why a separate priority is needed. Maybe this still needs answering: > It is a deferred interrupt that generates an MCE

RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-14 Thread Joshi, Mukul
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran > > [CAUTION: External Email] > > On Thu, May 13, 2021 at 11:10:34PM +, Joshi, Mukul wrote: > > That's probably not the best example to look at. > > Oh, it is the *perfect* example but... &g

RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-13 Thread Joshi, Mukul
desktop.org > Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran > > [CAUTION: External Email] > > On Thu, May 13, 2021 at 10:32:45AM -0400, Alex Deucher wrote: > > Right. The sys admin can query the bad page count and decide when to > > retir

RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-13 Thread Joshi, Mukul
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran > > [CAUTION: External Email] > > On Thu, May 13, 2021 at 03:20:36AM +, Joshi, Mukul wrote: > > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is > defined. > > I would

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-13 Thread Alex Deucher
On Thu, May 13, 2021 at 10:57 AM Borislav Petkov wrote: > > On Thu, May 13, 2021 at 10:32:45AM -0400, Alex Deucher wrote: > > Right. The sys admin can query the bad page count and decide when to > > retire the card. > > Yap, although the driver should actively "tell" the sysadmin when some > crit

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-13 Thread Borislav Petkov
On Thu, May 13, 2021 at 10:32:45AM -0400, Alex Deucher wrote: > Right. The sys admin can query the bad page count and decide when to > retire the card. Yap, although the driver should actively "tell" the sysadmin when some critical counts of retired VRAM pages are reached because I doubt all admi

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-13 Thread Borislav Petkov
On Thu, May 13, 2021 at 10:17:47AM -0400, Alex Deucher wrote: > The bad pages are stored in an EEPROM on the board and the next time > the driver loads it reads the EEPROM so that it can reserve the bad > pages at init time so they don't get used again. And that works automagically on the next boo

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-13 Thread Alex Deucher
On Thu, May 13, 2021 at 10:30 AM Borislav Petkov wrote: > > On Thu, May 13, 2021 at 10:17:47AM -0400, Alex Deucher wrote: > > The bad pages are stored in an EEPROM on the board and the next time > > the driver loads it reads the EEPROM so that it can reserve the bad > > pages at init time so they

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-13 Thread Alex Deucher
On Thu, May 13, 2021 at 9:26 AM Borislav Petkov wrote: > > On Thu, May 13, 2021 at 03:20:36AM +, Joshi, Mukul wrote: > > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is > > defined. > > I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the > > amdgpu >

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-13 Thread Borislav Petkov
On Thu, May 13, 2021 at 03:20:36AM +, Joshi, Mukul wrote: > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined. > I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu > driver when CONFIG_X86_MCE_AMD is not defined. > I can avoid all that by u

RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-12 Thread Joshi, Mukul
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran > > [CAUTION: External Email] > > On Wed, May 12, 2021 at 07:00:58PM +, Joshi, Mukul wrote: > > SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is > > only interested in e

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-12 Thread Borislav Petkov
On Wed, May 12, 2021 at 07:00:58PM +, Joshi, Mukul wrote: > SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is > only interested in errors on GPU UMC. So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2. > We cannot know this without is_smca_umc_v2. You don't need it

RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-12 Thread Joshi, Mukul
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran > > [CAUTION: External Email] > > Hi, > > so this is a drive-by review using the lore.kernel.org mail because I wasn't > CCed > on this. > > On Tue, May 11, 2021 at 09:30:58PM -0400,

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-12 Thread Borislav Petkov
Hi, so this is a drive-by review using the lore.kernel.org mail because I wasn't CCed on this. On Tue, May 11, 2021 at 09:30:58PM -0400, Mukul Joshi wrote: > +static int amdgpu_bad_page_notifier(struct notifier_block *nb, > + unsigned long val, void *data) > +{ > +