Re: [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)

2021-08-31 Thread Grodzovsky, Andrey
What about removing (kthread_should_park()) ? We decided it's useless as far as I remember. Andrey From: amd-gfx on behalf of Liu, Monk Sent: 31 August 2021 20:24 To: Liu, Monk ; amd-gfx@lists.freedesktop.org Cc: dri-de...@lists.freedesktop.org Subject: RE:

Re: [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

2021-09-14 Thread Grodzovsky, Andrey
AFAIK this one is independent. Christian, can you confirm ? Andrey From: amd-gfx on behalf of Alex Deucher Sent: 14 September 2021 15:33 To: Christian König Cc: Liu, Monk ; amd-gfx list ; Maling list - DRI developers Subject: Re: [PATCH 1/2] drm/sched: fix t

Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2022-01-28 Thread Grodzovsky, Andrey
Just a gentle ping. Andrey From: Grodzovsky, Andrey Sent: 26 January 2022 10:52 To: Christian König ; Koenig, Christian ; Lazar, Lijo ; dri-de...@lists.freedesktop.org ; amd-gfx@lists.freedesktop.org ; Chen, JingWen Cc: Chen, Horace ; Liu, Monk Subject: Re

Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2022-02-06 Thread Grodzovsky, Andrey
21:41 To: Grodzovsky, Andrey ; Christian König ; Koenig, Christian ; Lazar, Lijo ; dri-de...@lists.freedesktop.org ; amd-gfx@lists.freedesktop.org ; Chen, JingWen Cc: Chen, Horace ; Liu, Monk Subject: Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs Hi Andrey, I don&#

Re: [PATCH 0/7] libdrm tests for hot-unplug fe goature

2021-06-03 Thread Grodzovsky, Andrey
Is libdrm on gitlab ? I wasn't aware of this. I assumed code reviews still go through dri-devel. Andrey From: Alex Deucher Sent: 03 June 2021 17:20 To: Grodzovsky, Andrey Cc: Maling list - DRI developers ; amd-gfx list ; Deucher, Alexander ; Christian

Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure.

2020-08-27 Thread Grodzovsky, Andrey
Ping Andrey From: amd-gfx on behalf of Andrey Grodzovsky Sent: 27 August 2020 10:54 To: Alex Deucher Cc: Deucher, Alexander ; Das, Nirmoy ; amd-gfx list Subject: Re: [PATCH 5/7] drm/amdgpu: Fix consecutive DPC recoveries failure. On 8/26/20 11:20 AM, Alex D

Re: [PATCH v4 0/8] Implement PCI Error Recovery on Navi12

2020-09-02 Thread Grodzovsky, Andrey
It's based on v5.9-rc2 but won't apply cleanly since there is a significant amount of amd-staging-drm-next patches which this was applied on top of. Andrey From: Bjorn Helgaas Sent: 02 September 2020 17:36 To: Grodzovsky, Andrey C

Re: [PATCH] drm/amdgpu: Remove warning for virtual_display

2020-10-07 Thread Grodzovsky, Andrey
Reviewed-by Andrey Grodzovsky Andrey From: amd-gfx on behalf of Emily.Deng Sent: 07 October 2020 21:35 To: amd-gfx@lists.freedesktop.org Cc: Deng, Emily Subject: [PATCH] drm/amdgpu: Remove warning for virtual_display Remove the virtual_display warning in

Re: [PATCH 2/2] drm/amd/display: Avoid MST manager resource leak.

2020-10-15 Thread Grodzovsky, Andrey
Ping for both patches. Andrey From: Andrey Grodzovsky Sent: 14 October 2020 13:24 To: amd-gfx@lists.freedesktop.org Cc: Kazlauskas, Nicholas ; Wentland, Harry ; Pan, Xinhui ; Grodzovsky, Andrey Subject: [PATCH 2/2] drm/amd/display: Avoid MST manager resource

Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Grodzovsky, Andrey
with 1 and MAX_INT. Andrey From: Zhang, Jack (Jian) Sent: 10 March 2021 22:05 To: Grodzovsky, Andrey ; amd-gfx@lists.freedesktop.org ; Koenig, Christian ; Liu, Monk ; Deng, Emily Subject: RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode [AMD Official

Re: [PATCH v3 05/12] drm/ttm: Expose ttm_tt_unpopulate for driver use

2020-11-27 Thread Grodzovsky, Andrey
Hey Daniel, just a ping on a bunch of questions i posted bellow. Andtey From: Grodzovsky, Andrey Sent: 25 November 2020 14:34 To: Daniel Vetter ; Koenig, Christian Cc: r...@kernel.org ; daniel.vet...@ffwll.ch ; dri-de...@lists.freedesktop.org ; e

Re: [PATCH v3 10/12] drm/amdgpu: Avoid sysfs dirs removal post device unplug

2020-11-27 Thread Grodzovsky, Andrey
Hey, just a ping on my comments/question bellow. Andrey From: Grodzovsky, Andrey Sent: 25 November 2020 12:39 To: Daniel Vetter Cc: amd-gfx list ; dri-devel ; Christian König ; Rob Herring ; Lucas Stach ; Qiang Yu ; Anholt, Eric ; Pekka Paalanen ; Deucher

Re: [PATCH v3 01/12] drm: Add dummy page per device or GEM object

2021-01-08 Thread Grodzovsky, Andrey
Ok then, I guess I will proceed with the dummy pages list implementation then. Andrey From: Koenig, Christian Sent: 08 January 2021 09:52 To: Grodzovsky, Andrey ; Daniel Vetter Cc: amd-gfx@lists.freedesktop.org ; dri-de...@lists.freedesktop.org ; daniel.vet

Re: [PATCH] drm/amd/powerplay: add lock protection for swSMU APIs

2019-10-17 Thread Grodzovsky, Andrey
On 10/16/19 11:55 PM, Quan, Evan wrote: > This is a quick and low risk fix. Those APIs which > are exposed to other IPs or to support sysfs/hwmon > interfaces or DAL will have lock protection. Meanwhile > no lock protection is enforced for swSMU internal used > APIs. Future optimization is needed.

Stack out of bounds in KFD on Arcturus

2019-10-17 Thread Grodzovsky, Andrey
He Felix - I see this on boot when working with Arcturus. Andrey [  103.602092] kfd kfd: Allocated 3969056 bytes on gart [  103.610769] == [  103.611469] BUG: KASAN: stack-out-of-bounds in kfd_create_vcrat_image_gpu+0x5db/0xb80 [a

Re: Stack out of bounds in KFD on Arcturus

2019-10-17 Thread Grodzovsky, Andrey
ed > here hasn't changed recently. > > Are you using some weird kernel config with a smaller stack? Is it > specific to a compiler version or some optimization flags? I've > sometimes seen function inlining cause excessive stack usage. > > Regards, >   Felix > >

Re: [PATCH] drm/amd/powerplay: add lock protection for swSMU APIs

2019-10-18 Thread Grodzovsky, Andrey
On 10/18/19 1:00 AM, Quan, Evan wrote: > > -Original Message- > From: Grodzovsky, Andrey > Sent: Thursday, October 17, 2019 10:22 PM > To: Quan, Evan ; amd-gfx@lists.freedesktop.org > Subject: Re: [PATCH] drm/amd/powerplay: add lock protection for swSMU APIs > >

Re: [PATCH 4/4] drm/amdgpu: Move amdgpu_ras_recovery_init to after SMU ready.

2019-10-22 Thread Grodzovsky, Andrey
s.freedesktop.org > Cc: Chen, Guchun ; Zhou1, Tao ; > Deucher, Alexander ; noreply-conflue...@amd.com; > Quan, Evan ; Grodzovsky, Andrey > Subject: [PATCH 4/4] drm/amdgpu: Move amdgpu_ras_recovery_init to after SMU > ready. > > For Arcturus the I2C traffic is done through

Re: [PATCH 2/4] drm/amd/powerplay: Add EEPROM I2C read/write support to Arcturus.

2019-10-22 Thread Grodzovsky, Andrey
Deucher, Alexander ; noreply-conflue...@amd.com; > Quan, Evan ; Grodzovsky, Andrey > Subject: [PATCH 2/4] drm/amd/powerplay: Add EEPROM I2C read/write support to > Arcturus. > > The communication is done through SMU table and hence the code is in > powerplay. > > Sig

Re: Stack out of bounds in KFD on Arcturus

2019-10-22 Thread Grodzovsky, Andrey
Friday, October 18, 2019 4:55 PM > To: Grodzovsky, Andrey > Cc: amd-gfx@lists.freedesktop.org > Subject: Re: Stack out of bounds in KFD on Arcturus > > On 2019-10-17 6:38 p.m., Grodzovsky, Andrey wrote: >> Not that I aware of, is there a special Kconfig flag to determine >&g

Re: Stack out of bounds in KFD on Arcturus

2019-10-22 Thread Grodzovsky, Andrey
I don't know - what Kconfig flag should I look at ? Andrey On 10/22/19 1:17 PM, Zeng, Oak wrote: > Sorry I meant is the kernel stack size 16KB in your kconfig? > > Oak > > -Original Message- > From: Grodzovsky, Andrey > Sent: Tuesday, October 22, 2019 12:49 PM

Re: Stack out of bounds in KFD on Arcturus

2019-10-22 Thread Grodzovsky, Andrey
t to know whether > this is mi100 specific issue. > > Oak > > -----Original Message- > From: Grodzovsky, Andrey > Sent: Tuesday, October 22, 2019 1:28 PM > To: Zeng, Oak ; Kuehling, Felix > Cc: amd-gfx@lists.freedesktop.org > Subject: Re: Stack out of bounds in KFD o

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Grodzovsky, Andrey
On 10/22/19 2:28 PM, Yang, Philip wrote: > If device reset/suspend/resume failed for some reason, dqm lock is > hold forever and this causes deadlock. Below is a kernel backtrace when > application open kfd after suspend/resume failed. > > Instead of holding dqm lock in pre_reset and releasing dqm

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Grodzovsky, Andrey
On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote: > On 10/22/19 2:28 PM, Yang, Philip wrote: >> If device reset/suspend/resume failed for some reason, dqm lock is >> hold forever and this causes deadlock. Below is a kernel backtrace when >> application open kfd after

Re: [PATCH 2/4] drm/amd/powerplay: Add EEPROM I2C read/write support to Arcturus.

2019-10-22 Thread Grodzovsky, Andrey
019 4:48 AM >> To: amd-gfx@lists.freedesktop.org >> Cc: Chen, Guchun ; Zhou1, Tao >> ; Deucher, Alexander ; >> noreply-conflue...@amd.com; Quan, Evan ; >> Grodzovsky, Andrey >> Subject: [PATCH 2/4] drm/amd/powerplay: Add EEPROM I2C read/write >> support

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Grodzovsky, Andrey
On 10/22/19 3:19 PM, Yang, Philip wrote: > > On 2019-10-22 2:40 p.m., Grodzovsky, Andrey wrote: >> On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote: >>> On 10/22/19 2:28 PM, Yang, Philip wrote: >>>> If device reset/suspend/resume failed for some reason, dqm lock is

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Grodzovsky, Andrey
On 10/22/19 4:04 PM, Yang, Philip wrote: > > On 2019-10-22 3:36 p.m., Grodzovsky, Andrey wrote: >> On 10/22/19 3:19 PM, Yang, Philip wrote: >>> On 2019-10-22 2:40 p.m., Grodzovsky, Andrey wrote: >>>> On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote: >>>

Re: [PATCH] drm/amdgpu: guard ib scheduling while in reset

2019-10-24 Thread Grodzovsky, Andrey
On 10/24/19 7:01 AM, Christian König wrote: Am 24.10.19 um 12:58 schrieb S, Shirish: [Why] Upon GPU reset, kernel cleans up already submitted jobs via drm_sched_cleanup_jobs. This schedules ib's via drm_sched_main()->run_job, leading to race condition of rings being ready or not, since during rese

Re: [PATCH 1/2] drm/sched: Set error to s_fence if HW job submission failed.

2019-10-25 Thread Grodzovsky, Andrey
On 10/25/19 4:44 AM, Christian König wrote: > Am 24.10.19 um 21:57 schrieb Andrey Grodzovsky: >> Problem: >> When run_job fails and HW fence returned is NULL we still signal >> the s_fence to avoid hangs but the user has no way of knowing if >> the actual HW job was ran and finished. >> >> Fix: >>

Re: [PATCH] drm/amdgpu: guard ib scheduling while in reset

2019-10-25 Thread Grodzovsky, Andrey
if (!ring->sched.ready) { + dump_stack(); dev_err(adev->dev, "couldn't schedule ib on ring <%s>\n", ring->name); return -EINVAL; On 10/24/2019 10:00 PM, Christian König wrote: Am 24.10.19 um 17:06 schrieb Grodzovsky, Andrey: On 10/24/19 7:01

Re: [PATCH 1/2] drm/sched: Set error to s_fence if HW job submission failed.

2019-10-25 Thread Grodzovsky, Andrey
On 10/25/19 11:55 AM, Koenig, Christian wrote: > Am 25.10.19 um 16:57 schrieb Grodzovsky, Andrey: >> On 10/25/19 4:44 AM, Christian König wrote: >>> Am 24.10.19 um 21:57 schrieb Andrey Grodzovsky: >>>> Problem: >>>> When run_job fails and HW fence returne

Re: [PATCH] drm/amdgpu: guard ib scheduling while in reset

2019-10-25 Thread Grodzovsky, Andrey
On 10/25/19 11:57 AM, Koenig, Christian wrote: Am 25.10.19 um 17:35 schrieb Grodzovsky, Andrey: On 10/25/19 5:26 AM, Koenig, Christian wrote: Am 25.10.19 um 11:22 schrieb S, Shirish: On 10/25/2019 2:23 PM, Koenig, Christian wrote: amdgpu_do_asic_reset starting to resume blocks ... amdgpu

Re: [PATCH] drm/sched: Fix passing zero to 'PTR_ERR' warning

2019-10-29 Thread Grodzovsky, Andrey
On 10/29/19 2:03 PM, Dan Carpenter wrote: > On Tue, Oct 29, 2019 at 11:04:44AM -0400, Andrey Grodzovsky wrote: >> Fix a static code checker warning. >> >> Signed-off-by: Andrey Grodzovsky >> --- >> drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions

Re: [PATCH] drm/amdgpu: guard ib scheduling while in reset

2019-10-30 Thread Grodzovsky, Andrey
That good  as proof of RCA but I still think we should grab a dedicated lock inside scheduler since the race is internal to scheduler code so this better to handle it inside the scheduler code to make the fix apply for all drivers using it. Andrey On 10/30/19 4:44 AM, S, Shirish wrote: >>

Re: [PATCH] drm/amdgpu: dont schedule jobs while in reset

2019-10-30 Thread Grodzovsky, Andrey
On 10/30/19 6:22 AM, S, Shirish wrote: > On 10/30/2019 3:50 PM, Koenig, Christian wrote: >> Am 30.10.19 um 10:13 schrieb S, Shirish: >>> [Why] >>> >>> doing kthread_park()/unpark() from drm_sched_entity_fini >>> while GPU reset is in progress defeats all the purpose of >>> drm_sched_stop->kthread_

Re: [PATCH] drm/amdgpu: guard ib scheduling while in reset

2019-10-30 Thread Grodzovsky, Andrey
) hack in > drm_sched_entity_fini(). > > We could do this with a struct completion or convert the scheduler > from a thread to a work item. > > Regards, > Christian. > > Am 30.10.19 um 15:44 schrieb Grodzovsky, Andrey: >> That good  as proof of RCA but I still think we should

Re: [PATCH] drm/amdgpu: guard ib scheduling while in reset

2019-10-30 Thread Grodzovsky, Andrey
taking all those locks > in the right order. > > Christian. > > Am 30.10.19 um 15:56 schrieb Grodzovsky, Andrey: >> Can you elaborate on what is the tricky part with the lock ? I assumed >> we just use per scheduler lock. >> >> Andrey >> >> On 10/

Re: [PATCH] drm/amdgpu: dont schedule jobs while in reset

2019-10-30 Thread Grodzovsky, Andrey
Reviewed-by: Andrey Grodzovsky Andrey On 10/30/19 6:20 AM, Koenig, Christian wrote: > Am 30.10.19 um 10:13 schrieb S, Shirish: >> [Why] >> >> doing kthread_park()/unpark() from drm_sched_entity_fini >> while GPU reset is in progress defeats all the purpose of >

Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr

2019-11-08 Thread Grodzovsky, Andrey
On 11/8/19 5:35 AM, Koenig, Christian wrote: > Hi Emily, > > exactly that can't happen. See here: > >>     /* Don't destroy jobs while the timeout worker is running */ >>     if (sched->timeout != MAX_SCHEDULE_TIMEOUT && >>     !cancel_delayed_work(&sched->work_tdr)) >>    

Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr

2019-11-08 Thread Grodzovsky, Andrey
On 11/8/19 5:54 AM, Deng, Emily wrote: > Hi Christian, > Sorry, seems I understand wrong. And from the print, the free job's > thread is the same as job timeout thread. So seems have some issue in > function amdgpu_device_gpu_recover. I don't think it's correct, seems your prints just don

Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr

2019-11-11 Thread Grodzovsky, Andrey
Thinking more about this claim - we assume here that if cancel_delayed_work returned true it guarantees that timeout work is not running but, it merely means there was a pending timeout work which was removed from the workqueue before it's timer elapsed and so it didn't have a chance to be deque

Re: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.

2019-11-25 Thread Grodzovsky, Andrey
o the issue Emily reported can be avoided. Andrey From: Deng, Emily Sent: 25 November 2019 16:44:36 To: Grodzovsky, Andrey Cc: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian; steven.pr...@arm.com; Grodzovsky, Andrey Subjec

Re: [PATCH 2/5] drm: Add Reusable task barrier.

2019-12-12 Thread Grodzovsky, Andrey
[AMD Official Use Only - Internal Distribution Only] __ From: Christian König Sent: 12 December 2019 03:31 To: Alex Deucher; Grodzovsky, Andrey Cc: Deucher, Alexander; Ma, Le; Quan, Evan; amd-gfx list; Zhang, Hawking Subject: Re: [PATCH 2/5] drm: Add Reusable

Re: [PATCH 1/4] drm/scheduler: make sure timer is restarted

2018-10-16 Thread Grodzovsky, Andrey
Patches 1-3 Reviewed-by: Andrey Grodzovsky Patch 4 Acked-by: Andrey Grodzovsky Andrey On 10/16/2018 07:55 AM, Christian König wrote: > Make sure we always restart the timer after a timeout and remove the > device specific workarounds. > > Signed-off-by: Christian König > ---

Re: [PATCH 1/3] drm/sched: Add callback to mark if sched is ready to work.

2018-10-19 Thread Grodzovsky, Andrey
Eclipse Andrey On 10/19/2018 03:13 AM, Michel Dänzer wrote: > ... and the new line here. > > Which editor are you using? ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 3/3] drm/amdgpu: Refresh rq selection for job after ASIC reset

2018-10-19 Thread Grodzovsky, Andrey
On 10/19/2018 03:08 AM, Koenig, Christian wrote: > Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky: >> A ring might become unusable after reset, if that the case >> drm_sched_entity_select_rq will choose another, working rq >> to run the job if there is one. >> Also, skip recovery of ring which is

Re: [PATCH 3/3] drm/amdgpu: Refresh rq selection for job after ASIC reset

2018-10-19 Thread Grodzovsky, Andrey
That my next step. Andrey On 10/19/2018 12:28 PM, Christian König wrote: From my testing looks like we can, compute ring 0 is dead but IB tests pass on other compute rings. Interesting, but I would rather investigate why compute ring 0 is dead while other still work.

Re: [PATCH v3 2/2] drm/amdgpu: Retire amdgpu_ring.ready flag v3

2018-10-23 Thread Grodzovsky, Andrey
On 10/23/2018 05:23 AM, Christian König wrote: > Am 22.10.18 um 22:46 schrieb Andrey Grodzovsky: >> Start using drm_gpu_scheduler.ready isntead. >> >> v3: >> Add helper function to run ring test and set >> sched.ready flag status accordingly, clean explicit >> sched.ready sets from the IP specifi

Re: [PATCH v2 1/2] drm/sched: Add boolean to mark if sched is ready to work v2

2018-10-23 Thread Grodzovsky, Andrey
On 10/22/2018 05:33 AM, Koenig, Christian wrote: > Am 19.10.18 um 22:52 schrieb Andrey Grodzovsky: >> Problem: >> A particular scheduler may become unsuable (underlying HW) after >> some event (e.g. GPU reset). If it's later chosen by >> the get free sched. policy a command will fail to be >> sub

Re: [PATCH] drm/amdgpu: Fix compute ring 1.0.0 failure after reset

2018-10-26 Thread Grodzovsky, Andrey
On 10/26/2018 04:05 AM, Christian König wrote: > Am 25.10.18 um 22:16 schrieb Andrey Grodzovsky: >> Problem: After GPU reset on dGPUs with gfx8 compute ring >> 1.0.0 fails to pass the ring test. Ring registers inspection >> shows that it's active and no hang is observed (rptr == wptr) >> No signi

Re: [PATCH 4/4] drm/amdgpu: remove messages from IB tests

2018-10-29 Thread Grodzovsky, Andrey
Reviewed-by: Andrey Grodzovsky Andrey On 10/29/2018 11:28 AM, Christian König wrote: > We already print an error message that an IB test failed in the common > code. > > Signed-off-by: Christian König > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 18 +++ >

Re: [PATCH 4/4] drm/amdgpu: remove messages from IB tests

2018-10-29 Thread Grodzovsky, Andrey
Typo, series is Reviewed-by: Andrey Grodzovsky Andrey On 10/29/2018 12:18 PM, Grodzovsky, Andrey wrote: > Reviewed-by: Andrey Grodzovsky > > Andrey > > > On 10/29/2018 11:28 AM, Christian König wrote: >> We already print an error message that an IB test fail

Re: [PATCH libdrm] amdgpu/test: Add illegal register and memory access test.

2018-10-31 Thread Grodzovsky, Andrey
On 10/31/2018 03:49 PM, Alex Deucher wrote: > On Wed, Oct 31, 2018 at 2:33 PM Andrey Grodzovsky > wrote: >> Illegal access will cause CP hang followed by job timeout and >> recovery kicking in. >> Also, disable the suite for all APU ASICs until GPU >> reset issues for them will be resolved and G

Re: [PATCH libdrm] amdgpu/test: Add illegal register and memory access test.

2018-10-31 Thread Grodzovsky, Andrey
On 10/31/2018 03:49 PM, Alex Deucher wrote: > On Wed, Oct 31, 2018 at 2:33 PM Andrey Grodzovsky > wrote: >> Illegal access will cause CP hang followed by job timeout and >> recovery kicking in. >> Also, disable the suite for all APU ASICs until GPU >> reset issues for them will be resolved and G

Re: [PATCH libdrm] amdgpu/test: Add illegal register and memory access test.

2018-11-02 Thread Grodzovsky, Andrey
On 11/02/2018 10:24 AM, Michel Dänzer wrote: > On 2018-10-31 7:33 p.m., Andrey Grodzovsky wrote: >> Illegal access will cause CP hang followed by job timeout and >> recovery kicking in. >> Also, disable the suite for all APU ASICs until GPU >> reset issues for them will be resolved and GPU reset

Re: [PATCH libdrm] amdgpu/test: Add illegal register and memory access test.

2018-11-02 Thread Grodzovsky, Andrey
On 11/02/2018 02:12 PM, Alex Deucher wrote: > On Fri, Nov 2, 2018 at 11:59 AM Grodzovsky, Andrey > wrote: >> >> >> On 11/02/2018 10:24 AM, Michel Dänzer wrote: >>> On 2018-10-31 7:33 p.m., Andrey Grodzovsky wrote: >>>> Illegal access will cause CP h

Re: [PATCH libdrm] amdgpu/test: Add illegal register and memory access test.

2018-11-02 Thread Grodzovsky, Andrey
There is a pplib messaging related failure currently during GPU reset. I will put this issue on my TODO list for later time after handling more prioritized stuff and will disable the deadlock test suite for all non dGPU gfx8/9 ASICs until then. Andrey On 11/02/2018 02:14 PM, Grodzovsky

Re: [PATCH] drm/amdgpu: Each PSP need to get latest topology info on XGMI configuration

2018-11-09 Thread Grodzovsky, Andrey
Reviewed-by: Andrey Grodzovsky Question - shouldn't we also set psp_xgmi_node_info.is_sharing_enabled to 1 to enable FB sharing ? Andrey On 11/08/2018 11:14 AM, Liu, Shaoyun wrote: > From: shaoyunl > > Driver need to call each psp instance to get topology info before set topology > > Change-

Re: [PATCH 3/5] drm/amdgpu: Refactor amdgpu_xgmi_add_device

2018-11-21 Thread Grodzovsky, Andrey
On 11/21/2018 02:29 PM, Alex Deucher wrote: > On Wed, Nov 21, 2018 at 1:11 PM Andrey Grodzovsky > wrote: >> This is prep work for updating each PSP FW in hive after >> GPU reset. >> Split into build topology SW state and update each PSP FW in the hive. >> Save topology and count of XGMI devices

Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-21 Thread Grodzovsky, Andrey
Depends what was the reason for triggering the reset for that node how do we know ? If the reason was RAS error that probably not hard to check all errors are cleared, but if the reason was job timeout on that specific node I will need to recheck that no jobs are left in incomplete state state.

Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-22 Thread Grodzovsky, Andrey
verything together again and start the > scheduler to go on with job submission. > > Christian. > > Am 21.11.18 um 23:02 schrieb Grodzovsky, Andrey: >> Depends what was the reason for triggering the reset for that node how >> do we know ? >> If the reason was RAS erro

Re: [PATCH 2/2] drm/amd/display: Remove wait for hw/flip done in atomic check

2018-11-22 Thread Grodzovsky, Andrey
On 11/22/2018 12:34 PM, Nicholas Kazlauskas wrote: > [Why] > Atomic check can't truly be non-blocking if amdgpu_dm is waiting for > hw_done and flip_done in atomic check. This introduces waits when > any previous non-blocking commits queued work on a worker thread and > a new atomic commit attemp

Re: [PATCH 2/2] drm/amd/display: Remove wait for hw/flip done in atomic check

2018-11-22 Thread Grodzovsky, Andrey
On 11/22/2018 02:43 PM, Kazlauskas, Nicholas wrote: > On 11/22/18 2:39 PM, Grodzovsky, Andrey wrote: >> >> On 11/22/2018 12:34 PM, Nicholas Kazlauskas wrote: >>> [Why] >>> Atomic check can't truly be non-blocking if amdgpu_dm is waiting for >>&

Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-22 Thread Grodzovsky, Andrey
On 11/22/2018 02:03 PM, Christian König wrote: > Am 22.11.18 um 16:44 schrieb Grodzovsky, Andrey: >> >> On 11/22/2018 06:16 AM, Christian König wrote: >>> How about using a lock per hive and then acquiring that with trylock() >>> instead? >>> >>

Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-26 Thread Grodzovsky, Andrey
t and do it for each driver in between scheduler deactivation and activation back ? Andrey On 11/22/2018 02:56 PM, Grodzovsky, Andrey wrote: Additional to that I would try improve the pre, middle, post handling towards checking if we made some progress in between. In other words we stop all sc

Re: [PATCH v3 0/3] Add support for XGMI hive reset

2018-11-28 Thread Grodzovsky, Andrey
Ping... Andrey On 11/27/2018 01:37 PM, Andrey Grodzovsky wrote: > This set of patches adds support to reset entire XGMI hive > when reset is required. > > Patches 1-2 refactoring a bit the XGMI infrastructure as > preparaton for the actual hive reset change. > > Patch 5 is GPU reset/recovery ref

Re: [PATCH 1/2] drm/amdgpu: Handle xgmi device removal and add reset wq.

2018-11-30 Thread Grodzovsky, Andrey
On 11/30/2018 04:03 AM, Christian König wrote: > Am 29.11.18 um 21:36 schrieb Andrey Grodzovsky: >> XGMI hive has some resources allocted on device init which >> needs to be deallocated when the device is unregistered. >> >> Add per hive wq to allow all the nodes in hive to run resets >> concuren

Re: [PATCH 1/2] drm/amdgpu: Handle xgmi device removal and add reset wq.

2018-11-30 Thread Grodzovsky, Andrey
On 11/30/2018 10:53 AM, Koenig, Christian wrote: > Am 30.11.18 um 16:14 schrieb Grodzovsky, Andrey: >> On 11/30/2018 04:03 AM, Christian König wrote: >>> Am 29.11.18 um 21:36 schrieb Andrey Grodzovsky: >>>> XGMI hive has some resources allocted on device init whic

Re: [PATCH v2 2/3] drm/amdgpu: Handle xgmi device removal.

2018-11-30 Thread Grodzovsky, Andrey
On 11/30/2018 02:49 PM, Alex Deucher wrote: > On Fri, Nov 30, 2018 at 1:17 PM Andrey Grodzovsky > wrote: >> XGMI hive has some resources allocted on device init which >> needs to be deallocated when the device is unregistered. >> >> v2: Remove creation of dedicated wq for XGMI hive reset. >> >>

Re: [PATCH v2 2/3] drm/amdgpu: Handle xgmi device removal.

2018-11-30 Thread Grodzovsky, Andrey
On 11/30/2018 03:08 PM, Alex Deucher wrote: > On Fri, Nov 30, 2018 at 3:06 PM Grodzovsky, Andrey > wrote: >> >> >> On 11/30/2018 02:49 PM, Alex Deucher wrote: >>> On Fri, Nov 30, 2018 at 1:17 PM Andrey Grodzovsky >>> wrote: >>>> XGMI

Re: [PATCH] drm/amdgpu: add a xgmi supported flag

2018-11-30 Thread Grodzovsky, Andrey
On 11/30/2018 03:30 PM, Alex Deucher wrote: > Use this to track whether an asic supports xgmi rather than > checking the asic type everywhere. > > Signed-off-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 + > drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 4 ++-- > driver

Re: [PATCH] drm/amdgpu: add a xgmi supported flag

2018-11-30 Thread Grodzovsky, Andrey
Reviewed-by: Andrey Grodzovsky Andrey On 11/30/2018 03:36 PM, Alex Deucher wrote: > On Fri, Nov 30, 2018 at 3:34 PM Grodzovsky, Andrey > wrote: >> >> >> On 11/30/2018 03:30 PM, Alex Deucher wrote: >>> Use this to track whether an asic supports xgmi rathe

Re: [PATCH] drm/amd/display: Add fast path for cursor plane updates

2018-12-05 Thread Grodzovsky, Andrey
On 12/05/2018 02:59 PM, Nicholas Kazlauskas wrote: > [Why] > Legacy cursor plane updates from drm helpers go through the full > atomic codepath. A high volume of cursor updates through this slow > code path can cause subsequent page-flips to skip vblank intervals > since each individual update is

Re: [PATCH] drm/amd/display: Add fast path for cursor plane updates

2018-12-05 Thread Grodzovsky, Andrey
On 12/05/2018 03:42 PM, Kazlauskas, Nicholas wrote: > On 2018-12-05 3:26 p.m., Grodzovsky, Andrey wrote: >> >> On 12/05/2018 02:59 PM, Nicholas Kazlauskas wrote: >>> [Why] >>> Legacy cursor plane updates from drm helpers go through the full >>> atomic

Re: [PATCH] drm/amd/display: Add fast path for cursor plane updates

2018-12-06 Thread Grodzovsky, Andrey
Not an expert on Freesync so maybe stupid question but from he comment looks like this pipe locking is only for the sake of Freesync mode there - why is it then called unconditionally w/o checking if you even run in Freesync mode ? Andrey On 12/06/2018 08:42 AM, Kazlauskas, Nicholas wrote: >

Re: [PATCH] drm/amd/display: Add fast path for cursor plane updates

2018-12-06 Thread Grodzovsky, Andrey
Ok - the change is Acked-by: Andrey Grodzovsky Andrey On 12/06/2018 10:59 AM, Nicholas Kazlauskas wrote: > On 2018-12-06 10:36 a.m., Grodzovsky, Andrey wrote: >> Not an expert on Freesync so maybe stupid question but from he comment >> looks like this pipe locking is only

Re: [PATCH 2/2] drm/sched: Rework HW fence processing.

2018-12-06 Thread Grodzovsky, Andrey
On 12/06/2018 12:41 PM, Andrey Grodzovsky wrote: > Expedite job deletion from ring mirror list to the HW fence signal > callback instead from finish_work, together with waiting for all > such fences to signal in drm_sched_stop we garantee that > already signaled job will not be processed twice. >

Re: [PATCH 1/2] drm/sched: Refactor ring mirror list handling.

2018-12-06 Thread Grodzovsky, Andrey
On 12/06/2018 01:33 PM, Christian König wrote: > Am 06.12.18 um 18:41 schrieb Andrey Grodzovsky: >> Decauple sched threads stop and start and ring mirror >> list handling from the policy of what to do about the >> guilty jobs. >> When stoppping the sched thread and detaching sched fences >> from

Re: [PATCH 2/2] drm/sched: Rework HW fence processing.

2018-12-07 Thread Grodzovsky, Andrey
On 12/07/2018 03:19 AM, Christian König wrote: > Am 07.12.18 um 04:18 schrieb Zhou, David(ChunMing): >> >>> -Original Message- >>> From: dri-devel On Behalf Of >>> Andrey Grodzovsky >>> Sent: Friday, December 07, 2018 1:41 AM >>> To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedes

Re: [PATCH 1/1] drm/amdgpu: Fix stub function name

2018-12-10 Thread Grodzovsky, Andrey
Acked-by: Andrey Grodzovsky Andrey On 12/10/2018 04:29 PM, Kuehling, Felix wrote: > This function was renamed in a previous commit. Update the stub > function name for builds with CONFIG_HSA_AMD disabled. > > Fixes: 62f65d3cb34a ("drm/amdgpu: Add KFD VRAM limit checking&

Re: [PATCH v3 2/2] drm/sched: Rework HW fence processing.

2018-12-11 Thread Grodzovsky, Andrey
dzovsky >> Sent: Tuesday, December 11, 2018 5:44 AM >> To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; >> ckoenig.leichtzumer...@gmail.com; e...@anholt.net; >> etna...@lists.freedesktop.org >> Cc: Zhou, David(ChunMing) ; Liu, Monk >> ; Grodzovsky, Andrey &

Re: [PATCH libdrm] amdgpu/test: Enable deadlock test for CI family (gfx7)

2018-12-11 Thread Grodzovsky, Andrey
np Andrey On 12/11/2018 03:18 PM, Alex Deucher wrote: > On Tue, Dec 11, 2018 at 3:13 PM Andrey Grodzovsky > wrote: >> I retested GPU recovery with Bonaire ASIC and it works. >> >> Signed-off-by: Andrey Grodzovsky > Reviewed-by: Alex Deucher > > Care to enable it in the kernel as well? > > Al

Re: [PATCH v3 2/2] drm/sched: Rework HW fence processing.

2018-12-12 Thread Grodzovsky, Andrey
ote: > Yeah, completely correct explained. > > I was unfortunately really busy today, but going to give that a look > as soon as I have time. > > Christian. > > Am 11.12.18 um 17:01 schrieb Grodzovsky, Andrey: >> A I understand you say that by the time the fence callback r

Re: [PATCH v3 2/2] drm/sched: Rework HW fence processing.

2018-12-14 Thread Grodzovsky, Andrey
Just a reminder. Any new comments in light of all the discussion ? Andrey On 12/12/2018 08:08 AM, Grodzovsky, Andrey wrote: > BTW, the problem I pointed out with drm_sched_entity_kill_jobs_cb is not > an issue with this patch set since it removes the cb from > s_fence->finished in g

Re: [PATCH] drm/amd/display: Skip fast cursor updates for fb changes

2018-12-14 Thread Grodzovsky, Andrey
On 12/14/2018 12:26 PM, Nicholas Kazlauskas wrote: > [Why] > The behavior of drm_atomic_helper_cleanup_planes differs depending on > whether the commit was asynchronous or not. When it's called from > amdgpu_dm_atomic_commit_tail during a typical atomic commit the > plane state has been swapped s

Re: [PATCH] drm/amd/display: Skip fast cursor updates for fb changes

2018-12-14 Thread Grodzovsky, Andrey
On 12/14/2018 12:41 PM, Kazlauskas, Nicholas wrote: > On 12/14/18 12:34 PM, Grodzovsky, Andrey wrote: >> >> On 12/14/2018 12:26 PM, Nicholas Kazlauskas wrote: >>> [Why] >>> The behavior of drm_atomic_helper_cleanup_planes differs depending on >>> whethe

Re: [PATCH] drm/amd/display: Skip fast cursor updates for fb changes

2018-12-14 Thread Grodzovsky, Andrey
In general I agree with Michel that  DRM solution is required to properly address this but since now it's not really obvious what is the proper solution it seems to me OK to go with this fix until it's found. Reviewed-by: Andrey Grodzovsky Andrey On 12/14/2018 12:51 PM, Kazlauskas

Re: [PATCH] drm/amdgpu: unify Vega20 PSP SOS firmwares for A0 and A1

2018-12-14 Thread Grodzovsky, Andrey
With this change in latest drm-next and related commit in latest FW i get [ 148.887374] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed [ 148.887535] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block failed -22 Had to revert to be able to boot. Andrey On

Re: [PATCH] drm/amd/display: Skip fast cursor updates for fb changes

2018-12-14 Thread Grodzovsky, Andrey
On 12/14/2018 02:17 PM, Kazlauskas, Nicholas wrote: > On 12/14/18 2:06 PM, Grodzovsky, Andrey wrote: >> In general I agree with Michel that  DRM solution is required to >> properly address this but since now it's not really obvious what is the >> proper solution it seem

Re: [PATCH] drm/amd/display: Skip fast cursor updates for fb changes

2018-12-17 Thread Grodzovsky, Andrey
On 12/17/2018 04:53 AM, Michel Dänzer wrote: > On 2018-12-15 6:25 a.m., Grodzovsky, Andrey wrote: >> On 12/14/2018 02:17 PM, Kazlauskas, Nicholas wrote: >>> On 12/14/18 2:06 PM, Grodzovsky, Andrey wrote: >>>> In general I agree with Michel that  DRM solution is req

Re: [PATCH v3 1/2] drm/sched: Refactor ring mirror list handling.

2018-12-17 Thread Grodzovsky, Andrey
On 12/17/2018 10:27 AM, Christian König wrote: > Am 10.12.18 um 22:43 schrieb Andrey Grodzovsky: >> Decauple sched threads stop and start and ring mirror >> list handling from the policy of what to do about the >> guilty jobs. >> When stoppping the sched thread and detaching sched fences >> from

Re: After Vega 56/64 GPU hang I unable reboot system

2018-12-17 Thread Grodzovsky, Andrey
On 12/17/2018 01:51 PM, Wentland, Harry wrote: > On 2018-12-15 4:42 a.m., Mikhail Gavrilov wrote: >> On Sat, 15 Dec 2018 at 00:36, Wentland, Harry wrote: >>> Looks like there's an error before this happens that might get us into this >>> mess: >>> >>> [ 229.741741] [drm:amdgpu_job_timedout [am

Re: [PATCH 5/5] drm/amd/display: Move the dm update dance to crtc->atomic_check

2018-12-18 Thread Grodzovsky, Andrey
On 12/18/2018 10:26 AM, sunpeng...@amd.com wrote: > From: Leo Li > > drm_atomic_helper_check_planes() calls the crtc atomic check helpers. In > an attempt to better align with the DRM framework, we can move the > entire dm_update dance to the crtc check helper (since it essentially > checks that

Re: [PATCH 5/5] drm/amd/display: Move the dm update dance to crtc->atomic_check

2018-12-18 Thread Grodzovsky, Andrey
On 12/18/2018 12:09 PM, Kazlauskas, Nicholas wrote: > On 12/18/18 10:26 AM, sunpeng...@amd.com wrote: >> From: Leo Li >> >> drm_atomic_helper_check_planes() calls the crtc atomic check helpers. In >> an attempt to better align with the DRM framework, we can move the >> entire dm_update dance to

Re: [PATCH 5/5] drm/amd/display: Move the dm update dance to crtc->atomic_check

2018-12-19 Thread Grodzovsky, Andrey
On 12/19/2018 08:54 AM, Kazlauskas, Nicholas wrote: > On 12/18/18 3:12 PM, Grodzovsky, Andrey wrote: >> >> On 12/18/2018 10:26 AM, sunpeng...@amd.com wrote: >>> From: Leo Li >>> >>> drm_atomic_helper_check_planes() calls the crtc atomic check helpers.

Re: [PATCH v4 1/2] drm/sched: Refactor ring mirror list handling.

2018-12-19 Thread Grodzovsky, Andrey
On 12/19/2018 11:21 AM, Christian König wrote: > Am 17.12.18 um 20:51 schrieb Andrey Grodzovsky: >> Decauple sched threads stop and start and ring mirror >> list handling from the policy of what to do about the >> guilty jobs. >> When stoppping the sched thread and detaching sched fences >> from

Re: After Vega 56/64 GPU hang I unable reboot system

2018-12-19 Thread Grodzovsky, Andrey
+Tom Andrey On 12/19/2018 01:35 PM, Mikhail Gavrilov wrote: > On Tue, 18 Dec 2018 at 00:08, Grodzovsky, Andrey > wrote: >> Please install UMR and dump gfx ring content and waves after the hang is >> happening. >> >> UMR at - https://cgit.freedesktop.org/amd/umr

Re: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

2018-12-21 Thread Grodzovsky, Andrey
I believe this issue would be resolved by my pending  in review patch set, specifically 'drm/sched: Refactor ring mirror list handling.' since already on the first TO handler it will go over all the rings including the second timed out ring and will remove all call backs including the bad job c

Re: [PATCH v5 1/2] drm/sched: Refactor ring mirror list handling.

2018-12-21 Thread Grodzovsky, Andrey
On 12/21/2018 01:37 PM, Christian König wrote: > Am 20.12.18 um 20:23 schrieb Andrey Grodzovsky: >> Decauple sched threads stop and start and ring mirror >> list handling from the policy of what to do about the >> guilty jobs. >> When stoppping the sched thread and detaching sched fences >> from

Re: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

2018-12-27 Thread Grodzovsky, Andrey
HW fence processing'. > Now there was still much Call-Trace in new osdb triggered in > dma_fence_set_error. Do you have link for these patches? > Thanks. > > BR, > Wentao > > > -Original Message- > From: Grodzovsky, Andrey > Sent: Saturday, December 22

  1   2   3   >