Re: [PATCH AUTOSEL 4.19 29/53] drm/amdkfd: Add picasso pci id

2019-04-27 Thread Deucher, Alexander
NACK.  4.19 did not contain support for picasso. Please drop this patch for 
4.19.

Alex

From: amd-gfx  on behalf of Sasha Levin 

Sent: Friday, April 26, 2019 9:40 PM
To: linux-ker...@vger.kernel.org; sta...@vger.kernel.org
Cc: Deucher, Alexander; Sasha Levin; amd-gfx@lists.freedesktop.org; 
dri-de...@lists.freedesktop.org
Subject: [PATCH AUTOSEL 4.19 29/53] drm/amdkfd: Add picasso pci id

From: Alex Deucher 

[ Upstream commit e7ad88553aa1d48e950ca9a4934d246c1bee4be4 ]

Picasso is a new raven variant.

Reviewed-by: Kent Russell 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 5aba50f63ac6..06d19bb89cfc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -275,6 +275,7 @@ static const struct kfd_deviceid supported_devices[] = {
 { 0x9876, &carrizo_device_info },   /* Carrizo */
 { 0x9877, &carrizo_device_info },   /* Carrizo */
 { 0x15DD, &raven_device_info }, /* Raven */
+   { 0x15D8, &raven_device_info }, /* Raven */
 #endif
 { 0x67A0, &hawaii_device_info },/* Hawaii */
 { 0x67A1, &hawaii_device_info },/* Hawaii */
--
2.19.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if guilty job already signaled.

2019-04-27 Thread Zhou, David(ChunMing)
Sorry, I only can put my Acked-by: Chunming Zhou  on 
patch#3.

I cannot fully judge patch #4, #5, #6.

-David

From: amd-gfx  On Behalf Of Grodzovsky, 
Andrey
Sent: Friday, April 26, 2019 10:09 PM
To: Koenig, Christian ; Zhou, David(ChunMing) 
; dri-de...@lists.freedesktop.org; 
amd-gfx@lists.freedesktop.org; e...@anholt.net; etna...@lists.freedesktop.org
Cc: Kazlauskas, Nicholas ; Liu, Monk 

Subject: Re: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if guilty job already 
signaled.


Ping (mostly David and Monk).

Andrey
On 4/24/19 3:09 AM, Christian König wrote:
Am 24.04.19 um 05:02 schrieb Zhou, David(ChunMing):
>> -drm_sched_stop(&ring->sched, &job->base);
>> -
>>   /* after all hw jobs are reset, hw fence is meaningless, so 
>> force_completion */
>>   amdgpu_fence_driver_force_completion(ring);
>>   }

HW fence are already forced completion, then we can just disable irq fence 
process and ignore hw fence signal when we are trying to do GPU reset, I think. 
Otherwise which will make the logic much more complex.
If this situation happens because of long time execution, we can increase 
timeout of reset detection.

You are not thinking widely enough, forcing the hw fence to complete can 
trigger other to start other activity in the system.

We first need to stop everything and make sure that we don't do any processing 
any more and then start with our reset procedure including forcing all hw 
fences to complete.

Christian.



-David

From: amd-gfx 

 On Behalf Of Grodzovsky, Andrey
Sent: Wednesday, April 24, 2019 12:00 AM
To: Zhou, David(ChunMing) ; 
dri-de...@lists.freedesktop.org; 
amd-gfx@lists.freedesktop.org; 
e...@anholt.net; 
etna...@lists.freedesktop.org; 
ckoenig.leichtzumer...@gmail.com
Cc: Kazlauskas, Nicholas 
; Liu, Monk 

Subject: Re: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if guilty job already 
signaled.


No, i mean the actual HW fence which signals when the job finished execution on 
the HW.

Andrey
On 4/23/19 11:19 AM, Zhou, David(ChunMing) wrote:
do you mean fence timer? why not stop it as well when stopping sched for the 
reason of hw reset?

 Original Message 
Subject: Re: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if guilty job already 
signaled.
From: "Grodzovsky, Andrey"
To: "Zhou, David(ChunMing)" 
,dri-de...@lists.freedesktop.org,amd-gfx@lists.freedesktop.org,e...@anholt.net,etna...@lists.freedesktop.org,ckoenig.leichtzumer...@gmail.com
CC: "Kazlauskas, Nicholas" ,"Liu, Monk"

On 4/22/19 9:09 AM, Zhou, David(ChunMing) wrote:
> +Monk.
>
> GPU reset is used widely in SRIOV, so need virtulizatino guy take a look.
>
> But out of curious, why guilty job can signal more if the job is already
> set to guilty? set it wrongly?
>
>
> -David


It's possible that the job does completes at a later time then it's
timeout handler started processing so in this patch we try to protect
against this by rechecking the HW fence after stopping all SW
schedulers. We do it BEFORE marking guilty on the job's sched_entity so
at the point we check the guilty flag is not set yet.

Andrey


>
> 在 2019/4/18 23:00, Andrey Grodzovsky 写道:
>> Also reject TDRs if another one already running.
>>
>> v2:
>> Stop all schedulers across device and entire XGMI hive before
>> force signaling HW fences.
>> Avoid passing job_signaled to helper fnctions to keep all the decision
>> making about skipping HW reset in one place.
>>
>> v3:
>> Fix SW sched. hang after non HW reset. sched.hw_rq_count has to be balanced
>> against it's decrement in drm_sched_stop in non HW reset case.
>> v4: rebase
>> v5: Revert v3 as we do it now in sceduler code.
>>
>> Signed-off-by: Andrey Grodzovsky 
>> 
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 143 
>> +++--
>>1 file changed, 95 insertions(+), 48 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index a0e165c..85f8792 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3334,8 +3334,6 @@ static int amdgpu_device_pre_asic_reset(struct 
>> amdgpu_device *adev,
>>   if (!ring || !ring->sched.thread)
>>   continue;
>>
>> -drm_sched_stop(&ring->sched, &job->base);
>> -
>>   /* after all hw jobs are reset, hw fence is meaningless, so 
>> force_completion */
>>   amdgpu_fence_driver_force_completion(ring)

RE: [PATCH] drm/amdgpu: support gpu recovery tests on compute rings

2019-04-27 Thread Quan, Evan
How about amdgpu.lockup_timeout=non-compute-jobs[, gfx, sdma, decode, encode][: 
compute-jobs] ?
This will not break backward compatibility.

And I’m not sure how to map “decode” and “encode” to the uvd/vce/vcn rings.
Since there are many rings related with these IPs(uvd, uvd_enc, vce, vcn_dec, 
vcn_enc, vcn_jpeg).
Maybe we should use IP name(uvd, vce or vcn) instead of “decode/encode”?

Regards,
Evan
From: amd-gfx  On Behalf Of Deucher, 
Alexander
Sent: 2019年4月26日 22:24
To: Michel Dänzer ; Quan, Evan ; Koenig, 
Christian 
Cc: Xu, Feifei ; Cui, Flora ; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: support gpu recovery tests on compute rings

How about an interface to change the timeout on a per engine (gfx, compute, 
dma, etc.) basis?
amdgpu.lockup_timeout=,]
if only one parameter is given, we change it globably.  If more are given, we 
override the global one.  Could also do a sysfs interface to change it on the 
fly.

Alex

From: amd-gfx 
mailto:amd-gfx-boun...@lists.freedesktop.org>>
 on behalf of Michel Dänzer mailto:mic...@daenzer.net>>
Sent: Friday, April 26, 2019 4:35 AM
To: Quan, Evan; Koenig, Christian
Cc: Xu, Feifei; Cui, Flora; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: support gpu recovery tests on compute rings

On 2019-04-26 10:20 a.m., Quan, Evan wrote:
> My concern is there is already one module parameter "lockup_timeout".
> parm:   lockup_timeout:GPU lockup timeout in ms > 0 (default 1) 
> (int)
>
> Adding one more "timeout" seems redundant.
> And that will makes the description of "lockup_timeout"(seems working for all 
> jobs) does not match its real effect(affect only non-compute jobs).
>
> A better way is to rename "lockup_timeout" to "non-compute lockup_timeout". 
> But I do not think we can change existing module parameter. Right?

Right. Also, there are already too many amdgpu module parameters, we
should try to remove some rather than adding new ones for every little
thing that could be tweaked. :)

One possibility might be to optionally allow passing multiple values to
lockup_timeout, e.g.

 amdgpu.lockup_timeout=1,0

The first value would need to have the same meaning as now for backwards
compatibility.


--
Earthling Michel Dänzer   |  https://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx