Re: [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

2021-08-31 Thread Christian König
Am 01.09.21 um 02:46 schrieb Monk Liu: issue: in cleanup_job the cancle_delayed_work will cancel a TO timer even the its corresponding job is still running. fix: do not cancel the timer in cleanup_job, instead do the cancelling only when the heading job is signaled, and if there is a "next" job

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Andrey Grodzovsky
On 2021-09-01 12:40 a.m., Jingwen Chen wrote: On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote: On 2021-09-01 12:25 a.m., Jingwen Chen wrote: On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote: I will answer everything here - On 2021-08-31 9:58 p.m., Liu, Monk

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Jingwen Chen
On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote: > > On 2021-09-01 12:25 a.m., Jingwen Chen wrote: > > On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote: > > > I will answer everything here - > > > > > > On 2021-08-31 9:58 p.m., Liu, Monk wrote: > > > > > > > > >

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Andrey Grodzovsky
On 2021-09-01 12:25 a.m., Jingwen Chen wrote: On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote: I will answer everything here - On 2021-08-31 9:58 p.m., Liu, Monk wrote: [AMD Official Use Only] In the previous discussion, you guys stated that we should dr

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Jingwen Chen
On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote: > I will answer everything here - > > On 2021-08-31 9:58 p.m., Liu, Monk wrote: > > > [AMD Official Use Only] > > > > In the previous discussion, you guys stated that we should drop the > “kthread_should_park”

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Andrey Grodzovsky
I will answer everything here - On 2021-08-31 9:58 p.m., Liu, Monk wrote: [AMD Official Use Only] In the previous discussion, you guys stated that we should drop the “kthread_should_park” in cleanup_job. @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) {  

[PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-08-31 Thread Alex Sierra
During svm restore pages interrupt handler, kfd_process ref count was never dropped when xnack was disabled. Therefore, the object was never released. Signed-off-by: Alex Sierra --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers

RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Liu, Monk
[AMD Official Use Only] In the previous discussion, you guys stated that we should drop the "kthread_should_park" in cleanup_job. @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) { struct drm_sched_job *job, *next; - /* -* Don't destroy jobs

RE: [PATCH] drm/sched: fix the bug of time out calculation(v3)

2021-08-31 Thread Liu, Monk
[AMD Official Use Only] That' really matter in practice, when two jobs from different process scheduled to the ring close to each other, if we don't discriminate A from B then B will be considered a bad job due to A's timeout, which will force B's process to exit (e.g.: X server) Thanks

RE: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Liu, Monk
[AMD Official Use Only] Okay, I will reprepare this patch Thanks -- Monk Liu | Cloud-GPU Core team -- -Original Message- From: Daniel Vetter Sent: Tuesday, August 31, 2021 9:02 PM To: Liu, Monk Cc: amd-g

[diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Liu, Monk
[AMD Official Use Only] Hi Daniel/Christian/Andrey It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021). Honestly speaking the email ways that we ar

RE: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Liu, Monk
[AMD Official Use Only] >> Also why don't we reuse the function drivers already have to stop a >> scheduler thread? We seem to have two kthread_park now, that's probably one >> too much. Are you referring to drm_sched_stop ? That's different, we don't need the logic from it, see that it go thr

RE: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Liu, Monk
[AMD Official Use Only] >> This is a __ function, i.e. considered internal, and it's lockless atomic, >> i.e. unordered. And you're not explaining why this works. It's not a traditional habit from what I can see that put explain in code, but we can do that in mails , We want to park the schedul

[PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Monk Liu
tested-by: jingwen chen Signed-off-by: Monk Liu Signed-off-by: jingwen chen --- drivers/gpu/drm/scheduler/sched_main.c | 24 1 file changed, 4 insertions(+), 20 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c i

[PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

2021-08-31 Thread Monk Liu
issue: in cleanup_job the cancle_delayed_work will cancel a TO timer even the its corresponding job is still running. fix: do not cancel the timer in cleanup_job, instead do the cancelling only when the heading job is signaled, and if there is a "next" job we start_timeout again. v2: further clea

Re: [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)

2021-08-31 Thread Grodzovsky, Andrey
What about removing (kthread_should_park()) ? We decided it's useless as far as I remember. Andrey From: amd-gfx on behalf of Liu, Monk Sent: 31 August 2021 20:24 To: Liu, Monk ; amd-gfx@lists.freedesktop.org Cc: dri-de...@lists.freedesktop.org Subject: RE:

RE: [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)

2021-08-31 Thread Liu, Monk
[AMD Official Use Only] Ping Christian, Andrey Can we merge this patch first ? this is a standalone patch for the timer Thanks -- Monk Liu | Cloud-GPU Core team -- -Original Message- From: Monk Liu Sent

Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-08-31 Thread Felix Kuehling
On 2021-08-31 6:09 p.m., Zeng, Oak wrote: A nit-pick inline. Otherwise this patch is Reviewed-by: Oak Zeng Regards, Oak On 2021-08-31, 5:57 PM, "amd-gfx on behalf of Felix Kuehling" wrote: On some GPUs the PCIe atomic requirement for KFD depends on the MEC firmware version.

Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-08-31 Thread Zeng, Oak
A nit-pick inline. Otherwise this patch is Reviewed-by: Oak Zeng Regards, Oak On 2021-08-31, 5:57 PM, "amd-gfx on behalf of Felix Kuehling" wrote: On some GPUs the PCIe atomic requirement for KFD depends on the MEC firmware version. Add a firmware version check for this. The mi

[PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-08-31 Thread Felix Kuehling
On some GPUs the PCIe atomic requirement for KFD depends on the MEC firmware version. Add a firmware version check for this. The minimum firmware version that works without atomics can be updated in the device_info structure for each GPU type. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/am

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Luben Tuikov
On 2021-08-31 16:56, Andrey Grodzovsky wrote: > On 2021-08-31 12:01 p.m., Luben Tuikov wrote: >> On 2021-08-31 11:23, Andrey Grodzovsky wrote: >>> On 2021-08-31 10:38 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote: > On 2021-08-31 10:03 a.m., D

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Andrey Grodzovsky
On 2021-08-31 12:01 p.m., Luben Tuikov wrote: On 2021-08-31 11:23, Andrey Grodzovsky wrote: On 2021-08-31 10:38 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote: On 2021-08-31 10:03 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 09:53:36AM -04

[PATCH v7 8/8] nouveau: fold multiple DRM_DEBUG_DRIVERs together

2021-08-31 Thread Jim Cromie
With DRM_USE_DYNAMIC_DEBUG, each callsite record requires 56 bytes. We can combine 12 into one here and save ~620 bytes. Signed-off-by: Jim Cromie --- drivers/gpu/drm/nouveau/nouveau_drm.c | 36 +-- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/drivers/g

[PATCH v7 5/8] drm_print: add choice to use dynamic debug in drm-debug

2021-08-31 Thread Jim Cromie
drm's debug system writes 10 distinct categories of messages to syslog using a small API[1]: drm_dbg*(10 names), DRM_DEV_DEBUG*(3 names), DRM_DEBUG*(8 names). There are thousands of these callsites, each categorized in this systematized way. These callsites can be enabled at runtime by their cate

[PATCH v7 7/8] amdgpu_ucode: reduce number of pr_debug calls

2021-08-31 Thread Jim Cromie
There are blocks of DRM_DEBUG calls, consolidate their args into single calls. With dynamic-debug in use, each callsite consumes 56 bytes of callsite data, and this patch removes about 65 calls, so it saves ~3.5kb. no functional changes. RFC: this creates multi-line log messages, does that break

[PATCH v7 6/8] drm_print: instrument drm_debug_enabled

2021-08-31 Thread Jim Cromie
Duplicate drm_debug_enabled() code into both "basic" and "dyndbg" ifdef branches. Then add a pr_debug("todo: ...") into the "dyndbg" branch. Then convert the "dyndbg" branch's code to a macro, so that its pr_debug() get its callsite info from the invoking function, instead of from drm_debug_enabl

[PATCH v7 3/8] i915/gvt: use DEFINE_DYNAMIC_DEBUG_CATEGORIES to create "gvt:core:" etc categories

2021-08-31 Thread Jim Cromie
The gvt component of this driver has ~120 pr_debugs, in 9 categories quite similar to those in DRM. Following the interface model of drm.debug, add a parameter to map bits to these categorizations. DEFINE_DYNAMIC_DEBUG_CATEGORIES(debug_gvt, __gvt_debug, "dyndbg bitmap desc", { "gv

[PATCH v7 4/8] amdgpu: use DEFINE_DYNAMIC_DEBUG_CATEGORIES

2021-08-31 Thread Jim Cromie
logger_types.h defines many DC_LOG_*() categorized debug wrappers. Most of these use DRM debug API, so are controllable using drm.debug, but others use bare pr_debug("$prefix: .."), each with a different class-prefix matching "^\[\w+\]:" Use DEFINE_DYNAMIC_DEBUG_CATEGORIES to create a /sys debug_d

[PATCH v7 2/8] dyndbg: remove spaces in pr_debug "gvt: core:" etc prefixes

2021-08-31 Thread Jim Cromie
Taking embedded spaces out of existing prefixes makes them better class-prefixes; simplifying the extra quoting needed otherwise: $> echo format "^gvt: core:" +p >control Dropping the internal spaces means any trailing space in a query will more clearly terminate the prefix being searched for.

[PATCH v7 1/8] dyndbg: add DEFINE_DYNAMIC_DEBUG_CATEGORIES and callbacks

2021-08-31 Thread Jim Cromie
DEFINE_DYNAMIC_DEBUG_CATEGORIES(name, var, bitmap_desc, @bit_descs) allows users to define a drm.debug style (bitmap) sysfs interface, and to specify the desired mapping from bits[0-N] to the format-prefix'd pr_debug()s to be controlled. DEFINE_DYNAMIC_DEBUG_CATEGORIES(debug_gvt, __gvt_debug,

[PATCH v7 0/8] use DYNAMIC_DEBUG to implement DRM.debug

2021-08-31 Thread Jim Cromie
Hi Jason, DRM folks, In DRM-debug currently, drm_debug_enabled() is called a lot to decide whether or not to write debug messages. Each test is cheap, but costs continue with uptime. DYNAMIC_DEBUG "dyndbg", when built with JUMP_LABEL, replaces each of those tests with a patchable NOOP, for "zero

Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-31 Thread Andrey Grodzovsky
On 2021-08-31 9:11 a.m., Daniel Vetter wrote: On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote: On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote: On 2021-08-19 5:30 a.m., Daniel Vetter wrote: On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote: On

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Luben Tuikov
On 2021-08-31 11:23, Andrey Grodzovsky wrote: > On 2021-08-31 10:38 a.m., Daniel Vetter wrote: >> On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote: >>> On 2021-08-31 10:03 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote: > It's

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Andrey Grodzovsky
On 2021-08-31 10:38 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote: On 2021-08-31 10:03 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote: It's says patch [2/2] but i can't find patch 1 On 2021-08-31 6:

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Luben Tuikov
On 2021-08-31 08:59, Daniel Vetter wrote: > Can we please have some actual commit message here, with detailed > explanation of the race/bug/whatever, how you fix it and why this is the > best option? I agree with Daniel--a narrative form of a commit message is so much easier for humans to digest.

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Daniel Vetter
On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote: > > On 2021-08-31 10:03 a.m., Daniel Vetter wrote: > > On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote: > > > It's says patch [2/2] but i can't find patch 1 > > > > > > On 2021-08-31 6:35 a.m., Monk Liu wrote: >

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Andrey Grodzovsky
On 2021-08-31 10:03 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote: It's says patch [2/2] but i can't find patch 1 On 2021-08-31 6:35 a.m., Monk Liu wrote: tested-by: jingwen chen Signed-off-by: Monk Liu Signed-off-by: jingwen chen --- driv

Re: [PATCH] drm/amdgpu: stop scheduler when calling hw_fini (v2)

2021-08-31 Thread Alex Deucher
On Mon, Aug 30, 2021 at 2:24 AM Guchun Chen wrote: > > This gurantees no more work on the ring can be submitted > to hardware in suspend/resume case, otherwise a potential > race will occur and the ring will get no chance to stay > empty before suspend. > > v2: Call drm_sched_resubmit_job before d

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Daniel Vetter
On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote: > It's says patch [2/2] but i can't find patch 1 > > On 2021-08-31 6:35 a.m., Monk Liu wrote: > > tested-by: jingwen chen > > Signed-off-by: Monk Liu > > Signed-off-by: jingwen chen > > --- > > drivers/gpu/drm/scheduler/sched_

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Andrey Grodzovsky
It's says patch [2/2] but i can't find patch 1 On 2021-08-31 6:35 a.m., Monk Liu wrote: tested-by: jingwen chen Signed-off-by: Monk Liu Signed-off-by: jingwen chen --- drivers/gpu/drm/scheduler/sched_main.c | 24 1 file changed, 4 insertions(+), 20 deletions(-) di

Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-31 Thread Daniel Vetter
On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote: > On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote: > > > > On 2021-08-19 5:30 a.m., Daniel Vetter wrote: > > > On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote: > > > > On 2021-08-18 10:42 a.m., Danie

Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

2021-08-31 Thread Daniel Vetter
On Fri, Aug 27, 2021 at 08:30:32PM +0200, Christian König wrote: > Yeah, that's what I meant with that the start of processing a job is a bit > swampy defined. > > Jobs overload, but we simply don't have another good indicator that a job > started except that the previous one completed. > > It's

Re: [drm/amdgpu] Driver crashes on 5.13.9 kernel

2021-08-31 Thread Kari Argillander
On Mon, Aug 30, 2021 at 07:15:29PM +0300, Skyler Mäntysaari wrote: > I have tried kernel 5.13.13, without any difference and I haven't > tried with an older kernel, as this hardware is that new that I have > very little faith in less than 5.x kernel would even have support for > the needed GPU. Ye

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Daniel Vetter
On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote: > Can we please have some actual commit message here, with detailed > explanation of the race/bug/whatever, how you fix it and why this is the > best option? > > On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote: > > tested-by:

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Daniel Vetter
Can we please have some actual commit message here, with detailed explanation of the race/bug/whatever, how you fix it and why this is the best option? On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote: > tested-by: jingwen chen > Signed-off-by: Monk Liu > Signed-off-by: jingwen chen > -

[PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Monk Liu
tested-by: jingwen chen Signed-off-by: Monk Liu Signed-off-by: jingwen chen --- drivers/gpu/drm/scheduler/sched_main.c | 24 1 file changed, 4 insertions(+), 20 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c i

[PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)

2021-08-31 Thread Monk Liu
issue: in cleanup_job the cancle_delayed_work will cancel a TO timer even the its corresponding job is still running. fix: do not cancel the timer in cleanup_job, instead do the cancelling only when the heading job is signaled, and if there is a "next" job we start_timeout again. v2: further clea

[PATCH] drm/amdgpu: Clear RAS interrupt status on aldebaran

2021-08-31 Thread Clements, John
[AMD Official Use Only] Submitting patch to resolve incorrect register address' on Aldebaran affecting RAS interrupt handling 0001-drm-amdgpu-Clear-RAS-interrupt-status-on-aldebaran.patch Description: 0001-drm-amdgpu-Clear-RAS-interrupt-status-on-aldebaran.patch

Re: [PATCH v3] drm/amdgpu: Fix a deadlock if previous GEM object allocation fails

2021-08-31 Thread Christian König
Am 31.08.21 um 09:08 schrieb Pan, Xinhui: Fall through to handle the error instead of return. Fixes: f8aab60422c37 ("drm/amdgpu: Initialise drm_gem_object_funcs for imported BOs") Cc: sta...@vger.kernel.org Signed-off-by: xinhui pan Reviewed-by: Christian König --- drivers/gpu/drm/amd/am

[PATCH v3] drm/amdgpu: Fix a deadlock if previous GEM object allocation fails

2021-08-31 Thread Pan, Xinhui
Fall through to handle the error instead of return. Fixes: f8aab60422c37 ("drm/amdgpu: Initialise drm_gem_object_funcs for imported BOs") Cc: sta...@vger.kernel.org Signed-off-by: xinhui pan --- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 23 ++- 1 file changed, 10 insertions(+