Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Daniel Vetter
Hi Monk, On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk wrote: > > [AMD Official Use Only] > > > Hi Daniel/Christian/Andrey > > > > It looks the voice from you three are spread over those email floods to me, > the feature we are working on (diagnostic TDR scheme) is pending there for > more than 6 mo

RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Liu, Monk
[AMD Official Use Only] Daniel >From the link you share it looks you(or someone else) have quite a bunch >patches that changes DRM_SCHED or even amdgpu, by that case before they are >merged to kernel tree I'm wondering if any AMD develop reviewed them ? They looks to me somehow conflicting wi

RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Liu, Monk
[AMD Official Use Only] >> For me your project exists since a few weeks at most, because that is when >> your team showed up on dri-devel. That you already spent 6 months on this >> within amd, on a code area that very much affects shared code, without >> kicking of any thread on dri-devel isn'

Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-01 Thread Lazar, Lijo
On 9/1/2021 3:26 AM, Felix Kuehling wrote: On some GPUs the PCIe atomic requirement for KFD depends on the MEC firmware version. Add a firmware version check for this. The minimum firmware version that works without atomics can be updated in the device_info structure for each GPU type. Signed

Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-09-01 Thread philip yang
On 2021-08-31 10:41 p.m., Alex Sierra wrote: During svm restore pages interrupt handler, kfd_process ref count was never dropped when xnack was disabled. Therefore, the object was never released. Good catch, but if xnack is off, we should not get he

Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Christoph Hellwig
On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote: > >> driver code is not really involved in updating the CPU mappings. Maybe > >> it's something we need to do in the migration helpers. > > It looks like I'm totally misunderstanding what you are adding here > > then. Why do we need a

RE: [PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-09-01 Thread Kim, Jonathan
[AMD Official Use Only] We were seeing process leaks on a couple of machines running certain tests that triggered vm faults on purpose. I think svm_range_restore_pages gets called unconditionally on vm fault handling (unless the retry interrupt payload bit is supposed to be clear with xnack off

Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-01 Thread Felix Kuehling
Am 2021-09-01 um 7:04 a.m. schrieb Lazar, Lijo: > > > On 9/1/2021 3:26 AM, Felix Kuehling wrote: >> On some GPUs the PCIe atomic requirement for KFD depends on the MEC >> firmware version. Add a firmware version check for this. The minimum >> firmware version that works without atomics can be updat

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Alex Deucher
On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk wrote: > > [AMD Official Use Only] > > Daniel > > From the link you share it looks you(or someone else) have quite a bunch > patches that changes DRM_SCHED or even amdgpu, by that case before they are > merged to kernel tree I'm wondering if any AMD devel

Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Felix Kuehling
Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig: > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote: driver code is not really involved in updating the CPU mappings. Maybe it's something we need to do in the migration helpers. >>> It looks like I'm totally misundersta

Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-09-01 Thread philip yang
On 2021-09-01 9:45 a.m., Kim, Jonathan wrote: [AMD Official Use Only] We were seeing process leaks on a couple of machines running certain tests that triggered vm faults

Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-01 Thread Lazar, Lijo
[Public] What I wanted to ask was - Whether user mode application relies only on link properties alone to assume atomic ops are supported? If they check only link properties and if the firmware doesn't work fine, should it be still marked as supported? Basically, what is the purpose of exposin

Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-01 Thread Alex Deucher
On Wed, Sep 1, 2021 at 12:30 PM Lazar, Lijo wrote: > > [Public] > > > What I wanted to ask was - > > Whether user mode application relies only on link properties alone to assume > atomic ops are supported? If they check only link properties and if the > firmware doesn't work fine, should it be s

Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-01 Thread Felix Kuehling
Am 2021-09-01 um 12:30 p.m. schrieb Lazar, Lijo: > > [Public] > > > What I wanted to ask was - > > Whether user mode application relies only on link properties alone to > assume atomic ops are supported? If they check only link properties > and if the firmware doesn't work fine, should it be still

RE: [PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-09-01 Thread Kim, Jonathan
[Public] I wouldn’t know if it was another bug elsewhere. From what I was seeing, the leak was coming from !p->xnack_enable on the svm_range_restore_pages call. If it helps, I saw this on Aldebaran where a shader does some bad memory access on purpose on a debugged ptraced child process. The vm

Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-09-01 Thread Felix Kuehling
Am 2021-09-01 um 12:59 p.m. schrieb Kim, Jonathan: > > [Public] > > > [Public] > > > I wouldn’t know if it was another bug elsewhere. > > From what I was seeing, the leak was coming from !p->xnack_enable on > the svm_range_restore_pages call. > > If it helps, I saw this on Aldebaran where a shader

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Dave Airlie
On Thu, 2 Sept 2021 at 01:20, Alex Deucher wrote: > > On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk wrote: > > > > [AMD Official Use Only] > > > > Daniel > > > > From the link you share it looks you(or someone else) have quite a bunch > > patches that changes DRM_SCHED or even amdgpu, by that case be

Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable

2021-09-01 Thread Felix Kuehling
If it's not too late, please add Fixes: 2383f56bbe4a ("drm/amdkfd: page table restore through svm API") Thanks,   Felix Am 2021-09-01 um 1:54 p.m. schrieb Felix Kuehling: > Am 2021-09-01 um 12:59 p.m. schrieb Kim, Jonathan: >> [Public] >> >> >> [Public] >> >> >> I wouldn’t know if it was anothe

Re: [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

2021-09-01 Thread Alex Deucher
On Wed, Sep 1, 2021 at 2:50 AM Christian König wrote: > > Am 01.09.21 um 02:46 schrieb Monk Liu: > > issue: > > in cleanup_job the cancle_delayed_work will cancel a TO timer > > even the its corresponding job is still running. > > > > fix: > > do not cancel the timer in cleanup_job, instead do the

[pull] amdgpu, amdkfd drm-next-5.15

2021-09-01 Thread Alex Deucher
Hi Dave, Daniel, Fixes for 5.15. The following changes since commit 8f0284f190e6a0aa09015090568c03f18288231a: Merge tag 'amd-drm-next-5.15-2021-08-27' of https://gitlab.freedesktop.org/agd5f/linux into drm-next (2021-08-30 09:06:03 +1000) are available in the Git repository at: https://g

Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Felix Kuehling
On 2021-09-01 6:03 p.m., Dave Chinner wrote: On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote: Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig: On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote: driver code is not really involved in updating the CPU mappings. Ma

Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Dave Chinner
On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote: > > Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig: > > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote: > driver code is not really involved in updating the CPU mappings. Maybe > it's something we need

Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Dave Chinner
On Wed, Sep 01, 2021 at 07:07:34PM -0400, Felix Kuehling wrote: > On 2021-09-01 6:03 p.m., Dave Chinner wrote: > > On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote: > > > Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig: > > > > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kueh

Re: [PATCH 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-01 Thread Lazar, Lijo
Thanks Felix for the detailed explanation. Thanks, Lijo On 9/1/2021 10:17 PM, Felix Kuehling wrote: Am 2021-09-01 um 12:30 p.m. schrieb Lazar, Lijo: [Public] What I wanted to ask was - Whether user mode application relies only on link properties alone to assume atomic ops are supported? If

RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-01 Thread Liu, Monk
[AMD Official Use Only] >>> I'm not sure I can add much to help this along, I'm sure Alex has some internal training, Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand. Usually