On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> From: Jason Gunthorpe
>
> Remove the interval tree in the driver and rely on the tree maintained by
> the mmu_notifier for delivering mmu_notifier invalidation callbacks.
>
> For some reason amdgpu has a very complicated arrangement where it tries
I haven't had enough time to fully understand the deferred logic in this
change. I spotted one problem, see comments inline.
On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> From: Jason Gunthorpe
>
> Of the 13 users of mmu_notifiers, 8 of them use only
> invalidate_range_start/end() and immedia
On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> From: Jason Gunthorpe
>
> find_vma() must be called under the mmap_sem, reorganize this code to
> do the vma check after entering the lock.
>
> Further, fix the unlocked use of struct task_struct's mm, instead use
> the mm from hmm_mirror which has
On 2019-10-09 11:34, Daniel Vetter wrote:
> On Wed, Oct 09, 2019 at 03:25:22PM +0000, Kuehling, Felix wrote:
>> On 2019-10-09 6:31, Daniel Vetter wrote:
>>> On Tue, Oct 08, 2019 at 06:53:18PM +0000, Kuehling, Felix wrote:
>>>> The description sounds reasonable to me a
On 2019-10-09 6:31, Daniel Vetter wrote:
> On Tue, Oct 08, 2019 at 06:53:18PM +0000, Kuehling, Felix wrote:
>>
>> The description sounds reasonable to me and maps well to the CU masking
>> feature in our GPUs.
>>
>> It would also allow us to do more coar
On 2019-08-29 2:05 a.m., Kenny Ho wrote:
> The number of logical gpu (lgpu) is defined to be the number of compute
> unit (CU) for a device. The lgpu allocation limit only applies to
> compute workload for the moment (enforced via kfd queue creation.) Any
> cu_mask update is validated against the
On 2019-08-29 2:05 a.m., Kenny Ho wrote:
> drm.lgpu
> A read-write nested-keyed file which exists on all cgroups.
> Each entry is keyed by the DRM device's major:minor.
>
> lgpu stands for logical GPU, it is an abstraction used to
> subdivide a physical DRM devic
On 2019-10-07 12:08 p.m., Alex Deucher wrote:
> On Sat, Oct 5, 2019 at 1:58 PM Colin King wrote:
>> From: Colin Ian King
>>
>> Function kgd2kfd_init is missing a void argument, add it
>> to clean up the non-ANSI function declaration.
>>
>> Signed-off-by: Colin Ian King
> Applied. thanks!
Thank
On 2019-09-18 12:30 p.m., Allen Pais wrote:
> alloc_workqueue is not checked for errors and as a result,
> a potential NULL dereference could occur.
>
> Signed-off-by: Allen Pais
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_interrupt.c | 5 +
> 1 file changed, 5 insertions(+)
>
> diff --git a/dri
On 2019-08-20 8:36 a.m., Jason Gunthorpe wrote:
> On Tue, Aug 20, 2019 at 11:45:54AM +1000, Stephen Rothwell wrote:
>> Hi all,
>>
>> On Mon, 19 Aug 2019 18:34:41 -0700 Randy Dunlap
>> wrote:
>>> On 8/19/19 2:18 AM, Stephen Rothwell wrote:
Hi all,
Changes since 20190816:
>
On 2019-08-06 19:15, Jason Gunthorpe wrote:
> From: Jason Gunthorpe
>
> The sequence of mmu_notifier_unregister_no_release(),
> mmu_notifier_call_srcu() is identical to mmu_notifier_put() with the
> free_notifier callback.
>
> As this is the last user of those APIs, converting it means we can drop
On 2019-08-06 13:44, Jason Gunthorpe wrote:
> On Tue, Aug 06, 2019 at 07:05:53PM +0300, Christoph Hellwig wrote:
>> The option is just used to select HMM mirror support and has a very
>> confusing help text. Just pull in the HMM mirror code by default
>> instead.
>>
>> Signed-off-by: Christoph Hel
On 2019-08-02 16:07, Jason Gunthorpe wrote:
> When using mmu_notififer_unregister_no_release() the caller must ensure
> there is a SRCU synchronize before the mn memory is freed, otherwise use
> after free races are possible, for instance:
>
> CPU0 CPU1
>
On 2019-07-31 11:52 a.m., Alex Deucher wrote:
> Unused.
>
> Signed-off-by: Alex Deucher
The series is
Reviewed-by: Felix Kuehling
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arct
On 2019-07-30 1:51 a.m., Christoph Hellwig wrote:
> All users pass PAGE_SIZE here, and if we wanted to support single
> entries for huge pages we should really just add a HMM_FAULT_HUGEPAGE
> flag instead that uses the huge page size instead of having the
> caller calculate that size once, just for
On 2019-07-30 1:51 a.m., Christoph Hellwig wrote:
> The start, end and page_shift values are all saved in the range
> structure, so we might as well use that for argument passing.
>
> Signed-off-by: Christoph Hellwig
Reviewed-by: Felix Kuehling
> ---
> Documentation/vm/hmm.rst
On 2019-07-30 1:51 a.m., Christoph Hellwig wrote:
> The list is used to add the range to another list as an entry in the
> core hmm code, so there is no need to initialize it in a driver.
I've seen code that uses list_empty to check whether a list head has
been added to a list or not. For that to
On 2019-07-30 1:51 a.m., Christoph Hellwig wrote:
> hmm_range_fault can only return -EAGAIN if called with the block
> argument set to false, so remove the special handling for it.
The block argument no longer exists. You replaced that with the
HMM_FAULT_ALLOW_RETRY with opposite logic. So this s
This memory allocation flag will be used to indicate BOs containing
sensitive data that should not be leaked to other processes.
Signed-off-by: Felix Kuehling
---
include/uapi/drm/amdgpu_drm.h | 4
1 file changed, 4 insertions(+)
diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/dr
This notifies the driver that a BO is about to be released.
Releasing a BO also invokes the move_notify callback from
ttm_bo_cleanup_memtype_use, but that happens too late for anything
that would add fences to the BO and require a delayed delete.
Signed-off-by: Felix Kuehling
---
drivers/gpu/dr
Wipe VRAM memory containing sensitive data when moving or releasing
BOs. Clearing the memory is pipelined to minimize any impact on
subsequent memory allocation latency. Use of a poison value should
help debug future use-after-free bugs.
When moving BOs, the existing ttm_bo_pipelined_move ensures
Memory used by KFD applications can contain sensitive information that
should not be leaked to other processes. The current approach to prevent
leaks is to clear VRAM at allocation time. This is not effective because
memory can be reused in other ways without being cleared. Synchronously
clearing m
[adding dri-devel]
On 2019-07-09 11:59 p.m., Kuehling, Felix wrote:
> This notifies the driver that a BO is about to be released.
>
> Releasing a BO also invokes the move_notify callback from
> ttm_bo_cleanup_memtype_use, but that happens too late for anything
> that would add f
On 2019-07-07 7:30 p.m., Stephen Rothwell wrote:
> Hi all,
>
> On Wed, 3 Jul 2019 17:09:16 -0400 Alex Deucher wrote:
>> On Wed, Jul 3, 2019 at 5:03 PM Kuehling, Felix
>> wrote:
>>> On 2019-07-03 10:10 a.m., Jason Gunthorpe wrote:
>>>> On Wed, Jul 03,
On 2019-07-04 2:32 a.m., Oded Gabbay wrote:
> I'm leaving the role of amdkfd maintainer. Therefore, update the relevant
> entry in the MAINTAINERS file with the name of the new maintainer.
>
> Good Luck!
Thank you Oded! Thanks for being the maintainer even after leaving AMD
and helping me transit
On 2019-07-03 10:10 a.m., Jason Gunthorpe wrote:
> On Wed, Jul 03, 2019 at 01:55:08AM +0000, Kuehling, Felix wrote:
>> From: Philip Yang
>>
>> In order to pass mirror instead of mm to hmm_range_register, we need
>> pass bo instead of ttm to amdgpu_ttm_tt_get_user_page
On 2019-07-02 6:59 p.m., Jason Gunthorpe wrote:
> On Wed, Jul 03, 2019 at 12:49:12AM +0200, Christoph Hellwig wrote:
>> On Tue, Jul 02, 2019 at 07:53:23PM +, Jason Gunthorpe wrote:
I'm sending this out now since we are updating many of the HMM APIs
and I think it will be useful.
>>> T
From: Philip Yang
In order to pass mirror instead of mm to hmm_range_register, we need
pass bo instead of ttm to amdgpu_ttm_tt_get_user_pages because mirror
is part of amdgpu_mn structure, which is accessible from bo.
Signed-off-by: Philip Yang
Reviewed-by: Felix Kuehling
Signed-off-by: Felix
On 2019-07-01 2:20 a.m., Christoph Hellwig wrote:
> We should not have two different error codes for the same condition. In
> addition this really complicates the code due to the special handling of
> EAGAIN that drops the mmap_sem due to the FAULT_FLAG_ALLOW_RETRY logic
> in the core vm.
I think
I think this could happen if KFD initialization fails for a device.
Currently we'd add the device, and then remove it again. That may leave
a gap in the proximity domains. Oak just had a fix recently to clean
that up by only adding KFD devices to the topology after successful
initialization.
R
On 2019-06-26 2:54 a.m., Koenig, Christian wrote:
> Am 26.06.19 um 08:40 schrieb Kuehling, Felix:
>> Returning -EAGAIN prevents ttm_bo_mem_space from trying alternate
>> placements and can lead to live-locks in amdgpu_cs, retrying
>> indefinitely and never succeeding.
>
Returning -EAGAIN prevents ttm_bo_mem_space from trying alternate
placements and can lead to live-locks in amdgpu_cs, retrying
indefinitely and never succeeding.
Fixes: cfcc52e477e4 ("drm/ttm: fix busy memory to fail other user v10")
CC: Christian Koenig
Signed-off-by: Felix Kuehling
---
driver
I believe I found a live-lock due to this patch when running our KFD
eviction test in a loop. I pretty reliably hangs on the second loop
iteration. If I revert this patch, the problem disappears.
With some added instrumentation, I see that amdgpu_cs_list_validate in
amdgpu_cs_parser_bos returns
On 2019-06-18 1:37, Christoph Hellwig wrote:
> On Mon, Jun 17, 2019 at 09:45:09PM -0300, Jason Gunthorpe wrote:
>> Am I looking at the wrong thing? Looks like it calls it through a work
>> queue should should be OK..
> Yes, it calls it through a work queue. I guess that is fine because
> it needs
[+Philip]
Hi Jason,
I'm out of the office this week.
Hi Philip, can you give this a go? Not sure how much you've been
following this patch series review. Message or call me on Skype to
discuss any questions.
Thanks,
Felix
On 2019-06-11 12:48, Jason Gunthorpe wrote:
> On Thu, Jun 06, 2019
[resent with correct address for Alex]
On 2019-06-06 11:11 a.m., Jason Gunthorpe wrote:
> On Fri, May 10, 2019 at 07:53:21PM +0000, Kuehling, Felix wrote:
>> These problems were found in AMD-internal testing as we're working on
>> adopting HMM. They are rebased against gli
On 2019-06-06 11:11 a.m., Jason Gunthorpe wrote:
> On Fri, May 10, 2019 at 07:53:21PM +0000, Kuehling, Felix wrote:
>> These problems were found in AMD-internal testing as we're working on
>> adopting HMM. They are rebased against glisse/hmm-5.2-v3. We'd like to get
>
On 2019-06-05 9:56, Michel Dänzer wrote:
> On 2019-06-05 1:24 p.m., Christian König wrote:
>> Am 04.06.19 um 21:03 schrieb Zeng, Oak:
>>> From: amd-gfx On Behalf Of
>>> Kuehling, Felix
>>> On 2019-06-04 11:23, Christian König wrote:
[snip]
>>> --
On 2019-06-04 11:23, Christian König wrote:
> Since we now keep BOs on the LRU we need to make sure
> that they are removed when they are pinned.
>
> Signed-off-by: Christian König
> ---
> include/drm/ttm/ttm_bo_driver.h | 14 ++
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
On 2019-05-29 11:07 a.m., Colin King wrote:
> From: Colin Ian King
>
> The pointer dev is set to null yet it is being dereferenced when
> checking dev->dqm->sched_policy. Fix this by performing the check
> on dev->dqm->sched_policy after dev has been assigned and null
> checked. Also remove the
BOs when there is nothing easier to evict.
ROCm applications like to use lots of memory. So it probably makes sense
for us to stop removing our BOs from the LRU as well while we
mass-validate our BOs in amdgpu_amdkfd_gpuvm_restore_process_bos.
Regards,
Felix
>
> Christian.
>
> Am 22.05
Can you explain how this avoids OOM situations? When is it safe to leave
a reserved BO on the LRU list? Could we do the same thing in
amdgpu_amdkfd_gpuvm.c? And if we did, what would be the expected side
effects or consequences?
Thanks,
Felix
On 2019-05-22 8:59 a.m., Christian König wrote:
On 2019-05-13 5:27 p.m., Andrew Morton wrote:
> [CAUTION: External Email]
>
> On Fri, 10 May 2019 19:53:23 + "Kuehling, Felix"
> wrote:
>
>> From: Philip Yang
>>
>> While the page is migrating by NUMA balancing, HMM failed to detect this
0ea534db
> which did not have a clear line of sight for 5.2 either.
When was that? I saw "Use HMM for userptr" in Dave's 5.2-rc1 pull
request to Linus.
Regards,
Felix
>
> Alex
> ----
> *From:* amd
[Fixed Alex's email address, sorry for getting it wrong first]
On 2019-05-13 3:49 p.m., Jerome Glisse wrote:
> [CAUTION: External Email]
>
> Andrew can we get this 2 fixes line up for 5.2 ?
>
> On Mon, May 13, 2019 at 07:36:44PM +0000, Kuehling, Felix wrote:
>> Hi Jerom
y 10, 2019 at 07:53:24PM +, Kuehling, Felix wrote:
>> Don't set this flag by default in hmm_vma_do_fault. It is set
>> conditionally just a few lines below. Setting it unconditionally
>> can lead to handle_mm_fault doing a non-blocking fault, returning
>> -EBUSY and
Don't set this flag by default in hmm_vma_do_fault. It is set
conditionally just a few lines below. Setting it unconditionally
can lead to handle_mm_fault doing a non-blocking fault, returning
-EBUSY and unlocking mmap_sem unexpectedly.
Signed-off-by: Felix Kuehling
---
mm/hmm.c | 2 +-
1 file c
These problems were found in AMD-internal testing as we're working on
adopting HMM. They are rebased against glisse/hmm-5.2-v3. We'd like to get
them applied to a mainline Linux kernel as well as drm-next and
amd-staging-drm-next sooner rather than later.
Currently the HMM in amd-staging-drm-next
From: Philip Yang
While the page is migrating by NUMA balancing, HMM failed to detect this
condition and still return the old page. Application will use the new
page migrated, but driver pass the old page physical address to GPU,
this crash the application later.
Use pte_protnone(pte) to return
s patch untag user pointers in
> amdgpu_gem_userptr_ioctl() for the GEM case and in amdgpu_amdkfd_gpuvm_
> alloc_memory_of_gpu() for the KFD case. This also makes sure that an
> untagged pointer is passed to amdgpu_ttm_tt_get_user_pages(), which uses
> it for vma lookups.
>
> Suggested-by: Kuehling, Fel
On 2019-05-06 12:30 p.m., Andrey Konovalov wrote:
> [CAUTION: External Email]
>
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> In radeon_gem_userptr_ioctl(
; amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu() for the KFD case.
>
> Suggested-by: Kuehling, Felix
> Signed-off-by: Andrey Konovalov
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 ++
> drivers/gpu/drm/a
On 2019-04-30 9:25 a.m., Andrey Konovalov wrote:
> [CAUTION: External Email]
>
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> radeon_ttm_tt_pin_userptr() u
Adding dri-devel
On 2019-04-17 6:15 p.m., Yang, Philip wrote:
> After patch "drm: Use the same mmap-range offset and size for GEM and
> TTM", application failed to create bo of system memory because drm
> mmap_range size decrease to 64GB from original 1TB. This is not big
> enough for applications
[dropping the robot]
I think Philip fixed those issues on amd-staging-drm-next. Either some
fixes are missing on drm-next-5.2-wip, or they are there but should be
squashed to avoid hitting these errors on intermediate builds.
Regards,
Felix
On 2019-04-03 2:26 p.m., kbuild test robot wrote:
On 2019-04-02 10:37 a.m., Andrey Konovalov wrote:
> On Mon, Mar 25, 2019 at 11:21 PM Kuehling, Felix
> wrote:
>> On 2019-03-20 10:51 a.m., Andrey Konovalov wrote:
>>> This patch is a part of a series that extends arm64 kernel ABI to allow to
>>> pass tagged user p
On 2019-04-02 10:29 a.m., Paul E. McKenney wrote:
> Having DEFINE_SRCU() or DEFINE_STATIC_SRCU() in a loadable module
> requires that the size of the reserved region be increased, which is
> not something we really want to be doing. This commit therefore removes
> the DEFINE_STATIC_SRCU() from dri
On 2019-03-20 10:51 a.m., Andrey Konovalov wrote:
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> amdgpu_ttm_tt_get_user_pages() uses provided user pointers
Alex already applied an equivalent patch by Colin King (attached for
reference).
Regards,
Felix
On 3/18/2019 2:05 PM, Gustavo A. R. Silva wrote:
> Assign return value of function amdgpu_bo_sync_wait() to variable ret
> for its further check.
>
> Addresses-Coverity-ID: 1443914 ("Logically dead
On 2/25/2019 2:58 PM, Thomas Hellstrom wrote:
> On Mon, 2019-02-25 at 14:20 +, Koenig, Christian wrote:
>> Am 23.02.19 um 00:19 schrieb Kuehling, Felix:
>>> Don't account for them in other zones such as dma32. The kernel
>>> page
>>> allocator has its o
Don't account for them in other zones such as dma32. The kernel page
allocator has its own heuristics to avoid exhausting special zones
for regular kernel allocations.
Signed-off-by: Felix Kuehling
CC: thellst...@vmware.com
CC: christian.koe...@amd.com
---
drivers/gpu/drm/ttm/ttm_memory.c | 6 ++
On 2019-02-22 8:45 a.m., Thomas Hellstrom wrote:
> On Fri, 2019-02-22 at 07:10 +, Koenig, Christian wrote:
>> Am 21.02.19 um 22:02 schrieb Thomas Hellstrom:
>>> Hi,
>>>
>>> On Thu, 2019-02-21 at 20:24 +, Kuehling, Felix wrote:
>>>> On 2019-02
On 2019-02-21 12:34 p.m., Thomas Hellstrom wrote:
> On Thu, 2019-02-21 at 16:57 +0000, Kuehling, Felix wrote:
>> On 2019-02-21 2:59 a.m., Koenig, Christian wrote:
>>> On x86 with HIGHMEM there is no dma32 zone. Why do we need one on
>>>>> x86_64? Can we make
On 2019-02-21 2:59 a.m., Koenig, Christian wrote:
> On x86 with HIGHMEM there is no dma32 zone. Why do we need one on
>>> x86_64? Can we make x86_64 more like HIGHMEM instead?
>>>
>>> Regards,
>>> Felix
>>>
>> IIRC with x86, the kernel zone is always smaller than any dma32 zone,
>> so we'd al
On 2019-02-20 1:41 a.m., Thomas Hellstrom wrote:
> On Tue, 2019-02-19 at 17:06 +0000, Kuehling, Felix wrote:
>> On 2019-02-18 3:39 p.m., Thomas Hellstrom wrote:
>>> On Mon, 2019-02-18 at 18:07 +0100, Christian König wrote:
>>>> Am 18.02.19 um 10:47 schrieb Thomas He
> default,
>>>>> which
>>>>> means if we drop this check, other devices may stop functioning
>>>>> unexpectedly?
>>>>>
>>>>> However, in the end I'd expect the kernel page allocation
>>>>> system
>>&
This is an RFC. I'm not sure this is the right solution, but it
highlights the problem I'm trying to solve.
The dma32_zone limits the acc_size of all allocated BOs to 2GB. On a
64-bit system with hundreds of GB of system memory and GPU memory,
this can become a bottle neck. We're seeing TTM memory
Thank you, Nathan. I applied your patch to amd-staging-drm-next.
Sorry for the late response. I'm catching up with my email backlog after
a vacation.
Regards,
Felix
On 2019-01-21 6:52 p.m., Nathan Chancellor wrote:
> Clang warns:
>
> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_crat.c:866:5: war
On 2019-01-03 12:34 p.m., Gustavo A. R. Silva wrote:
> Fix boolean expressions by using logical AND operator '&&'
> instead of bitwise operator '&'.
>
> This issue was detected with the help of Coccinelle.
>
> Fixes: c8c5e569c5b0 ("drm/amdgpu: Consolidate visible vs. real vram check
> v2.")
Actual
On 2018-12-05 6:04 p.m., Jerome Glisse wrote:
> On Wed, Dec 05, 2018 at 09:42:45PM +0000, Kuehling, Felix wrote:
>> The amdgpu part looks good to me.
>>
>> A minor nit-pick in mmu_notifier.c (inline).
>>
>> Either way, the series is Acked-by: Felix Kuehling
>
The amdgpu part looks good to me.
A minor nit-pick in mmu_notifier.c (inline).
Either way, the series is Acked-by: Felix Kuehling
On 2018-12-05 12:36 a.m., jgli...@redhat.com wrote:
> From: Jérôme Glisse
>
> To avoid having to change many callback definition everytime we want
> to add a parame
On 2018-11-28 4:14 a.m., Joonas Lahtinen wrote:
> Quoting Ho, Kenny (2018-11-27 17:41:17)
>> On Tue, Nov 27, 2018 at 4:46 AM Joonas Lahtinen
>> wrote:
>>> I think a more abstract property "% of GPU (processing power)" might
>>> be a more universal approach. One can then implement that through
>>
On 2018-10-22 1:23 p.m., Arun KS wrote:
> Remove managed_page_count_lock spinlock and instead use atomic
> variables.
>
> Suggested-by: Michal Hocko
> Suggested-by: Vlastimil Babka
> Signed-off-by: Arun KS
Acked-by: Felix Kuehling
Regards,
Felix
>
> ---
> As discussed here,
> https://patch
Apologies. We already have a fix for this on our internal amd-kfd-staging
branch, but it's missing from amd-staging-drm-next. I'll cherry-pick our fix to
amd-staging-drm-next and nominate it for drm-fixes.
Regards,
Felix
-Original Message-
From: amd-gfx On Behalf Of Joerg Roedel
Sent
On 2018-11-01 7:03 a.m., Dmitry V. Levin wrote:
> Consistently use types provided by via
> to fix struct kfd_ioctl_get_queue_wave_state_args userspace compilation
> errors.
>
> Fixes: 5df099e8bc83f ("drm/amdkfd: Add wavefront context save state retrieval
> ioctl")
> Signed-off-by: Dmitry V. Lev
The BIOS signature check does not guarantee integrity of the BIOS image
either way. As I understand it, the signature is just a magic number.
It's not a cryptographic signature. The check is just a sanity check.
Therefore this change doesn't add any meaningful protection against the
scenario you de
76 matches
Mail list logo