Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

2019-11-01 Thread Yang, Philip
Sorry, resend patch, the one in previous email missed couple of lines duo to copy/paste. On 2019-11-01 3:45 p.m., Yang, Philip wrote: > > > On 2019-11-01 1:42 p.m., Jason Gunthorpe wrote: >> On Fri, Nov 01, 2019 at 03:59:26PM +0000, Yang, Philip wrote: >>>> This test

Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

2019-11-01 Thread Yang, Philip
On 2019-11-01 1:42 p.m., Jason Gunthorpe wrote: > On Fri, Nov 01, 2019 at 03:59:26PM +0000, Yang, Philip wrote: >>> This test for range_blockable should be before mutex_lock, I can move >>> it up >>> >> yes, thanks. > > Okay, I wrote it like this: &

Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

2019-11-01 Thread Yang, Philip
On 2019-11-01 11:12 a.m., Jason Gunthorpe wrote: > On Fri, Nov 01, 2019 at 02:44:51PM +0000, Yang, Philip wrote: >> >> >> On 2019-10-29 3:25 p.m., Jason Gunthorpe wrote: >>> On Tue, Oct 29, 2019 at 07:22:37PM +, Yang, Philip wrote: >>>> Hi Jason, &g

Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

2019-11-01 Thread Yang, Philip
On 2019-10-29 3:25 p.m., Jason Gunthorpe wrote: > On Tue, Oct 29, 2019 at 07:22:37PM +0000, Yang, Philip wrote: >> Hi Jason, >> >> I did quick test after merging amd-staging-drm-next with the >> mmu_notifier branch, which includes this set changes. The test result >

Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

2019-10-29 Thread Yang, Philip
Hi Jason, I did quick test after merging amd-staging-drm-next with the mmu_notifier branch, which includes this set changes. The test result has different failures, app stuck intermittently, GUI no display etc. I am understanding the changes and will try to figure out the cause. Regards, Phili

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Yang, Philip
On 2019-10-22 3:36 p.m., Grodzovsky, Andrey wrote: > > On 10/22/19 3:19 PM, Yang, Philip wrote: >> >> On 2019-10-22 2:40 p.m., Grodzovsky, Andrey wrote: >>> On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote: >>>> On 10/22/19 2:28 PM, Yang, Philip wrote: >

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Yang, Philip
On 2019-10-22 2:44 p.m., Kuehling, Felix wrote: > On 2019-10-22 14:28, Yang, Philip wrote: >> If device reset/suspend/resume failed for some reason, dqm lock is >> hold forever and this causes deadlock. Below is a kernel backtrace when >> application open kfd after

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Yang, Philip
On 2019-10-22 2:40 p.m., Grodzovsky, Andrey wrote: > > On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote: >> On 10/22/19 2:28 PM, Yang, Philip wrote: >>> If device reset/suspend/resume failed for some reason, dqm lock is >>> hold forever and this causes deadlock. Be

[PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Yang, Philip
If device reset/suspend/resume failed for some reason, dqm lock is hold forever and this causes deadlock. Below is a kernel backtrace when application open kfd after suspend/resume failed. Instead of holding dqm lock in pre_reset and releasing dqm lock in post_reset, add dqm->device_stopped flag w

Re: [PATCH] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Yang, Philip
On 2019-10-21 9:03 p.m., Kuehling, Felix wrote: > > On 2019-10-21 5:04 p.m., Yang, Philip wrote: >> If device reset/suspend/resume failed for some reason, dqm lock is >> hold forever and this causes deadlock. Below is a kernel backtrace when >> application open kfd aft

[PATCH] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-21 Thread Yang, Philip
If device reset/suspend/resume failed for some reason, dqm lock is hold forever and this causes deadlock. Below is a kernel backtrace when application open kfd after suspend/resume failed. Instead of holding dqm lock in pre_reset and releasing dqm lock in post_reset, add dqm->device_stopped flag w

[PATCH v2] drm/amdkfd: kfd open return failed if device is locked

2019-10-18 Thread Yang, Philip
If device is locked for suspend and resume, kfd open should return failed -EAGAIN without creating process, otherwise the application exit to release the process will hang to wait for resume is done if the suspend and resume is stuck somewhere. This is backtrace: v2: fix processes that were create

Re: [PATCH] drm/amdkfd: kfd open return failed if device is locked

2019-10-18 Thread Yang, Philip
On 2019-10-18 11:40 a.m., Kuehling, Felix wrote: > On 2019-10-18 10:27 a.m., Yang, Philip wrote: >> If device is locked for suspend and resume, kfd open should return >> failed -EAGAIN without creating process, otherwise the application exit >> to release the process will ha

[PATCH] drm/amdkfd: kfd open return failed if device is locked

2019-10-18 Thread Yang, Philip
If device is locked for suspend and resume, kfd open should return failed -EAGAIN without creating process, otherwise the application exit to release the process will hang to wait for resume is done if the suspend and resume is stuck somewhere. This is backtrace: [Thu Oct 17 16:43:37 2019] INFO: t

Re: [PATCH] drm/amdgpu: fix compiler warnings for df perfmons

2019-10-17 Thread Yang, Philip
Reviewed-by: Philip Yang On 2019-10-17 1:56 p.m., Kim, Jonathan wrote: > fixing compiler warnings in df v3.6 for c-state toggle and pmc count. > > Change-Id: I74f8f1eafccf523a89d60d005e3549235f75c6b8 > Signed-off-by: Jonathan Kim > --- > drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 4 ++-- > 1 fil

Re: [PATCH] drm/amdgpu: disable c-states on xgmi perfmons

2019-10-17 Thread Yang, Philip
I got compiler warnings after update this morning, because the variables are not initialized in df_v3_6_set_df_cstate() return failed path. CC [M] drivers/gpu/drm/amd/amdgpu/gmc_v9_0.o CC [M] drivers/gpu/drm/amd/amdgpu/gfxhub_v1_1.o /home/yangp/git/compute_staging/kernel/drivers/gpu/drm/am

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-17 Thread Yang, Philip
On 2019-10-17 4:54 a.m., Christian König wrote: > Am 16.10.19 um 18:04 schrieb Jason Gunthorpe: >> On Wed, Oct 16, 2019 at 10:58:02AM +0200, Christian König wrote: >>> Am 15.10.19 um 20:12 schrieb Jason Gunthorpe: From: Jason Gunthorpe 8 of the mmu_notifier using drivers (i915_gem

Re: [PATCH v3] drm/amdgpu: user pages array memory leak fix

2019-10-11 Thread Yang, Philip
On 2019-10-11 1:33 p.m., Kuehling, Felix wrote: > On 2019-10-11 10:36 a.m., Yang, Philip wrote: >> user_pages array should always be freed after validation regardless if >> user pages are changed after bo is created because with HMM change parse >> bo always allocate user pa

[PATCH v3] drm/amdgpu: user pages array memory leak fix

2019-10-11 Thread Yang, Philip
user_pages array should always be freed after validation regardless if user pages are changed after bo is created because with HMM change parse bo always allocate user pages array to get user pages for userptr bo. v2: remove unused local variable and amend commit v3: add back get user pages in ge

Re: [PATCH] drm/amdgpu: user pages array memory leak fix

2019-10-11 Thread Yang, Philip
On 2019-10-11 4:40 a.m., Christian König wrote: > Am 03.10.19 um 21:44 schrieb Yang, Philip: >> user_pages array should always be freed after validation regardless if >> user pages are changed after bo is created because with HMM change parse >> bo always allocate user pa

Re: [PATCH] drm/amdgpu: user pages array memory leak fix

2019-10-04 Thread Yang, Philip
invalidated when amdgpu_cs_submit. I don't find issue for overnight test, but not sure if there is potential side effect. Thanks, Philip On 2019-10-03 3:44 p.m., Yang, Philip wrote: > user_pages array should always be freed after validation regardless if > user pages are changed after bo

[PATCH] drm/amdgpu: user pages array memory leak fix

2019-10-03 Thread Yang, Philip
user_pages array should always be freed after validation regardless if user pages are changed after bo is created because with HMM change parse bo always allocate user pages array to get user pages for userptr bo. Don't need to get user pages while creating uerptr bo because user pages will only b

[PATCH] drm/amdgpu: user pages array memory leak fix

2019-10-03 Thread Yang, Philip
user_pages array should be freed regardless if user pages are invalidated after bo is created because HMM change to always allocate user pages array to get user pages while parsing user page bo. Don't need to to get user pages while creating bo because user pages will only be used after parsing us

Re: [PATCH 9/9] drm/amdgpu: add graceful VM fault handling v2

2019-09-09 Thread Yang, Philip
On 2019-09-09 8:03 a.m., Christian König wrote: > Am 04.09.19 um 22:12 schrieb Yang, Philip: >> This series looks nice and clear for me, two questions embedded below. >> >> Are we going to use dedicated sdma page queue for direct VM update path >> during a fault? >&

[PATCH] drm/amdgpu: check if nbio->ras_if exist

2019-09-06 Thread Yang, Philip
To avoid NULL function pointer access. This happens on VG10, reboot command hangs and have to power off/on to reboot the machine. This is serial console log: [ OK ] Reached target Unmount All Filesystems. [ OK ] Reached target Final Step. Starting Reboot... [ 305.696271] systemd-shut

Re: [PATCH 1/1] drm/amdgpu: Disable retry faults in VMID0

2019-09-05 Thread Yang, Philip
VMID0 init path was missed when enabling amdgpu_noretry option. Good catch and fix. Reviewed-by: Philip Yang On 2019-09-04 7:31 p.m., Kuehling, Felix wrote: > There is no point retrying page faults in VMID0. Those faults are > always fatal. > > Signed-off-by: Felix Kuehling > --- > drivers/

Re: [PATCH 9/9] drm/amdgpu: add graceful VM fault handling v2

2019-09-04 Thread Yang, Philip
This series looks nice and clear for me, two questions embedded below. Are we going to use dedicated sdma page queue for direct VM update path during a fault? Thanks, Philip On 2019-09-04 11:02 a.m., Christian König wrote: > Next step towards HMM support. For now just silence the retry fault an

Re: [PATCH] mm/hmm: hmm_range_fault handle pages swapped out

2019-08-16 Thread Yang, Philip
On 2019-08-15 8:54 p.m., Jason Gunthorpe wrote: > On Thu, Aug 15, 2019 at 08:52:56PM +0000, Yang, Philip wrote: >> hmm_range_fault may return NULL pages because some of pfns are equal to >> HMM_PFN_NONE. This happens randomly under memory pressure. The reason is >> for swap

[PATCH] mm/hmm: hmm_range_fault handle pages swapped out

2019-08-15 Thread Yang, Philip
hmm_range_fault may return NULL pages because some of pfns are equal to HMM_PFN_NONE. This happens randomly under memory pressure. The reason is for swapped out page pte path, hmm_vma_handle_pte doesn't update fault variable from cpu_flags, so it failed to call hmm_vam_do_fault to swap the page in.

Re: [PATCH 1/1] drm/amdkfd: Consistently apply noretry setting

2019-07-04 Thread Yang, Philip
On 2019-07-04 12:02 p.m., Kuehling, Felix wrote: > On 2019-07-03 6:19 p.m., Yang, Philip wrote: >> amdgpu_noretry default value is 0, this will generate VM fault storm >> because the vm fault is not recovered. It may slow down the machine and >> need reboot after applic

Re: [PATCH 1/1] drm/amdkfd: Consistently apply noretry setting

2019-07-03 Thread Yang, Philip
amdgpu_noretry default value is 0, this will generate VM fault storm because the vm fault is not recovered. It may slow down the machine and need reboot after application VM fault. Maybe change default value to 1? Other than that, this is reviewed by Philip Yang On 2019-07-02 3:05 p.m., Kuehli

[PATCH] drm/amdgpu: improve HMM error -ENOMEM and -EBUSY handling

2019-06-14 Thread Yang, Philip
Under memory pressure, hmm_range_fault may return error code -ENOMEM or -EBUSY, change pr_info to pr_debug to remove unnecessary kernel log message because we will retry restore again. Call get_user_pages_done if TTM get user pages failed will have WARN_ONCE kernel calling stack dump log. Change-

Re: [PATCH 1/4] Revert "drm/amdkfd: Fix sdma queue allocate race condition"

2019-06-14 Thread Yang, Philip
I just figured out previous patch have issue. New patch is simple and looks good to me. This series is Reviewed-by: Philip.Yang On 2019-06-14 9:27 p.m., Zeng, Oak wrote: > This reverts commit 0a7c7281bdaae8cf63d77be26a4b46128114bdec. > This fix is not proper. allocate_mqd can't be moved before

Re: [PATCH] drm/amdgpu: Need to set the baco cap before baco reset

2019-06-14 Thread Yang, Philip
Hi Emily, I am not familiar with vbios and driver init part, just based on my experience, the patch don't modify amdgpu_get_bios but move amdgpu_get_bios to amdgpu_device_ip_early_init from amdgpu_device_init, so amdgpu_get_bios is executed earlier. The kernel error message "BUG: kernel NULL p

Re: [PATCH] drm/amdgpu: only use kernel zone if need_dma32 is not required

2019-06-13 Thread Yang, Philip
On 2019-06-13 4:54 a.m., Koenig, Christian wrote: > Am 12.06.19 um 23:13 schrieb Yang, Philip: >> On 2019-06-12 3:28 p.m., Christian König wrote: >>> Am 12.06.19 um 17:13 schrieb Yang, Philip: >>>> TTM create two zones, kernel zone and dma32 zone for system memory.

Re: [PATCH v2 hmm 00/11] Various revisions from a locking/code review

2019-06-12 Thread Yang, Philip
Rebase to https://github.com/jgunthorpe/linux.git hmm branch, need some changes because of interface hmm_range_register change. Then run a quick amdgpu_test. Test is finished, result is ok. But there is below kernel BUG message, seems hmm_free_rcu calls down_write. [ 1171.919921] BUG: sleep

Re: [PATCH] drm/amdgpu: only use kernel zone if need_dma32 is not required

2019-06-12 Thread Yang, Philip
On 2019-06-12 3:28 p.m., Christian König wrote: > Am 12.06.19 um 17:13 schrieb Yang, Philip: >> TTM create two zones, kernel zone and dma32 zone for system memory. If >> system memory address allocated is below 4GB, this account to dma32 zone >> and will exhaust dma32 zone a

Re: [PATCH] drm/amdgpu: only use kernel zone if need_dma32 is not required

2019-06-12 Thread Yang, Philip
e_flags. Is that chain broken > somewhere? Overriding glob->mem_glob->num_zones from amdgpu seems to be > a bit of a hack. > > Regards, >   Felix > > On 2019-06-12 8:13, Yang, Philip wrote: >> TTM create two zones, kernel zone and dma32 zone for system memory. If &

[PATCH] drm/amdgpu: only use kernel zone if need_dma32 is not required

2019-06-12 Thread Yang, Philip
TTM create two zones, kernel zone and dma32 zone for system memory. If system memory address allocated is below 4GB, this account to dma32 zone and will exhaust dma32 zone and trigger unnesssary TTM eviction. Patch "drm/ttm: Account for kernel allocations in kernel zone only" only handle the alloc

[PATCH] drm/amdgpu: use new HMM APIs and helpers v4

2019-06-04 Thread Yang, Philip
HMM provides new APIs and helps in kernel 5.2-rc1 to simplify driver path. The old hmm APIs are deprecated and will be removed in future. Below are changes in driver: 1. Change hmm_vma_fault to hmm_range_register and hmm_range_fault which supports range with multiple vmas, remove the multiple vma

Re: [PATCH] drm/amdgpu: use new HMM APIs and helpers v3

2019-06-03 Thread Yang, Philip
On 2019-06-03 5:02 p.m., Kuehling, Felix wrote: > On 2019-06-03 2:44 p.m., Yang, Philip wrote: >> HMM provides new APIs and helps in kernel 5.2-rc1 to simplify driver >> path. The old hmm APIs are deprecated and will be removed in future. >> >> Below are changes

[PATCH] drm/amdgpu: use new HMM APIs and helpers v3

2019-06-03 Thread Yang, Philip
HMM provides new APIs and helps in kernel 5.2-rc1 to simplify driver path. The old hmm APIs are deprecated and will be removed in future. Below are changes in driver: 1. Change hmm_vma_fault to hmm_range_register and hmm_range_fault which supports range with multiple vmas, remove the multiple vma

Re: [PATCH] drm/amdgpu: use new HMM APIs and helpers

2019-06-03 Thread Yang, Philip
On 2019-06-03 7:23 a.m., Christian König wrote: > Am 03.06.19 um 12:17 schrieb Christian König: >> Am 01.06.19 um 00:01 schrieb Kuehling, Felix: >>> On 2019-05-31 5:32 p.m., Yang, Philip wrote: >>>> On 2019-05-31 3:42 p.m., Kuehling, Felix wrote: >>>>>

Re: [PATCH] drm/amdgpu: use new HMM APIs and helpers

2019-05-31 Thread Yang, Philip
On 2019-05-31 3:42 p.m., Kuehling, Felix wrote: > On 2019-05-31 1:28 p.m., Yang, Philip wrote: >> >> On 2019-05-30 6:36 p.m., Kuehling, Felix wrote: >>>> >>>> #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR) >>>> - if (gtt->ranges &&

[PATCH] drm/amdgpu: use new HMM APIs and helpers v2

2019-05-31 Thread Yang, Philip
HMM provides new APIs and helps in kernel 5.2-rc1 to simplify driver path. The old hmm APIs are deprecated and will be removed in future. Below are changes in driver: 1. Change hmm_vma_fault to hmm_range_register and hmm_range_fault which supports range with multiple vmas, remove the multiple vma

Re: [PATCH] drm/amdgpu: use new HMM APIs and helpers

2019-05-31 Thread Yang, Philip
On 2019-05-30 6:36 p.m., Kuehling, Felix wrote: >> >>#if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR) >> -if (gtt->ranges && >> -ttm->pages[0] == hmm_pfn_to_page(>t->ranges[0], >> - gtt->ranges[0].pfns[0])) >> +if (gtt->range && >> +

[PATCH] drm/amdgpu: use new HMM APIs and helpers

2019-05-30 Thread Yang, Philip
HMM provides new APIs and helps in kernel 5.2-rc1 to simplify driver path. The old hmm APIs are deprecated and will be removed in future. Below are changes in driver: 1. Change hmm_vma_fault to hmm_range_register and hmm_range_fault which supports range with multiple vmas, remove the multiple vma

Re: [PATCH 1/1] drm/amdgpu: Improve error handling for HMM

2019-05-09 Thread Yang, Philip
On 2019-05-07 5:52 p.m., Kuehling, Felix wrote: > Use unsigned long for number of pages. > > Check that pfns are valid after hmm_vma_fault. If they are not, > return an error instead of continuing with invalid page pointers and > PTEs. > > Signed-off-by: Felix Kuehling Reviewed-by: Philip Yang

[PATCH] drm: increase drm mmap_range size to 1TB

2019-04-17 Thread Yang, Philip
After patch "drm: Use the same mmap-range offset and size for GEM and TTM", application failed to create bo of system memory because drm mmap_range size decrease to 64GB from original 1TB. This is not big enough for applications. Increase the drm mmap_range size to 1TB. Change-Id: Id482af261f56f32

Re: Userptr broken with the latest amdgpu driver

2019-04-08 Thread Yang, Philip
Hi Marek, I guess you are using old kernel config with 5.x kernel, and the kernel config option CONFIG_HMM is missing because the dependency option CONFIG_ZONE_DEVICE is missing in old config file. Please update your kernel config file to enable option CONFIG_ZONE_DEVICE. You should have this

Re: kernel errors with HMM enabled

2019-03-14 Thread Yang, Philip
Hi Tom, Yes, we are missing some HMM fixes/changes from 5.1, but the crash log seems not related to those fixes/changes in 5.1. I did see the similar crash log in __mmu_notifier_release path that should be fixed by the patch "use reference counting for HMM struct" as Alex mentioned. Since you

Re: [PATCH 2/3] drm/amdgpu: support userptr cross VMAs case with HMM v2

2019-03-12 Thread Yang, Philip
Hi Felix, Submitted v3 to fix the potential problems with invalid userptr. Philip On 2019-03-12 3:30 p.m., Kuehling, Felix wrote: > See one comment inline. There are still some potential problems that > you're not catching. > > On 2019-03-06 9:42 p.m., Yang, Philip wrote: >

[PATCH 2/3] drm/amdgpu: support userptr cross VMAs case with HMM v3

2019-03-12 Thread Yang, Philip
userptr may cross two VMAs if the forked child process (not call exec after fork) malloc buffer, then free it, and then malloc larger size buf, kerenl will create new VMA adjacent to old VMA which was cloned from parent process, some pages of userptr are in the first VMA, the rest pages are in the

Re: [PATCH 3/6] drm/amdgpu: allocate VM PDs/PTs on demand

2019-03-12 Thread Yang, Philip
vm fault happens about 1/10 for KFDCWSRTest.BasicTest for me. I am using SDMA for page table update. I don't try CPU page table update. Philip On 2019-03-12 11:12 a.m., Russell, Kent wrote: > Peculiar, I hit it immediately when I ran it . Can you try use > --gtest_filter=KFDCWSRTest.BasicTest

[PATCH 2/3] drm/amdgpu: support userptr cross VMAs case with HMM v2

2019-03-06 Thread Yang, Philip
userptr may cross two VMAs if the forked child process (not call exec after fork) malloc buffer, then free it, and then malloc larger size buf, kerenl will create new VMA adjacent to old VMA which was cloned from parent process, some pages of userptr are in the first VMA, the rest pages are in the

[PATCH 1/3] drm/amdkfd: support concurrent userptr update for HMM v2

2019-03-06 Thread Yang, Philip
Userptr restore may have concurrent userptr invalidation after hmm_vma_fault adds the range to the hmm->ranges list, needs call hmm_vma_range_done to remove the range from hmm->ranges list first, then reschedule the restore worker. Otherwise hmm_vma_fault will add same range to the list, this will

Re: [PATCH 2/3] drm/amdgpu: support userptr cross VMAs case with HMM

2019-03-06 Thread Yang, Philip
I will submit v2 to fix those issues. Some comments inline... On 2019-03-06 3:11 p.m., Kuehling, Felix wrote: > Some comments inline ... > > On 3/5/2019 1:09 PM, Yang, Philip wrote: >> userptr may cross two VMAs if the forked child process (not call exec >> after fork) mal

Re: [PATCH 1/3] drm/amdkfd: support concurrent userptr update for HMM

2019-03-06 Thread Yang, Philip
at that point needs to be untracked. > > For now as a quick fix for an urgent bug, this change is Reviewed-by: > Felix Kuehling . But please revisit this and > check if there are similar corner cases as I explained above. > > Regards, >   Felix > > On 3/5/2019 1:09 PM, Yang,

[PATCH 3/3] drm/amdgpu: more descriptive message if HMM not enabled v3

2019-03-06 Thread Yang, Philip
If using old kernel config file, CONFIG_ZONE_DEVICE is not selected, so CONFIG_HMM and CONFIG_HMM_MIRROR is not enabled, the current driver error message "Failed to register MMU notifier" is not clear. Inform user with more descriptive message on how to fix the missing kernel config option. Bugzil

Re: [PATCH 3/3] drm/amdgpu: more descriptive message if HMM not enabled v2

2019-03-06 Thread Yang, Philip
On 2019-03-06 10:04 a.m., Christian König wrote: > Am 06.03.19 um 16:02 schrieb Yang, Philip: >> If using old kernel config file, CONFIG_ZONE_DEVICE is not selected, >> so CONFIG_HMM and CONFIG_HMM_MIRROR is not enabled, the current driver >> error message "Failed to

[PATCH 3/3] drm/amdgpu: more descriptive message if HMM not enabled v2

2019-03-06 Thread Yang, Philip
If using old kernel config file, CONFIG_ZONE_DEVICE is not selected, so CONFIG_HMM and CONFIG_HMM_MIRROR is not enabled, the current driver error message "Failed to register MMU notifier" is not clear. Inform user with more descriptive message on how to fix the missing kernel config option. Bugzil

Re: [PATCH 3/3] drm/amdgpu: more descriptive message if HMM not enabled

2019-03-06 Thread Yang, Philip
On 2019-03-06 4:05 a.m., Michel Dänzer wrote: > On 2019-03-05 7:09 p.m., Yang, Philip wrote: >> If using old kernel config file, CONFIG_ZONE_DEVICE is not selected, >> so CONFIG_HMM and CONFIG_HMM_MIRROR is not enabled, the current driver >> error message "Failed to reg

[PATCH 2/3] drm/amdgpu: support userptr cross VMAs case with HMM

2019-03-05 Thread Yang, Philip
userptr may cross two VMAs if the forked child process (not call exec after fork) malloc buffer, then free it, and then malloc larger size buf, kerenl will create new VMA adjacent to old VMA which was cloned from parent process, some pages of userptr are in the first VMA, the rest pages are in the

[PATCH 1/3] drm/amdkfd: support concurrent userptr update for HMM

2019-03-05 Thread Yang, Philip
Userptr restore may have concurrent userptr invalidation after hmm_vma_fault adds the range to the hmm->ranges list, needs call hmm_vma_range_done to remove the range from hmm->ranges list first, then reschedule the restore worker. Otherwise hmm_vma_fault will add same range to the list, this will

[PATCH 0/3] handle userptr corner cases with HMM path

2019-03-05 Thread Yang, Philip
Those corner cases are found by kfdtest.KFDIPCTest. Philip Yang (3): drm/amdkfd: support concurrent userptr update for HMM drm/amdgpu: support userptr cross VMAs case with HMM drm/amdgpu: more descriptive message if HMM not enabled .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 28 +++-

[PATCH 3/3] drm/amdgpu: more descriptive message if HMM not enabled

2019-03-05 Thread Yang, Philip
If using old kernel config file, CONFIG_ZONE_DEVICE is not selected, so CONFIG_HMM and CONFIG_HMM_MIRROR is not enabled, the current driver error message "Failed to register MMU notifier" is not clear. Inform user with more descriptive message on how to fix the missing kernel config option. Bugzil

Re: [PATCH] drm/amdgpu: handle userptr corner cases with HMM path

2019-03-04 Thread Yang, Philip
han one VMA, fail > 2. Loop over all the VMAs in the address range > > Thanks, >Felix > > -----Original Message- > From: amd-gfx On Behalf Of Yang, > Philip > Sent: Friday, March 01, 2019 12:30 PM > To: amd-gfx@lists.freedesktop.org > Cc: Yang, Ph

[PATCH] drm/amdgpu: handle userptr corner cases with HMM path

2019-03-01 Thread Yang, Philip
Those corner cases are found by kfdtest.KFDIPCTest. userptr may cross two vmas if the forked child process (not call exec after fork) malloc buffer, then free it, and then malloc larger size buf, kerenl will create new vma adjacent to old vma which was cloned from parent process, some pages of use

Re: KASAN caught amdgpu / HMM use-after-free

2019-02-28 Thread Yang, Philip
: > > [ Dropping Jérôme and the linux-mm list ] > > On 2019-02-27 7:48 p.m., Yang, Philip wrote: >> Hi Alex, >> >> Pushed, thanks. >> >> mm/hmm: use reference counting for HMM struct > > Thanks, but I'm not seeing it yet. Maybe it needs some sp

Re: KASAN caught amdgpu / HMM use-after-free

2019-02-27 Thread Yang, Philip
. > > Alex > > *From:* amd-gfx on behalf of > Yang, Philip > *Sent:* Wednesday, February 27, 2019 1:05 PM > *To:* Michel Dänzer; Jérôme Glisse > *Cc:* linux...@kvack.org; amd-gfx@lists.freedesktop.org > *Subject:* Re: KASAN caught amdgpu / HMM use-after-f

Re: KASAN caught amdgpu / HMM use-after-free

2019-02-27 Thread Yang, Philip
amd-staging-drm-next will rebase to kernel 5.1 to pickup this fix automatically. As a short-term workaround, please cherry-pick this fix into your local repository. Regards, Philip On 2019-02-27 12:33 p.m., Michel Dänzer wrote: > On 2019-02-27 6:14 p.m., Yang, Philip wrote: >>

Re: KASAN caught amdgpu / HMM use-after-free

2019-02-27 Thread Yang, Philip
Hi Michel, Yes, I found the same issue and the bug has been fixed by Jerome: 876b462120aa mm/hmm: use reference counting for HMM struct The fix is on hmm-for-5.1 branch, I cherry-pick it into my local branch to workaround the issue. Regards, Philip On 2019-02-27 12:02 p.m., Michel Dänzer wrot

[PATCH] drm/amdgpu: fix HMM config dependency issue

2019-02-21 Thread Yang, Philip
Only select HMM_MIRROR will get kernel config dependency warnings if CONFIG_HMM is missing in the config. Add depends on HMM will solve the issue. Add conditional compilation to fix compilation errors if HMM_MIRROR is not enabled as HMM config is not enabled. Change-Id: I1b44a0b5285bbef5e98bfb045

Re: [PATCH] drm/amdgpu: select ARCH_HAS_HMM and ZONE_DEVICE option

2019-02-21 Thread Yang, Philip
Thanks Jerome for the the correct HMM config option, only select HMM_MIRROR is not good enough because CONFIG_HMM option maybe missing, add depends on ARCH_HAS_HMM will solve the issue. I will submit new patch to fix the compilation error if HMM_MIRROR config is missing and the HMM config depen

[PATCH] drm/amdgpu: select ARCH_HAS_HMM and ZONE_DEVICE option

2019-02-20 Thread Yang, Philip
Those options are needed to support HMM Change-Id: Ieb7bb3bcec07245d79a02793e6728228decc400a Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/Kconfig | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/Kconfig b/drivers/gpu/drm/amd/amdgpu/Kconfig index 960a

[PATCH 2/3] drm/amdkfd: avoid HMM change cause circular lock dependency v2

2019-02-06 Thread Yang, Philip
There is circular lock between gfx and kfd path with HMM change: lock(dqm) -> bo::reserve -> amdgpu_mn_lock To avoid this, move init/unint_mqd() out of lock(dqm), to remove nested locking between mmap_sem and bo::reserve. The locking order is: bo::reserve -> amdgpu_mn_lock(p->mn) Change-Id: I2ec0

[PATCH 0/3] Use HMM to replace get_user_pages

2019-02-06 Thread Yang, Philip
Hi Christian, Resend patch 1/3, 2/3, added Reviewed-by in comments. Change in patch 3/3, amdgpu_cs_submit, amdgpu_cs_ioctl return -EAGAIN to user space to retry cs_ioctl. Regards, Philip Philip Yang (3): drm/amdgpu: use HMM mirror callback to replace mmu notifier v7 drm/amdkfd: avoid HMM ch

[PATCH 1/3] drm/amdgpu: use HMM mirror callback to replace mmu notifier v7

2019-02-06 Thread Yang, Philip
Replace our MMU notifier with hmm_mirror_ops.sync_cpu_device_pagetables callback. Enable CONFIG_HMM and CONFIG_HMM_MIRROR as a dependency in DRM_AMDGPU_USERPTR Kconfig. It supports both KFD userptr and gfx userptr paths. Change-Id: Ie62c3c5e3c5b8521ab3b438d1eff2aa2a003835e Signed-off-by: Philip Y

[PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v8

2019-02-06 Thread Yang, Philip
Use HMM helper function hmm_vma_fault() to get physical pages backing userptr and start CPU page table update track of those pages. Then use hmm_vma_range_done() to check if those pages are updated before amdgpu_cs_submit for gfx or before user queues are resumed for kfd. If userptr pages are upda

Re: [PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v7

2019-02-06 Thread Yang, Philip
ries might not be sufficient any more. > Yes, it looks better to handle retry from user space. The extra sys call overhead can be ignored because this does not happen all the time. I will submit new patch for review. Thanks, Philip On 2019-02-06 4:20 a.m., Christian König wrote: > Am 0

[PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v7

2019-02-05 Thread Yang, Philip
Use HMM helper function hmm_vma_fault() to get physical pages backing userptr and start CPU page table update track of those pages. Then use hmm_vma_range_done() to check if those pages are updated before amdgpu_cs_submit for gfx or before user queues are resumed for kfd. If userptr pages are upda

Re: [PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v6

2019-02-05 Thread Yang, Philip
Hi Christian, I will submit new patch for review, my comments embedded inline below. Thanks, Philip On 2019-02-05 1:09 p.m., Koenig, Christian wrote: > Am 05.02.19 um 18:25 schrieb Yang, Philip: >> [SNIP]+ >>>> +    if (r == -ERESTARTSYS) { >&

Re: [PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v6

2019-02-05 Thread Yang, Philip
Hi Christian, My comments are embedded below. I will submit another patch to address those. Thanks, Philip On 2019-02-05 6:52 a.m., Christian König wrote: > Am 04.02.19 um 19:23 schrieb Yang, Philip: >> Use HMM helper function hmm_vma_fault() to get physical pages backing >> us

[PATCH 1/3] drm/amdgpu: use HMM mirror callback to replace mmu notifier v7

2019-02-04 Thread Yang, Philip
Replace our MMU notifier with hmm_mirror_ops.sync_cpu_device_pagetables callback. Enable CONFIG_HMM and CONFIG_HMM_MIRROR as a dependency in DRM_AMDGPU_USERPTR Kconfig. It supports both KFD userptr and gfx userptr paths. Change-Id: Ie62c3c5e3c5b8521ab3b438d1eff2aa2a003835e Signed-off-by: Philip Y

[PATCH 0/3] Use HMM to replace get_user_pages

2019-02-04 Thread Yang, Philip
Hi Christian, This patch is rebased to lastest HMM. Please review the GEM and CS part changes in patch 3/3. Thanks, Philip Yang (3): drm/amdgpu: use HMM mirror callback to replace mmu notifier v7 drm/amdkfd: avoid HMM change cause circular lock dependency v2 drm/amdgpu: replace get_user_pa

[PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v6

2019-02-04 Thread Yang, Philip
Use HMM helper function hmm_vma_fault() to get physical pages backing userptr and start CPU page table update track of those pages. Then use hmm_vma_range_done() to check if those pages are updated before amdgpu_cs_submit for gfx or before user queues are resumed for kfd. If userptr pages are upda

[PATCH 2/3] drm/amdkfd: avoid HMM change cause circular lock dependency v2

2019-02-04 Thread Yang, Philip
There is circular lock between gfx and kfd path with HMM change: lock(dqm) -> bo::reserve -> amdgpu_mn_lock To avoid this, move init/unint_mqd() out of lock(dqm), to remove nested locking between mmap_sem and bo::reserve. The locking order is: bo::reserve -> amdgpu_mn_lock(p->mn) Change-Id: I2ec0

Re: [PATCH 1/3] drm/amdgpu: use HMM mirror callback to replace mmu notifier v6

2019-02-04 Thread Yang, Philip
On 2019-02-04 10:18 a.m., Christian König wrote: > Am 04.02.19 um 16:06 schrieb Yang, Philip: >> Replace our MMU notifier with hmm_mirror_ops.sync_cpu_device_pagetables >> callback. Enable CONFIG_HMM and CONFIG_HMM_MIRROR as a dependency in >> DRM_AMDGPU_USERPTR Kconfig. &

[PATCH 2/3] drm/amdkfd: avoid HMM change cause circular lock dependency v2

2019-02-04 Thread Yang, Philip
There is circular lock between gfx and kfd path with HMM change: lock(dqm) -> bo::reserve -> amdgpu_mn_lock To avoid this, move init/unint_mqd() out of lock(dqm), to remove nested locking between mmap_sem and bo::reserve. The locking order is: bo::reserve -> amdgpu_mn_lock(p->mn) Change-Id: I2ec0

[PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v6

2019-02-04 Thread Yang, Philip
Use HMM helper function hmm_vma_fault() to get physical pages backing userptr and start CPU page table update track of those pages. Then use hmm_vma_range_done() to check if those pages are updated before amdgpu_cs_submit for gfx or before user queues are resumed for kfd. If userptr pages are upda

[PATCH 1/3] drm/amdgpu: use HMM mirror callback to replace mmu notifier v6

2019-02-04 Thread Yang, Philip
Replace our MMU notifier with hmm_mirror_ops.sync_cpu_device_pagetables callback. Enable CONFIG_HMM and CONFIG_HMM_MIRROR as a dependency in DRM_AMDGPU_USERPTR Kconfig. It supports both KFD userptr and gfx userptr paths. The depdent HMM patchset from Jérôme Glisse are all merged into 4.20.0 kerne

[PATCH 0/3] Use HMM to replace get_user_pages

2019-02-04 Thread Yang, Philip
Hi Christian, This patch is rebased to lastest HMM. Please review the GEM and CS part changes in patch 3/3. Regards, Philip Yang (3): drm/amdgpu: use HMM mirror callback to replace mmu notifier v6 drm/amdkfd: avoid HMM change cause circular lock dependency v2 drm/amdgpu: replace get_user_p

[PATCH] drm/amdgpu: use spin_lock_irqsave to protect vm_manager.pasid_idr

2019-01-31 Thread Yang, Philip
amdgpu_vm_get_task_info is called from interrupt handler and sched timeout workqueue, so it is needed to use irq version spin_lock to avoid deadlock. Change-Id: Ifedd4b97535bf0b5d3936edd2d9688957020efd4 --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++-- 1 file changed, 3 insertions(+), 2 delet

Re: Yet another RX Vega hang with another kernel panic signature. WARNING: inconsistent lock state

2019-01-31 Thread Yang, Philip
I found same issue while debugging, I will submit patch to fix this shortly. Philip On 2019-01-30 10:35 p.m., Mikhail Gavrilov wrote: > Hi folks. > Yet another kernel panic happens while GPU again is hang: > > [ 1469.906798] > [ 1469.906799] WARNING: inconsistent

Re: [PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v6

2019-01-14 Thread Yang, Philip
Ping Christian, any comments for the GEM and CS part changes? Thanks. Philip On 2019-01-10 12:02 p.m., Yang, Philip wrote: > Use HMM helper function hmm_vma_fault() to get physical pages backing > userptr and start CPU page table update track of those pages. Then use > hmm_vma_range_

[PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v6

2019-01-10 Thread Yang, Philip
Use HMM helper function hmm_vma_fault() to get physical pages backing userptr and start CPU page table update track of those pages. Then use hmm_vma_range_done() to check if those pages are updated before amdgpu_cs_submit for gfx or before user queues are resumed for kfd. If userptr pages are upda

[PATCH 1/3] drm/amdgpu: use HMM mirror callback to replace mmu notifier v6

2019-01-10 Thread Yang, Philip
Replace our MMU notifier with hmm_mirror_ops.sync_cpu_device_pagetables callback. Enable CONFIG_HMM and CONFIG_HMM_MIRROR as a dependency in DRM_AMDGPU_USERPTR Kconfig. It supports both KFD userptr and gfx userptr paths. The depdent HMM patchset from Jérôme Glisse are all merged into 4.20.0 kerne

[PATCH 2/3] drm/amdkfd: avoid HMM change cause circular lock dependency v2

2019-01-10 Thread Yang, Philip
There is circular lock between gfx and kfd path with HMM change: lock(dqm) -> bo::reserve -> amdgpu_mn_lock To avoid this, move init/unint_mqd() out of lock(dqm), to remove nested locking between mmap_sem and bo::reserve. The locking order is: bo::reserve -> amdgpu_mn_lock(p->mn) Change-Id: I2ec0

Re: [PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v5

2019-01-07 Thread Yang, Philip
On 2019-01-07 9:21 a.m., Christian König wrote: > Am 14.12.18 um 22:10 schrieb Yang, Philip: >> Use HMM helper function hmm_vma_fault() to get physical pages backing >> userptr and start CPU page table update track of those pages. Then use >> hmm_vma_range_done() to che

Re: [PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v5

2019-01-02 Thread Yang, Philip
ies is Reviewed-by: Felix > Kuehling > > Regards, >   Felix > > On 2018-12-14 4:10 p.m., Yang, Philip wrote: >> Use HMM helper function hmm_vma_fault() to get physical pages backing >> userptr and start CPU page table update track of those pages. Then use >>

  1   2   >