[AMD Official Use Only]

I just sent out patch below yesterday.  swapping unpopulated bo is useless 
indeed.

[RFC PATCH 2/2] drm/ttm: skip swapout when ttm has no backend page.

________________________________________
发件人: Christian König <ckoenig.leichtzumer...@gmail.com>
发送时间: 2021年5月20日 14:39
收件人: Pan, Xinhui; Kuehling, Felix; amd-gfx@lists.freedesktop.org
抄送: Deucher, Alexander; dan...@ffwll.ch; Koenig, Christian; 
dri-de...@lists.freedesktop.org
主题: Re: 回复: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to 
swapout and swapin

> swapout function create one swap storage which is filled with zero. And set 
> ttm->page_flags as TTM_PAGE_FLAG_SWAPPED.  Just because ttm has no backend 
> page this time, no real data is swapout to this swap storage.

That's the fundamental problem. A TT object which isn't populated
shouldn't be considered for swapout nor eviction in the first place.

I'm going to take a look later today.

Christian.

Am 20.05.21 um 04:55 schrieb Pan, Xinhui:
> [AMD Official Use Only]
>
> swapout function create one swap storage which is filled with zero. And set 
> ttm->page_flags as TTM_PAGE_FLAG_SWAPPED.  Just because ttm has no backend 
> page this time, no real data is swapout to this swap storage.
>
> swapin function is called during populate as TTM_PAGE_FLAG_SWAPPED is set.
> Now here is the problem, we swapin data to ttm bakend memory from swap 
> storage. That just causes the memory been overwritten.
>
> ________________________________________
> 发件人: Christian König <ckoenig.leichtzumer...@gmail.com>
> 发送时间: 2021年5月19日 18:01
> 收件人: Pan, Xinhui; Kuehling, Felix; amd-gfx@lists.freedesktop.org
> 抄送: Deucher, Alexander; dan...@ffwll.ch; Koenig, Christian; 
> dri-de...@lists.freedesktop.org
> 主题: Re: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to swapout 
> and swapin
>
> I'm scratching my head how that is even possible.
>
> See when a BO is created in the system domain it is just an empty hull,
> e.g. without backing store and allocated pages.
>
> So the swapout function will just ignore it.
>
> Christian.
>
> Am 19.05.21 um 07:07 schrieb Pan, Xinhui:
>> [AMD Official Use Only]
>>
>> I have reverted Chris'  patch, still hit this failure.
>> Just see two lines in Chris' patch. Any BO in cpu domian would be swapout 
>> first. That is why we hit this issue frequently now. But the bug is there 
>> long time ago.
>>
>> -       for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {
>> -               list_for_each_entry(bo, &glob->swap_lru[i], swap) {
>> [snip]
>> +       for (i = TTM_PL_SYSTEM; i < TTM_NUM_MEM_TYPES; ++i) {
>> +               for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
>>
>>
>> ________________________________________
>> 发件人: Pan, Xinhui <xinhui....@amd.com>
>> 发送时间: 2021年5月19日 12:09
>> 收件人: Kuehling, Felix; amd-gfx@lists.freedesktop.org
>> 抄送: Deucher, Alexander; Koenig, Christian; dri-de...@lists.freedesktop.org; 
>> dan...@ffwll.ch
>> 主题: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to swapout and 
>> swapin
>>
>> yes, we really dont swapout SG BOs.
>> The problems is that before we validate a userptr BO, we create this BO in 
>> CPU domain by default. So this BO has chance to swapout.
>>
>> we set flag TTM_PAGE_FLAG_SG on userptr BO in popluate() which is too late.
>> I have not try to revert Chris' patch as I think it desnt help. Or I can 
>> have a try later.
>>
>> ________________________________________
>> 发件人: Kuehling, Felix <felix.kuehl...@amd.com>
>> 发送时间: 2021年5月19日 11:29
>> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
>> 抄送: Deucher, Alexander; Koenig, Christian; dri-de...@lists.freedesktop.org; 
>> dan...@ffwll.ch
>> 主题: Re: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to swapout and 
>> swapin
>>
>> Swapping SG BOs makes no sense, because TTM doesn't own the pages of
>> this type of BO.
>>
>> Last I checked, userptr BOs (and other SG BOs) were protected from
>> swapout by the fact that they would not be added to the swap-LRU. But it
>> looks like Christian just removed the swap-LRU. I guess this broke that
>> protection:
>>
>> commit 2cb51d22d70b18eaf339abf9758bf0b7608da65c
>> Author: Christian König <christian.koe...@amd.com>
>> Date:   Tue Oct 6 16:30:09 2020 +0200
>>
>>        drm/ttm: remove swap LRU v3
>>
>>        Instead evict round robin from each devices SYSTEM and TT domain.
>>
>>        v2: reorder num_pages access reported by Dan's script
>>        v3: fix rebase fallout, num_pages should be 32bit
>>
>>        Signed-off-by: Christian König <christian.koe...@amd.com>
>>        Tested-by: Nirmoy Das <nirmoy....@amd.com>
>>        Reviewed-by: Huang Rui <ray.hu...@amd.com>
>>        Reviewed-by: Matthew Auld <matthew.a...@intel.com>
>>        Link: 
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F424009%2F&amp;data=04%7C01%7CXinhui.Pan%40amd.com%7C2f5f749422eb44efa95408d91b5a029c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637570895679516712%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=I9K%2FsTdCESLlS9P5rLOGUqEGCcxi0Jc7aaEOltyBP4M%3D&amp;reserved=0
>>
>> Regards,
>>      Felix
>>
>>
>> On 2021-05-18 10:28 p.m., xinhui pan wrote:
>>> cpu 1                                           cpu 2
>>> kfd alloc BO A(userptr)                         alloc BO B(GTT)
>>>        ->init -> validate                               -> init -> validate 
>>> -> populate
>>>        init_user_pages                            -> swapout BO A //hit ttm 
>>> pages limit
>>>         -> get_user_pages (fill up ttm->pages)
>>>          -> validate -> populate
>>>              -> swapin BO A // Now hit the BUG
>>>
>>> We know that get_user_pages may race with swapout on same BO.
>>> Threre are some issues I have met.
>>> 1) memory corruption.
>>> This is because we do a swap before memory is setup. ttm_tt_swapout()
>>> just create a swap_storage with its content being 0x0. So when we setup
>>> memory after the swapout. The following swapin makes the memory
>>> corrupted.
>>>
>>> 2) panic
>>> When swapout happes with get_user_pages, they touch ttm->pages without
>>> anylock. It causes memory corruption too. But I hit page fault mostly.
>>>
>>> Signed-off-by: xinhui pan <xinhui....@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 16 +++++++++++++++-
>>>     1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> index 928e8d57cd08..42460e4480f8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> @@ -835,6 +835,7 @@ static int init_user_pages(struct kgd_mem *mem, 
>>> uint64_t user_addr)
>>>         struct amdkfd_process_info *process_info = mem->process_info;
>>>         struct amdgpu_bo *bo = mem->bo;
>>>         struct ttm_operation_ctx ctx = { true, false };
>>> +     struct page **pages;
>>>         int ret = 0;
>>>
>>>         mutex_lock(&process_info->lock);
>>> @@ -852,7 +853,13 @@ static int init_user_pages(struct kgd_mem *mem, 
>>> uint64_t user_addr)
>>>                 goto out;
>>>         }
>>>
>>> -     ret = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages);
>>> +     pages = kvmalloc_array(bo->tbo.ttm->num_pages,
>>> +                     sizeof(struct page *),
>>> +                     GFP_KERNEL | __GFP_ZERO);
>>> +     if (!pages)
>>> +             goto unregister_out;
>>> +
>>> +     ret = amdgpu_ttm_tt_get_user_pages(bo, pages);
>>>         if (ret) {
>>>                 pr_err("%s: Failed to get user pages: %d\n", __func__, ret);
>>>                 goto unregister_out;
>>> @@ -863,6 +870,12 @@ static int init_user_pages(struct kgd_mem *mem, 
>>> uint64_t user_addr)
>>>                 pr_err("%s: Failed to reserve BO\n", __func__);
>>>                 goto release_out;
>>>         }
>>> +
>>> +     WARN_ON_ONCE(bo->tbo.ttm->page_flags & TTM_PAGE_FLAG_SWAPPED);
>>> +
>>> +     memcpy(bo->tbo.ttm->pages,
>>> +                     pages,
>>> +                     sizeof(struct page*) * bo->tbo.ttm->num_pages);
>>>         amdgpu_bo_placement_from_domain(bo, mem->domain);
>>>         ret = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
>>>         if (ret)
>>> @@ -872,6 +885,7 @@ static int init_user_pages(struct kgd_mem *mem, 
>>> uint64_t user_addr)
>>>     release_out:
>>>         amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
>>>     unregister_out:
>>> +     kvfree(pages);
>>>         if (ret)
>>>                 amdgpu_mn_unregister(bo);
>>>     out:
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CXinhui.Pan%40amd.com%7C2f5f749422eb44efa95408d91b5a029c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637570895679516712%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=aZE0mxygik02mv9JUstV2BDKJl6zb1BNLggHq5WLhKY%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to