V5: Proposed IOCTL APIs for CRIU with consolidated feedback
CRIU is a user space tool which is very popular for container live
migration in datacentres. It can checkpoint a running application, save
its complete state, memory contents and all system resources to images
on disk which can be migrate
- Update debug config for Checkpoint-Restore (CR) support
- Also include necessary options for CR with docker containers.
Reviewed-by: Felix Kuehling
Signed-off-by: Rajneesh Bhardwaj
---
arch/x86/configs/rock-dbg_defconfig | 53 ++---
1 file changed, 34 insertions(+),
This IOCTL op is expected to be called as a precursor to the actual
Checkpoint operation. This does the basic discovery into the target
process seized by CRIU and relays the information to the userspace that
utilizes it to start the Checkpoint operation via another dedicated
IOCTL op.
The process_
This adds support to discover the buffer objects that belong to a
process being checkpointed. The data corresponding to these buffer
objects is returned to user space plugin running under criu master
context which then stores this info to recreate these buffer objects
during a restore operation.
This adds support to create userptr BOs on restore and introduces a new
ioctl op to restart memory notifiers for the restored userptr BOs.
When doing CRIU restore MMU notifications can happen anytime after we call
amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the
restore p
Checkpoint-Restore in userspace (CRIU) is a powerful tool that can
snapshot a running process and later restore it on same or a remote
machine but expects the processes that have a device file (e.g. GPU)
associated with them, provide necessary driver support to assist CRIU
and its extensible plugin
From: David Yat Sin
When re-creating queues during CRIU restore, restore the queue with the
same queue id value used during CRIU dump.
Signed-off-by: Rajneesh Bhardwaj
Signed-off-by: David Yat Sin
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_dbgdev
This implements the KFD CRIU Restore ioctl that lays the basic
foundation for the CRIU restore operation. It provides support to
create the buffer objects corresponding to the checkpointed image.
This ioctl creates various types of buffer objects such as VRAM,
MMIO, Doorbell, GTT based on the date
From: David Yat Sin
Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO
op the queues will be stay in an evicted state. Once the plugin is done
draining BO contents, it is safe to perform an UNPAUSE op for the queues
to resume.
Signed-off-by: David Yat Sin
Signed-off-by: Ra
From: David Yat Sin
When re-creating queues during CRIU restore, restore the queue with the
same sdma id value used during CRIU dump.
Signed-off-by: David Yat Sin
Signed-off-by: Rajneesh Bhardwaj
---
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 48 ++-
.../drm/amd/amdkfd/kf
From: David Yat Sin
When re-creating queues during CRIU restore, restore the queue with the
same doorbell id value used during CRIU dump.
Signed-off-by: David Yat Sin
---
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 60 +--
1 file changed, 41 insertions(+), 19 deletions(-)
From: David Yat Sin
Add support to existing CRIU ioctl's to save number of queues and queue
properties for each queue during checkpoint and re-create queues on
restore.
Signed-off-by: David Yat Sin
Signed-off-by: Rajneesh Bhardwaj
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 110 ++
During checkpoint stage, save the shared virtual memory ranges and
attributes for the target process. A process may contain a number of svm
ranges and each range might contain a number of attributes. While not
all attributes may be applicable for a given prange but during
checkpoint we store all po
Recoverable page faults are represented by the xnack mode setting inside
a kfd process and are used to represent the device page faults. For CR,
we don't consider negative values which are typically used for querying
the current xnack mode without modifying it.
Signed-off-by: Rajneesh Bhardwaj
--
During CRIU restore phase, the VMAs for the virtual address ranges are
not at their final location yet so in this stage, only cache the data
required to successfully resume the svm ranges during an imminent CRIU
resume phase.
Signed-off-by: Rajneesh Bhardwaj
---
drivers/gpu/drm/amd/amdkfd/kfd_ch
From: David Yat Sin
Checkpoint contents of queue MQD's on CRIU dump and restore them during
CRIU restore.
Signed-off-by: David Yat Sin
Signed-off-by: Rajneesh Bhardwaj
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c | 2 +-
.../drm/a
- Change KFD minor version to 7 for CRIU
Proposed userspace changes:
https://github.com/RadeonOpenCompute/criu
Signed-off-by: Rajneesh Bhardwaj
---
include/uapi/linux/kfd_ioctl.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi
KFD buffer objects do not associate a GEM handle with them so cannot
directly be used with libdrm to initiate a system dma (sDMA) operation
to speedup the checkpoint and restore operation so export them as dmabuf
objects and use with libdrm helper (amdgpu_bo_import) to further process
the sdma comm
From: David Yat Sin
Add support to existing CRIU ioctl's to save and restore events during
criu checkpoint and restore.
Signed-off-by: David Yat Sin
Signed-off-by: Rajneesh Bhardwaj
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 70 +-
drivers/gpu/drm/amd/amdkfd/kfd_events.c | 272 +
Currently the SVM ranges use actual_gpu_id but with Checkpoint Restore
support its possible that the SVM ranges can be resumed on another node
where the actual_gpu_id may not be same as the original (user_gpu_id)
gpu id. So modify svm code to use user_gpu_id.
Signed-off-by: Rajneesh Bhardwaj
---
From: David Yat Sin
When doing a restore on a different node, the gpu_id's on the restore
node may be different. But the user space application will still refer
use the original gpu_id's in the ioctl calls. Adding code to create a
gpu id mapping so that kfd can determine actual gpu_id during the
A KFD process may contain a number of virtual address ranges for shared
virtual memory management and each such range can have many SVM
attributes spanning across various nodes within the process boundary.
This change reports the total number of such SVM ranges and
their total private data size by
From: David Yat Sin
Checkpoint contents of queue control stacks on CRIU dump and restore them
during CRIU restore.
Signed-off-by: David Yat Sin
Signed-off-by: Rajneesh Bhardwaj
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c | 2 +-
..
In CRIU resume stage, resume all the shared virtual memory ranges from
the data stored inside the resuming kfd process during CRIU restore
phase. Also setup xnack mode and free up the resources.
KFD_IOCTL_SVM_ATTR_CLR_FLAGS is not available for querying via get_attr
interface but we must clear the
Both svm_range_get_attr and svm_range_set_attr helpers use mm struct
from current but for a Checkpoint or Restore operation, the current->mm
will fetch the mm for the CRIU master process. So modify these helpers to
accept the task mm for a target kfd process to support Checkpoint
Restore.
Signed-o
On Thu, Feb 3, 2022 at 12:58 PM Jani Nikula wrote:
>
> On Mon, 27 Sep 2021, Fangzhi Zuo wrote:
> > +/* DSC Extended Capability Branch Total DSC Resources */
> > +#define DP_DSC_SUPPORT_AND_DSC_DECODER_COUNT 0x2260 /* 2.0 */
> > +# define DP_DSC_DECODER_COUNT_MASK (0b111
- move i915 buddy selftests into drm selftests folder
- add Makefile and Kconfig support
- add sanitycheck testcase
Prerequisites
- These series of selftests patches are created on top of
drm buddy series
- Enable kselftests for DRM as a module in .config
Signed-off-by: Arunpravin
---
drivers
add a test to check the maximum allocation limit
Signed-off-by: Arunpravin
---
.../gpu/drm/selftests/drm_buddy_selftests.h | 1 +
drivers/gpu/drm/selftests/test-drm_buddy.c| 60 +++
2 files changed, 61 insertions(+)
diff --git a/drivers/gpu/drm/selftests/drm_buddy_selftes
create a mm with one block of each order available, and
try to allocate them all.
Signed-off-by: Arunpravin
---
.../gpu/drm/selftests/drm_buddy_selftests.h | 1 +
drivers/gpu/drm/selftests/test-drm_buddy.c| 82 +++
2 files changed, 83 insertions(+)
diff --git a/drivers/gp
- add a test to check the range allocation
- export get_buddy() function in drm_buddy.c
- export drm_prandom_u32_max_state() in lib/drm_random.c
- include helper functions
- include prime number header file
Signed-off-by: Arunpravin
---
drivers/gpu/drm/drm_buddy.c | 20 +-
dri
create a pot-sized mm, then allocate one of each possible
order within. This should leave the mm with exactly one
page left.
Signed-off-by: Arunpravin
---
.../gpu/drm/selftests/drm_buddy_selftests.h | 1 +
drivers/gpu/drm/selftests/test-drm_buddy.c| 153 ++
2 files change
- add a test to ascertain that the critical functionalities
of the program is working fine
- add a timeout helper function
Signed-off-by: Arunpravin
---
.../gpu/drm/selftests/drm_buddy_selftests.h | 1 +
drivers/gpu/drm/selftests/test-drm_buddy.c| 143 ++
2 files change
create a pot-sized mm, then allocate one of each possible
order within. This should leave the mm with exactly one
page left. Free the largest block, then whittle down again.
Eventually we will have a fully 50% fragmented mm.
Signed-off-by: Arunpravin
---
.../gpu/drm/selftests/drm_buddy_selftests
On Mon, 27 Sep 2021, Fangzhi Zuo wrote:
> +/* DSC Extended Capability Branch Total DSC Resources */
> +#define DP_DSC_SUPPORT_AND_DSC_DECODER_COUNT 0x2260 /* 2.0 */
> +# define DP_DSC_DECODER_COUNT_MASK (0b111 << 5)
> +# define DP_DSC_DECODER_COUNT_SHIFT
One nit pick.
Regards,
David
@@ -673,15 +693,19 @@ static int kfd_ioctl_dbg_address_watch(struct file *filep,
memset((void *) &aw_info, 0, sizeof(struct dbg_address_watch_info));
- dev = kfd_device_by_id(args->gpu_id);
- if (!dev)
+ mutex_lock(&p->mutex);
+ pdd
Am 03.02.22 um 14:32 schrieb Arunpravin:
- move i915 buddy selftests into drm selftests folder
- add Makefile and Kconfig support
- add sanitycheck testcase
Prerequisites
- These series of selftests patches are created on top of
drm buddy series
- Enable kselftests for DRM as a module in .con
On 2022-02-02 13:49, Fangzhi Zuo wrote:
> From: Wayne Lin
>
> [Why]
> commit "drm/amd/display: turn DPMS off on connector unplug" and
> commit "drm/amd/display: Clear dc remote sinks on MST disconnect"
> were trying to resolve the resource problem when we connectors get
> disconnected under MS
Am 2022-02-03 um 00:04 schrieb Yang Li:
Use resource_size function on resource object instead of explicit
computation.
Eliminate the following coccicheck warning:
./drivers/gpu/drm/amd/amdkfd/kfd_migrate.c:978:11-14: ERROR: Missing
resource_size with res
Reported-by: Abaci Robot
Signed-off-b
From: Wayne Lin
This patch lived in our internal branch since August
but somehow missed the merge to upstream.
Original Patch:
(dc: Handle removed connector in early_unregister)
Signed-off-by: Wayne Lin
Signed-off-by: Fangzhi Zuo
---
.../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 7
.
From: Wayne Lin
This patch lived in our internal branch since August
but somehow missed the merge to upstream.
Original patch description:
[Why]
commit "drm/amd/display: turn DPMS off on connector unplug" and
commit "drm/amd/display: Clear dc remote sinks on MST disconnect"
were trying to resol
On 2022-02-03 13:17, Fangzhi Zuo wrote:
> From: Wayne Lin
>
> This patch lived in our internal branch since August
> but somehow missed the merge to upstream.
>
> Original patch description:
>
> [Why]
> commit "drm/amd/display: turn DPMS off on connector unplug" and
> commit "drm/amd/display: C
Fixes hangs on driver load on DCN 2.0 parts.
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=215511
Fixes: ee2698cf79cc ("drm/amd/display: Changed pipe split policy to allow for
multi-display pipe split")
Signed-off-by: Alex Deucher
---
drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c | 2
MIT.
Signed-off-by: Alex Deucher
---
.../gpu/drm/amd/include/asic_reg/dcn/dpcs_3_0_0_offset.h | 7 +++
.../gpu/drm/amd/include/asic_reg/dcn/dpcs_3_0_0_sh_mask.h | 7 +++
2 files changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/amd/include/asic_reg/dcn/dpcs_3_0_0_offset.h
b/dri
To align with other headers.
Signed-off-by: Alex Deucher
---
drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c | 4 ++--
.../amd/include/asic_reg/{dcn => dpcs}/dpcs_3_0_3_offset.h| 0
.../amd/include/asic_reg/{dcn => dpcs}/dpcs_3_0_3_sh_mask.h | 0
3 files changed, 2 insertions
To align with other headers.
Signed-off-by: Alex Deucher
---
drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c | 4 ++--
drivers/gpu/drm/amd/display/dc/dcn30/dcn30_resource.c | 4 ++--
drivers/gpu/drm/amd/display/dc/dcn301/dcn301_resource.c | 4 ++--
drivers/gpu/drm/amd
These have been at production level for a while. Drop
the flag.
Signed-off-by: Alex Deucher
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
i
Am 2022-02-03 um 14:09 schrieb Alex Deucher:
These have been at production level for a while. Drop
the flag.
Signed-off-by: Alex Deucher
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/d
From: Zhan Liu
[ Upstream commit ac46d93235074a6c5d280d35771c23fd8620e7d9 ]
[Why]
DCN301 has seamless boot enabled. With MPC split enabled
at the same time, system will hang.
[How]
Revert MPC split policy back to "MPC_SPLIT_AVOID". Since we have
ODM combine enabled on DCN301, pipe split is not
From: Alex Deucher
[ Upstream commit dc919d670c6fd1ac81ebf31625cd19579f7b3d4c ]
Some architectures (e.g., ARM) have relatively low udelay limits.
On most architectures, anything longer than 2000us is not recommended.
Change the check to align with other similar checks in DC.
Reviewed-by: Harry
From: Alex Deucher
[ Upstream commit 98fdcacb45f7cd2092151d6af2e60152811eb79c ]
Some architectures (e.g., ARM) throw an compilation error if the
udelay is too long. In general udelays of longer than 2000us are
not recommended on any architecture. Switch to msleep in these
cases.
Reviewed-by:
From: Zhan Liu
[ Upstream commit ac46d93235074a6c5d280d35771c23fd8620e7d9 ]
[Why]
DCN301 has seamless boot enabled. With MPC split enabled
at the same time, system will hang.
[How]
Revert MPC split policy back to "MPC_SPLIT_AVOID". Since we have
ODM combine enabled on DCN301, pipe split is not
if (!(tmp & flag)) condition will always evaluate to true
when the flag is 0x0 (AMDGPU_RLCG_GC_WRITE). Instead check
that address bits are cleared to determine whether
the command is complete.
Signed-off-by: Victor Skvortsov
---
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 2 +-
drivers/gpu/drm/am
MESA polls for errors every 2-3 seconds. Printing with dev_info() causes
the dmesg log to fill up with the same message, e.g,
[18028.206676] amdgpu :0b:00.0: amdgpu: df doesn't config ras function.
Make it dev_info_once(), as it isn't something correctible during boot, so
printing just once i
From: Roman Li
[Why]
pflip interrupt order are mapped 1 to 1 to otg id.
e.g. if irq_src=26 corresponds to otg0 then 27->otg1, 28->otg2...
Linux DM registers pflip interrupts per number of crtcs.
In fused pipe case crtc numbers can be less than otg id.
e.g. if one pipe out of 3(otg#0-2) is fused
[AMD Official Use Only]
We can probably just make these dev_dbg(). The vast majority of cards are
non-RAS. No need to print this at all in most cases.
Alex
From: Tuikov, Luben
Sent: Thursday, February 3, 2022 5:14 PM
To: amd-gfx@lists.freedesktop.org
Cc: Tui
On 2/3/2022 5:14 PM, roman...@amd.com wrote:
From: Roman Li
[Why]
pflip interrupt order are mapped 1 to 1 to otg id.
e.g. if irq_src=26 corresponds to otg0 then 27->otg1, 28->otg2...
Linux DM registers pflip interrupts per number of crtcs.
In fused pipe case crtc numbers can be less than otg i
MESA polls for errors every 2-3 seconds. Printing with dev_info() causes
the dmesg log to fill up with the same message, e.g,
[18028.206676] amdgpu :0b:00.0: amdgpu: df doesn't config ras function.
Make it dev_dbg_once(), as it isn't something correctible during boot or
thereafter, so printin
On Thu, Feb 3, 2022 at 6:14 PM Luben Tuikov wrote:
>
> MESA polls for errors every 2-3 seconds. Printing with dev_info() causes
> the dmesg log to fill up with the same message, e.g,
>
> [18028.206676] amdgpu :0b:00.0: amdgpu: df doesn't config ras function.
>
> Make it dev_dbg_once(), as it i
Prevent random memory access in the FRU EEPROM code by passing the size of
the destination buffer to the reading routine, and reading no more than the
size of the buffer.
Cc: Kent Russell
Cc: Alex Deucher
Signed-off-by: Luben Tuikov
---
.../gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c| 21 ++
Read buffers no longer expose the I2C address, and so we don't need to
offset by two when we get the read data.
Cc: Alex Deucher
Cc: Kent Russell
Cc: Andrey Grodzovsky
Fixes: bd607166af7fe3 ("drm/amdgpu: Enable reading FRU chip via I2C v3")
Signed-off-by: Luben Tuikov
---
drivers/gpu/drm/amd/
Buffer is abbreviated "buf", not "buff", which
means something entirely different.
Cc: Kent Russell
Cc: Alex Deucher
Signed-off-by: Luben Tuikov
---
.../gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c| 22 +--
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/drivers/g
Noticed the below warning while running a pytorch workload on vega10
GPUs. Change to trylock to avoid conflicts with already held reservation
locks.
[ +0.03] WARNING: possible recursive locking detected
[ +0.03] 5.13.0-kfd-rajneesh #1030 Not tainted
[ +0.04]
The series is
Reviewed-by: Felix Kuehling
Am 2022-02-03 um 04:08 schrieb Rajneesh Bhardwaj:
V5: Proposed IOCTL APIs for CRIU with consolidated feedback
CRIU is a user space tool which is very popular for container live
migration in datacentres. It can checkpoint a running application, save
i
[AMD Official Use Only]
Thank you Felix for the review and your guidance.
-Original Message-
From: Kuehling, Felix
Sent: Thursday, February 3, 2022 10:22 PM
To: Bhardwaj, Rajneesh ;
amd-gfx@lists.freedesktop.org
Cc: Yat Sin, David ; Deucher, Alexander
; dri-de...@lists.freedesktop.org
Reordered the patches; fixed some bugs.
Luben Tuikov (3):
drm/amdgpu: Nerf "buff" to "buf"
drm/amdgpu: Don't offset by 2 in FRU EEPROM
drm/amdgpu: Prevent random memory access in FRU code
Cc: Alex Deucher
Cc: Kent Russell
Cc: Andrey Grodzovsky
.../gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
Buffer is abbreviated "buf" (buf-fer), not "buff" (buff-er).
This is consistent with the rest of the kernel code.
Cc: Kent Russell
Cc: Alex Deucher
Signed-off-by: Luben Tuikov
---
.../gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c| 28 +--
1 file changed, 14 insertions(+), 14 delet
Prevent random memory access in the FRU EEPROM code by passing the size of
the destination buffer to the reading routine, and reading no more than the
size of the buffer.
Cc: Kent Russell
Cc: Alex Deucher
Signed-off-by: Luben Tuikov
---
.../gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c| 21 ++
Read buffers no longer expose the I2C address, and so we don't need to
offset by two when we get the read data.
Cc: Alex Deucher
Cc: Kent Russell
Cc: Andrey Grodzovsky
Fixes: bd607166af7fe3 ("drm/amdgpu: Enable reading FRU chip via I2C v3")
Signed-off-by: Luben Tuikov
---
drivers/gpu/drm/amd/
Am 04.02.22 um 04:11 schrieb Rajneesh Bhardwaj:
Noticed the below warning while running a pytorch workload on vega10
GPUs. Change to trylock to avoid conflicts with already held reservation
locks.
[ +0.03] WARNING: possible recursive locking detected
[ +0.03] 5.13.0-kfd-rajneesh #1030
69 matches
Mail list logo