[RFC 4/5] RDMA/mlx5: Add fallback for P2P DMA errors

2024-12-01 Thread Yonatan Maman
From: Yonatan Maman Handle P2P DMA mapping errors when the transaction requires traversing an inaccessible host bridge that is not in the allowlist: - In `populate_mtt`, if a P2P mapping fails, the `HMM_PFN_ALLOW_P2P` flag is cleared only for the PFNs that returned a mapping error. - In

[RFC 5/5] RDMA/mlx5: Enabling ATS for ODP memory

2024-12-01 Thread Yonatan Maman
From: Yonatan Maman ATS (Address Translation Services) mainly utilized to optimize PCI Peer-to-Peer transfers and prevent bus failures. This change employed ATS usage for ODP memory, to optimize DMA P2P for ODP memory. (e.g DMA P2P for private device pages - ODP memory). Signed-off-by: Yonatan

[RFC 2/5] nouveau/dmem: HMM P2P DMA for private dev pages

2024-12-01 Thread Yonatan Maman
From: Yonatan Maman Enabling Peer-to-Peer DMA (P2P DMA) access in GPU-centric applications is crucial for minimizing data transfer overhead (e.g., for RDMA use- case). This change aims to enable that capability for Nouveau over HMM device private pages. P2P DMA for private device pages allows

[RFC 3/5] IB/core: P2P DMA for device private pages

2024-12-01 Thread Yonatan Maman
From: Yonatan Maman Add Peer-to-Peer (P2P) DMA request for hmm_range_fault calling, utilizing capabilities introduced in mm/hmm. By setting range.default_flags to HMM_PFN_REQ_FAULT | HMM_PFN_REQ_TRY_P2P, HMM attempts to initiate P2P DMA connections for device private pages (instead of page fault

[RFC 0/5] GPU Direct RDMA (P2P DMA) for Device Private Pages

2024-12-01 Thread Yonatan Maman
From: Yonatan Maman Based on: Provide a new two step DMA mapping API patchset https://lore.kernel.org/kvm/20241114170247.ga5...@lst.de/T/#t This patch series aims to enable Peer-to-Peer (P2P) DMA access in GPU-centric applications that utilize RDMA and private device pages. This enhancement

[RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

2024-12-01 Thread Yonatan Maman
From: Yonatan Maman hmm_range_fault() by default triggered a page fault on device private when HMM_PFN_REQ_FAULT flag was set. pages, migrating them to RAM. In some cases, such as with RDMA devices, the migration overhead between the device (e.g., GPU) and the CPU, and vice-versa, significantly

Re: [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages

2024-10-20 Thread Yonatan Maman
On 18/10/2024 10:26, Zhu Yanjun wrote: External email: Use caution opening links or attachments 在 2024/10/16 17:16, Yonatan Maman 写道: On 16/10/2024 7:23, Christoph Hellwig wrote: On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote: From: Yonatan Maman This patch series aims

Re: [PATCH v1 2/4] nouveau/dmem: HMM P2P DMA for private dev pages

2024-10-16 Thread Yonatan Maman
On 16/10/2024 8:12, Alistair Popple wrote: Yonatan Maman writes: From: Yonatan Maman Enabling Peer-to-Peer DMA (P2P DMA) access in GPU-centric applications is crucial for minimizing data transfer overhead (e.g., for RDMA use- case). This change aims to enable that capability for

Re: [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages

2024-10-16 Thread Yonatan Maman
On 16/10/2024 7:23, Christoph Hellwig wrote: On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote: From: Yonatan Maman This patch series aims to enable Peer-to-Peer (P2P) DMA access in GPU-centric applications that utilize RDMA and private device pages. This enhancement is crucial

Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages

2024-10-16 Thread Yonatan Maman
, 2024 at 06:23:45PM +0300, Yonatan Maman wrote: From: Yonatan Maman hmm_range_fault() natively triggers a page fault on device private pages, migrating them to RAM. That "natively" above doesn't make sense to me. What I meant to convey is that hmm_range_fault() by default triggered a

[PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages

2024-10-15 Thread Yonatan Maman
From: Yonatan Maman This patch series aims to enable Peer-to-Peer (P2P) DMA access in GPU-centric applications that utilize RDMA and private device pages. This enhancement is crucial for minimizing data transfer overhead by allowing the GPU to directly expose device private page data to devices

[PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages

2024-10-15 Thread Yonatan Maman
From: Yonatan Maman hmm_range_fault() natively triggers a page fault on device private pages, migrating them to RAM. In some cases, such as with RDMA devices, the migration overhead between the device (e.g., GPU) and the CPU, and vice-versa, significantly damages performance. Thus, enabling Peer

[PATCH v1 4/4] RDMA/mlx5: Enabling ATS for ODP memory

2024-10-15 Thread Yonatan Maman
From: Yonatan Maman ATS (Address Translation Services) mainly utilized to optimize PCI Peer-to-Peer transfers and prevent bus failures. This change employed ATS usage for ODP memory, to optimize DMA P2P for ODP memory. (e.g DMA P2P for private device pages - ODP memory). Signed-off-by: Yonatan

[PATCH v1 2/4] nouveau/dmem: HMM P2P DMA for private dev pages

2024-10-15 Thread Yonatan Maman
From: Yonatan Maman Enabling Peer-to-Peer DMA (P2P DMA) access in GPU-centric applications is crucial for minimizing data transfer overhead (e.g., for RDMA use- case). This change aims to enable that capability for Nouveau over HMM device private pages. P2P DMA for private device pages allows

[PATCH v1 3/4] IB/core: P2P DMA for device private pages

2024-10-15 Thread Yonatan Maman
From: Yonatan Maman Add Peer-to-Peer (P2P) DMA request for hmm_range_fault calling, utilizing capabilities introduced in mm/hmm. By setting range.default_flags to HMM_PFN_REQ_FAULT | HMM_PFN_REQ_TRY_P2P, HMM attempts to initiate P2P DMA connections for device private pages (instead of page fault

Re: [PATCH 1/2] nouveau/dmem: Fix privileged error in copy engine channel

2024-10-08 Thread Yonatan Maman
On 30/09/2024 14:09, Danilo Krummrich wrote: External email: Use caution opening links or attachments Hi Yonatan, On Mon, Sep 23, 2024 at 01:54:56PM +, Yonatan Maman wrote: When `nouveau_dmem_copy_one` is called, the following error occurs: [272146.675156] nouveau :06:00.0: fifo

[PATCH v2 2/2] nouveau/dmem: Fix memory leak in `migrate_to_ram` upon copy error

2024-10-08 Thread Yonatan Maman
From: Yonatan Maman A copy push command might fail, causing `migrate_to_ram` to return a dirty HIGH_USER page to the user. This exposes a security vulnerability in the nouveau driver. To prevent memory leaks in `migrate_to_ram` upon a copy error, allocate a zero page for the destination page

[PATCH v3 0/2] drm/nouveau/dmem: Fix Vulnerability and Device Channels configuration

2024-10-08 Thread Yonatan Maman
From: Yonatan Maman This patch series addresses two critical issues in the Nouveau driver related to device channels, error handling, and sensitive data leaks. - Vulnerability in migrate_to_ram: The migrate_to_ram function might return a dirty HIGH_USER page when a copy push command (FW

Re: [PATCH 2/2] nouveau/dmem: Fix memory leak in `migrate_to_ram` upon copy error

2024-10-08 Thread Yonatan Maman
On 30/09/2024 14:20, Danilo Krummrich wrote: External email: Use caution opening links or attachments On Mon, Sep 23, 2024 at 01:54:58PM +, Yonatan Maman wrote: A copy push command might fail, causing `migrate_to_ram` to return a dirty HIGH_USER page to the user. This exposes a

[PATCH v4 2/2] nouveau/dmem: Fix vulnerability in migrate_to_ram upon copy error

2024-10-08 Thread Yonatan Maman
From: Yonatan Maman The `nouveau_dmem_copy_one` function ensures that the copy push command is sent to the device firmware but does not track whether it was executed successfully. In the case of a copy error (e.g., firmware or hardware failure), the copy push command will be sent via the

[PATCH v3 1/2] nouveau/dmem: Fix privileged error in copy engine channel

2024-10-08 Thread Yonatan Maman
From: Yonatan Maman When `nouveau_dmem_copy_one` is called, the following error occurs: [272146.675156] nouveau :06:00.0: fifo: PBDMA9: 0004 [HCE_PRIV] ch 1 0300 3386 This indicates that a copy push command triggered a Host Copy Engine Privileged error on channel 1 (Copy Engine

[PATCH v4 1/2] nouveau/dmem: Fix privileged error in copy engine channel

2024-10-08 Thread Yonatan Maman
From: Yonatan Maman When `nouveau_dmem_copy_one` is called, the following error occurs: [272146.675156] nouveau :06:00.0: fifo: PBDMA9: 0004 [HCE_PRIV] ch 1 0300 3386 This indicates that a copy push command triggered a Host Copy Engine Privileged error on channel 1 (Copy Engine

[PATCH v3 2/2] nouveau/dmem: Fix vulnerability in migrate_to_ram upon copy error

2024-10-08 Thread Yonatan Maman
From: Yonatan Maman The `nouveau_dmem_copy_one` function ensures that the copy push command is sent to the device firmware but does not track whether it was executed successfully. In the case of a copy error (e.g., firmware or hardware failure), the copy push command will be sent via the

[PATCH v4 0/2] drm/nouveau/dmem: Fix Vulnerability and Device Channels configuration

2024-10-08 Thread Yonatan Maman
From: Yonatan Maman This patch series addresses two critical issues in the Nouveau driver related to device channels, error handling, and sensitive data leaks. - Vulnerability in migrate_to_ram: The migrate_to_ram function might return a dirty HIGH_USER page when a copy push command (FW

[no subject]

2024-10-08 Thread Yonatan Maman
From: Yonatan Maman Date: Mon, 7 Oct 2024 14:48:26 +0300 Subject: [PATCH v2 0/2] drm/nouveau/dmem: Fix Memory Leaking and Device Channels configuration This patch series addresses two critical issues in the Nouveau driver related to device channels, error handling, and memory leaks. - Memory

[PATCH v2 1/2] nouveau/dmem: Fix privileged error in copy engine channel

2024-10-08 Thread Yonatan Maman
From: Yonatan Maman When `nouveau_dmem_copy_one` is called, the following error occurs: [272146.675156] nouveau :06:00.0: fifo: PBDMA9: 0004 [HCE_PRIV] ch 1 0300 3386 This indicates that a copy push command triggered a Host Copy Engine Privileged error on channel 1 (Copy Engine