On Fri, Oct 22, 2021 at 5:30 PM Elena Agostini <eagost...@nvidia.com> wrote: > > On Tue, Oct 19, 2021 at 21:36 Jerin Jacob <jerinjac...@gmail.com> wrote: > > > > > On Wed, Oct 20, 2021 at 12:38 AM Thomas Monjalon <tho...@monjalon.net> > > wrote: > > > > > > > > 19/10/2021 20:14, jer...@marvell.com: > > > > > Definition of Dataplane Workload Accelerator > > > > > -------------------------------------------- > > > > > Dataplane Workload Accelerator(DWA) typically contains a set of CPUs, > > > > > Network controllers and programmable data acceleration engines for > > > > > packet processing, cryptography, regex engines, baseband processing, > > > > etc. > > > > > This allows DWA to offload compute/packet processing/baseband/ > > > > > cryptography-related workload from the host CPU to save the cost and > > > > power. > > > > > Also to enable scaling the workload by adding DWAs to the Host CPU as > > > > needed. > > > > > > > > > > Unlike other devices in DPDK, the DWA device is not fixed-function > > > > > due to the fact that it has CPUs and programmable HW accelerators. > > > > > This enables DWA personality/workload to be completely programmable. > > > > > Typical examples of DWA offloads are Flow/Session management, > > > > > Virtual switch, TLS offload, IPsec offload, l3fwd offload, etc. > > > > > > > > If I understand well, the idea is to abstract the offload > > > > of some stack layers in the hardware. > > > > > > Yes. It may not just HW, For expressing the complicated workloads > > > may need CPU and/or other HW accelerators. > > > > > > > I am not sure we should give an API for such stack layers in DPDK. > > > > > > Why not? > > > > > > > It looks to be the role of the dataplane application to finely manage > > > > how to use the hardware for a specific dataplane. > > > > > > It is possible with this scheme. > > > > > > > I believe the API for such layer would be either too big, or too limited, > > > > or not optimized for specific needs. > > > > > > It will be optimized for specific needs as applications ask for what to do? > > > not how to do? > > > > > > > If we really want to automate or abstract the HW/SW co-design, > > > > I think we should better look at compiler work like P4 or PANDA. > > > > > > The compiler stuff is very static in nature. It can address the packet > > > transformation > > > workloads. Not the ones like IPsec or baseband offload. > > > Another way to look at it, GPU RFC started just because you are not able > > > to express all the workload in P4. > > > > > > > That’s not the purpose of the GPU RFC. > > gpudev library goal is to enhance the dialog between GPU, CPU and NIC > offering the possibility to: > > > > - Have DPDK aware of non-CPU memory like device memory (e.g. similarly to > what happened with MPI) > > - Hide some memory management GPU library specific implementation details > > - Reduce the gap between network activity and device activity (e.g. > receive/send packets directly using the device memory) > > - Reduce the gap between CPU activity and application-defined GPU workload > > - Open to the capability to interact with the GPU device, not managing it
Agree. I am not advocating P4 as the replacement for gpulib or DWA. If someone thinks possible. It would be great to how to express that for complex workload like TLS offload or ORAN 7.2 split highphy baseband offload etc. Could you give more details on "Open to the capability to interact with the GPU device, not managing it" What do you mean by managing it and what this RFC doing to manage it? > > > > gpudev library can be easily embedded in any GPU specific application with a > relatively small effort. > > The application can allocate, communicate and manage the memory with the > device transparently through DPDK. See below > > What you are providing here is different and out of the scope of the gpudev > library: control and manage the workload submission of possibly any > > accelerator device, hiding a lot of implementation details within DPDK. No. it has both control and user plane. which also allows an implementation to allocate, communicate and manage the memory with the device transparently through DPDK using user action. TLV messages can be at level. We can define the profile from a low and higher level based on what feature we need to offload. Or chain the multiple small profiles to create complex workloads. > > A wrapper for accelerator devices specific libraries and I think that it’s > too far to be realistic. > > As a GPU user, I don’t want to delegate my tasks to DWA because it can’t be > fully optimized, updated to the latest GPU specific feature, etc.. DWA is the GPU.Task are expressed in generic representation, so it can be optimized for GPU/DPU/IPU based on accelerator speciifics. > > > > Additionally, a generic DWA won't work for a GPU: > > - Memory copies of DWA to CPU / CPU to DWA is latency expensive. Packets can > directly be received in device memory No copy involved. The host port is just an abstract model. You can use just shared memory as underneath. Also, If you see the RFC, We can add new host ports that are specific to the category for transport(Ethernet, PCIe, Shared memory) > > - When launching multiple processing blocks, efficiency may be compromised How you are avoiding that with gpulib, the same logic can be moved to driver implementation. Right? > > > > I don’t actually see a real comparison between gpudev and DWA. > > If in the future we’ll expose some GPU workload through the gpudev library, > it will be for some network specific and well-defined problems. How do you want to represent the "network specific" and "well-defined" problem from application PoV. The problem, I am trying to address, if every vendor express the workload in accelerator specific fashion then we need N library and N application code to solve a single problem, I have provided an example for L3FWD, it will be good to know, how it can not map to GPU. Such level of depth discussion will give more ideas instead of an abstract level. Or you can take up a workload that can be NOT expressed with DWA RFC. That helps to understand the gap. I think, TB board/DPDK community needs to decide the direction following questions 1) Agree/Disagree on the need for workload offload accelerators in DPDK. 2) Do we need to expose accelerator-specific workload libraries (ie separate libraries for GPU, DPU etc) let the _DPDK_ application deal with using acceleration-specific API for the workload. If the majority thinks yes, In such case, we can have dpudev library in addition to gpudev, basically, it will be removing the profile concept from this RFC. 3) Allow accelerator-specific libraries and DWA kind of model and application to pick the model they want.