Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library

Wang, Haiyue Sat, 28 Aug 2021 22:33:09 -0700

> -----Original Message-----
> From: Jerin Jacob <jerinjac...@gmail.com>
> Sent: Friday, August 27, 2021 20:19
> To: Thomas Monjalon <tho...@monjalon.net>
> Cc: Jerin Jacob <jer...@marvell.com>; dpdk-dev <dev@dpdk.org>; Stephen 
> Hemminger
> <step...@networkplumber.org>; David Marchand <david.march...@redhat.com>; 
> Andrew Rybchenko
> <andrew.rybche...@oktetlabs.ru>; Wang, Haiyue <haiyue.w...@intel.com>; 
> Honnappa Nagarahalli
> <honnappa.nagaraha...@arm.com>; Yigit, Ferruh <ferruh.yi...@intel.com>; 
> techbo...@dpdk.org; Elena
> Agostini <eagost...@nvidia.com>
> Subject: Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library
> 
> On Fri, Aug 27, 2021 at 3:14 PM Thomas Monjalon <tho...@monjalon.net> wrote:
> >
> > 31/07/2021 15:42, Jerin Jacob:
> > > On Sat, Jul 31, 2021 at 1:51 PM Thomas Monjalon <tho...@monjalon.net> 
> > > wrote:
> > > > 31/07/2021 09:06, Jerin Jacob:
> > > > > On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon <tho...@monjalon.net> 
> > > > > wrote:
> > > > > > From: Elena Agostini <eagost...@nvidia.com>
> > > > > >
> > > > > > In heterogeneous computing system, processing is not only in the 
> > > > > > CPU.
> > > > > > Some tasks can be delegated to devices working in parallel.
> > > > > >
> > > > > > The goal of this new library is to enhance the collaboration between
> > > > > > DPDK, that's primarily a CPU framework, and other type of devices 
> > > > > > like GPUs.
> > > > > >
> > > > > > When mixing network activity with task processing on a non-CPU 
> > > > > > device,
> > > > > > there may be the need to put in communication the CPU with the 
> > > > > > device
> > > > > > in order to manage the memory, synchronize operations, exchange 
> > > > > > info, etc..
> > > > > >
> > > > > > This library provides a number of new features:
> > > > > > - Interoperability with device specific library with generic 
> > > > > > handlers
> > > > > > - Possibility to allocate and free memory on the device
> > > > > > - Possibility to allocate and free memory on the CPU but visible 
> > > > > > from the device
> > > > > > - Communication functions to enhance the dialog between the CPU and 
> > > > > > the device
> > > > > >
> > > > > > The infrastructure is prepared to welcome drivers in drivers/hc/
> > > > > > as the upcoming NVIDIA one, implementing the hcdev API.
> > > > > >
> > > > > > Some parts are not complete:
> > > > > >   - locks
> > > > > >   - memory allocation table
> > > > > >   - memory freeing
> > > > > >   - guide documentation
> > > > > >   - integration in devtools/check-doc-vs-code.sh
> > > > > >   - unit tests
> > > > > >   - integration in testpmd to enable Rx/Tx to/from GPU memory.
> > > > >
> > > > > Since the above line is the crux of the following text, I will start
> > > > > from this point.
> > > > >
> > > > > + Techboard
> > > > >
> > > > > I  can give my honest feedback on this.
> > > > >
> > > > > I can map similar  stuff  in Marvell HW, where we do machine learning
> > > > > as compute offload
> > > > > on a different class of CPU.
> > > > >
> > > > > In terms of RFC patch features
> > > > >
> > > > > 1) memory API - Use cases are aligned
> > > > > 2) communication flag and communication list
> > > > > Our structure is completely different and we are using HW ring kind of
> > > > > interface to post the job to compute interface and
> > > > > the job completion result happens through the event device.
> > > > > Kind of similar to the DMA API that has been discussed on the mailing 
> > > > > list.
> > > >
> > > > Interesting.
> > >
> > > It is hard to generalize the communication mechanism.
> > > Is other GPU vendors have a similar communication mechanism? AMD, Intel ??
> >
> > I don't know who to ask in AMD & Intel. Any ideas?
> 
> Good question.
> 
> At least in Marvell HW, the communication flag and communication list is
> our structure is completely different and we are using HW ring kind of
> interface to post the job to compute interface and
> the job completion result happens through the event device.
> kind of similar to the DMA API that has been discussed on the mailing list.
> 
> >
> > > > > Now the bigger question is why need to Tx and then Rx something to
> > > > > compute the device
> > > > > Isn't  ot offload something? If so, why not add the those offload in
> > > > > respective subsystem
> > > > > to improve the subsystem(ethdev, cryptiodev etc) features set to adapt
> > > > > new features or
> > > > > introduce new subsystem (like ML, Inline Baseband processing) so that
> > > > > it will be an opportunity to
> > > > > implement the same in  HW or compute device. For example, if we take
> > > > > this path, ML offloading will
> > > > > be application code like testpmd, which deals with "specific" device
> > > > > commands(aka glorified rawdev)
> > > > > to deal with specific computing device offload "COMMANDS"
> > > > > (The commands will be specific to  offload device, the same code wont
> > > > > run on  other compute device)
> > > >
> > > > Having specific features API is convenient for compatibility
> > > > between devices, yes, for the set of defined features.
> > > > Our approach is to start with a flexible API that the application
> > > > can use to implement any processing because with GPU programming,
> > > > there is no restriction on what can be achieved.
> > > > This approach does not contradict what you propose,
> > > > it does not prevent extending existing classes.
> > >
> > > It does prevent extending the existing classes as no one is going to
> > > extent it there is the path of not doing do.
> >
> > I disagree. Specific API is more convenient for some tasks,
> > so there is an incentive to define or extend specific device class APIs.
> > But it should not forbid doing custom processing.
> 
> This is the same as the raw device is in DPDK where the device
> personality is not defined.
> 
> Even if define another API and if the personality is not defined,
> it comes similar to the raw device as similar
> to rawdev enqueue and dequeue.
> 
> To summarize,
> 
> 1)  My _personal_ preference is to have specific subsystems
> to improve the DPDK instead of the raw device kind of path.


Something like rte_memdev to focus on device (GPU) memory management ?

The new DPDK auxiliary bus maybe make life easier to solve the complex
heterogeneous computing library. ;-)

> 2) If the device personality is not defined, use rawdev
> 3) All computing devices do not use  "communication flag" and
> "communication list"
> kind of structure. If are targeting a generic computing device then
> that is not a portable scheme.
> For GPU abstraction if "communication flag" and "communication list"
> is the right kind of mechanism
> then we can have a separate library for GPU communication specific to GPU <->
> DPDK communication needs and explicit for GPU.
> 
> I think generic DPDK applications like testpmd should not
> pollute with device-specific functions. Like, call device-specific
> messages from the application
> which makes the application runs only one device. I don't have a
> strong opinion(expect
> standardizing  "communication flag" and "communication list" as
> generic computing device
> communication mechanism) of others think it is OK to do that way in DPDK.
> 
> >
> > > If an application can run only on a specific device, it is similar to
> > > a raw device,
> > > where the device definition is not defined. (i.e JOB metadata is not 
> > > defined and
> > > it is specific to the device).
> > >
> > > > > Just my _personal_ preference is to have specific subsystems to
> > > > > improve the DPDK instead of raw device kind of
> > > > > path. If we decide another path as a community it is _fine_ too(as a
> > > > > _project manager_ point of view it will be an easy path to dump SDK
> > > > > stuff to DPDK without introducing the pain of the subsystem nor
> > > > > improving the DPDK).
> > > >
> > > > Adding a new class API is also improving DPDK.
> > >
> > > But the class is similar as raw dev class. The reason I say,
> > > Job submission and response is can be abstracted as queue/dequeue APIs.
> > > Taks/Job metadata is specific to compute devices (and it can not be
> > > generalized).
> > > If we generalize it makes sense to have a new class that does
> > > "specific function".
> >
> > Computing device programming is already generalized with languages like 
> > OpenCL.
> > We should not try to reinvent the same.
> > We are just trying to properly integrate the concept in DPDK
> > and allow building on top of it.
> 
> See above.
> 
> >
> >

Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library

Reply via email to