On Wed, Jun 23, 2021 at 3:07 PM Bruce Richardson <bruce.richard...@intel.com> wrote: > > On Wed, Jun 23, 2021 at 12:51:07PM +0530, Jerin Jacob wrote: > > On Wed, Jun 23, 2021 at 9:00 AM fengchengwen <fengcheng...@huawei.com> > > wrote: > > >
> > > > > > Currently, it is hard to define generic dma descriptor, I think the > > > well-defined > > > APIs is feasible. > > > > I would like to understand why not feasible? if we move the > > preparation to the slow path. > > > > i.e > > > > struct rte_dmadev_desc defines all the "attributes" of all DMA devices > > available > > using capability. I believe with the scheme, we can scale and > > incorporate all features of > > all DMA HW without any performance impact. > > > > something like: > > > > struct rte_dmadev_desc { > > /* Attributes all DMA transfer available for all HW under capability. */ > > channel or port; > > ops ; // copy, fill etc.. > > /* impemention opqueue memory as zero length array, > > rte_dmadev_desc_prep() update this memory with HW specific information > > */ > > uint8_t impl_opq[]; > > } > > > > // allocate the memory for dma decriptor > > struct rte_dmadev_desc *rte_dmadev_desc_alloc(devid); > > // Convert DPDK specific descriptors to HW specific descriptors in slowpath > > */ > > rte_dmadev_desc_prep(devid, struct rte_dmadev_desc *desc); > > // Free dma descriptor memory > > rte_dmadev_desc_free(devid, struct rte_dmadev_desc *desc ) > > > > The above calls in slow path. > > > > Only below call in fastpath. > > // Here desc can be NULL(in case you don't need any specific attribute > > attached to transfer, if needed, it can be an object which is gone > > through rte_dmadev_desc_prep()) > > rte_dmadev_enq(devid, struct rte_dmadev_desc *desc, void *src, void > > *dest, unsigned int len, cookie) > > > > The trouble here is the performance penalty due to building up and tearing > down structures and passing those structures into functions via function > pointer. With the APIs for enqueue/dequeue that have been discussed here, > all parameters will be passed in registers, and then each driver can do a > write of the actual hardware descriptor straight to cache/memory from > registers. With the scheme you propose above, the register contains a > pointer to the data which must then be loaded into the CPU before being > written out again. This increases our offload cost. See below. > > However, assuming that the desc_prep call is just for slowpath or > initialization time, I'd be ok to have the functions take an extra > hw-specific parameter for each call prepared with tx_prep. It would still > allow all other parameters to be passed in registers. How much data are you > looking to store in this desc struct? It can't all be represented as flags, > for example? There is around 128bit of metadata for octeontx2. New HW may completely different metata http://code.dpdk.org/dpdk/v21.05/source/drivers/raw/octeontx2_dma/otx2_dpi_rawdev.h#L149 I see following issue with flags scheme: - We need to start populate in fastpath, Since it based on capabality, application needs to have different versions of fastpath code - Not future proof, Not easy add other stuff as needed when new HW comes with new transfer attributes. > > As for the individual APIs, we could do a generic "enqueue" API, which > takes the op as a parameter, I prefer having each operation as a separate > function, in order to increase the readability of the code and to reduce Only issue I see, all application needs have two path for doing the stuff, one with _prep() and separate function() and drivers need to support both. > the number of parameters needed per function i.e. thereby saving registers > needing to be used and potentially making the function calls and offload My worry is, struct rte_dmadev can hold only function pointers for <= 8 fastpath functions for 64B cache line. When you say new op, say fill, need a new function, What will be the change wrt HW driver point of view? Is it updating HW descriptor with op as _fill_ vs _copy_? something beyond that? If it is about, HW descriptor update, then _prep() can do all work, just driver need to copy desc to to HW. I believe upto to 6 arguments passed over registers in x86(it is 8 in arm64). if so, the desc pointer(already populated in HW descriptor format by _prep()) is in register, and would be simple 64bit/128bit copy from desc pointer to HW memory on driver enq(). I dont see any overhead on that, On other side, we if keep adding arguments, it will spill out to stack. > cost cheaper. Perhaps we can have the "common" ops such as copy, fill, have > their own functions, and have a generic "enqueue" function for the > less-commonly used or supported ops? > > /Bruce