On Sun, Jul 04, 2021 at 03:00:30PM +0530, Jerin Jacob wrote: > On Fri, Jul 2, 2021 at 6:51 PM Chengwen Feng <fengcheng...@huawei.com> wrote: > > > > This patch introduces 'dmadevice' which is a generic type of DMA > > device. > > > > The APIs of dmadev library exposes some generic operations which can > > enable configuration and I/O with the DMA devices. > > > > Signed-off-by: Chengwen Feng <fengcheng...@huawei.com> > > Thanks for v1. > > I would suggest finalizing lib/dmadev/rte_dmadev.h before doing the > implementation so that you don't need > to waste time on rewoking the implementation. >
I actually like having the .c file available too. Before we lock down the .h file and the API, I want to verify the performance of our drivers with the implementation, and having a working .c file is obviously necessary for that. So I appreciate having it as part of the RFC. > Comments inline. > > > --- <snip> > > + * > > + * The DMA framework is built on the following abstraction model: > > + * > > + * ------------ ------------ > > + * |virt-queue| |virt-queue| > > + * ------------ ------------ > > + * \ / > > + * \ / > > + * \ / > > + * ------------ ------------ > > + * | HW-queue | | HW-queue | > > + * ------------ ------------ > > + * \ / > > + * \ / > > + * \ / > > + * ---------- > > + * | dmadev | > > + * ---------- > > Continuing the discussion with @Morten Brørup , I think, we need to > finalize the model. > +1 and the terminology with regards to queues and channels. With our ioat hardware, each HW queue was called a channel for instance. > > + * a) The DMA operation request must be submitted to the virt queue, virt > > + * queues must be created based on HW queues, the DMA device could > > have > > + * multiple HW queues. > > + * b) The virt queues on the same HW-queue could represent different > > contexts, > > + * e.g. user could create virt-queue-0 on HW-queue-0 for mem-to-mem > > + * transfer scenario, and create virt-queue-1 on the same HW-queue for > > + * mem-to-dev transfer scenario. > > + * NOTE: user could also create multiple virt queues for mem-to-mem > > transfer > > + * scenario as long as the corresponding driver supports. > > + * > > + * The control plane APIs include > > configure/queue_setup/queue_release/start/ > > + * stop/reset/close, in order to start device work, the call sequence must > > be > > + * as follows: > > + * - rte_dmadev_configure() > > + * - rte_dmadev_queue_setup() > > + * - rte_dmadev_start() > > Please add reconfigure behaviour etc, Please check the > lib/regexdev/rte_regexdev.h > introduction. I have added similar ones so you could reuse as much as > possible. > > > > + * The dataplane APIs include two parts: > > + * a) The first part is the submission of operation requests: > > + * - rte_dmadev_copy() > > + * - rte_dmadev_copy_sg() - scatter-gather form of copy > > + * - rte_dmadev_fill() > > + * - rte_dmadev_fill_sg() - scatter-gather form of fill > > + * - rte_dmadev_fence() - add a fence force ordering between > > operations > > + * - rte_dmadev_perform() - issue doorbell to hardware > > + * These APIs could work with different virt queues which have > > different > > + * contexts. > > + * The first four APIs are used to submit the operation request to > > the virt > > + * queue, if the submission is successful, a cookie (as type > > + * 'dma_cookie_t') is returned, otherwise a negative number is > > returned. > > + * b) The second part is to obtain the result of requests: > > + * - rte_dmadev_completed() > > + * - return the number of operation requests completed > > successfully. > > + * - rte_dmadev_completed_fails() > > + * - return the number of operation requests failed to complete. > > + * > > + * The misc APIs include info_get/queue_info_get/stats/xstats/selftest, > > provide > > + * information query and self-test capabilities. > > + * > > + * About the dataplane APIs MT-safe, there are two dimensions: > > + * a) For one virt queue, the submit/completion API could be MT-safe, > > + * e.g. one thread do submit operation, another thread do completion > > + * operation. > > + * If driver support it, then declare RTE_DMA_DEV_CAPA_MT_VQ. > > + * If driver don't support it, it's up to the application to guarantee > > + * MT-safe. > > + * b) For multiple virt queues on the same HW queue, e.g. one thread do > > + * operation on virt-queue-0, another thread do operation on > > virt-queue-1. > > + * If driver support it, then declare RTE_DMA_DEV_CAPA_MT_MVQ. > > + * If driver don't support it, it's up to the application to guarantee > > + * MT-safe. > > From an application PoV it may not be good to write portable > applications. Please check > latest thread with @Morten Brørup > > > + */ > > + > > +#ifdef __cplusplus > > +extern "C" { > > +#endif > > + > > +#include <rte_common.h> > > +#include <rte_memory.h> > > +#include <rte_errno.h> > > +#include <rte_compat.h> > > Sort in alphabetical order. > > > + > > +/** > > + * dma_cookie_t - an opaque DMA cookie > > Since we are defining the behaviour is not opaque any more. > I think, it is better to call ring_idx or so. > +1 for ring index. We don't need a separate type for it though, just document the index as an unsigned return value. > > > +#define RTE_DMA_DEV_CAPA_MT_MVQ (1ull << 11) /**< Support MT-safe of > > multiple virt queues */ > > Please lot of @see for all symbols where it is being used. So that one > can understand the full scope of > symbols. See below example. > > #define RTE_REGEXDEV_CAPA_RUNTIME_COMPILATION_F (1ULL << 0) > /**< RegEx device does support compiling the rules at runtime unlike > * loading only the pre-built rule database using > * struct rte_regexdev_config::rule_db in rte_regexdev_configure() > * > * @see struct rte_regexdev_config::rule_db, rte_regexdev_configure() > * @see struct rte_regexdev_info::regexdev_capa > */ > > > + * > > + * If dma_cookie_t is >=0 it's a DMA operation request cookie, <0 it's a > > error > > + * code. > > + * When using cookies, comply with the following rules: > > + * a) Cookies for each virtual queue are independent. > > + * b) For a virt queue, the cookie are monotonically incremented, when it > > reach > > + * the INT_MAX, it wraps back to zero. I disagree with the INT_MAX (or INT32_MAX) value here. If we use that value, it means that we cannot use implicit wrap-around inside the CPU and have to check for the INT_MAX value. Better to: 1. Specify that it wraps at UINT16_MAX which allows us to just use a uint16_t internally and wrap-around automatically, or: 2. Specify that it wraps at a power-of-2 value >= UINT16_MAX, giving drivers the flexibility at what value to wrap around. > > + * c) The initial cookie of a virt queue is zero, after the device is > > stopped or > > + * reset, the virt queue's cookie needs to be reset to zero. > > + * Example: > > + * step-1: start one dmadev > > + * step-2: enqueue a copy operation, the cookie return is 0 > > + * step-3: enqueue a copy operation again, the cookie return is 1 > > + * ... > > + * step-101: stop the dmadev > > + * step-102: start the dmadev > > + * step-103: enqueue a copy operation, the cookie return is 0 > > + * ... > > + */ > > Good explanation. > > > +typedef int32_t dma_cookie_t; > As I mentioned before, I'd just remove this, and use regular int types, with "ring_idx" as the name. > > > + > > +/** > > + * dma_scatterlist - can hold scatter DMA operation request > > + */ > > +struct dma_scatterlist { > > I prefer to change scatterlist -> sg > i.e rte_dma_sg > > > + void *src; > > + void *dst; > > + uint32_t length; > > +}; > > + > > > + > > +/** > > + * A structure used to retrieve the contextual information of > > + * an DMA device > > + */ > > +struct rte_dmadev_info { > > + /** > > + * Fields filled by framewok > > typo. > > > + */ > > + struct rte_device *device; /**< Generic Device information */ > > + const char *driver_name; /**< Device driver name */ > > + int socket_id; /**< Socket ID where memory is allocated */ > > + > > + /** > > + * Specification fields filled by driver > > + */ > > + uint64_t dev_capa; /**< Device capabilities (RTE_DMA_DEV_CAPA_) */ > > + uint16_t max_hw_queues; /**< Maximum number of HW queues. */ > > + uint16_t max_vqs_per_hw_queue; > > + /**< Maximum number of virt queues to allocate per HW queue */ > > + uint16_t max_desc; > > + /**< Maximum allowed number of virt queue descriptors */ > > + uint16_t min_desc; > > + /**< Minimum allowed number of virt queue descriptors */ > > Please add max_nb_segs. i.e maximum number of segments supported. > > > + > > + /** > > + * Status fields filled by driver > > + */ > > + uint16_t nb_hw_queues; /**< Number of HW queues configured */ > > + uint16_t nb_vqs; /**< Number of virt queues configured */ > > +}; > > + i > > + > > +/** > > + * dma_address_type > > + */ > > +enum dma_address_type { > > + DMA_ADDRESS_TYPE_IOVA, /**< Use IOVA as dma address */ > > + DMA_ADDRESS_TYPE_VA, /**< Use VA as dma address */ > > +}; > > + > > +/** > > + * A structure used to configure a DMA device. > > + */ > > +struct rte_dmadev_conf { > > + enum dma_address_type addr_type; /**< Address type to used */ > > I think, there are 3 kinds of limitations/capabilities. > > When the system is configured as IOVA as VA > 1) Device supports any VA address like memory from rte_malloc(), > rte_memzone(), malloc, stack memory > 2) Device support only VA address from rte_malloc(), rte_memzone() i.e > memory backed by hugepage and added to DMA map. > > When the system is configured as IOVA as PA > 1) Devices support only PA addresses . > > IMO, Above needs to be advertised as capability and application needs > to align with that > and I dont think application requests the driver to work in any of the modes. > > I don't think we need this level of detail for addressing capabilities. Unless I'm missing something, the hardware should behave exactly as other hardware does taking in iova's. If the user wants to check whether virtual addresses to pinned memory can be used directly, the user can call "rte_eal_iova_mode". We can't have a situation where some hardware uses one type of addresses and another hardware the other. Therefore, the only additional addressing capability we should need to report is that the hardware can use SVM/SVA and use virtual addresses not in hugepage memory. > > > + uint16_t nb_hw_queues; /**< Number of HW-queues enable to use */ > > + uint16_t max_vqs; /**< Maximum number of virt queues to use */ > > You need to what is max value allowed etc i.e it is based on > info_get() and mention the field > in info structure > > > > + > > +/** > > + * dma_transfer_direction > > + */ > > +enum dma_transfer_direction { > > rte_dma_transter_direction > > > + DMA_MEM_TO_MEM, > > + DMA_MEM_TO_DEV, > > + DMA_DEV_TO_MEM, > > + DMA_DEV_TO_DEV, > > +}; > > + > > +/** > > + * A structure used to configure a DMA virt queue. > > + */ > > +struct rte_dmadev_queue_conf { > > + enum dma_transfer_direction direction; > > > > + /**< Associated transfer direction */ > > + uint16_t hw_queue_id; /**< The HW queue on which to create virt > > queue */ > > + uint16_t nb_desc; /**< Number of descriptor for this virt queue */ > > + uint64_t dev_flags; /**< Device specific flags */ > > Use of this? Need more comments on this. > Since it is in slowpath, We can have non opaque names here based on > each driver capability. > > > > + void *dev_ctx; /**< Device specific context */ > > Use of this ? Need more comment ont this. > I think this should be dropped. We should not have any opaque device-specific info in these structs, rather if a particular device needs parameters we should call them out. Drivers for which it's not relevant can ignore them (and report same in capability if necessary). Since this is not a dataplane API, we aren't concerned too much about perf and can size the struct appropriately. > > Please add some good amount of reserved bits and have API to init this > structure for future ABI stability, say rte_dmadev_queue_config_init() > or so. > I don't think that is necessary. Since the config struct is used only as parameter to the config function, any changes to it can be managed by versioning that single function. Padding would only be necessary if we had an array of these config structs somewhere. > > > + > > +/** > > + * A structure used to retrieve information of a DMA virt queue. > > + */ > > +struct rte_dmadev_queue_info { > > + enum dma_transfer_direction direction; > > A queue may support all directions so I think it should be a bitfield. > > > + /**< Associated transfer direction */ > > + uint16_t hw_queue_id; /**< The HW queue on which to create virt > > queue */ > > + uint16_t nb_desc; /**< Number of descriptor for this virt queue */ > > + uint64_t dev_flags; /**< Device specific flags */ > > +}; > > + > > > +__rte_experimental > > +static inline dma_cookie_t > > +rte_dmadev_copy_sg(uint16_t dev_id, uint16_t vq_id, > > + const struct dma_scatterlist *sg, > > + uint32_t sg_len, uint64_t flags) > > I would like to change this as: > rte_dmadev_copy_sg(uint16_t dev_id, uint16_t vq_id, const struct > rte_dma_sg *src, uint32_t nb_src, > const struct rte_dma_sg *dst, uint32_t nb_dst) or so allow the use case like > src 30 MB copy can be splitted as written as 1 MB x 30 dst. > > > > > +{ > > + struct rte_dmadev *dev = &rte_dmadevices[dev_id]; > > + return (*dev->copy_sg)(dev, vq_id, sg, sg_len, flags); > > +} > > + > > +/** > > + * @warning > > + * @b EXPERIMENTAL: this API may change without prior notice. > > + * > > + * Enqueue a fill operation onto the DMA virt queue > > + * > > + * This queues up a fill operation to be performed by hardware, but does > > not > > + * trigger hardware to begin that operation. > > + * > > + * @param dev_id > > + * The identifier of the device. > > + * @param vq_id > > + * The identifier of virt queue. > > + * @param pattern > > + * The pattern to populate the destination buffer with. > > + * @param dst > > + * The address of the destination buffer. > > + * @param length > > + * The length of the destination buffer. > > + * @param flags > > + * An opaque flags for this operation. > > PLEASE REMOVE opaque stuff from fastpath it will be a pain for > application writers as > they need to write multiple combinations of fastpath. flags are OK, if > we have a valid > generic flag now to control the transfer behavior. > +1. Flags need to be explicitly listed. If we don't have any flags for now, we can specify that the value must be given as zero and it's for future use. > > > +/** > > + * @warning > > + * @b EXPERIMENTAL: this API may change without prior notice. > > + * > > + * Add a fence to force ordering between operations > > + * > > + * This adds a fence to a sequence of operations to enforce ordering, such > > that > > + * all operations enqueued before the fence must be completed before > > operations > > + * after the fence. > > + * NOTE: Since this fence may be added as a flag to the last operation > > enqueued, > > + * this API may not function correctly when called immediately after an > > + * "rte_dmadev_perform" call i.e. before any new operations are enqueued. > > + * > > + * @param dev_id > > + * The identifier of the device. > > + * @param vq_id > > + * The identifier of virt queue. > > + * > > + * @return > > + * - =0: Successful add fence. > > + * - <0: Failure to add fence. > > + * > > + * NOTE: The caller must ensure that the input parameter is valid and the > > + * corresponding device supports the operation. > > + */ > > +__rte_experimental > > +static inline int > > +rte_dmadev_fence(uint16_t dev_id, uint16_t vq_id) > > +{ > > + struct rte_dmadev *dev = &rte_dmadevices[dev_id]; > > + return (*dev->fence)(dev, vq_id); > > +} > > Since HW submission is in a queue(FIFO) the ordering is always > maintained. Right? > Could you share more details and use case of fence() from > driver/application PoV? > There are different kinds of ordering to consider, ordering of completions and the ordering of operations. While jobs are reported as completed to the user in order, for performance hardware, may overlap individual jobs within a burst (or even across bursts). Therefore, we need a fence operation to inform hardware that one job should not be started until the other has fully completed. > > > + > > +/** > > + * @warning > > + * @b EXPERIMENTAL: this API may change without prior notice. > > + * > > + * Trigger hardware to begin performing enqueued operations > > + * > > + * This API is used to write the "doorbell" to the hardware to trigger it > > + * to begin the operations previously enqueued by rte_dmadev_copy/fill() > > + * > > + * @param dev_id > > + * The identifier of the device. > > + * @param vq_id > > + * The identifier of virt queue. > > + * > > + * @return > > + * - =0: Successful trigger hardware. > > + * - <0: Failure to trigger hardware. > > + * > > + * NOTE: The caller must ensure that the input parameter is valid and the > > + * corresponding device supports the operation. > > + */ > > +__rte_experimental > > +static inline int > > +rte_dmadev_perform(uint16_t dev_id, uint16_t vq_id) > > +{ > > + struct rte_dmadev *dev = &rte_dmadevices[dev_id]; > > + return (*dev->perform)(dev, vq_id); > > +} > > Since we have additional function call overhead in all the > applications for this scheme, I would like to understand > the use of doing this way vs enq does the doorbell implicitly from > driver/application PoV? > In our benchmarks it's just faster. When we tested it, the overhead of the function calls was noticably less than the cost of building up the parameter array(s) for passing the jobs in as a burst. [We don't see this cost with things like NIC I/O since DPDK tends to already have the mbuf fully populated before the TX call anyway.] > > > + > > +/** > > + * @warning > > + * @b EXPERIMENTAL: this API may change without prior notice. > > + * > > + * Returns the number of operations that have been successful completed. > > + * > > + * @param dev_id > > + * The identifier of the device. > > + * @param vq_id > > + * The identifier of virt queue. > > + * @param nb_cpls > > + * The maximum number of completed operations that can be processed. > > + * @param[out] cookie > > + * The last completed operation's cookie. > > + * @param[out] has_error > > + * Indicates if there are transfer error. > > + * > > + * @return > > + * The number of operations that successful completed. > > successfully > > > + * > > + * NOTE: The caller must ensure that the input parameter is valid and the > > + * corresponding device supports the operation. > > + */ > > +__rte_experimental > > +static inline uint16_t > > +rte_dmadev_completed(uint16_t dev_id, uint16_t vq_id, const uint16_t > > nb_cpls, > > + dma_cookie_t *cookie, bool *has_error) > > +{ > > + struct rte_dmadev *dev = &rte_dmadevices[dev_id]; > > + has_error = false; > > + return (*dev->completed)(dev, vq_id, nb_cpls, cookie, has_error); > > It may be better to have cookie/ring_idx as third argument. > No strong opinions here, but having it as in the code above means all input parameters come before all output, which makes sense to me. > > +} > > + > > +/** > > + * @warning > > + * @b EXPERIMENTAL: this API may change without prior notice. > > + * > > + * Returns the number of operations that failed to complete. > > + * NOTE: This API was used when rte_dmadev_completed has_error was set. > > + * > > + * @param dev_id > > + * The identifier of the device. > > + * @param vq_id > > + * The identifier of virt queue. > (> + * @param nb_status > > + * Indicates the size of status array. > > + * @param[out] status > > + * The error code of operations that failed to complete. > > + * @param[out] cookie > > + * The last failed completed operation's cookie. > > + * > > + * @return > > + * The number of operations that failed to complete. > > + * > > + * NOTE: The caller must ensure that the input parameter is valid and the > > + * corresponding device supports the operation. > > + */ > > +__rte_experimental > > +static inline uint16_t > > +rte_dmadev_completed_fails(uint16_t dev_id, uint16_t vq_id, > > + const uint16_t nb_status, uint32_t *status, > > + dma_cookie_t *cookie) > > IMO, it is better to move cookie/rind_idx at 3. > Why it would return any array of errors? since it called after > rte_dmadev_completed() has > has_error. Is it better to change > > rte_dmadev_error_status((uint16_t dev_id, uint16_t vq_id, dma_cookie_t > *cookie, uint32_t *status) > > I also think, we may need to set status as bitmask and enumerate all > the combination of error codes > of all the driver and return string from driver existing rte_flow_error > > See > struct rte_flow_error { > enum rte_flow_error_type type; /**< Cause field and error types. */ > const void *cause; /**< Object responsible for the error. */ > const char *message; /**< Human-readable error message. */ > }; > I think we need a multi-return value API here, as we may add operations in future which have non-error status values to return. The obvious case is DMA engines which support "compare" operations. In that case a successful compare (as in there were no DMA or HW errors) can return "equal" or "not-equal" as statuses. For general "copy" operations, the faster completion op can be used to just return successful values (and only call this status version on error), while apps using those compare ops or a mixture of copy and compare ops, would always use the slower one that returns status values for each and every op.. The ioat APIs used 32-bit integer values for this status array so as to allow e.g. 16-bits for error code and 16-bits for future status values. For most operations there should be a fairly small set of things that can go wrong, i.e. bad source address, bad destination address or invalid length. Within that we may have a couple of specifics for why an address is bad, but even so I don't think we need to start having multiple bit combinations. > > +{ > > + struct rte_dmadev *dev = &rte_dmadevices[dev_id]; > > + return (*dev->completed_fails)(dev, vq_id, nb_status, status, > > cookie); > > +} > > + > > +struct rte_dmadev_stats { > > + uint64_t enqueue_fail_count; > > + /**< Conut of all operations which failed enqueued */ > > + uint64_t enqueued_count; > > + /**< Count of all operations which successful enqueued */ > > + uint64_t completed_fail_count; > > + /**< Count of all operations which failed to complete */ > > + uint64_t completed_count; > > + /**< Count of all operations which successful complete */ > > +}; > > We need to have capability API to tell which items are > updated/supported by the driver. > I also would remove the enqueue fail counts, since they are better counted by the app. If a driver reports 20,000 failures we have no way of knowing if that is 20,000 unique operations which failed to enqueue or a single operation which failed to enqueue 20,000 times but succeeded on attempt 20,001. > > > diff --git a/lib/dmadev/rte_dmadev_core.h b/lib/dmadev/rte_dmadev_core.h > > new file mode 100644 > > index 0000000..a3afea2 > > --- /dev/null > > +++ b/lib/dmadev/rte_dmadev_core.h > > @@ -0,0 +1,98 @@ > > +/* SPDX-License-Identifier: BSD-3-Clause > > + * Copyright 2021 HiSilicon Limited. > > + */ > > + > > +#ifndef _RTE_DMADEV_CORE_H_ > > +#define _RTE_DMADEV_CORE_H_ > > + > > +/** > > + * @file > > + * > > + * RTE DMA Device internal header. > > + * > > + * This header contains internal data types. But they are still part of the > > + * public API because they are used by inline public functions. > > + */ > > + > > +struct rte_dmadev; > > + > > +typedef dma_cookie_t (*dmadev_copy_t)(struct rte_dmadev *dev, uint16_t > > vq_id, > > + void *src, void *dst, > > + uint32_t length, uint64_t flags); > > +/**< @internal Function used to enqueue a copy operation. */ > > To avoid namespace conflict(as it is public API) use rte_ > > > > + > > +/** > > + * The data structure associated with each DMA device. > > + */ > > +struct rte_dmadev { > > + /**< Enqueue a copy operation onto the DMA device. */ > > + dmadev_copy_t copy; > > + /**< Enqueue a scatter list copy operation onto the DMA device. */ > > + dmadev_copy_sg_t copy_sg; > > + /**< Enqueue a fill operation onto the DMA device. */ > > + dmadev_fill_t fill; > > + /**< Enqueue a scatter list fill operation onto the DMA device. */ > > + dmadev_fill_sg_t fill_sg; > > + /**< Add a fence to force ordering between operations. */ > > + dmadev_fence_t fence; > > + /**< Trigger hardware to begin performing enqueued operations. */ > > + dmadev_perform_t perform; > > + /**< Returns the number of operations that successful completed. */ > > + dmadev_completed_t completed; > > + /**< Returns the number of operations that failed to complete. */ > > + dmadev_completed_fails_t completed_fails; > > We need to limit fastpath items in 1 CL > I don't think that is going to be possible. I also would like to see numbers to check if we benefit much from having these fastpath ops separate from the regular ops. > > + > > + void *dev_private; /**< PMD-specific private data */ > > + const struct rte_dmadev_ops *dev_ops; /**< Functions exported by > > PMD */ > > + > > + uint16_t dev_id; /**< Device ID for this instance */ > > + int socket_id; /**< Socket ID where memory is allocated */ > > + struct rte_device *device; > > + /**< Device info. supplied during device initialization */ > > + const char *driver_name; /**< Driver info. supplied by probing */ > > + char name[RTE_DMADEV_NAME_MAX_LEN]; /**< Device name */ > > + > > + RTE_STD_C11 > > + uint8_t attached : 1; /**< Flag indicating the device is attached */ > > + uint8_t started : 1; /**< Device state: STARTED(1)/STOPPED(0) */ > > Add a couple of reserved fields for future ABI stability. > > > + > > +} __rte_cache_aligned; > > + > > +extern struct rte_dmadev rte_dmadevices[]; > > +