Hi Fiona Please refer to my comments below with my understanding on two major points OUT_OF_SPACE and Stateful Design. If you believe we still need a meeting to converge on same please share meeting details to me.
> -----Original Message----- > From: Trahe, Fiona [mailto:fiona.tr...@intel.com] > Sent: 15 December 2017 23:11 > To: Verma, Shally <shally.ve...@cavium.com>; dev@dpdk.org > Cc: Athreya, Narayana Prasad <narayanaprasad.athr...@cavium.com>; > Challa, Mahipal <mahipal.cha...@cavium.com>; De Lara Guarch, Pablo > <pablo.de.lara.gua...@intel.com>; Gupta, Ashish > <ashish.gu...@cavium.com>; Sahu, Sunila <sunila.s...@cavium.com>; > Trahe, Fiona <fiona.tr...@intel.com>; Jain, Deepak K > <deepak.k.j...@intel.com> > Subject: RE: [RFC v1] doc compression API for DPDK > > Hi Shally, > > > > -----Original Message----- > > From: Verma, Shally [mailto:shally.ve...@cavium.com] > > Sent: Thursday, December 7, 2017 5:43 AM > > To: Trahe, Fiona <fiona.tr...@intel.com>; dev@dpdk.org > > Cc: Athreya, Narayana Prasad <narayanaprasad.athr...@cavium.com>; > Challa, Mahipal > > <mahipal.cha...@cavium.com>; De Lara Guarch, Pablo > <pablo.de.lara.gua...@intel.com>; Gupta, Ashish > > <ashish.gu...@cavium.com>; Sahu, Sunila <sunila.s...@cavium.com> > > Subject: RE: [RFC v1] doc compression API for DPDK > > //snip.... > > > > > > > Please note any time output buffer ran out of space during write > then > > > > > operation will turn “Stateful”. See > > > > > > more on Stateful under respective section. > > > > > [Fiona] Let's come back to this later. An alternative is that > > > OUT_OF_SPACE is > > > > > returned and the application > > > > > must treat as a fail and resubmit the operation with a larger > destination > > > > > buffer. > > > > > > > > [Shally] Then I propose to add a feature flag > > > "FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible > > > > PMD design. > > > > As there're devices which treat it as error on compression but not on > > > decompression. > > > > If it is not supported, then it should be treated as failure condition > > > > and > app > > > can resubmit operation. > > > > if supported, behaviour *To-be-Defined* under stateful. > > > [Fiona] Can you explain 'turn stateful' some more? > > > If compressor runs out of space during stateless operation, either comp > or > > > decomp, and turns stateful, how would the app know? And what would > be in > > > status, consumed and produced? > > > Could it return OUT_OF_SPACE, and if both consumed and produced == 0 > > > > [Shally] If consumed = produced == 0, then it's not OUT_OF_SPACE > condition. > > > > > then the whole op must be resubmitted with a bigger output buffer. But > if > > > consumed and produced > 0 then app could take the output and submit > next > > > op > > > continuing from consumed+1. > > > > > > > [Shally] consumed and produced will *always* be > 0 in case of > OUT_OF_SPACE. > > OUT_OF_SPACE means output buffer exhausted while writing data into it > and PMD may have more to > > write to it. So in such case, PMD should set > > Produced = complete length of output buffer > > Status = OUT_OF_SPACE > > consume, following possibilities here: > > 1. consumed = complete length of src mbuf means PMD has read full input, > OR > > 2. consumed = partial length of src mbuf means PMD has read partial input > > > > On seeing this status, app should consume output and re-enqueue same > op with empty output buffer and > > src = consumed+1. > [Fiona] As this was a stateless op, the PMD cannot be expected to have > stored the history and state and so > cannot be expected to continue from consumed+1. This would be stateful > behaviour. [Shally] Exactly. > But it seems you are saying that even on in this stateless case you'd like the > PMDs who can store state > to have the option of converting to stateful. So > a PMD which can support this could return OUT_OF_SPACE with > produced/consumed as you describe above. > a PMD which can't support it should return an error. > The appl can continue on from consumed+1 in the former case and resubmit > the full request > with a bigger buffer in the latter case. > Is this the behaviour you're looking for? > If so the error could be something like NEED_BIGGER_DST_BUF? > However, wouldn't OUT_OF_SPACE with produced=consumed=0 convey the > same information on the API? > It may correspond to an error on the underlying PMD, but would it be simpler > on the compressdev API > > > > Please note as per current proposal, app should call > rte_compdev_enqueue_stream() version of API if it > > doesn't know output size beforehand. > [Fiona] True. But above is only trying to describe behaviour in the stateless > error case. [Shally] Ok. Now I got point of confusion with term 'turns stateful' here. No it's not like stateless to stateful conversion. Stateless operation is stateless only and in stateless we don't expect OUT_OF_SPACE error. So, now I also understand what you're trying to imply with produced=consumed=0. So, let me summarise redefinition of OUT_OF_SPACE based on RFC v3: Interpreting OUT_OF_SPACE condition: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A. Stateless Operations: ---------------------------------- A.1 If operation is stateless i.e. rte_comp_op. op_type == RTE_COMP_OP_STATELESS, and PMD runs out of buffer during compression or decompression then it is an error condition for PMD. It will reset itself and return with produced=consumed=0 with status OUT_OF_SPACE. On seeing this, application should resubmit full request with bigger output buffer size. B. Stateful Operations: ------------------------------- B.1 If operation is stateful i.e. rte_comp_op.op_type == RTE_COMP_OP_STATEFUL, and PMD runs out of buffer during compression or decompression, then PMD will update produced=consumed (as mentioned above) and app should resubmit op with input from consumed+1 and output buffer with free space. Please note for such case, application should allocate stream via call to rte_comp_stream_create() and attach it to op and pass it along every time pending op is enqueued until op processing is complete with status set to SUCCESS/FAILURE. > > //snip..... > > > > > > > > > > > > > D.2.1.2 Stateful operation state maintenance > > > > > > ------------------------------------------------------------- > > > > > > This section starts with description of our understanding about > > > > > compression API support for stateful. > > > > > > Depending upon understanding build upon these concepts, we will > > > identify > > > > > required data structure/param > > > > > > to maintain in-progress operation context by PMD. > > > > > > > > > > > > For stateful compression, batch of dependent packets starts at a > packet > > > > > having > > > > > > RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet > having > > > > > RTE_FULL_FLUSH/FINAL_FLUSH. > > > > > > i.e. array of operations will carry structure like this: > > > > > > > > > > > > ----------------------------------------------------------------------------------- > - > > > > > > |op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush| > > > > > > ----------------------------------------------------------------------------------- > - > > > > > > > > > [Fiona] I think it needs to be more constrained than your examples > below. > > > Only 1 operation from a stream can be in a burst. As each operation > > > in a stateful stream must complete, as next operation needs state and > > > history > > > of previous operation to be complete before it can be processed. > > > And if one failed, e.g. due to OUT_OF_SPACE, this should affect > > > the following operation in the same stream. > > > Worst case this means bursts of 1. Burst can be >1 if there are multiple > > > independent streams with available data for processing. Or if there is > > > data available which can be statelessly processed. > > > > > > If there are multiple buffers available from a stream , then instead they > can > > > be linked together in an mbuf chain sent in a single operation. > > > > > > To handle the sequences below would mean the PMD > > > would need to store ops sending one at a time to be processed. > > > > > > As this is significantly different from what you describe below, I'll > > > wait for > > > further feedback > > > before continuing. > > > > [Shally] I concur with your thoughts. And these're are not significantly > different from the concept > > presented below. > > > > Yes as you mentioned, even for burst_size>1 PMD will have to serialize > each op internally i.e. > > It has to wait for previous to finish before putting next for processing > > which > is > > as good as application making serialised call passing one op at-a-time or if > > stream consists of multiple buffers, making their scatter-gather list and > > then enqueue it as one op at a time which is more efficient and ideal usage. > > However in order to allow extensibility, I didn't mention limitation on > burst_size. > > Because If PMD doesn't support burst_size > 1 it can always return > nb_enqueued = 1, in which case > > app can enqueue next however with condition it should wait for previous > to complete > > before making next enqueue call. > > > > So, if we take simple example to compress 2k of data with src mbuf size = > 1k. > > Then with burst_size=1, expected call flow would be(this is just one flow, > other variations are also possible > > suchas making chain of 1k buffers and pass whole data in one go): > > > > 1. fill 1st 1k chunk of data in op.msrc > > 2.enqueue_stream (..., |op.flush = no_flush|, 1, ptr_stream); > > 3.dequeue_burst(|op|,1); > > 4.refill next 1k chunk in op.msrc > > 5.enqueue_stream(...,|op.flush = full_flush|, 1 , ptr_stream); > > 6.dequeue_burst(|op|, 1); > > 7.end > > > > So, I don’t see much of a change in API call flow from here to design > presented below except nb_ops = 1 in > > each call. > > However I am assuming that op structure would still be same for stateful > processing i.e. it would start with > > op.flush value = NO/SYNC_FLUSH and end at op with flush value = FULL > FLUSH. > > Are we on same page here? > > > > Thanks > > Shally > > [Fiona] We still have a different understanding of the stateful flow needed > on the API. > I’ll try to clarify and maybe we can set up a meeting to discuss. > My assumptions first: > • Order of ops on a qp must be maintained – ops should be dequeued > in same sequence they are enqueued. > • Ops from many streams can be enqueued on same qp. > • Ops from a qp may be fanned out to available hw or sw engines and > processed in parallel, so each op must be independent. > • Stateless and stateful ops can be enqueued on the same qp > > Submitting a burst of stateless ops to a qp is no problem. > Submitting more than 1 op at a time from the same stateful stream to a qp is > a problem. > Example: > Appl submits 2 ops in same stream in a burst, each has src and dest mbufs, > input length/offset and > requires checksum to be calculated. > The first op must be processed to completion before the second can be > started as it needs the history and the checksum so far. > If each dest mbuf is big enough so no overflow, each dest mbuf will be > partially filled. This is probably not > what’s desired, and will force an extra copy to make the output data > contiguous. > If the dest mbuf in the first op is too small, then does the PMD alloc more > memory in the dest mbuf? > Or alloc another mbuf? Or fail and the whole burst must be resubmitted? > Or store the 2nd op, wait, on seeing the OUT_OF_SPACE on the 1st op, > overwrite the src, dest, len etc of the 2nd op > to include the unprocessed part of the 1st op? > In the meantime, are all other ops on the qp blocked behind these? > For hw accelerators it’s worse, as PMD would normally return once ops are > offloaded and the dequeue would > pass processed ops straight back to the appl. Instead, the enqueue would > need to kick off a thread to > dequeue ops and filter to find the stateful one, storing the others til the > next > application dequeue is called. > > Above scenarios don’t lend themselves to accelerating a packet processing > workload. > It pushes a workload down to all PMDs which I believe belongs above this API > as > that work is not about offloading the compute intensive compression work > but > about the sequencing of data and so is better coded once, above the API in > an application layer > common to all PMDs. (See Note1 in http://dpdk.org/ml/archives/dev/2017- > October/078944.html ) > If an application has several packets with data from a stream that it needs to > (de)compress statefully, > what it probably wants is for the output data to fill each output buffer > completely before writing to the next buffer. > Chaining the src mbufs in these pkts into one chain and sending as one op > allows the output > data to be packed into a dest mbuf or mbuf chain. > I think what’s needed is a layer above the API to accumulate incoming > packets while waiting for the > previous set of packets to be compressed. Forwarding to the PMD to queue > there is not the right place > to buffer them as the queue should be per stream rather than on the > accelerator engine’s queue > which has lots of other independent packets. > [Shally] Ok. I believe I get it. In general I agree to this proposal. However have concern on 1 point here i.e. order maintenance. Please see further for more explanation. > > Proposal: > • Ops from a qp may be fanned out to available hw or sw engines and > processed in parallel, so each op must be independent. [Shally] Possible only if PMD support combination of SW and HW processing. Right? > • Order of ops on a qp must be maintained – ops should be dequeued in > same sequence they are enqueued. [Shally] If each op is independent then why do we need to maintain ordering. Since they're independent and thus can be processed in parallel so they can well be quite out-of-order and available for dequeue as soon as completed. Serializing them will limit HW throughput capability. And I can envision some app may not care about ordering just completion. So I would suggest if application need ordering should tag each op with some id or serial number in op user_data area to identify enqueue order OR we may add flag in enqueue_burst() API to enforce serialized dequeuing, if that's hard requirement of any. > • Stateless and stateful ops can be enqueued on the same qp > • Stateless and stateful ops can be enqueued in the same burst > • Only 1 op at a time may be enqueued to the qp from any stateful stream. > • A burst can have multiple stateful ops, but each must be from a different > stream. > • All ops will have a session attached – this will only contain immutable data > which > can be used by many ops, devices and or drivers at the same time. > • All stateful ops will have a stream attached for maintaining state and > history, this can only be used by one op at a time. [Shally] So, you mean: A single enque_burst() *can* carry multiple streams. I.E. This is allowed both in burst or in qp (say, when multiple threads call enque_burst() on same qp) --------------------------------------------------------------------------------------------------------- enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final | op4.no_flush | op5.no_flush |) --------------------------------------------------------------------------------------------------------- Where, All op1, op2...op5 belongs to all *different* streams. Op3 can be stateless/stateful depending upon op_type value and each can have *same or different* sessions. If I understand this right, then yes it looks good to me. However this also bring one minor point for discussion but I would wait to initiate that until we close on current open points. Thanks Shally > > > Code artefacts: > > enum rte_comp_op_type { > RTE_COMP_OP_STATELESS, > RTE_COMP_OP_STATEFUL > } > > Add following to rte_comp_op: > enum rte_comp_op_type op_type; > void * stream_private; > /* location where PMD maintains stream state – only required if op_type is > STATEFUL, else set to NULL */ > > As size of stream data will vary depending on PMD, each PMD or device > should allocate & manage its own mempool. Associated APIs are: > rte_comp_stream_create(uint8_t dev_id, rte_comp_session *sess, void ** > stream); > /* This should alloc a stream from the device’s mempool and initialise it. > This > handle will be passed to the PMD with every op in the stream. Q. Should > qp_id also be added, with constraint that all ops in the same stream should > be sent to the same qp? */ > rte_comp_stream_free(uint8_t dev_id, void * stream); > /* This should clear the stream and return it to the device’s mempool */ > > All ops are enqueued/dequeued to device & qp using same > rte_compressdev_enqueue_burst()/…dequeue_burst; > > Re flush flags, stateful stream would start with op.flush = NONE or SYNC and > end with FULL or FINAL > STATELESS ops would just use either FULL or FINAL > > > Let me know if you want to set up a meeting - it might be a more effective > way to > arrive at an API that works for all PMDs. > > I'll send out a v3 today with above plus updates based on all the other > feedback. > > Regards, > Fiona