Re: [dpdk-dev] [RFC v1] doc compression API for DPDK

Verma, Shally Tue, 19 Dec 2017 23:15:25 -0800

Hi Fiona

Please refer to my comments below with my understanding on two major points 
OUT_OF_SPACE and Stateful Design.
If you believe we still need a meeting to converge on same please share meeting 
details to me.



> -----Original Message-----
> From: Trahe, Fiona [mailto:fiona.tr...@intel.com]
> Sent: 15 December 2017 23:11
> To: Verma, Shally <shally.ve...@cavium.com>; dev@dpdk.org
> Cc: Athreya, Narayana Prasad <narayanaprasad.athr...@cavium.com>;
> Challa, Mahipal <mahipal.cha...@cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.gua...@intel.com>; Gupta, Ashish
> <ashish.gu...@cavium.com>; Sahu, Sunila <sunila.s...@cavium.com>;
> Trahe, Fiona <fiona.tr...@intel.com>; Jain, Deepak K
> <deepak.k.j...@intel.com>
> Subject: RE: [RFC v1] doc compression API for DPDK
> 
> Hi Shally,
> 
> 
> > -----Original Message-----
> > From: Verma, Shally [mailto:shally.ve...@cavium.com]
> > Sent: Thursday, December 7, 2017 5:43 AM
> > To: Trahe, Fiona <fiona.tr...@intel.com>; dev@dpdk.org
> > Cc: Athreya, Narayana Prasad <narayanaprasad.athr...@cavium.com>;
> Challa, Mahipal
> > <mahipal.cha...@cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.gua...@intel.com>; Gupta, Ashish
> > <ashish.gu...@cavium.com>; Sahu, Sunila <sunila.s...@cavium.com>
> > Subject: RE: [RFC v1] doc compression API for DPDK
> 
> //snip....
> 
> > > > > > Please note any time output buffer ran out of space during write
> then
> > > > > operation will turn “Stateful”.  See
> > > > > > more on Stateful under respective section.
> > > > > [Fiona] Let's come back to this later. An alternative is that
> > > OUT_OF_SPACE is
> > > > > returned and the  application
> > > > > must treat as a fail and resubmit the operation with a larger
> destination
> > > > > buffer.
> > > >
> > > > [Shally] Then I propose to add a feature flag
> > > "FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible
> > > > PMD design.
> > > > As there're devices which treat it as error on compression but not on
> > > decompression.
> > > > If it is not supported, then it should be treated as failure condition 
> > > > and
> app
> > > can resubmit operation.
> > > > if supported, behaviour *To-be-Defined* under stateful.
> > > [Fiona] Can you explain 'turn stateful' some more?
> > > If compressor runs out of space during stateless operation, either comp
> or
> > > decomp, and turns stateful, how would the app know? And what would
> be in
> > > status, consumed and produced?
> > > Could it return OUT_OF_SPACE, and if both consumed and produced == 0
> >
> > [Shally] If consumed = produced == 0, then it's not OUT_OF_SPACE
> condition.
> >
> > > then the whole op must be resubmitted with a bigger output buffer. But
> if
> > > consumed and produced > 0 then app could take the output and submit
> next
> > > op
> > > continuing from consumed+1.
> > >
> >
> > [Shally] consumed and produced will *always* be > 0 in case of
> OUT_OF_SPACE.
> > OUT_OF_SPACE means output buffer exhausted while writing data into it
> and PMD may have more to
> > write to it. So in such case, PMD should set
> > Produced = complete length of output buffer
> > Status = OUT_OF_SPACE
> > consume, following possibilities here:
> > 1. consumed = complete length of src mbuf means PMD has read full input,
> OR
> > 2. consumed = partial length of src mbuf means PMD has read partial input
> >
> > On seeing this status, app should consume output and re-enqueue same
> op with empty output buffer and
> > src = consumed+1.
> [Fiona] As this was a stateless op, the PMD cannot be expected to have
> stored the history and state and so
> cannot be expected to continue from consumed+1. This would be stateful
> behaviour.

[Shally] Exactly.

> But it seems you are saying that even on in this stateless case you'd like the
> PMDs who can store state
> to have the option of converting to stateful. So
> a PMD which can support this could return OUT_OF_SPACE with
> produced/consumed as you describe above.
> a PMD which can't support it should return an error.
> The appl can continue on from consumed+1 in the former case and resubmit
> the full request
> with a bigger buffer in the latter case.
> Is this the behaviour you're looking for?
> If so the error could be something like NEED_BIGGER_DST_BUF?
> However, wouldn't OUT_OF_SPACE with produced=consumed=0 convey the
> same information on the API?
> It may correspond to an error on the underlying PMD, but would it be simpler
> on the compressdev API
> 
> 
> > Please note as per current proposal, app should call
> rte_compdev_enqueue_stream() version of API if it
> > doesn't know output size beforehand.
> [Fiona] True. But above is only trying to describe behaviour in the stateless
> error case.

[Shally] Ok. Now I got point of confusion with term 'turns stateful' here. No 
it's not like stateless to stateful conversion. 
Stateless operation is stateless only and in stateless we don't expect 
OUT_OF_SPACE error. So, now I also understand what you're trying to imply with 
produced=consumed=0.

So, let me summarise redefinition of OUT_OF_SPACE based on RFC v3: 
 
Interpreting OUT_OF_SPACE condition:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A. Stateless Operations:
----------------------------------
A.1 If operation is stateless i.e. rte_comp_op. op_type == 
RTE_COMP_OP_STATELESS, and PMD runs out of buffer during compression or 
decompression then it is an error condition for PMD. 
It will reset itself and return with produced=consumed=0 with status 
OUT_OF_SPACE. On seeing this, application should resubmit full request with 
bigger output buffer size.

B. Stateful Operations:
-------------------------------
B.1 If operation is stateful i.e. rte_comp_op.op_type == RTE_COMP_OP_STATEFUL,  
and PMD runs out of buffer during compression or decompression, then PMD will 
update 
produced=consumed (as mentioned above) and app should resubmit op with input 
from consumed+1 and output buffer with free space.
Please note for such case, application should allocate stream via call to 
rte_comp_stream_create() and attach it to op and pass it along every time 
pending op is enqueued until op processing is complete with status set to 
SUCCESS/FAILURE.


> 
> //snip.....
> 
> > > > > >
> > > > > > D.2.1.2 Stateful operation state maintenance
> > > > > >  -------------------------------------------------------------
> > > > > > This section starts with description of our understanding about
> > > > > compression API support for stateful.
> > > > > > Depending upon understanding build upon these concepts, we will
> > > identify
> > > > > required data structure/param
> > > > > > to maintain in-progress operation context by PMD.
> > > > > >
> > > > > > For stateful compression, batch of dependent packets starts at a
> packet
> > > > > having
> > > > > > RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet
> having
> > > > > RTE_FULL_FLUSH/FINAL_FLUSH.
> > > > > > i.e. array of operations will carry structure like this:
> > > > > >
> > > > > > -----------------------------------------------------------------------------------
> -
> > > > > > |op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
> > > > > > -----------------------------------------------------------------------------------
> -
> > > > > >
> > > [Fiona] I think it needs to be more constrained than your examples
> below.
> > > Only 1 operation from a stream can be in a burst. As each operation
> > > in a stateful stream must complete, as next operation needs state and
> > > history
> > > of previous operation to be complete before it can be processed.
> > > And if one failed, e.g. due to OUT_OF_SPACE, this should affect
> > > the following operation in the same stream.
> > > Worst case this means bursts of 1. Burst can be >1 if there are multiple
> > > independent streams with available data for processing. Or if there is
> > > data available which can be statelessly processed.
> > >
> > > If there are multiple buffers available from a stream , then instead they
> can
> > > be linked together in an mbuf chain sent in a single operation.
> > >
> > > To handle the sequences below would mean the PMD
> > > would need to store ops sending one at a time to be processed.
> > >
> > > As this is significantly different from what you describe below, I'll 
> > > wait for
> > > further feedback
> > > before continuing.
> >
> > [Shally] I concur with your thoughts. And these're are not significantly
> different from the concept
> > presented below.
> >
> > Yes as you mentioned, even for burst_size>1 PMD will have to serialize
> each op internally i.e.
> > It has to wait for previous to finish before putting next for processing 
> > which
> is
> > as good as application making serialised call passing one op at-a-time or if
> > stream consists of multiple buffers, making their scatter-gather list and
> > then enqueue it as one op at a time which is more efficient and ideal usage.
> > However in order to allow extensibility, I didn't mention limitation on
> burst_size.
> > Because If PMD doesn't support burst_size > 1 it can always return
> nb_enqueued = 1, in which case
> > app can enqueue next however with condition it should wait for previous
> to complete
> > before making next enqueue call.
> >
> > So, if we take simple example to compress 2k of data with src mbuf size =
> 1k.
> > Then with burst_size=1, expected call flow would be(this is just one flow,
> other variations are also possible
> > suchas making chain of 1k buffers and pass whole data in one go):
> >
> > 1. fill 1st 1k chunk of data in op.msrc
> > 2.enqueue_stream (..., |op.flush = no_flush|, 1, ptr_stream);
> > 3.dequeue_burst(|op|,1);
> > 4.refill next 1k chunk in op.msrc
> > 5.enqueue_stream(...,|op.flush = full_flush|, 1 , ptr_stream);
> > 6.dequeue_burst(|op|, 1);
> > 7.end
> >
> > So, I don’t see much of a change in API call flow from here to design
> presented below except nb_ops = 1 in
> > each call.
> > However I am assuming that op structure would still be same for stateful
> processing i.e. it would start with
> > op.flush value = NO/SYNC_FLUSH and end at op with flush value = FULL
> FLUSH.
> > Are we on same page here?
> >
> > Thanks
> > Shally
> 
> [Fiona] We still have a different understanding of the stateful flow needed
> on the API.
> I’ll try to clarify and maybe we can set up a meeting to discuss.
> My assumptions first:
> •     Order of ops on a qp must be maintained – ops should be dequeued
> in same sequence they are enqueued.
> •     Ops from many streams can be enqueued on same qp.
> •     Ops from a qp may be fanned out to available hw or sw engines and
> processed in parallel, so each op must be independent.
> •     Stateless and stateful ops can be enqueued on the same qp
> 
> Submitting a burst of stateless ops to a qp is no problem.
> Submitting more than 1 op at a time from the same stateful stream to a qp is
> a problem.
> Example:
> Appl submits 2 ops in same stream in a burst, each has src and dest mbufs,
> input length/offset and
> requires checksum to be calculated.
> The first op must be processed to completion before the second can be
> started as it needs the history and the checksum so far.
> If each dest mbuf is big enough so no overflow, each dest mbuf will be
> partially filled. This is probably not
> what’s desired, and will force an extra copy to make the output data
> contiguous.
> If the dest mbuf in the first op is too small, then does the PMD alloc more
> memory in the dest mbuf?
> Or alloc another mbuf? Or fail and the whole burst must be resubmitted?
> Or store the 2nd op, wait, on seeing the OUT_OF_SPACE on the 1st op,
> overwrite the src, dest, len etc of the 2nd op
> to include the unprocessed part of the 1st op?
> In the meantime, are all other ops on the qp blocked behind these?
> For hw accelerators it’s worse, as PMD would normally return once ops are
> offloaded and the dequeue would
> pass processed ops straight back to the appl. Instead, the enqueue would
> need to kick off a thread to
> dequeue ops and filter to find the stateful one, storing the others til the 
> next
> application dequeue is called.
> 
> Above scenarios don’t lend themselves to accelerating a packet processing
> workload.
> It pushes a workload down to all PMDs which I believe belongs above this API
> as
> that work is not about offloading the compute intensive compression work
> but
> about the sequencing of data and so is better coded once, above the API in
> an application layer
> common to all PMDs. (See Note1 in http://dpdk.org/ml/archives/dev/2017-
> October/078944.html )
> If an application has several packets with data from a stream that it needs to
> (de)compress statefully,
> what it probably wants is for the output data to fill each output buffer
> completely before writing to the next buffer.
> Chaining the src mbufs in these pkts into one chain and sending as one op
> allows the output
> data to be packed into a dest mbuf or mbuf chain.
> I think what’s needed is a layer above the API to accumulate incoming
> packets while waiting for the
> previous set of packets to be compressed. Forwarding to the PMD to queue
> there is not the right place
> to buffer them as the queue should be per stream rather than on the
> accelerator engine’s queue
> which has lots of other independent packets.
> 

[Shally] Ok. I believe I get it.  
In general I agree to this proposal. However have concern on 1 point here i.e. 
order maintenance. Please see further for more explanation.

> 
> Proposal:
> • Ops from a qp may be fanned out to available hw or sw engines and
>     processed in parallel, so each op must be independent.
[Shally] Possible only if  PMD support combination of SW and HW processing. 
Right?

> • Order of ops on a qp must be maintained – ops should be dequeued in
> same sequence they are enqueued.
[Shally] If each op is independent then why do we need to maintain ordering. 
Since they're independent and thus can be processed in parallel so they can 
well be quite out-of-order and available for dequeue as soon as completed.
Serializing them will limit HW throughput capability. And I can envision some 
app may not care about ordering just completion. 
So I would suggest if application need ordering should tag each op with some id 
or serial number in op user_data area to identify enqueue order OR we may add 
flag in enqueue_burst() API to enforce serialized dequeuing, if that's hard 
requirement of any.

> • Stateless and stateful ops can be enqueued on the same qp
> • Stateless and stateful ops can be enqueued in the same burst
> • Only 1 op at a time may be enqueued to the qp from any stateful stream.
> • A burst can have multiple stateful ops, but each must be from a different
> stream.
> • All ops will have a session attached – this will only contain immutable data
> which
>    can be used by many ops, devices and or drivers at the same time.
> • All stateful ops will have a stream attached for maintaining state and
>    history, this can only be used by one op at a time.
[Shally] So, you mean:

A single enque_burst() *can* carry multiple streams. I.E. This is allowed both 
in burst or in qp (say, when multiple threads call enque_burst() on same qp)
               
                                      
---------------------------------------------------------------------------------------------------------
              enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final | 
op4.no_flush | op5.no_flush |)
                                       
---------------------------------------------------------------------------------------------------------
Where,
All op1, op2...op5 belongs to all *different* streams. Op3 can be 
stateless/stateful depending upon op_type value and each can have *same or 
different* sessions.

If I understand this right, then yes it looks good to me. However this also 
bring one minor point for discussion but I would wait to initiate that until we 
close on current open points.

Thanks
Shally
> 
> 
> Code artefacts:
> 
> enum rte_comp_op_type {
>     RTE_COMP_OP_STATELESS,
>     RTE_COMP_OP_STATEFUL
> }
> 
> Add following to rte_comp_op:
>     enum rte_comp_op_type op_type;
>     void * stream_private;
>     /* location where PMD maintains stream state – only required if op_type is
> STATEFUL, else set to NULL */
> 
> As size of stream data will vary depending on PMD, each PMD or device
> should allocate & manage its own mempool. Associated APIs are:
> rte_comp_stream_create(uint8_t dev_id, rte_comp_session *sess, void **
> stream);
> /* This should alloc a stream from the device’s mempool and initialise it. 
> This
> handle will be passed to the PMD with every op in the stream. Q. Should
> qp_id also be added, with constraint that all ops in the same stream should
> be sent to the same qp?  */
> rte_comp_stream_free(uint8_t dev_id, void * stream);
> /* This should clear the stream and return it to the device’s mempool */
> 
> All ops are enqueued/dequeued to device & qp using same
> rte_compressdev_enqueue_burst()/…dequeue_burst;
> 
> Re flush flags, stateful stream would start with op.flush = NONE or SYNC and
> end with FULL or FINAL
> STATELESS ops would just use either FULL or FINAL
> 
> 
> Let me know if you want to set up a meeting - it might be a more effective
> way to
> arrive at an API that works for all PMDs.
> 
> I'll send out a v3 today with above plus updates based on all the other
> feedback.
> 
> Regards,
> Fiona

Re: [dpdk-dev] [RFC v1] doc compression API for DPDK

Reply via email to