Re: [dpdk-dev] [RFC v2] doc compression API for DPDK

Verma, Shally Wed, 21 Feb 2018 20:48:07 -0800


>-----Original Message-----
>From: Ahmed Mansour [mailto:ahmed.mans...@nxp.com]
>Sent: 22 February 2018 01:06
>To: Trahe, Fiona <fiona.tr...@intel.com>; Verma, Shally 
><shally.ve...@cavium.com>; dev@dpdk.org
>Cc: Athreya, Narayana Prasad <narayanaprasad.athr...@cavium.com>; Gupta, 
>Ashish <ashish.gu...@cavium.com>; Sahu, Sunila
><sunila.s...@cavium.com>; De Lara Guarch, Pablo 
><pablo.de.lara.gua...@intel.com>; Challa, Mahipal
><mahipal.cha...@cavium.com>; Jain, Deepak K <deepak.k.j...@intel.com>; Hemant 
>Agrawal <hemant.agra...@nxp.com>; Roy
>Pledge <roy.ple...@nxp.com>; Youri Querry <youri.querr...@nxp.com>
>Subject: Re: [RFC v2] doc compression API for DPDK
>
>On 2/21/2018 9:35 AM, Trahe, Fiona wrote:
>> Hi Ahmed, Shally,
>>
>>
>>> -----Original Message-----
>>> From: Ahmed Mansour [mailto:ahmed.mans...@nxp.com]
>>> Sent: Tuesday, February 20, 2018 7:56 PM
>>> To: Verma, Shally <shally.ve...@cavium.com>; Trahe, Fiona 
>>> <fiona.tr...@intel.com>; dev@dpdk.org
>>> Cc: Athreya, Narayana Prasad <narayanaprasad.athr...@cavium.com>; Gupta, 
>>> Ashish
>>> <ashish.gu...@cavium.com>; Sahu, Sunila <sunila.s...@cavium.com>; De Lara 
>>> Guarch, Pablo
>>> <pablo.de.lara.gua...@intel.com>; Challa, Mahipal 
>>> <mahipal.cha...@cavium.com>; Jain, Deepak K
>>> <deepak.k.j...@intel.com>; Hemant Agrawal <hemant.agra...@nxp.com>; Roy 
>>> Pledge
>>> <roy.ple...@nxp.com>; Youri Querry <youri.querr...@nxp.com>
>>> Subject: Re: [RFC v2] doc compression API for DPDK
>>>
>>> /// snip ///
>>>>>>>>>>>>>>>>>> D.2.1 Stateful operation state maintenance
>>>>>>>>>>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>>>>>>>>>>> It is always an ideal expectation from application that it 
>>>>>>>>>>>>>>>>>> should parse
>>>>>>>>>>>>>>>>> through all related chunk of source data making its 
>>>>>>>>>>>>>>>>> mbuf-chain and
>>>>>>>>>>>>>>> enqueue
>>>>>>>>>>>>>>>>> it for stateless processing.
>>>>>>>>>>>>>>>>>> However, if it need to break it into several enqueue_burst() 
>>>>>>>>>>>>>>>>>> calls, then
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>> expected call flow would be something like:
>>>>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will 
>>>>>>>>>>>>>>>>> call dequeue
>>>>>>>>>>>>>>>>> burst in a loop until all ops are received. Is this correct?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. 
>>>>>>>>>>>>>>>> However
>>>>>>>>>>>>>>> this illustration is specifically in
>>>>>>>>>>>>>>>> context of stateful op processing to reflect if a stream is 
>>>>>>>>>>>>>>>> broken into
>>>>>>>>>>>>>>> chunks, then each chunk should be
>>>>>>>>>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to 
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> dequeued first before next chunk is
>>>>>>>>>>>>>>>> enqueued.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>>>>>>> enqueue_burst( |op.full_flush |)
>>>>>>>>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I 
>>>>>>>>>>>>>>>>> understand that
>>>>>>>>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>> the response in exception cases?
>>>>>>>>>>>>>>>> [Shally] Multiples ops are allowed in flight, however 
>>>>>>>>>>>>>>>> condition is each op in
>>>>>>>>>>>>>>> such case is independent of
>>>>>>>>>>>>>>>> each other i.e. belong to different streams altogether.
>>>>>>>>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal 
>>>>>>>>>>>>>>>> to process all
>>>>>>>>>>>>>>> related chunks of data in single
>>>>>>>>>>>>>>>> burst by passing them as ops array but later found that as 
>>>>>>>>>>>>>>>> not-so-useful for
>>>>>>>>>>>>>>> PMD handling for various
>>>>>>>>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments 
>>>>>>>>>>>>>>>> for same.
>>>>>>>>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be 
>>>>>>>>>>>>>>> processed at a
>>>>>>>>>>>>>>> time, since each needs the
>>>>>>>>>>>>>>> state of the previous, to allow more than 1 op to be in-flight 
>>>>>>>>>>>>>>> at a time would
>>>>>>>>>>>>>>> force PMDs to implement internal queueing and exception 
>>>>>>>>>>>>>>> handling for
>>>>>>>>>>>>>>> OUT_OF_SPACE conditions you mention.
>>>>>>>>>>>>> [Ahmed] But we are putting the ops on qps which would make them
>>>>>>>>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little 
>>>>>>>>>>>>> bit more
>>>>>>>>>>>>> complex but doable.
>>>>>>>>>>>> [Fiona] In my opinion this is not doable, could be very 
>>>>>>>>>>>> inefficient.
>>>>>>>>>>>> There may be many streams.
>>>>>>>>>>>> The PMD would have to have an internal queue per stream so
>>>>>>>>>>>> it could adjust the next src offset and length in the OUT_OF_SPACE 
>>>>>>>>>>>> case.
>>>>>>>>>>>> And this may ripple back though all subsequent ops in the stream 
>>>>>>>>>>>> as each
>>>>>>>>>>>> source len is increased and its dst buffer is not big enough.
>>>>>>>>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
>>>>>>>>>>> The caller would still need to adjust
>>>>>>>>>>> the src length/output buffer as you say. The PMD cannot handle
>>>>>>>>>>> OUT_OF_SPACE internally.
>>>>>>>>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this 
>>>>>>>>>>> stream
>>>>>>>>>>> until it gets explicit
>>>>>>>>>>> confirmation from the caller to continue working on this stream. 
>>>>>>>>>>> Any ops
>>>>>>>>>>> received by
>>>>>>>>>>> the PMD should be returned to the caller with status STREAM_PAUSED 
>>>>>>>>>>> since
>>>>>>>>>>> the caller did not
>>>>>>>>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
>>>>>>>>>>> These semantics can be enabled by adding a new function to the API
>>>>>>>>>>> perhaps stream_resume().
>>>>>>>>>>> This allows the caller to indicate that it acknowledges that it has 
>>>>>>>>>>> seen
>>>>>>>>>>> the issue and this op
>>>>>>>>>>> should be used to resolve the issue. Implementations that do not 
>>>>>>>>>>> support
>>>>>>>>>>> this mode of use
>>>>>>>>>>> can push back immediately after one op is in flight. Implementations
>>>>>>>>>>> that support this use
>>>>>>>>>>> mode can allow many ops from the same session
>>>>>>>>>>>
>>>>>>>>>> [Shally] Is it still in context of having single burst where all op 
>>>>>>>>>> belongs to one stream? If yes, I
>>> would
>>>>>>> still
>>>>>>>>>> say it would add an overhead to PMDs especially if it is expected to 
>>>>>>>>>> work closer to HW (which I
>>> think
>>>>>>> is
>>>>>>>>>> the case with DPDK PMD).
>>>>>>>>>> Though your approach is doable but why this all cannot be in a layer 
>>>>>>>>>> above PMD? i.e. a layer
>>> above
>>>>>>> PMD
>>>>>>>>>> can either pass one-op at a time with burst size = 1 OR can make 
>>>>>>>>>> chained mbuf of input and
>>> output
>>>>>>> and
>>>>>>>>>> pass than as one op.
>>>>>>>>>> Is it just to ease applications of chained mbuf burden or do you see 
>>>>>>>>>> any performance /use-case
>>>>>>>>>> impacting aspect also?
>>>>>>>>>>
>>>>>>>>>> if it is in context where each op belong to different stream in a 
>>>>>>>>>> burst, then why do we need
>>>>>>>>>> stream_pause and resume? It is a expectations from app to pass more 
>>>>>>>>>> output buffer with
>>> consumed
>>>>>>> + 1
>>>>>>>>>> from next call onwards as it has already
>>>>>>>>>> seen OUT_OF_SPACE.
>>>>>>>> [Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
>>>>>>>> implementation rejects all ops that belong to a stream that has entered
>>>>>>>> "RECOVERABLE" state for one reason or another. The caller must
>>>>>>>> acknowledge explicitly that it has received news of the problem before
>>>>>>>> the PMD allows this stream to exit "RECOVERABLE" state. I agree with 
>>>>>>>> you
>>>>>>>> that implementing this functionality in the software layer above the 
>>>>>>>> PMD
>>>>>>>> is a bad idea since the latency reductions are lost.
>>>>>>> [Shally] Just reiterating, I rather meant other way around i.e. I see 
>>>>>>> it easier to put all such complexity
>>> in a
>>>>>>> layer above PMD.
>>>>>>>
>>>>>>>> This setup is useful in latency sensitive applications where the 
>>>>>>>> latency
>>>>>>>> of buffering multiple ops into one op is significant. We found latency
>>>>>>>> makes a significant difference in search applications where the PMD
>>>>>>>> competes with software decompression.
>>>>>> [Fiona] I see, so when all goes well, you get best-case latency, but when
>>>>>> out-of-space occurs latency will probably be worse.
>>>>> [Ahmed] This is exactly right. This use mode assumes out-of-space is a
>>>>> rare occurrence. Recovering from it should take similar time to
>>>>> synchronous implementations. The caller gets OUT_OF_SPACE_RECOVERABLE in
>>>>> both sync and async use. The caller can fix up the op and send it back
>>>>> to the PMD to continue work just as would be done in sync. Nonetheless,
>>>>> the added complexity is not justifiable if out-of-space is very common
>>>>> since the recoverable state will be the limiting factor that forces
>>>>> synchronicity.
>>>>>>>>> [Fiona] I still have concerns with this and would not want to support 
>>>>>>>>> in our PMD.
>>>>>>>>> TO make sure I understand, you want to send a burst of ops, with 
>>>>>>>>> several from same stream.
>>>>>>>>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not 
>>>>>>>>> process any
>>>>>>>>> subsequent ops in that stream.
>>>>>>>>> Should it return them in a dequeue_burst() with status still 
>>>>>>>>> NOT_PROCESSED?
>>>>>>>>> Or somehow drop them? How?
>>>>>>>>> While still processing ops form other streams.
>>>>>>>> [Ahmed] This is exactly correct. It should return them with
>>>>>>>> NOT_PROCESSED. Yes, the PMD should continue processing other streams.
>>>>>>>>> As we want to offload each op to hardware with as little CPU 
>>>>>>>>> processing as possible we
>>>>>>>>> would not want to open up each op to see which stream it's attached 
>>>>>>>>> to and
>>>>>>>>> make decisions to do per-stream storage, or drop it, or bypass hw and 
>>>>>>>>> dequeue without
>>> processing.
>>>>>>>> [Ahmed] I think I might have missed your point here, but I will try to
>>>>>>>> answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
>>>>>>>> to the PMD and the PMD should reject until stream_continue() is called.
>>>>>>>> The next op to be sent by the user will have a special marker in it to
>>>>>>>> inform the PMD to continue working on this stream. Alternatively the
>>>>>>>> DPDK layer can be made "smarter" to fail during the enqueue by checking
>>>>>>>> the stream and its state, but like you say this adds additional CPU
>>>>>>>> overhead during the enqueue.
>>>>>>>> I am curious. In a simple synchronous use case. How do we prevent users
>>>>>>> >from putting multiple ops in flight that belong to a single stream? Do
>>>>>>>> we just currently say it is undefined behavior? Otherwise we would have
>>>>>>>> to check the stream and incur the CPU overhead.
>>>>>> [Fiona] We don't do anything to prevent it. It's undefined. IMO on data 
>>>>>> path in
>>>>>> DPDK model we expect good behaviour and don't have to error check for 
>>>>>> things like this.
>>>>> [Ahmed] This makes sense. We also assume good behavior.
>>>>>> In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the hw 
>>>>>> q, then
>>>>>> build and send those messages. If we found an op from a stream which 
>>>>>> already
>>>>>> had one inflight, we'd have to hold that back, store in a sw 
>>>>>> stream-specific holding queue,
>>>>>> only send 19 to hw. We cannot send multiple ops from same stream to
>>>>>> the hw as it fans them out and does them in parallel.
>>>>>> Once the enqueue_burst() returns, there is no processing
>>>>>> context which would spot that the first has completed
>>>>>> and send the next op to the hw. On a dequeue_burst() we would spot this,
>>>>>> in that context could process the next op in the stream.
>>>>>> On out of space, instead of processing the next op we would have to 
>>>>>> transfer
>>>>>> all unprocessed ops from the stream to the dequeue result.
>>>>>> Some parts of this are doable, but seems likely to add a lot more 
>>>>>> latency,
>>>>>> we'd need to add extra threads and timers to move ops from the sw
>>>>>> queue to the hw q to get any benefit, and these constructs would add
>>>>>> context switching and CPU cycles. So we prefer to push this 
>>>>>> responsibility
>>>>>> to above the API and it can achieve similar.
>>>>> [Ahmed] I see what you mean. Our workflow is almost exactly the same
>>>>> with our hardware, but the fanning out is done by the hardware based on
>>>>> the stream and ops that belong to the same stream are never allowed to
>>>>> go out of order. Otherwise the data would be corrupted. Likewise the
>>>>> hardware is responsible for checking the state of the stream and
>>>>> returning frames as NOT_PROCESSED to the software
>>>>>>>>> Maybe we could add a capability if this behaviour is important for 
>>>>>>>>> you?
>>>>>>>>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
>>>>>>>>> Our PMD would set this to 0. And expect no more than one op from a 
>>>>>>>>> stateful stream
>>>>>>>>> to be in flight at any time.
>>>>>>>> [Ahmed] That makes sense. This way the different DPDK implementations 
>>>>>>>> do
>>>>>>>> not have to add extra checking for unsupported cases.
>>>>>>> [Shally] @ahmed, If I summarise your use-case, this is how to want to 
>>>>>>> PMD to support?
>>>>>>> - a burst *carry only one stream* and all ops then assumed to be belong 
>>>>>>> to that stream? (please
>>> note,
>>>>>>> here burst is not carrying more than one stream)
>>>>> [Ahmed] No. In this use case the caller sets up an op and enqueues a
>>>>> single op. Then before the response comes back from the PMD the caller
>>>>> enqueues a second op on the same stream.
>>>>>>> -PMD will submit one op at a time to HW?
>>>>> [Ahmed] I misunderstood what PMD means. I used it throughout to mean the
>>>>> HW. I used DPDK to mean the software implementation that talks to the
>>>>> hardware.
>>>>> The software will submit all ops immediately. The hardware has to figure
>>>>> out what to do with the ops depending on what stream they belong to.
>>>>>>> -if processed successfully, push it back to completion queue with 
>>>>>>> status = SUCCESS. If failed or run to
>>>>>>> into OUT_OF_SPACE, then push it to completion queue with status = 
>>>>>>> FAILURE/
>>>>>>> OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and 
>>>>>>> return with enqueue
>>> count
>>>>>>> = total # of ops submitted originally with burst?
>>>>> [Ahmed] This is exactly what I had in mind. all ops will be submitted to
>>>>> the HW. The HW will put all of them on the completion queue with the
>>>>> correct status exactly as you say.
>>>>>>> -app assumes all have been enqueued, so it go and dequeue all ops
>>>>>>> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of 
>>>>>>> ops with call to
>>>>>>> stream_continue/resume API starting from op which encountered 
>>>>>>> OUT_OF_SPACE and others as
>>>>>>> NOT_PROCESSED with updated input and output buffer?
>>>>> [Ahmed] Correct this is what we do today in our proprietary API.
>>>>>>> -repeat until *all* are dequeued with status = SUCCESS or *any* with 
>>>>>>> status = FAILURE? If anytime
>>>>>>> failure is seen, then app start whole processing all over again or just 
>>>>>>> drop this burst?!
>>>>> [Ahmed] The app has the choice on how to proceed. If the issue is
>>>>> recoverable then the application can continue this stream from where it
>>>>> stopped. if the failure is unrecoverable then the application should
>>>>> first fix the problem and start from the beginning of the stream.
>>>>>>> If all of above is true, then I think we should add another API such as
>>> rte_comp_enque_single_stream()
>>>>>>> which will be functional under Feature Flag = 
>>>>>>> ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better
>>>>>>> name is SUPPORT_ENQUEUE_SINGLE_STREAM?!
>>>>> [Ahmed] The main advantage in async use is lost if we force all related
>>>>> ops to be in the same burst. if we do that, then we might as well merge
>>>>> all the ops into one op. That would reduce the overhead.
>>>>> The use mode I am proposing is only useful in cases where the data
>>>>> becomes available after the first enqueue occurred. I want to allow the
>>>>> caller to enqueue the second set of data as soon as it is available
>>>>> regardless of whether or not the HW has already started working on the
>>>>> first op inflight.
>>>> [Shally] @ahmed,  Ok.. seems I missed a point here. So, confirm me 
>>>> following:
>>>>
>>>> As per current description in doc, expected stateful usage is:
>>>> enqueue (op1) --> dequeue(op1) --> enqueue(op2)
>>>>
>>>> but you're suggesting to allow an option to change it to
>>>>
>>>> enqueue(op1) -->enqueue(op2)
>>>>
>>>> i.e.  multiple ops from same stream can be put in-flight via subsequent 
>>>> enqueue_burst() calls without
>>> waiting to dequeue previous ones as PMD support it . So, no change to 
>>> current definition of a burst. It will
>>> still carry multiple streams where each op belonging to different stream ?!
>>> [Ahmed] Correct. I guess a user could put two ops on the same burst that
>>> belong to the same stream. In that case it would be more efficient to
>>> merge the ops using scatter gather. Nonetheless, I would not add checks
>>> in my implementation to limit that use. The hardware does not perceive a
>>> difference between ops that came on one burst and ops that came on two
>>> different bursts. to the hardware they are all ops. What matters is
>>> which stream each op belongs to.
>>>> if yes, then seems your HW can be setup for multiple streams so it is 
>>>> efficient for your case to support it
>>> in DPDK PMD layer but our hw doesn't by-default and need SW to back it. 
>>> Given that, I also suggest to
>>> enable it under some feature flag.
>>>> However it looks like an add-on and if it doesn't change current 
>>>> definition of a burst and minimum
>>> expectation set on stateful processing described in this document, then 
>>> IMO, you can propose this feature
>>> as an incremental patch on baseline version, in absence of which,
>>>> application will exercise stateful processing as described here 
>>>> (enq->deq->enq). Thoughts?
>>> [Ahmed] Makes sense. I was worried that there might be fundamental
>>> limitations to this mode of use in the API design. That is why I wanted
>>> to share this use mode with you guys and see if it can be accommodated
>>> using an incremental patch in the future.
>>>>>> [Fiona] Am curious about Ahmed's response to this. I didn't get that a 
>>>>>> burst should carry only one
>>> stream
>>>>>> Or get how this makes a difference? As there can be many enqueue_burst() 
>>>>>> calls done before an
>>> dequeue_burst()
>>>>>> Maybe you're thinking the enqueue_burst() would be a blocking call that 
>>>>>> would not return until all the
>>> ops
>>>>>> had been processed? This would turn it into a synchronous call which 
>>>>>> isn't the intent.
>>>>> [Ahmed] Agreed, a blocking or even a buffering software layer that baby
>>>>> sits the hardware does not fundamentally change the parameters of the
>>>>> system as a whole. It just moves workflow management complexity down
>>>>> into the DPDK software layer. Rather there are real latency and
>>>>> throughput advantages (because of caching) that I want to expose.
>>>>>
>> [Fiona] ok, so I think we've agreed that this can be an option, as long as 
>> not required of
>> PMDs and enabled under an explicit capability - named something like
>> ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS
>> @Ahmed, we'll leave it up to you to define details.
>> What's necessary is API text to describe the expected behaviour on any error 
>> conditions,
>> the pause/resume API, whether an API is expected to clean up if resume 
>> doesn't happen
>> and if there's any time limit on this, etc
>> But I wouldn't expect any changes to existing burst APIs, and all PMDs and 
>> applications
>> must be able to handle the default behaviour, i.e. with this capability 
>> disabled.
>> Specifically even if a PMD has this capability, if an application ignores it 
>> and only sends
>> one op at a time, if a PMD returns OUT_OF_SPACE_RECOVERABLE the stream should
>> not be in a paused state and the PMD should not wait for a resume() to 
>> handle the
>> next op sent for that stream.
>> Does that make sense?
>[Ahmed] That make sense. When this mode is enabled then additional
>functions must be called to resume the work, even if only one op was in
>flight. When this mode is not enabled then the PMD assumes that the
>caller will never enqueue a stateful op before receiving a response to
>the one that precedes it in a stream


[Shally] @ahmed , just to confirm on this

>When this mode is not enabled then the PMD assumes that the caller will never 
>enqueue a stateful op ...

I think what we want to ensure reverse of it i.e. "if mode is *enabled*, then 
also PMD should assume that caller can use enqueue->dequeue->enqueue sequence 
for stateful processing and if on deque, 
he discover OUT_OF_SPACE_RECOVERABLE and call enqueue() again to handle it , 
that should be also be supported by PMD" . 
In a sense, an application written for one PMD which doesn't have this 
capability should also work for PMD which has this capability.

>>
>>>>> /// snip ///
>>>>
>>

Re: [dpdk-dev] [RFC v2] doc compression API for DPDK

Reply via email to