>-----Original Message----- >From: Ahmed Mansour [mailto:ahmed.mans...@nxp.com] >Sent: 22 February 2018 01:06 >To: Trahe, Fiona <fiona.tr...@intel.com>; Verma, Shally ><shally.ve...@cavium.com>; dev@dpdk.org >Cc: Athreya, Narayana Prasad <narayanaprasad.athr...@cavium.com>; Gupta, >Ashish <ashish.gu...@cavium.com>; Sahu, Sunila ><sunila.s...@cavium.com>; De Lara Guarch, Pablo ><pablo.de.lara.gua...@intel.com>; Challa, Mahipal ><mahipal.cha...@cavium.com>; Jain, Deepak K <deepak.k.j...@intel.com>; Hemant >Agrawal <hemant.agra...@nxp.com>; Roy >Pledge <roy.ple...@nxp.com>; Youri Querry <youri.querr...@nxp.com> >Subject: Re: [RFC v2] doc compression API for DPDK > >On 2/21/2018 9:35 AM, Trahe, Fiona wrote: >> Hi Ahmed, Shally, >> >> >>> -----Original Message----- >>> From: Ahmed Mansour [mailto:ahmed.mans...@nxp.com] >>> Sent: Tuesday, February 20, 2018 7:56 PM >>> To: Verma, Shally <shally.ve...@cavium.com>; Trahe, Fiona >>> <fiona.tr...@intel.com>; dev@dpdk.org >>> Cc: Athreya, Narayana Prasad <narayanaprasad.athr...@cavium.com>; Gupta, >>> Ashish >>> <ashish.gu...@cavium.com>; Sahu, Sunila <sunila.s...@cavium.com>; De Lara >>> Guarch, Pablo >>> <pablo.de.lara.gua...@intel.com>; Challa, Mahipal >>> <mahipal.cha...@cavium.com>; Jain, Deepak K >>> <deepak.k.j...@intel.com>; Hemant Agrawal <hemant.agra...@nxp.com>; Roy >>> Pledge >>> <roy.ple...@nxp.com>; Youri Querry <youri.querr...@nxp.com> >>> Subject: Re: [RFC v2] doc compression API for DPDK >>> >>> /// snip /// >>>>>>>>>>>>>>>>>> D.2.1 Stateful operation state maintenance >>>>>>>>>>>>>>>>>> --------------------------------------------------------------- >>>>>>>>>>>>>>>>>> It is always an ideal expectation from application that it >>>>>>>>>>>>>>>>>> should parse >>>>>>>>>>>>>>>>> through all related chunk of source data making its >>>>>>>>>>>>>>>>> mbuf-chain and >>>>>>>>>>>>>>> enqueue >>>>>>>>>>>>>>>>> it for stateless processing. >>>>>>>>>>>>>>>>>> However, if it need to break it into several enqueue_burst() >>>>>>>>>>>>>>>>>> calls, then >>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>> expected call flow would be something like: >>>>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |) >>>>>>>>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will >>>>>>>>>>>>>>>>> call dequeue >>>>>>>>>>>>>>>>> burst in a loop until all ops are received. Is this correct? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next >>>>>>>>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. >>>>>>>>>>>>>>>> However >>>>>>>>>>>>>>> this illustration is specifically in >>>>>>>>>>>>>>>> context of stateful op processing to reflect if a stream is >>>>>>>>>>>>>>>> broken into >>>>>>>>>>>>>>> chunks, then each chunk should be >>>>>>>>>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to >>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>> dequeued first before next chunk is >>>>>>>>>>>>>>>> enqueued. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |) >>>>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next >>>>>>>>>>>>>>>>>> enqueue_burst( |op.full_flush |) >>>>>>>>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I >>>>>>>>>>>>>>>>> understand that >>>>>>>>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just >>>>>>>>>>>>>>> distinguish >>>>>>>>>>>>>>>>> the response in exception cases? >>>>>>>>>>>>>>>> [Shally] Multiples ops are allowed in flight, however >>>>>>>>>>>>>>>> condition is each op in >>>>>>>>>>>>>>> such case is independent of >>>>>>>>>>>>>>>> each other i.e. belong to different streams altogether. >>>>>>>>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal >>>>>>>>>>>>>>>> to process all >>>>>>>>>>>>>>> related chunks of data in single >>>>>>>>>>>>>>>> burst by passing them as ops array but later found that as >>>>>>>>>>>>>>>> not-so-useful for >>>>>>>>>>>>>>> PMD handling for various >>>>>>>>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments >>>>>>>>>>>>>>>> for same. >>>>>>>>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be >>>>>>>>>>>>>>> processed at a >>>>>>>>>>>>>>> time, since each needs the >>>>>>>>>>>>>>> state of the previous, to allow more than 1 op to be in-flight >>>>>>>>>>>>>>> at a time would >>>>>>>>>>>>>>> force PMDs to implement internal queueing and exception >>>>>>>>>>>>>>> handling for >>>>>>>>>>>>>>> OUT_OF_SPACE conditions you mention. >>>>>>>>>>>>> [Ahmed] But we are putting the ops on qps which would make them >>>>>>>>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little >>>>>>>>>>>>> bit more >>>>>>>>>>>>> complex but doable. >>>>>>>>>>>> [Fiona] In my opinion this is not doable, could be very >>>>>>>>>>>> inefficient. >>>>>>>>>>>> There may be many streams. >>>>>>>>>>>> The PMD would have to have an internal queue per stream so >>>>>>>>>>>> it could adjust the next src offset and length in the OUT_OF_SPACE >>>>>>>>>>>> case. >>>>>>>>>>>> And this may ripple back though all subsequent ops in the stream >>>>>>>>>>>> as each >>>>>>>>>>>> source len is increased and its dst buffer is not big enough. >>>>>>>>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling. >>>>>>>>>>> The caller would still need to adjust >>>>>>>>>>> the src length/output buffer as you say. The PMD cannot handle >>>>>>>>>>> OUT_OF_SPACE internally. >>>>>>>>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this >>>>>>>>>>> stream >>>>>>>>>>> until it gets explicit >>>>>>>>>>> confirmation from the caller to continue working on this stream. >>>>>>>>>>> Any ops >>>>>>>>>>> received by >>>>>>>>>>> the PMD should be returned to the caller with status STREAM_PAUSED >>>>>>>>>>> since >>>>>>>>>>> the caller did not >>>>>>>>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue. >>>>>>>>>>> These semantics can be enabled by adding a new function to the API >>>>>>>>>>> perhaps stream_resume(). >>>>>>>>>>> This allows the caller to indicate that it acknowledges that it has >>>>>>>>>>> seen >>>>>>>>>>> the issue and this op >>>>>>>>>>> should be used to resolve the issue. Implementations that do not >>>>>>>>>>> support >>>>>>>>>>> this mode of use >>>>>>>>>>> can push back immediately after one op is in flight. Implementations >>>>>>>>>>> that support this use >>>>>>>>>>> mode can allow many ops from the same session >>>>>>>>>>> >>>>>>>>>> [Shally] Is it still in context of having single burst where all op >>>>>>>>>> belongs to one stream? If yes, I >>> would >>>>>>> still >>>>>>>>>> say it would add an overhead to PMDs especially if it is expected to >>>>>>>>>> work closer to HW (which I >>> think >>>>>>> is >>>>>>>>>> the case with DPDK PMD). >>>>>>>>>> Though your approach is doable but why this all cannot be in a layer >>>>>>>>>> above PMD? i.e. a layer >>> above >>>>>>> PMD >>>>>>>>>> can either pass one-op at a time with burst size = 1 OR can make >>>>>>>>>> chained mbuf of input and >>> output >>>>>>> and >>>>>>>>>> pass than as one op. >>>>>>>>>> Is it just to ease applications of chained mbuf burden or do you see >>>>>>>>>> any performance /use-case >>>>>>>>>> impacting aspect also? >>>>>>>>>> >>>>>>>>>> if it is in context where each op belong to different stream in a >>>>>>>>>> burst, then why do we need >>>>>>>>>> stream_pause and resume? It is a expectations from app to pass more >>>>>>>>>> output buffer with >>> consumed >>>>>>> + 1 >>>>>>>>>> from next call onwards as it has already >>>>>>>>>> seen OUT_OF_SPACE. >>>>>>>> [Ahmed] Yes, this would add extra overhead to the PMD. Our PMD >>>>>>>> implementation rejects all ops that belong to a stream that has entered >>>>>>>> "RECOVERABLE" state for one reason or another. The caller must >>>>>>>> acknowledge explicitly that it has received news of the problem before >>>>>>>> the PMD allows this stream to exit "RECOVERABLE" state. I agree with >>>>>>>> you >>>>>>>> that implementing this functionality in the software layer above the >>>>>>>> PMD >>>>>>>> is a bad idea since the latency reductions are lost. >>>>>>> [Shally] Just reiterating, I rather meant other way around i.e. I see >>>>>>> it easier to put all such complexity >>> in a >>>>>>> layer above PMD. >>>>>>> >>>>>>>> This setup is useful in latency sensitive applications where the >>>>>>>> latency >>>>>>>> of buffering multiple ops into one op is significant. We found latency >>>>>>>> makes a significant difference in search applications where the PMD >>>>>>>> competes with software decompression. >>>>>> [Fiona] I see, so when all goes well, you get best-case latency, but when >>>>>> out-of-space occurs latency will probably be worse. >>>>> [Ahmed] This is exactly right. This use mode assumes out-of-space is a >>>>> rare occurrence. Recovering from it should take similar time to >>>>> synchronous implementations. The caller gets OUT_OF_SPACE_RECOVERABLE in >>>>> both sync and async use. The caller can fix up the op and send it back >>>>> to the PMD to continue work just as would be done in sync. Nonetheless, >>>>> the added complexity is not justifiable if out-of-space is very common >>>>> since the recoverable state will be the limiting factor that forces >>>>> synchronicity. >>>>>>>>> [Fiona] I still have concerns with this and would not want to support >>>>>>>>> in our PMD. >>>>>>>>> TO make sure I understand, you want to send a burst of ops, with >>>>>>>>> several from same stream. >>>>>>>>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not >>>>>>>>> process any >>>>>>>>> subsequent ops in that stream. >>>>>>>>> Should it return them in a dequeue_burst() with status still >>>>>>>>> NOT_PROCESSED? >>>>>>>>> Or somehow drop them? How? >>>>>>>>> While still processing ops form other streams. >>>>>>>> [Ahmed] This is exactly correct. It should return them with >>>>>>>> NOT_PROCESSED. Yes, the PMD should continue processing other streams. >>>>>>>>> As we want to offload each op to hardware with as little CPU >>>>>>>>> processing as possible we >>>>>>>>> would not want to open up each op to see which stream it's attached >>>>>>>>> to and >>>>>>>>> make decisions to do per-stream storage, or drop it, or bypass hw and >>>>>>>>> dequeue without >>> processing. >>>>>>>> [Ahmed] I think I might have missed your point here, but I will try to >>>>>>>> answer. There is no need to "cushion" ops in DPDK. DPDK should send ops >>>>>>>> to the PMD and the PMD should reject until stream_continue() is called. >>>>>>>> The next op to be sent by the user will have a special marker in it to >>>>>>>> inform the PMD to continue working on this stream. Alternatively the >>>>>>>> DPDK layer can be made "smarter" to fail during the enqueue by checking >>>>>>>> the stream and its state, but like you say this adds additional CPU >>>>>>>> overhead during the enqueue. >>>>>>>> I am curious. In a simple synchronous use case. How do we prevent users >>>>>>> >from putting multiple ops in flight that belong to a single stream? Do >>>>>>>> we just currently say it is undefined behavior? Otherwise we would have >>>>>>>> to check the stream and incur the CPU overhead. >>>>>> [Fiona] We don't do anything to prevent it. It's undefined. IMO on data >>>>>> path in >>>>>> DPDK model we expect good behaviour and don't have to error check for >>>>>> things like this. >>>>> [Ahmed] This makes sense. We also assume good behavior. >>>>>> In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the hw >>>>>> q, then >>>>>> build and send those messages. If we found an op from a stream which >>>>>> already >>>>>> had one inflight, we'd have to hold that back, store in a sw >>>>>> stream-specific holding queue, >>>>>> only send 19 to hw. We cannot send multiple ops from same stream to >>>>>> the hw as it fans them out and does them in parallel. >>>>>> Once the enqueue_burst() returns, there is no processing >>>>>> context which would spot that the first has completed >>>>>> and send the next op to the hw. On a dequeue_burst() we would spot this, >>>>>> in that context could process the next op in the stream. >>>>>> On out of space, instead of processing the next op we would have to >>>>>> transfer >>>>>> all unprocessed ops from the stream to the dequeue result. >>>>>> Some parts of this are doable, but seems likely to add a lot more >>>>>> latency, >>>>>> we'd need to add extra threads and timers to move ops from the sw >>>>>> queue to the hw q to get any benefit, and these constructs would add >>>>>> context switching and CPU cycles. So we prefer to push this >>>>>> responsibility >>>>>> to above the API and it can achieve similar. >>>>> [Ahmed] I see what you mean. Our workflow is almost exactly the same >>>>> with our hardware, but the fanning out is done by the hardware based on >>>>> the stream and ops that belong to the same stream are never allowed to >>>>> go out of order. Otherwise the data would be corrupted. Likewise the >>>>> hardware is responsible for checking the state of the stream and >>>>> returning frames as NOT_PROCESSED to the software >>>>>>>>> Maybe we could add a capability if this behaviour is important for >>>>>>>>> you? >>>>>>>>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ? >>>>>>>>> Our PMD would set this to 0. And expect no more than one op from a >>>>>>>>> stateful stream >>>>>>>>> to be in flight at any time. >>>>>>>> [Ahmed] That makes sense. This way the different DPDK implementations >>>>>>>> do >>>>>>>> not have to add extra checking for unsupported cases. >>>>>>> [Shally] @ahmed, If I summarise your use-case, this is how to want to >>>>>>> PMD to support? >>>>>>> - a burst *carry only one stream* and all ops then assumed to be belong >>>>>>> to that stream? (please >>> note, >>>>>>> here burst is not carrying more than one stream) >>>>> [Ahmed] No. In this use case the caller sets up an op and enqueues a >>>>> single op. Then before the response comes back from the PMD the caller >>>>> enqueues a second op on the same stream. >>>>>>> -PMD will submit one op at a time to HW? >>>>> [Ahmed] I misunderstood what PMD means. I used it throughout to mean the >>>>> HW. I used DPDK to mean the software implementation that talks to the >>>>> hardware. >>>>> The software will submit all ops immediately. The hardware has to figure >>>>> out what to do with the ops depending on what stream they belong to. >>>>>>> -if processed successfully, push it back to completion queue with >>>>>>> status = SUCCESS. If failed or run to >>>>>>> into OUT_OF_SPACE, then push it to completion queue with status = >>>>>>> FAILURE/ >>>>>>> OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and >>>>>>> return with enqueue >>> count >>>>>>> = total # of ops submitted originally with burst? >>>>> [Ahmed] This is exactly what I had in mind. all ops will be submitted to >>>>> the HW. The HW will put all of them on the completion queue with the >>>>> correct status exactly as you say. >>>>>>> -app assumes all have been enqueued, so it go and dequeue all ops >>>>>>> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of >>>>>>> ops with call to >>>>>>> stream_continue/resume API starting from op which encountered >>>>>>> OUT_OF_SPACE and others as >>>>>>> NOT_PROCESSED with updated input and output buffer? >>>>> [Ahmed] Correct this is what we do today in our proprietary API. >>>>>>> -repeat until *all* are dequeued with status = SUCCESS or *any* with >>>>>>> status = FAILURE? If anytime >>>>>>> failure is seen, then app start whole processing all over again or just >>>>>>> drop this burst?! >>>>> [Ahmed] The app has the choice on how to proceed. If the issue is >>>>> recoverable then the application can continue this stream from where it >>>>> stopped. if the failure is unrecoverable then the application should >>>>> first fix the problem and start from the beginning of the stream. >>>>>>> If all of above is true, then I think we should add another API such as >>> rte_comp_enque_single_stream() >>>>>>> which will be functional under Feature Flag = >>>>>>> ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better >>>>>>> name is SUPPORT_ENQUEUE_SINGLE_STREAM?! >>>>> [Ahmed] The main advantage in async use is lost if we force all related >>>>> ops to be in the same burst. if we do that, then we might as well merge >>>>> all the ops into one op. That would reduce the overhead. >>>>> The use mode I am proposing is only useful in cases where the data >>>>> becomes available after the first enqueue occurred. I want to allow the >>>>> caller to enqueue the second set of data as soon as it is available >>>>> regardless of whether or not the HW has already started working on the >>>>> first op inflight. >>>> [Shally] @ahmed, Ok.. seems I missed a point here. So, confirm me >>>> following: >>>> >>>> As per current description in doc, expected stateful usage is: >>>> enqueue (op1) --> dequeue(op1) --> enqueue(op2) >>>> >>>> but you're suggesting to allow an option to change it to >>>> >>>> enqueue(op1) -->enqueue(op2) >>>> >>>> i.e. multiple ops from same stream can be put in-flight via subsequent >>>> enqueue_burst() calls without >>> waiting to dequeue previous ones as PMD support it . So, no change to >>> current definition of a burst. It will >>> still carry multiple streams where each op belonging to different stream ?! >>> [Ahmed] Correct. I guess a user could put two ops on the same burst that >>> belong to the same stream. In that case it would be more efficient to >>> merge the ops using scatter gather. Nonetheless, I would not add checks >>> in my implementation to limit that use. The hardware does not perceive a >>> difference between ops that came on one burst and ops that came on two >>> different bursts. to the hardware they are all ops. What matters is >>> which stream each op belongs to. >>>> if yes, then seems your HW can be setup for multiple streams so it is >>>> efficient for your case to support it >>> in DPDK PMD layer but our hw doesn't by-default and need SW to back it. >>> Given that, I also suggest to >>> enable it under some feature flag. >>>> However it looks like an add-on and if it doesn't change current >>>> definition of a burst and minimum >>> expectation set on stateful processing described in this document, then >>> IMO, you can propose this feature >>> as an incremental patch on baseline version, in absence of which, >>>> application will exercise stateful processing as described here >>>> (enq->deq->enq). Thoughts? >>> [Ahmed] Makes sense. I was worried that there might be fundamental >>> limitations to this mode of use in the API design. That is why I wanted >>> to share this use mode with you guys and see if it can be accommodated >>> using an incremental patch in the future. >>>>>> [Fiona] Am curious about Ahmed's response to this. I didn't get that a >>>>>> burst should carry only one >>> stream >>>>>> Or get how this makes a difference? As there can be many enqueue_burst() >>>>>> calls done before an >>> dequeue_burst() >>>>>> Maybe you're thinking the enqueue_burst() would be a blocking call that >>>>>> would not return until all the >>> ops >>>>>> had been processed? This would turn it into a synchronous call which >>>>>> isn't the intent. >>>>> [Ahmed] Agreed, a blocking or even a buffering software layer that baby >>>>> sits the hardware does not fundamentally change the parameters of the >>>>> system as a whole. It just moves workflow management complexity down >>>>> into the DPDK software layer. Rather there are real latency and >>>>> throughput advantages (because of caching) that I want to expose. >>>>> >> [Fiona] ok, so I think we've agreed that this can be an option, as long as >> not required of >> PMDs and enabled under an explicit capability - named something like >> ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS >> @Ahmed, we'll leave it up to you to define details. >> What's necessary is API text to describe the expected behaviour on any error >> conditions, >> the pause/resume API, whether an API is expected to clean up if resume >> doesn't happen >> and if there's any time limit on this, etc >> But I wouldn't expect any changes to existing burst APIs, and all PMDs and >> applications >> must be able to handle the default behaviour, i.e. with this capability >> disabled. >> Specifically even if a PMD has this capability, if an application ignores it >> and only sends >> one op at a time, if a PMD returns OUT_OF_SPACE_RECOVERABLE the stream should >> not be in a paused state and the PMD should not wait for a resume() to >> handle the >> next op sent for that stream. >> Does that make sense? >[Ahmed] That make sense. When this mode is enabled then additional >functions must be called to resume the work, even if only one op was in >flight. When this mode is not enabled then the PMD assumes that the >caller will never enqueue a stateful op before receiving a response to >the one that precedes it in a stream
[Shally] @ahmed , just to confirm on this >When this mode is not enabled then the PMD assumes that the caller will never >enqueue a stateful op ... I think what we want to ensure reverse of it i.e. "if mode is *enabled*, then also PMD should assume that caller can use enqueue->dequeue->enqueue sequence for stateful processing and if on deque, he discover OUT_OF_SPACE_RECOVERABLE and call enqueue() again to handle it , that should be also be supported by PMD" . In a sense, an application written for one PMD which doesn't have this capability should also work for PMD which has this capability. >> >>>>> /// snip /// >>>> >>