On Thu, Feb 13, 2020 at 5:01 PM Doherty, Declan <declan.dohe...@intel.com> wrote: > > On 06/02/2020 10:54 AM, Jerin Jacob wrote: > > On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.co...@intel.com> wrote: > >> > >> Hi Jerin, > > > > Hi David, > > > >> Thanks for the comments. Please see replies below. > >> > >> Kind Regards, > >> David > >> > >>> On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.co...@intel.com> wrote: > >>>> > >>>> Introduction > >>>> ============ > >>>> > >>>> This RFC introduces a new DPDK library, rte_accelerator. > >>>> > >>>> The main aim of this library is to provide a flexible and extensible way > >>>> of > >>> combining one or more packet-processing functions into a single operation, > >>> thereby allowing these to be performed in parallel in optimized software > >>> libraries or in a hardware accelerator. These functions can include > >>> cryptography, compression and CRC/checksum calculation, while others can > >>> potentially be added in the future. Performing these functions in > >>> parallel as a > >>> single operation can enable a significant performance improvement. > >>>> > >>>> > >>>> Background > >>>> ========== > >>>> > >>>> There are a number of byte-wise operations which are present and > >>> common across many access network data-plane pipelines, such as Cipher, > >>> Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. > >>> Some > >>> prototyping has been done at Intel in relation to the 01.org > >>> access-network- > >>> dataplanes project to prove that a significant performance improvement is > >>> possible when such byte-wise operations are combined into a single pass of > >>> packet data processing. This performance boost has been prototyped for > >>> both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines. > >>> > >>> > >>> Could you share the relative performance numbers to show the gain? > >> > >> [DC] As mentioned above, the main performance gains are when the packet > >> processing operations can be combined into a single pass of the packet. > >> Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS MAC) have > >> been implemented in the AESNI MB library as single pass operation chains. > >> > >> We have modified the dpdk-crypto-perf-tester as part of our prototyping to > >> test the cases where: > >> 1) each packet processing function is done as an independent stage (e.g. > >> calling rte_net_crc for CRC, AESNI MB through rte_cryptodev for cipher, > >> and a C function to calculate the BIP) > >> 2) all packet processing functions done as a single-pass operation in > >> AESNI MB through rte_cryptodev > >> > >> We see the following results for 1024 byte input frames from > >> dpdk-crypto-perf-tester: > >> - XGS-PON MAC (Crypto-CRC-BIP): > >> - 3 independent stages: 1429 cycles/buf (13.75Gbps) > >> - 1 single-pass stage: 896 cycles/buf (21.9Gbps) > >> 37% cycle reduction > >> > >> - DOCSIS MAC (Crypto-CRC): > >> - 2 independent stages: 1421 cycles/buf (13.84Gbps) > >> - 1 single-pass stage: 1133 cycles/buf (17.34Gbps) > >> 20% cycle reduction > >> > >> Adding the accelerator API will allow vendors gain the benefits of these > >> cycle savings > > > > Numbers make sense. I have seen a similar performance improvement > > doing in one pass with CPU instructions. > > > > > >>>> - XGS-PON MAC: Crypto-CRC-BIP > >>>> - Order: > >>>> - Downstream: CRC, Encrypt, BIP > >>> > >>> I understand if the chain has two operations then it may possible to have > >>> handcrafted SW code to do both operations in one pass. > >>> I understand the spec is agnostic on a number of passes it does require to > >>> enable the xfrom but To understand the SW/HW capability, In the above > >>> case, "CRC, Encrypt, BIP", It is done in one pass in SW or three passes > >>> in SW > >>> or one pass using HW? > >> > >> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB > >> library SW. > >> However, this could also be performed as a single pass in a HW accelerator > > > > As a specification, cascading the xform chains make sense. > > Do we have any HW that does support chaining the xforms more than > > "two" in one pass? > > i.e real chaining function where two blocks of HWs work hand in hand > > for chaining. > > If none, it may be better to abstract as synonymous API(No dequeue, no > > enqueue) for the CPU use case. > > > > Where you thinking along the lines of a synchronous API option like that > just introduced to crytodev? i.e something like > > uint16_t rte_accelerator_process(struct rte_accelerator_ctx *ctx, > struct rte_accelerator_op ops[], > uint16_t nb_ops);
Yes. May be with capability or preference to denote application for the preferred usage model. > >