Hi Jerin, see reply below > On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.co...@intel.com> wrote: > > > > Hi Jerin, > > Hi David, > > > Thanks for the comments. Please see replies below. > > > > Kind Regards, > > David > > > > > On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.co...@intel.com> > wrote: > > > > > > > > Introduction > > > > ============ > > > > > > > > This RFC introduces a new DPDK library, rte_accelerator. > > > > > > > > The main aim of this library is to provide a flexible and > > > > extensible way of > > > combining one or more packet-processing functions into a single > > > operation, thereby allowing these to be performed in parallel in > > > optimized software libraries or in a hardware accelerator. These > > > functions can include cryptography, compression and CRC/checksum > > > calculation, while others can potentially be added in the future. > > > Performing these functions in parallel as a single operation can enable a > significant performance improvement. > > > > > > > > > > > > Background > > > > ========== > > > > > > > > There are a number of byte-wise operations which are present and > > > common across many access network data-plane pipelines, such as > > > Cipher, Authentication, CRC, Bit-Interleaved-Parity (BIP), other > > > checksums etc. Some prototyping has been done at Intel in relation > > > to the 01.org access-network- dataplanes project to prove that a > > > significant performance improvement is possible when such byte-wise > > > operations are combined into a single pass of packet data > > > processing. This performance boost has been prototyped for both XGS- > PON MAC data-plane and DOCSIS MAC data-plane pipelines. > > > > > > > > > Could you share the relative performance numbers to show the gain? > > > > [DC] As mentioned above, the main performance gains are when the > packet processing operations can be combined into a single pass of the > packet. > > Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS > MAC) have been implemented in the AESNI MB library as single pass > operation chains. > > > > We have modified the dpdk-crypto-perf-tester as part of our prototyping > to test the cases where: > > 1) each packet processing function is done as an independent stage > > (e.g. calling rte_net_crc for CRC, AESNI MB through rte_cryptodev for > > cipher, and a C function to calculate the BIP) > > 2) all packet processing functions done as a single-pass operation in > > AESNI MB through rte_cryptodev > > > > We see the following results for 1024 byte input frames from dpdk-crypto- > perf-tester: > > - XGS-PON MAC (Crypto-CRC-BIP): > > - 3 independent stages: 1429 cycles/buf (13.75Gbps) > > - 1 single-pass stage: 896 cycles/buf (21.9Gbps) > > 37% cycle reduction > > > > - DOCSIS MAC (Crypto-CRC): > > - 2 independent stages: 1421 cycles/buf (13.84Gbps) > > - 1 single-pass stage: 1133 cycles/buf (17.34Gbps) > > 20% cycle reduction > > > > Adding the accelerator API will allow vendors gain the benefits of > > these cycle savings > > Numbers make sense. I have seen a similar performance improvement doing > in one pass with CPU instructions. > > > > > > - XGS-PON MAC: Crypto-CRC-BIP > > > > - Order: > > > > - Downstream: CRC, Encrypt, BIP > > > > > > I understand if the chain has two operations then it may possible to > > > have handcrafted SW code to do both operations in one pass. > > > I understand the spec is agnostic on a number of passes it does > > > require to enable the xfrom but To understand the SW/HW capability, > > > In the above case, "CRC, Encrypt, BIP", It is done in one pass in SW > > > or three passes in SW or one pass using HW? > > > > [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB > library SW. > > However, this could also be performed as a single pass in a HW > > accelerator > > As a specification, cascading the xform chains make sense. > Do we have any HW that does support chaining the xforms more than "two" > in one pass? > i.e real chaining function where two blocks of HWs work hand in hand for > chaining. > If none, it may be better to abstract as synonymous API(No dequeue, no > enqueue) for the CPU use case.
[DC] I'm not aware of any HW that supports this at the moment, but that's not to say it couldn't in the future - if anyone else has any examples though, please feel free to share. Regardless, I don't see why we would introduce a different API for SW devices and HW devices. It would be up to each underlying PMD to decide if/how it supports a particular accelerator xform chain, but from an application's point of view, the accelerator API is always the same