> -----Original Message----- > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of harish.patil at > qlogic.com > Sent: Sunday, November 08, 2015 7:40 PM > To: dev at dpdk.org > Subject: [dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned > load/store intrinsics > > From: Harish Patil <harish.patil at qlogic.com> > > l3fwd app expects PMDs to return packets whose L2 header is > 16-byte aligned due to usage of _mm_load_si128()/_mm_store_si128() > intrinsics in the app. However, most of the protocol stacks expects > packets such that its IP/L3 header be aligned on a 16-byte boundary. > > Based on the recommendations received on dpdk-dev, we are changing > the l3fwd app to use _mm_loadu_si128()/_mm_loadu_si128() so that the > address need not be 16-byte aligned and thereby preventing crash. > We have tested that there is no performance impact due to this > change. > > Signed-off-by: Harish Patil <harish.patil at qlogic.com> > ---
Acked-by: Konstantin Ananyev <konstantin.ananyev at intel.com> As a side notice: In fact with gcc build I do see a slight regression: ~1% for 4 ports over 1 core test-case. Though I think the problem is not in the patch itself. By some, unknown to me reason, gcc treats aligned and unaligned load/store instrincts in a different way (at least for that particular case). With aligned load/store in use it generates code that is pretty close to the source: 4 loads first, then 4 BLENDs, then 4 stores (with some interfering scalar instructions of course). But with unaligned ones gcc starts to mix loads and blends for the same register, so now it is: load x0; blend x0; load x1; blend x1; .. As if the source code was: te[0] = _mm_loadu_si128(p[0]); te[0] = _mm_blend_epi16(te[0], ve[0], MASK_ETH); te[1] = _mm_loadu_si128(p[1]); te[1] = _mm_blend_epi16(te[1], ve[1], MASK_ETH); ... So load latency is not hidden any more. I tried it with different versions of - same story for all of them. Clang doesn't have such issue and generates similar code for both aligned and unaligned instrincts. The only way to fix it I can think about - put rte_compiler_barrier() just before the first blend instinct. That helped, now there are no noticeable differences in generated code and results before and after the patch. So I suppose, I'll have to submit a patch after yours one to fix that problem. Konstantin