> >> > > >> > 9) Does anyone else facing this problem? > >Any data on x86? > > > [Wang, Yipeng] > I tried Jerin's tests on x86. So by default l3fwd on x86 will use lookup_bulk > and SIMD instruction so there is no obvious throughput drop on both hit > and miss cases (for hit case, there is about 2.5% drop though). Do you mean, if the test case has 'hit only' lookups, there is 2.5% drop?
> > I manually changed l3fwd to do single packet lookup instead of bulk. For hit > case there is no throughput drop. > For miss case, there is 10% throughput drop. > > I dig into it, as expected, atomic load indeed translates to regular mov on > x86. > But since the reordering of the instruction, the compiler(gcc 5.4) cannot > unroll the for loop to a switch-case like assembly as before. > So I believe the reason of performance drops on x86 is because compiler > cannot optimize the code as well as previously. Thank you. This makes sense. > I guess this is totally different reason from why your performance drop on > non-TSO machine. On non-TSO machine, probably the excessive number of > atomic load causes a lot of overhead. > > A quick fix I found useful on x86 is to read all index together. I am no > expert > on the use of atomic intinsics, but I assume By adding a fence should still > maintain the correct ordering? > - uint32_t key_idx; > + uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES]; > void *pdata; > struct rte_hash_key *k, *keys = h->key_store; > > + memcpy(key_idx, bkt->key_idx, 4 * RTE_HASH_BUCKET_ENTRIES); > + __atomic_thread_fence(__ATOMIC_ACQUIRE); > + > for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) { > - key_idx = __atomic_load_n(&bkt->key_idx[i], > - __ATOMIC_ACQUIRE); > - if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) { > + if (bkt->sig_current[i] == sig && key_idx[i] != > + EMPTY_SLOT){ Thank you for your suggestion. I tried it on Arm platforms, unfortunately it did not help. However, the idea of reducing the number of memory orderings addresses the problem. I worked on a hacked patch for the last couple of days. I have tested it with L3FWD data set, it provides good benefits. I have sent it to you and Jerin. Any feedback will be helpful. > > Yipeng