Hi Bruce, I noticed that librte_distributor has quite sever LLC miss problem when running on 16 cores. While on 8 cores, there's no such problem. The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32 cores on 2 sockets.
The test case is the distributor_perf_autotest, i.e. in app/test/test_distributor_perf.c. The test result is collected by command: perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test -cff -n2 --no-huge Note that test results show that with or without hugepage, the LCC miss rate remains the same. So I will just show --no-huge config. With 8 cores, the LLC miss rate is OK: LLC-load-misses 26750 LLC-loads 93979233 LLC-store-misses 432263 LLC-stores 69954746 That is 0.028% of load miss and 0.62% of store miss. With 16 cores, the LLC miss rate is very high: LLC-load-misses 70263520 LLC-loads 143807657 LLC-store-misses 23115990 LLC-stores 63692854 That is 48.9% load miss and 36.3% store miss. Most of the load miss happens at first line of rte_distributor_poll_pkt. Most of the store miss happens at ... I don't know, because perf record on LLC-store-misses brings down my machine. It's not so straightforward to me how could this happen: 8 core fine, but 16 cores very bad. My guess is that 16 cores bring in more QPI transaction between sockets? Or 16 cores bring a different LLC access pattern? So I tried to reduce the padding inside union rte_distributor_buffer from 3 cachelines to 1 cacheline. - char pad[CACHE_LINE_SIZE*3]; + char pad[CACHE_LINE_SIZE]; And it does have a obvious result: LLC-load-misses 53159968 LLC-loads 167756282 LLC-store-misses 29012799 LLC-stores 63352541 Now it is 31.69% of load miss, and 45.79% of store miss. It lows down the load miss rate, but raises the store miss rate. Both numbers are still very high, sadly. But the bright side is that it decrease the Time per burst and time per packet. The original version has: === Performance test of distributor === Time per burst: 8013 Time per packet: 250 And the patched ver has: === Performance test of distributor === Time per burst: 6834 Time per packet: 213 I tried a couple of other tricks. Such as adding more idle loops in rte_distributor_get_pkt, and making the rte_distributor_buffer thread_local to each worker core. But none of this trick has any noticeable outcome. These failures make me tend to believe the high LLC miss rate is related to QPI or NUMA. But my machine is not able to perf on uncore QPI events so this cannot be approved. I cannot draw any conclusion or reveal the root cause after all. But I suggest a further study on the performance bottleneck so as to find a good solution. thx & rgds, -qinglai