Hello!

Hardware:
4x core Intel 8705G (NUC) Kabylake core i7 (32K L1, 8M L3)
Intel 05:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network 
Connection - 1 Gbe

I am currently experimenting with a plug-in that processes the payload of a 
packet to do some crypto (ECDSA verification). I have 2 nodes in my plug-in:
1) A hand-off node (pinned to core 3) that enqueues entire frames of packets to 
worker nodes on 2 other worker cores (1 & 2) using a frame queue of size 64 
(16,384 buffers)
2) A worker node that receives a frame from the hand-off node, loops through 
all packets (no pre-fetching), and starts a batch verification of the data.

I have a benchmark, where I feed packets from a PCAP file into the graph from a 
remote NIC through DPDK (vfio-pci).

I am trying to understand the performance statistics from 'vppctl show run' for 
my worker nodes. Here's what it looks like when I start a benchmark:

----------------------------------------------------------------- 
-----------------------------------------------------------------
Thread 1 vpp_wk_0 (lcore 1)
Time *150.4* , average vectors/node 3.04, last 128 main loops 8.00 per node 
256.00
vector rates in 0.0000e0, out 2.2369e4, drop 0.0000e0, punt 0.0000e0
Name                 State         Calls          Vectors        Suspends       
  Clocks       Vectors/Call
.....
narf-worker                      active            3219952         3364368      
         0 *1.33e5* *1.04*
.....
----------------------------------------------------------------- 
-----------------------------------------------------------------

It reports an average of *133,000 ticks/packet* AND each vector contains only 1 
packet on an average. I also have a memif interface as my next node in the 
graph that timestamps verified packets
on the other end. It averages *~45,000 packets/second*.

If I stop the benchmark and restart it, without restarting VPP, I start to see 
VPP batching more packets per vector (256, precisely) and the amortized cost 
per packet goes down significantly to *65,900 ticks/packet*.
This benches *~90,000 packets/second* as expected (the packet processing is 
expected to be 2x, if the size of the batch is > 64).

----------------------------------------------------------------- 
-----------------------------------------------------------------
Thread 1 vpp_wk_0 (lcore 1)/
Time *401.1* , average vectors/node 256.00, last 128 main loops 0.00 per node 
0.000
vector rates in 0.0000e0, out 4.6392e4, drop 0.0000e0, punt 0.0000e0
Name                 State         Calls          Vectors        Suspends       
  Clocks       Vectors/Call

.....
narf-worker                      active              72679        18605824      
         0 *6.59e4* *256.00*
.....

-----------------------------------------------------------------

I also have a counter in my nodes that report no. of packets read off of the 
NIC v/s no. of packets actually processed (difference being the packets that 
could not be queued onto the hand-off frame queue).

>From the initial bench, I see a large difference between the packets enqueued 
>and the packets received (because VPP is scheduling only 1 packet per frame).
-----------------------------------------------------------------
410616               narf-worker              NARF Transactions EC-verified
410013               narf-worker              NARF Transactions EC-verified
820697              narf-handoff              packets enqueued
2383254              narf-handoff              packets received
-----------------------------------------------------------------

With a restarted benchmark, the two counters are almost the same.
----------------------------------------------------------------- 
-----------------------------------------------------------------
77669136               narf-worker              NARF Transactions EC-verified
77670480               narf-worker              NARF Transactions EC-verified
155340896              narf-handoff              packets enqueued
167930211              narf-handoff              packets received
----------------------------------------------------------------- 
-----------------------------------------------------------------
Considering the frames received by the hand-off node are passed through to the 
worker nodes as is, it seems that the thread polling the NIC using DPDK is 
batching only 1 packet per frame initially. It then, switches to 256 
packets/frame when I restart the workload.

Is there something wrong w/ my setup?

-- Alok
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#13422): https://lists.fd.io/g/vpp-dev/message/13422
Mute This Topic: https://lists.fd.io/mt/32281669/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to