>>> [Robert:] >>> 1. The 82599 device supports up to 128 queues. Why do we see trouble >>> with as few as 5 queues? What could limit the system (and one port >>> controlled by 5+ cores) from receiving at line-rate without loss? >>> >>> 2. As far as we can tell, the RX path only touches the device >>> registers when it updates a Receive Descriptor Tail register (RDT[n]), >>> roughly every rx_free_thresh packets. Is there a big difference >>> between one core doing this and N cores doing it 1/N as often?
>>[Stephen:] >>As you add cores, there is more traffic on the PCI bus from each core >>polling. There is a fix number of PCI bus transactions per second >>possible. >>Each core is increasing the number of useless (empty) transactions. >[Bruce:] >The polling for packets by the core should not be using PCI bandwidth >directly, >as the ixgbe driver (and other drivers) check for the DD bit being set on >the >descriptor in memory/cache. I was preparing to reply with the same point. >>[Stephen:] Why do you think adding more cores will help? We're using run-to-completion and sometimes spend too many cycles per pkt. We realize that we need to move to io+workers model, but wanted a better understanding of the dynamics involved here. >[Bruce:] However, using an increased number of queues can >use PCI bandwidth in other ways, for instance, with more queues you >reduce the >amount of descriptor coalescing that can be done by the NICs, so that >instead of >having a single transaction of 4 descriptors to one queue, the NIC may >instead >have to do 4 transactions each writing 1 descriptor to 4 different >queues. This >is possibly why sending all traffic to a single queue works ok - the >polling on >the other queues is still being done, but has little effect. Brilliant! This idea did not occur to me. -- Thanks guys, Robert