On Tue, Oct 13, 2015 at 01:24:22PM -0700, Alexander Duyck wrote: > On 10/13/2015 07:47 AM, Sanford, Robert wrote: > >>>>[Robert:] > >>>>1. The 82599 device supports up to 128 queues. Why do we see trouble > >>>>with as few as 5 queues? What could limit the system (and one port > >>>>controlled by 5+ cores) from receiving at line-rate without loss? > >>>> > >>>>2. As far as we can tell, the RX path only touches the device > >>>>registers when it updates a Receive Descriptor Tail register (RDT[n]), > >>>>roughly every rx_free_thresh packets. Is there a big difference > >>>>between one core doing this and N cores doing it 1/N as often? > >>>[Stephen:] > >>>As you add cores, there is more traffic on the PCI bus from each core > >>>polling. There is a fix number of PCI bus transactions per second > >>>possible. > >>>Each core is increasing the number of useless (empty) transactions. > >>[Bruce:] > >>The polling for packets by the core should not be using PCI bandwidth > >>directly, > >>as the ixgbe driver (and other drivers) check for the DD bit being set on > >>the > >>descriptor in memory/cache. > >I was preparing to reply with the same point. > > > >>>[Stephen:] Why do you think adding more cores will help? > >We're using run-to-completion and sometimes spend too many cycles per pkt. > >We realize that we need to move to io+workers model, but wanted a better > >understanding of the dynamics involved here. > > > > > > > >>[Bruce:] However, using an increased number of queues can > >>use PCI bandwidth in other ways, for instance, with more queues you > >>reduce the > >>amount of descriptor coalescing that can be done by the NICs, so that > >>instead of > >>having a single transaction of 4 descriptors to one queue, the NIC may > >>instead > >>have to do 4 transactions each writing 1 descriptor to 4 different > >>queues. This > >>is possibly why sending all traffic to a single queue works ok - the > >>polling on > >>the other queues is still being done, but has little effect. > >Brilliant! This idea did not occur to me. > > You can actually make the throughput regression disappear by altering the > traffic pattern you are testing with. In the past I have found that sending > traffic in bursts where 4 frames belong to the same queue before moving to > the next one essentially eliminated the dropped packets due to PCIe > bandwidth limitations. The trick is you need to have the Rx descriptor > processing work in batches so that you can get multiple descriptors > processed for each PCIe read/write. > Yep, that's one test we used to prove the effect on descriptor coalescing, and it does work a treat! Unfortunately, I think controlling real-world input traffic that way, could be, ... em ... challenging? :-)
/Bruce