--- On Tue, 12/4/12, Bruce Evans <b...@optusnet.com.au> wrote:
> From: Bruce Evans <b...@optusnet.com.au> > Subject: Re: Latency issues with buf_ring > To: "Andre Oppermann" <opperm...@networx.ch> > Cc: "Adrian Chadd" <adr...@freebsd.org>, "Barney Cordoba" > <barney_cord...@yahoo.com>, "John Baldwin" <j...@freebsd.org>, > freebsd-net@FreeBSD.org > Date: Tuesday, December 4, 2012, 10:31 PM > On Tue, 4 Dec 2012, Andre Oppermann > wrote: > > > For most if not all ethernet drivers from 100Mbit/s the > TX DMA rings > > are so large that buffering at the IFQ level doesn't > make sense anymore > > and only adds latency. > > I found sort of the opposite for bge at 1Gbps. Most or > all bge NICs > have a tx ring size of 512. The ifq length is the tx > ring size minus > 1 (511). I needed to expand this to imax(2 * tick / 4, > 10000) to > maximize pps. This does bad things to latency and > worse things to > caching (512 buffers might fit in the L2 cache, but 10000 > buffers > bust any reasonably cache as they are cycled through), but I > only > tried to optimize tx pps. > > > So it could simply directly put everything into > > the TX DMA and not even try to soft-queue. If the > TX DMA ring is full > > ENOBUFS is returned instead of filling yet another > queue. > > That could work, but upper layers currently don't understand > ENOBUFS > at all, so it would work poorly now. Also, 512 entries > is not many, > so even if upper layers understood ENOBUFS it is not easy > for them to > _always_ respond fast enough to keep the tx active, unless > there are > upstream buffers with many more than 512 entries. > There needs to be > enough buffering somewhere so that the tx ring can be > replenished > almost instantly from the buffer, to handle the worst-case > latency > for the threads generatng new (unbuffered) packets. At > the line rate > of ~1.5 Mpps for 1 Gbps, the maximum latency that can be > covered by > 512 entries is only 340 usec. > > > However there > > are ALTQ interactions and other mechanisms which have > to be considered > > too making it a bit more involved. > > I didn't try to handle ALTQ or even optimize for TCP. > > More details: to maximize pps, the main detail is to ensure > that the tx > ring never becomes empty. The tx then transmits as > fast as possible. > This requires some watermark processing, but FreeBSD has > almost none > for tx rings. The following normally happens for > packet generators > like ttcp and netsend: > > - loop calling send() or sendto() until the tx ring (and > also any > upstream buffers) fill up. Then ENOBUFS is > returned. > > - watermark processing is broken in the user API at this > point. There > is no way for the application to wait for the ENOBUFS > condition to > go away (select() and poll() don't work). > Applications use poor > workarounds: > > - old (~1989) ttcp sleeps for 18 msec when send() returns > ENOBUFS. This > was barely good enough for 1 Mbps ethernet (line rate > ~1500 pps is 27 > per 18 msec, so IFQ_MAXLEN = 50 combined with just a > 1-entry tx ring > provides a safety factor of about 2). Expansion > of the tx ring size to > 512 makes this work with 10 Mbps ethernet too. > Expansion of the ifq > to 511 gives another factor of 2. After losing > the safety factor of 2, > we can now handle 40 Mbps ethernet, and are only a > factor of 25 short > for 1 Gbps. My hardware can't do line rate for > small packets -- it > can only do 640 kpps. Thus ttcp is only a > factor of 11 short of > supporting the hardware at 1 Gbps. > > This assumes that sleeps of 18 msec are actually > possible, which > they aren't with HZ = 100 giving a granularity of 10 > msec so that > sleep(18 msec) actually sleeps for an average of 23 > msec. -current > uses the bad default of HZ = 1000. With that > sleep(18 msec) would > average 18.5 msec. Of course, ttcp should sleep > for more like 1 > msec if that is possible. Then the average > sleep is 1.5 msec. ttcp > can keep up with the hardware with that, and is only > slightly behind > the hardware with the worst-case sleep of 2 msec > (512+511 packets > generated every 2 msec is 511.5 kpps). > > I normally use old ttcp, except I modify it to sleep > for 1 msec instead > of 18 in one version, and in another version I remove > the sleep so that > it busy-waits in a loop that calls send() which > almost always returns > ENOBUFS. The latter wastes a lot of CPU, but is > almost good enough > for throughput testing. > > - newer ttcp tries to program the sleep time in > microseconds. This doesn't > really work, since the sleep granularity is normally > at least a millisecond, > and even if it could be the 340 microseconds needed > by bge with no ifq > (see above, and better divide the 340 by 2), then > this is quite short > and would take almost as much CPU as > busy-waiting. I consider HZ = 1000 > to be another form of polling/busy-waiting and don't > use it except for > testing. > > - netrate/netsend also uses a programmed sleep time. > This doesn't really > work, as above. netsend also tries to limit its > rate based on sleeping. > This is further from working, since even > finer-grained sleeps are needed > to limit the rate accurately than to keep up with the > maxium rate. > > Watermark processing at the kernel level is not quite as > broken. It > is mostly non-existend, but partly works, sort of > accidentally. The > difference is now that there is a tx "eof" or "completion" > interrupt > which indicates the condition corresponding to the ENOBUFS > condition > going away, so that the kernel doesn't have to poll for > this. This > is not really an "eof" interrupt (unless bge is programmed > insanely, > to interrupt only after the tx ring is completely > empty). It acts as > primitive watermarking. bge can be programmed to > interrupt after > having sent every N packets (strictly, after every N buffer > descriptors, > but for small packets these are the same). When there > are more than > N packets to start, say M, this acts as a watermark at M-N > packets. > bge is normally misprogrammed with N = 10. At the line > rate of 1.5 Mpps, > this asks for an interrupt rate of 150 kHz, which is far too > high and > is usually unreachable, so reaching the line rate is > impossible due to > the CPU load from the interrupts. I use N = 384 or 256 > so that the > interrupt rate is not the dominant limit. However, N = > 10 is better > for latency and works under light loads. It also > reduces the amount > of buffering needed. > > The ifq works more as part of accidentally watermarking than > as a buffer. > It is the same size as the tx right (actually 1 smaller for > bogus reasons), > so it is not really useful as a buffer. However, with > no explicit > watermarking, any separate buffer like the ifq provides a > sort of > watermark at the boundary between the buffers. The > usefulness of this > would most obvious if the tx "eof" interrupt were actually > for eof > (perhaps that is what it was originally). Then on the > eof interrupt, > there is no time at all to generate new packets, and the > time when the > tx is idle can be minimized by keeping pre-generated packets > handy where > the can be copied to the tx ring at tx "eof" interrupt > time. A buffer > of about the same size as the tx ring (or maybe 1/4) the > size, is enough > for this. > > OTOH, with bge misprogrammed to interrupt after every 10 tx > packets, the > ifq is useless for its watermark purposes. The > watermark is effectively > in the tx ring, and very strangely placed there at 10 below > the top > (ring full). Normally tx watermarks are placed near > the bottom (ring > empty). They must not be placed too near the bottom, > else there would > not be enough time to replenish the ring between the time > when the "eof" > (really, the "watermark") interrupt is received and when the > tx runs > dry. They should not be placed too near the top like > they are in -current's > bge, else the point of having a large tx ring is defeated > and there are > too many interrupts. However, when they are placed > near the top, latencency > requirements are reduced. > > I recently worked on buffering for sio and noticed similar > related > problems for tx watermarks. Don't laugh -- serial i/o > 1 character at > a time at 3.686400 Mbps has much the same timing > requirements as > ethernet i/o 1 packet at a time at 1 Gbps. Each serial > character > takes ~2.7 usec and each minimal ethernet packet takes ~0.67 > usec. > With tx "ring" sizes of 128 and 512 respectively, the ring > times for > full to empty are 347 usec for serial i/o and 341 usec for > ethernet i/o. > Strangely, tx is harder than rx because: > - perfection is possible and easier to measure for tx. > It consists of > just keeping at least 1 entry in the tx ring at all > times. Latency > must be kept below ~340 usec to have any chance of > this. This is not > so easy to achieve under _all_ loads. > - for rx, you have an external source generating the > packets, so you > don't have to worry about latency affecting the > generators. > - the need for watermark processing is better known for rx, > since it > obviously doesn't work to generate the rx "eof" > interrupt near the > top. > The serial timing was actually harder to satisfy, because I > worked on > it on a 366 MHz CPU while I worked on bge on a 2 GHz CPU, > and even the > 2GHz CPU couldn't keep up with line rate (so from full to > empty takes > 800 usec). > > It turned out that the best position for the tx low > watermark is about > 1/4 or 1/2 from the bottom for both sio and bge. It > must be fairly > high, else the latency requirements are not met. In > the middle is a > good general position. Although it apparently "wastes" > half of the ring > to make the latency requirements easier to meet (without > very > system-dependent tuning), the efficiency lost from this is > reasonably > small. > > Bruce > I'm sure that Bill Paul is a nice man, but referencing drivers that were written from a template and never properly load tested doesn't really illustrate anything. All of his drivers are functional but optimized for nothing. BC _______________________________________________ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"