On Thu, 3 Jul 2008, Paul wrote:
Bruce Evans wrote:
No polling:
843762 25337 52313248 1 0 178 0
763555 0 47340414 1 0 178 0
830189 0 51471722 1 0 178 0
838724 0 52000892 1 0 178 0
813594 939 50442832 1 0 178 0
807303 763 50052790 1 0 178 0
791024 0 49043492 1 0 178 0
768316 1106 47635596 1 0 178 0
Machine is maxed and is unresponsive..
That's the most interesting one. Even 1% packet loss would probably
destroy performance, so the benchmarks that give 10-50% packet loss
are uninteresting.
But you realize that it's outputting all of these packets on em3 and I'm
watching them coming out
and they are consistent with the packets received on em0 that netstat shows
are 'good' packets.
Well, output is easier. I don't remember seeing the load on a taskq for
em3. If there is a memory bottleneck, it might to might not be more related
to running only 1 taskq per interrupt, depending on how independent the
memory system is for different CPU. I think Opterons have more indenpendence
here than most x86's.
I'm using a server opteron which supposedly has the best memory performance
out of any CPU right now.
Plus opterons have the biggest l1 cache, but small l2 cache. Do you think
larger l2 cache on the Xeon (6mb for 2 core) would be better?
I have a 2222 opteron coming which is 1ghz faster so we will see what happens
I suspect lower latency memory would help more. Big memory systems
have inherently higher latency. My little old A64 workstation and
laptop have main memory latencies 3 times smaller than freebsd.org's
new Core2 servers according to lmbench2 (42 nsec for the overclocked
DDR PC3200 one and 55 for the DDR2 PC5400 (?) one, vs 145-155 nsec).
If there are a lot of cache misses, then the extra 100 nsec can be
important. Profiling of sendto() using hwpmc or perfmon shows a
significant number of cache misses per packet (2 or 10?).
Polling ON:
input (em0) output
packets errs bytes packets errs bytes colls
784138 179079 48616564 1 0 226 0
788815 129608 48906530 2 0 356 0
Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm really
mistified by this..
Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy
to explain (perhaps incorrectly). Polling can then read at most 256
descriptors every 1/2000 second, giving a max throughput of 512 kpps.
Packets < descriptors in general but might be equal here (for small
packets). You seem to actually get 784 kpps, which is too high even
in descriptors unless, but matches exactly if the errors are counted
twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40%
still happens to be left over after giving up at 512 kpps. Most of
the errors are probably handled by the hardware at low cost in CPU by
dropping packets. There are other types of errors but none except
dropped packets is likely.
Read above, it's actually transmitting 770kpps out of em3 so it can't just be
512kpps.
Transmitting is easier, but with polling its even harder to send faster than
hz * queue_length than to receive. This is without polling in idle.
I was thinking of trying 4 or 5.. but how would that work with this new
hardware?
Poorly, except possibly with polling in FreeBSD-4. FreeBSD-4 generally
has lower overheads and latency, but is missing important improvements
(mainly tcp optimizations in upper layers, better DMA and/or mbuf
handling, and support for newer NICs). FreeBSD-5 is also missing the
overhead+latency advantage.
Here are some benchmarks. (ttcp mainly tests sendto(). 4.10 em needed a
2-line change to support a not-so-new PCI em NIC. Summary:
- my bge NIC can handle about 600 kpps on my faster machine, but only
achieves 300 in 4.10 unpatched.
- my em NIC can handle about 400 kpps on my slower machine, except in
later versions it can receive at about 600 kpps.
- only 6.x and later can achieve near wire throughput for 1500-MTU
packets (81 kpps vs 76 kpps). This depends on better DMA or mbuf
handling... I now remember the details -- it is mainly better mbuf
handling: old versions split the 1500-MTU packets into 2 mbufs and
this causes 2 descriptors per packet, which causes extra software
overheads and even larger overheads for the hardware.
%%%
Results of benchmarks run on 23 Feb 2007:
my~5.2 bge --> ~4.10 em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 639 98 1660 398* 77 8k
ttcp -l5 -t 6.0 100 3960 6.0 6 5900
ttcp -l1472 -u -t 76 27 395 76 40 8k
ttcp -l1472 -t 51 40 11k 51 26 8k
(*) Same as sender according to netstat -I, but systat -ip shows that
almost half aren't delivered to upper layers.
my~5.2 bge --> 4.11 em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 635 98 1650 399* 74 8k
ttcp -l5 -t 5.8 100 3900 5.8 6 5800
ttcp -l1472 -u -t 76 27 395 76 32 8k
ttcp -l1472 -t 51 40 11k 51 25 8k
(*) Same as sender according to netstat -I, but systat -ip shows that
almost half aren't delivered to upper layers.
my~5.2 bge --> my~5.2 em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 638 98 1660 394* 100- 8k
ttcp -l5 -t 5.8 100 3900 5.8 9 6000
ttcp -l1472 -u -t 76 27 395 76 46 8k
ttcp -l1472 -t 51 40 11k 51 35 8k
(*) Same as sender according to netstat -I, but systat -ip shows that
almost half aren't delivered to upper layers. With the em rate
limit on ips changed from 8k to 80k, about 95% are delivered up.
my~5.2 bge --> 6.2 em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 637 98 1660 637 100- 15k
ttcp -l5 -t 5.8 100 3900 5.8 8 12k
ttcp -l1472 -u -t 76 27 395 76 36 16k
ttcp -l1472 -t 51 40 11k 51 37 16k
my~5.2 bge --> ~current em-fastintr
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 641 98 1670 641 99 8k
ttcp -l5 -t 5.9 100 2670 5.9 7 6k
ttcp -l1472 -u -t 76 27 395 76 35 8k
ttcp -l1472 -t 52 43 11k 52 30 8k
~6.2 bge --> ~current em-fastintr
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 309 62 1600 309 64 8k
ttcp -l5 -t 4.9 100 3000 4.9 6 7k
ttcp -l1472 -u -t 76 27 395 76 34 8k
ttcp -l1472 -t 54 28 6800 54 30 8k
~current bge --> ~current em-fastintr
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 602 100 1570 602 99 8k
ttcp -l5 -t 5.3 100 2660 5.3 5 5300
ttcp -l1472 -u -t 81# 19 212 81# 38 8k
ttcp -l1472 -t 53 34 11k 53 30 8k
(#) Wire speed to within 0.5%. This is the only kppps in this set of
benchmarks that is close to wire speed. Older kernels apparently
lose relative to -current because mbufs for mtu-sized packets are
not contiguous in older kernels.
Old results:
~4.10 bge --> my~5.2 em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t n/a n/a n/a 346 79 8k
ttcp -l5 -t n/a n/a n/a 5.4 10 6800
ttcp -l1472 -u -t n/a n/a n/a 67 40 8k
ttcp -l1472 -t n/a n/a n/a 51 36 8k
~4.10 kernel, =4 bge --> ~current em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t n/a n/a n/a 347 96 14k
ttcp -l5 -t n/a n/a n/a 5.8 10 14k
ttcp -l1472 -u -t n/a n/a n/a 67 62 14K
ttcp -l1472 -t n/a n/a n/a 52 40 16k
~4.10 kernel, =4+ bge --> ~current em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t n/a n/a n/a 627 100 9k
ttcp -l5 -t n/a n/a n/a 5.6 9 13k
ttcp -l1472 -u -t n/a n/a n/a 68 63 14k
ttcp -l1472 -t n/a n/a n/a 54 44 16k
%%%
%%%
Results of benchmarks run on 28 Dec 2007:
~5.2 epsplex (em) ttcp:
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 825k 3 206k 229 412k 52.1 45.1 2.8
local with sink: 659k 3 263k 231 131k 66.5 27.3 6.2
tx remote no sink: 35k 3 273k 8237 266k 42.0 52.1 2.3 3.6
tx remote with sink: 26k 3 394k 8224 100 60.0 5.41 3.4 11.2
rx remote no sink: 25k 4 26 8237 373k 20.6 79.4 0.0 0.0
rx remote with sink: 30k 3 203k 8237 398k 36.5 60.7 2.8 0.0
6.3-PR besplex (em) ttcp:
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 417k 1 208k 418k 2 49.5 48.5 2.0
local with sink: 420k 1 276k 145k 2 70.0 23.6 6.4
tx remote no sink: 19k 2 250k 8144 2 58.5 38.7 2.8 0.0
tx remote with sink: 16k 2 361k 8336 2 72.9 24.0 3.1 4.4
rx remote no sink: 429 3 49 888 2 0.3 99.33 0.0 0.4
tx remote with sink: 13k 2 316k 5385 2 31.7 63.8 3.6 0.8
8.0-C epsplex (em-fast) ttcp:
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 442k 3 221k 230 442k 47.2 49.6 2.7
local with sink: 394k 3 262k 228 131k 72.1 22.6 5.3
tx remote no sink: 17k 3 226k 7832 100 94.1 0.2 3.0 0.0
tx remote with sink: 17k 3 360k 7962 100 91.7 0.2 3.7 4.4
rx remote no sink: saturated -- cannot update systat display
rx remote with sink: 15k 6 358k 8224 100 97.0 0.0 2.5 0.5
~4.10 besplex (bge) ttcp:
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 15 0 425k 228 11 96.3 0.0 3.7
local with sink: ** 0 622k 229 ** 94.7 0.3 5.0
tx remote no sink: 29 1 490k 7024 11 47.9 29.8 4.4 17.9
tx remote with sink: 26 1 635k 1883 11 65.7 11.4 5.6 17.3
rx remote no sink: 5 1 68 7025 1 0.0 47.3 0.0 52.7
rx remote with sink: 6679 2 365k 6899 12 19.7 29.2 2.5 48.7
~5.2-C besplex (bge) ttcp:
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 1M 3 271k 229 543k 50.7 46.8 2.5
local with sink: 1M 3 406k 229 203k 67.4 28.2 4.4
tx remote no sink: 49k 3 474k 11k 167k 52.3 42.7 5.0 0.0
tx remote with sink: 6371 3 641k 1900 100 76.0 16.8 6.2 0.9
rx remote no sink: 34k 3 25 11k 270k 0.8 65.4 0.0 33.8
rx remote with sink: 41k 3 365k 10k 370k 31.5 47.1 2.3 19.0
6.3-PR besplex (bge) ttcp (hz = 1000 else stathz broken):
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 540k 0 270k 540k 0 50.5 46.0 3.5
local with sink: 628k 0 417k 210k 0 68.8 27.9 3.3
tx remote no sink: 15k 1 222k 7190 1 28.4 29.3 1.7 40.6
tx remote with sink: 5947 1 315k 2825 1 39.9 14.7 2.6 42.8
rx remote no sink: 13k 1 23 6943 0 0.3 49.5 0.2 50.0
rx remote with sink: 20k 1 371k 6819 0 29.5 30.1 3.9 36.5
8.0-C besplex (bge) ttcp:
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 649k 3 324k 100 649k 53.9 42.9 3.2
local with sink: 649k 3 433k 100 216k 75.2 18.8 6.0
tx remote no sink: 24k 3 432k 10k 100 49.7 41.3 2.4 6.6
tx remote with sink: 3199 3 568k 1580 100 64.3 19.6 4.0 12.2
rx remote no sink: 20k 3 27 10k 100 0.0 46.1 0.0 53.9
rx remote with sink: 31k 3 370k 10k 100 30.7 30.9 4.8 33.5
%%%
Bruce
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"