[I sent this to the e1000-devel folks, and they suggested netdev might have opinions too. the below text has changed a little bit to reflect feedback from Auke Kok]
attached is a small patch for e1000 that dynamically changes Interrupt Throttle Rate for best performance - both latency and bandwidth. it makes e1000 look really good on netpipe with a ~28 us latency and 890 Mbit/s bandwidth. the basic idea is that high InterruptThrottleRate (~200k) is best for small messages, whilst low ITR (~15k) is best for large messages. leaving the ITR high for large messages burns outrageous amounts of cpu, and any less than ~15k ITR is bad for bandwidth. so this patch creates a new "performance dynamic" mode InterruptThrottleRate=2 (2,2 for dual NICS) which changes the ITR on the fly. the patch is based on the existing "dynamic" mode (ITR=1) which seems to be optimised for low cpu usage with little concern for performance. hopefully the thresholds chosen for ITR changeovers will be ok on other people's hardware too, but I really have no idea how universal it'll be. we've been running it for a few months on our cluster and it appears stable. 10M 20M 100M as thresholds for changing between the 200k 90k 30 15k ITRs were set pretty much by eye - by doing a bunch of netpipe runs and trying to minimise cpu usage (ITR) for a target latency/bandwidth. I've done an analysis of performance on this page: http://www.cita.utoronto.ca/mediawiki/index.php/E1000_performance_patch our hardware details are there too. there's also a link to another analysis of how the patch affects routing performance and cpu usage (surprisingly better). despite the netpipe improvements, I haven't seen much in the way of real world code differences (either +ve or -ve) from a regular 15k ITR. I've seen an improvement in one code, and a slight degradation (~1%) in HPL (top500.org benchmark). it should probably make the most difference for codes that consistantly send small (< 1k) messages. one possible improvement would be if the watchdog routine was called more than once every 2 seconds - that would allow the ITR to adapt more often. ideally (I think) for traffic with mixed packet sizes the ITR would be adapted 100's of times a second, but I'm not sure how practical that is. cheers, robin
diff -ru e1000-7.0.33/src/e1000_main.c e1000-7.0.33-rjh-performance/src/e1000_main.c --- e1000-7.0.33/src/e1000_main.c 2006-02-03 16:53:41.000000000 -0500 +++ e1000-7.0.33-rjh-performance/src/e1000_main.c 2006-04-01 21:44:21.000000000 -0500 @@ -1732,7 +1732,7 @@ if (hw->mac_type >= e1000_82540) { E1000_WRITE_REG(hw, RADV, adapter->rx_abs_int_delay); - if (adapter->itr > 1) + if (adapter->itr > 2) E1000_WRITE_REG(hw, ITR, 1000000000 / (adapter->itr * 256)); } @@ -2394,17 +2394,30 @@ } } - /* Dynamic mode for Interrupt Throttle Rate (ITR) */ - if (adapter->hw.mac_type >= e1000_82540 && adapter->itr == 1) { - /* Symmetric Tx/Rx gets a reduced ITR=2000; Total - * asymmetrical Tx or Rx gets ITR=8000; everyone - * else is between 2000-8000. */ - uint32_t goc = (adapter->gotcl + adapter->gorcl) / 10000; - uint32_t dif = (adapter->gotcl > adapter->gorcl ? - adapter->gotcl - adapter->gorcl : - adapter->gorcl - adapter->gotcl) / 10000; - uint32_t itr = goc > 0 ? (dif * 6000 / goc + 2000) : 8000; - E1000_WRITE_REG(&adapter->hw, ITR, 1000000000 / (itr * 256)); + /* Dynamic modes for Interrupt Throttle Rate (ITR) */ + if (adapter->hw.mac_type >= e1000_82540) { + if (adapter->itr == 1) { + /* Symmetric Tx/Rx gets a reduced ITR=2000; Total + * asymmetrical Tx or Rx gets ITR=8000; everyone + * else is between 2000-8000. */ + uint32_t goc = (adapter->gotcl + adapter->gorcl) / 10000; + uint32_t dif = (adapter->gotcl > adapter->gorcl ? + adapter->gotcl - adapter->gorcl : + adapter->gorcl - adapter->gotcl) / 10000; + uint32_t itr = goc > 0 ? (dif * 6000 / goc + 2000) : 8000; + E1000_WRITE_REG(&adapter->hw, ITR, 1000000000 / (itr * 256)); + } + else if (adapter->itr == 2) { /* low latency, high bandwidth, moderate cpu usage */ + /* range from high itr at low cl, to low itr at high cl + * < 10M => large itr + * 10M to 20M => 90k itr + * 20M to 100M => 30k itr + * > 100M => 15k itr */ + uint32_t goc = max(adapter->gotcl, adapter->gorcl) / 1000000; + uint32_t itr = goc > 10 ? (goc > 20 ? (goc > 100 ? 15000: 30000): 90000): 200000; + /* DPRINTK(PROBE, INFO, "e1000 ITR %d - [tr]cl min/ave/max %dm / %dm/ %dm\n", itr, min(adapter->gotcl, adapter->gorcl) / 1000000, (adapter->gotcl + adapter->gorcl) / 2000000, max(adapter->gotcl, adapter->gorcl) / 1000000 ); */ + E1000_WRITE_REG(&adapter->hw, ITR, 1000000000 / (itr * 256)); + } } /* Cause software interrupt to ensure rx ring is cleaned */ diff -ru e1000-7.0.33/src/e1000_param.c e1000-7.0.33-rjh-performance/src/e1000_param.c --- e1000-7.0.33/src/e1000_param.c 2006-02-03 16:53:41.000000000 -0500 +++ e1000-7.0.33-rjh-performance/src/e1000_param.c 2006-03-29 21:42:00.000000000 -0500 @@ -538,6 +538,10 @@ DPRINTK(PROBE, INFO, "%s set to dynamic mode\n", opt.name); break; + case 2: + DPRINTK(PROBE, INFO, "%s set to performance dynamic mode\n", + opt.name); + break; default: e1000_validate_option(&adapter->itr, &opt, adapter);