-----Original Message----- From: James Carlson [mailto:carls...@workingcode.com]
> It has a lot of tunables. First, there are the ones documented in > /kernel/drv/igb.conf. Of those, perhaps the most interesting is > flow_control. It looks like the hardware acceleration ones here are the > queuing-related ones, but they appear to be disabled by default. > Then there are ones that can be controlled via igb.conf but that aren't > advertised in the default file that ships. These include: Looks like I was dupe'd by Sun's documentation. http://docs.oracle.com/cd/E19082-01/819-2724/giozx/index.html it only outlines mr_enable & intr_force, > Have you tried isolating other components? Is the behavior the same on > all switch ports? Does it differ if you're connected to a different > brand of switch? So I have four servers performing this particular role. Three of them, slowly and quietly march to their death. One, marches to its death but periodically gets a 'reset' where http response time suddenly drops and the "clock" on the march to death starts over. Yesterday, the 'resets' started happening on a second box. I current switch is a Junper EX4200 configured as a virtual switch with reasonable size stack chassis configured as line cards. It seems to work just dandy for the other systems connected to it. I haven't tried another switch yet, but there are no link related issues being reported on either side. I have used this configuration before with e1000 based nics on OSOL2009.6 and OI148. > Is there anything special going on here? For example, do you have > standard IEEE link negotiation turned off -- forcing duplex or link speed? Nothing particularly special. The gigE spec says to leave those in autoneg, and they are in autoneg. Autoneg appears to work as advertised. The switch ports are configured for jumbo frames, 9014 bytes, but the web servers are not. The web servers use standard 1500 byte MTUs. I have tested jumbo/std frames, and it makes no difference on the problem. The servers are connected via a LACP with 2 members in the bundle. I have tested with LACP and with straight up Ethernet switching, it makes no difference. Here is the switch config, it is pretty plain vanilla. root@brdr0.sf0# show ge-0/0/2 description "www004 igb1"; ether-options { auto-negotiation; 802.3ad ae4; } root@brdr0.sf0# show ge-1/0/2 description "www004 igb0"; ether-options { auto-negotiation; 802.3ad ae4; } root@brdr0.sf0# show ae4 description "www004 aggr0"; mtu 9014; aggregated-ether-options { lacp { active; } } unit 0 { family ethernet-switching { vlan { members 300; } } } > Those symptoms seem to point in the direction of something much broader; > a system-wide issue with interrupts, perhaps. While I haven't instrumented interrupts, the normal range is between 16-30k/second. This doesn't strike me as unusual. It was actually the very first thing I looked at before I engaged the list. > Have you seen the postings about APIC issues? Could those apply here? > > set apix:apic_timer_preferred_mode = 0x0 I havent seen the APIC issues but I have been turning the APIC knobs. I gave apic_timer_preferred_mode a whirl. It made no impact. I also tried setprop acpi-user-options=0x8 and setprop acpi-user-options=0x2 individually. These also brought me no joy. I gave disabling hw acceleration a whirl. tx_hcksum_enable = 0; rx_hcksum_enable = 0; lso_enable = 0; disabling the hardware acceleration earned me about 40% jump in cpu utilization but no relief on the original problem. I also gave adding software ring buffers a go. That had little or no effect as well. And of course, I gave hw ring buffers a go... name = "pci8086,a03c" parent = "/pci@0,0/pci8086,a03c@3" unit-address = "0,1" rx_ring_size = 4096; name = "pci8086,a03c" parent = "/pci@0,0/pci8086,a03c@3" unit-address = "0,1" tx_ring_size = 4096; name = "pci8086,a03c" parent = "/pci@0,0/pci8086,a03c@3" unit-address = "0,1" mr_enable = 1; name = "pci8086,a03c" parent = "/pci@0,0/pci8086,a03c@3" unit-address = "0,1" rx_group_number = 4; Last night, I disabled flow control on one box. That had no effect. I haven't tried connecting them to another switch, but I see no link related issues. No discards, errors, etc. I did notice one interesting item. The server that has had the 'resets' where performance returns to normal but then begins the march of death has a link state of 'unknown' on its two vnics. All the other zones, which live on other physical systems, have a state of up. What this might mean, and how or if it relates I have no idea. But that is the only outlier I can find the batch. I would normally categorize the zones with the link state of 'unknown' as some how broken but they are the best performing zones in the bunch. Nothing I have tried so far makes any difference, except, if offload enough traffic off the system, the problem does appear to dissipate. j.
_______________________________________________ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss