On 06/18/2012 12:05 AM, Richard Elling wrote: > You might try some of the troubleshooting techniques described in Chapter 5 > of the DTtrace book by Brendan Gregg and Jim Mauro. It is not clear from your > description that you are seeing the same symptoms, but the technique should > apply. > -- richard
Thanks for the advice, I'll try it. In the mean time, I'm beginning to suspect I'm hitting some PCI-e issue on the Dell R715 machine. Looking at # mdb -k ::interrupts IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s) .[snip] 91 0x82 7 PCI Edg MSI 5 1 - pcieb_intr_handler .[snip]. In mpstat I can see that during normal operation, CPU 5 is nearly floored: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 5 0 0 0 512 0 1054 0 0 870 0 0 0 93 0 7 Then, when anything hits which disturbs the PCI-e bus (e.g. a txg flush or the xcall storm), the CPU goes to 100% utilization and my networking throughput drops accordingly. The issue can be softened by lowering the input bandwidth from ~46MB/s to below 20MB/s - at that point I'm getting only about 10% utilization on the core in question and no xcall storm or txg flush can influence my network (though I do see the CPU get about 70% busy during the process, but still enough left to avoid packet loss). So it seems, I'm hitting some hardware design issue, or something... I'll try moving my network card to the second PCI-e I/O bridge tomorrow (which seems to be bound to CPU 6). Any other ideas on what I might try to get the PCI-e I/O bridge bandwidth back? Or how to fight the starvation of the CPU by other activities in the system? (xcalls and/or txg flushes) I already tried putting the CPUs in question into an empty processor set, but that isn't enough, it seems. -- Saso _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss