On 06/18/2012 12:05 AM, Richard Elling wrote:
> You might try some of the troubleshooting techniques described in Chapter 5 
> of the DTtrace book by Brendan Gregg and Jim Mauro. It is not clear from your
> description that you are seeing the same symptoms, but the technique should
> apply.
>  -- richard

Thanks for the advice, I'll try it. In the mean time, I'm beginning to
suspect I'm hitting some PCI-e issue on the Dell R715 machine. Looking at

# mdb -k
::interrupts
IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# ISR(s)
.[snip]
91   0x82 7   PCI    Edg MSI    5   1     -         pcieb_intr_handler
.[snip].

In mpstat I can see that during normal operation, CPU 5 is nearly floored:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  5    0   0    0   512    0 1054    0    0  870    0     0    0  93   0   7

Then, when anything hits which disturbs the PCI-e bus (e.g. a txg flush
or the xcall storm), the CPU goes to 100% utilization and my networking
throughput drops accordingly. The issue can be softened by lowering the
input bandwidth from ~46MB/s to below 20MB/s - at that point I'm getting
only about 10% utilization on the core in question and no xcall storm or
txg flush can influence my network (though I do see the CPU get about
70% busy during the process, but still enough left to avoid packet loss).

So it seems, I'm hitting some hardware design issue, or something...
I'll try moving my network card to the second PCI-e I/O bridge tomorrow
(which seems to be bound to CPU 6).

Any other ideas on what I might try to get the PCI-e I/O bridge
bandwidth back? Or how to fight the starvation of the CPU by other
activities in the system? (xcalls and/or txg flushes) I already tried
putting the CPUs in question into an empty processor set, but that isn't
enough, it seems.

--
Saso
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to