On Fri, Jul 5, 2013 at 8:00 PM, Saso Kiselkov <[email protected]> wrote: > On 05/07/2013 17:08, [email protected] wrote: >> Good morning, >> >> I have a weird problem with two of the 15+ OpenSolaris storage servers in our >> environment. All the Nearline servers are essentially the same. Supermicro >> X9DR3-F based server, Dual E5-2609's, 64GB memory, Dual 10Gb SFP+ NICs, LSI >> 9200-8e HBA, Supermicro CSE-826E26-R1200LPB storage arrays and Seagate >> enterprise 2TB SATA or SAS drives (not mixed within a server). Root, l2ARC >> and >> ZIL are all on Intel SSD (SLC series 313 for ZIL, MLC 520 for L2ARC and MLC >> 330 >> for boot) >> >> The volumes are built out of 9 drive Z1 groups, ashift is set to 9 (which is >> supposed to appropiate for the enterprise seagates). The pools are large >> (120-130TB) but are only between 27 and 32% full. Each server serves an iSCSI >> (Comstar) and an CIFS (in kernel server) volume of the same pool. I realize >> this >> is not optimal from a recovery/resilver/rebuild standpoint but the servers >> are >> replicated and the data is easily rebuildable. >> >> Initially these servers did great for several months, while certainly no >> speed >> demons, 300+ MB/sec for sequential read/writes was not a problem. Several >> weeks >> ago, literally overnight, replication times went through the roof for one >> server. Simple testing showed that reading from the pool would no longer go >> over >> 25MB/s. Even a scrub that used to run at 400+ MB/sec is now crawling along at >> below 40MB/s. >> >> Sometime yesterday the second server started to exhibit the exact same >> behaviour. This one is used even less (it's our D2D2T server) and data is >> written to it at night and read during the day to be written to tape. >> >> I've exhausted all I know and I'm at a loss. Does anyone have any ideas of >> what >> to look at, or do any obvious reasons for this behaviour jump out from the >> configuration above? > > Isn't iostat -Exn reporting some transport errors? Smells like a drive > gone bad and forcing retries, which would cause about a 10x decrease in > performance. Just a guess, though.
Why should a retry require a 10x decrease in performance? A proper design would surely do retries in parallel to other operations (Reiser4 and btrfs do it) up to a certain amount of failures-in-flight. Irek _______________________________________________ OpenIndiana-discuss mailing list [email protected] http://openindiana.org/mailman/listinfo/openindiana-discuss
