Thanks Neil, we always appreciate your comments on ZIL implementation. One additional comment below...
On Oct 4, 2012, at 8:31 AM, Neil Perrin <neil.per...@oracle.com> wrote: > On 10/04/12 05:30, Schweiss, Chip wrote: >> >> Thanks for all the input. It seems information on the performance of the >> ZIL is sparse and scattered. I've spent significant time researching this >> the past day. I'll summarize what I've found. Please correct me if I'm >> wrong. >> The ZIL can have any number of SSDs attached either mirror or individually. >> ZFS will stripe across these in a raid0 or raid10 fashion depending on how >> you configure. > > The ZIL code chains blocks together and these are allocated round robin among > slogs or > if they don't exist then the main pool devices. > >> To determine the true maximum streaming performance of the ZIL setting >> sync=disabled will only use the in RAM ZIL. This gives up power protection >> to synchronous writes. > > There is no RAM ZIL. If sync=disabled then all writes are asynchronous and > are written > as part of the periodic ZFS transaction group (txg) commit that occurs every > 5 seconds. > >> Many SSDs do not help protect against power failure because they have their >> own ram cache for writes. This effectively makes the SSD useless for this >> purpose and potentially introduces a false sense of security. (These SSDs >> are fine for L2ARC) > > The ZIL code issues a write cache flush to all devices it has written before > returning > from the system call. I've heard, that not all devices obey the flush but we > consider them > as broken hardware. I don't have a list to avoid. > >> >> Mirroring SSDs is only helpful if one SSD fails at the time of a power >> failure. This leave several unanswered questions. How good is ZFS at >> detecting that an SSD is no longer a reliable write target? The chance of >> silent data corruption is well documented about spinning disks. What chance >> of data corruption does this introduce with up to 10 seconds of data written >> on SSD. Does ZFS read the ZIL during a scrub to determine if our SSD is >> returning what we write to it? > > If the ZIL code gets a block write failure it will force the txg to commit > before returning. > It will depend on the drivers and IO subsystem as to how hard it tries to > write the block. > >> >> Zpool versions 19 and higher should be able to survive a ZIL failure only >> loosing the uncommitted data. However, I haven't seen good enough >> information that I would necessarily trust this yet. > > This has been available for quite a while and I haven't heard of any bugs in > this area. > >> Several threads seem to suggest a ZIL throughput limit of 1Gb/s with SSDs. >> I'm not sure if that is current, but I can't find any reports of better >> performance. I would suspect that DDR drive or Zeus RAM as ZIL would push >> past this. > > 1GB/s seems very high, but I don't have any numbers to share. It is not unusual for workloads to exceed the performance of a single device. For example, if you have a device that can achieve 700 MB/sec, but a workload generated by lots of clients accessing the server via 10GbE (1 GB/sec), then it should be immediately obvious that the slog needs to be striped. Empirically, this is also easy to measure. -- richard > >> >> Anyone care to post their performance numbers on current hardware with E5 >> processors, and ram based ZIL solutions? >> >> Thanks to everyone who has responded and contacted me directly on this issue. >> >> -Chip >> On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel >> <andrew.gabr...@cucumber.demon.co.uk> wrote: >> Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Schweiss, Chip >> >> How can I determine for sure that my ZIL is my bottleneck? If it is the >> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL >> to >> make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. >> >> Temporarily set sync=disabled >> Or, depending on your application, leave it that way permanently. I know, >> for the work I do, most systems I support at most locations have >> sync=disabled. It all depends on the workload. >> >> Noting of course that this means that in the case of an unexpected system >> outage or loss of connectivity to the disks, synchronous writes since the >> last txg commit will be lost, even though the applications will believe they >> are secured to disk. (ZFS filesystem won't be corrupted, but it will look >> like it's been wound back by up to 30 seconds when you reboot.) >> >> This is fine for some workloads, such as those where you would start again >> with fresh data and those which can look closely at the data to see how far >> they got before being rudely interrupted, but not for those which rely on >> the Posix semantics of synchronous writes/syncs meaning data is secured on >> non-volatile storage when the function returns. >> >> -- >> Andrew >> >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- richard.ell...@richardelling.com +1-760-896-4422
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss