Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-09-03 Thread Darren J Moffat
On 26/08/2010 15:42, David Magda wrote: Does a scrub go through the slog and/or L2ARC devices, or only the "primary" storage components? A scrub traverses datasets including the ZIL thus the scrub will read (and if needed resilver) on a slog device too. http://src.opensolaris.org/source/xref

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-27 Thread George Wilson
Bob Friesenhahn wrote: On Thu, 26 Aug 2010, George Wilson wrote: What gets "scrubbed" in the slog? The slog contains transient data which exists for only seconds at a time. The slog is quite likely to be empty at any given point in time. Bob Yes, the typical ZIL block never lives long e

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-27 Thread Bob Friesenhahn
On Thu, 26 Aug 2010, George Wilson wrote: David Magda wrote: On Wed, August 25, 2010 23:00, Neil Perrin wrote: Does a scrub go through the slog and/or L2ARC devices, or only the "primary" storage components? A scrub will go through slogs and primary storage devices. The L2ARC device is cons

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread George Wilson
Edward Ned Harvey wrote: Add to that: During scrubs, perform some reads on log devices (even if there's nothing to read). We do read from log device if there is data stored on them. In fact, during scrubs, perform some reads on every device (even if it's actually empty.) Reading from the d

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread George Wilson
David Magda wrote: On Wed, August 25, 2010 23:00, Neil Perrin wrote: Does a scrub go through the slog and/or L2ARC devices, or only the "primary" storage components? A scrub will go through slogs and primary storage devices. The L2ARC device is considered volatile and data loss is not possibl

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread George Wilson
Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Neil Perrin This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When reading the intent

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I see, thank you for the clarification. So it is possible to have something equivalent to main storage self-healing on ZIL, with ZIL-scrub to activate it. Or is that already implemented also? (Sorry for asking these obvious questions, but I'm not famil

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread David Magda
On Wed, August 25, 2010 23:00, Neil Perrin wrote: > On 08/25/10 20:33, Edward Ned Harvey wrote: > >> It's commonly stated, that even with log device removal supported, the >> most common failure mode for an SSD is to blindly write without reporting >> any errors, and only detect that the device is

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Darren J Moffat
On 26/08/2010 15:08, Saso Kiselkov wrote: If I might add my $0.02: it appears that the ZIL is implemented as a kind of circular log buffer. As I understand it, when a corrupt checksum It is NOT circular since that implies limited number of entries that get overwritten. is detected, it is ta

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread StorageConcepts
Actually - I can't read ZFS code, so the next assumtions are more or less based on brainware - excuse me in advance :) How does ZFS detect "up to date" zil's ? - with the tnx check of the ueberblock - right ? In our corruption case, we had 2 valid ueberblocks at the end and ZFS used those t

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 If I might add my $0.02: it appears that the ZIL is implemented as a kind of circular log buffer. As I understand it, when a corrupt checksum is detected, it is taken to be the end of the log, but this kind of defeats the checksum's original purpose, w

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Markus Keil
Does that mean that when the begin of the intent log chain gets corrupted, all other intent log data after the corruption area is lost, because the checksum of the first corrupted block doesn't match?    Regards, Markus Neil Perrin hat am 23. August 2010 um 19:44 geschrieben: > This is a consequ

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Eric Schrock
On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote: > > 1) zil needs to report truncated transactions on zilcorruption As Neil outlined, this isn't possible while preserving current ZIL performance. There is no way to distinguish the "last" ZIL block without incurring additional writes for ev

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Eric Schrock
On Aug 26, 2010, at 9:14 AM, Edward Ned Harvey wrote: > * After introduction of ldr, before this bug fix is available, it is > pointless to mirror log devices. That's a bit of an overstatement. Mirrored logs protect against a wide variety of failure modes. Neil just isn't sure if it does the r

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of StorageConcepts > > So would say there are 2 bugs / missing features in this: > > 1) zil needs to report truncated transactions on zilcorruption > 2) zil should need mirrored counterpart to re

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Edward Ned Harvey
> From: Neil Perrin [mailto:neil.per...@oracle.com] > > Hmm, I need to check, but if we get a checksum mismatch then I don't > think we try other > mirror(s). This is automatic for the 'main pool', but of course the ZIL > code is different > by necessity. This problem can of course be fixed. (It w

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-25 Thread StorageConcepts
Hello, actually this is bad news. I always assumed that the mirror redundancy of zil can also be used to handle bad blocks on the zil device (just as the main pool self healing does for data blocks). I actually dont know how SSD's "die", because of the "wear out" characteristics I can think

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-25 Thread Neil Perrin
On 08/25/10 20:33, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Neil Perrin This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-25 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Neil Perrin > > This is a consequence of the design for performance of the ZIL code. > Intent log blocks are dynamically allocated and chained together. > When reading the intent log we read ea

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-23 Thread Neil Perrin
On 08/23/10 13:12, Markus Keil wrote: Does that mean that when the begin of the intent log chain gets corrupted, all other intent log data after the corruption area is lost, because the checksum of the first corrupted block doesn't match? - Yes, but you wouldn't want to replay the followin

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-23 Thread Neil Perrin
This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When reading the intent log we read each block and checksum it with the embedded checksum within the same block. If we can't read a block due to an IO error then t