Rustam wrote:
> Today my production server crashed  4 times. THIS IS NIGHTMARE! 
> Self-healing file system?! For me ZFS is SELF-KILLING filesystem.
> 
> I cannot fsck it, there's no such tool. I cannot scrub it, it crashes
> 30-40 minutes after scrub starts. I cannot use it, it crashes a
> number of times every day! And with every crash number of checksum
> failures is growing:
> 
> NAME        STATE     READ WRITE CKSUM box5        ONLINE       0
> 0     0 ...after a few hours... box5        ONLINE       0     0
> 4 ...after a few hours... box5        ONLINE       0     0     62 
> ...after another few hours... box5        ONLINE       0     0
> 120 ...crash! and we start again... box5        ONLINE       0     0
> 0 ...etc...
> 
> actually 120 is record, sometimes it crashed as soon as it boots.
> 
> and always there's a permanent error: errors: Permanent errors have
> been detected in the following files: box5:<0x0>
> 
> and very wise self-healing advice: http://www.sun.com/msg/ZFS-8000-8A
>  Restore the file in question if possible.  Otherwise restore the
> entire pool from backup.
> 
> Thanks, but if I restore it from backup it won't be ZFS anymore,
> that's for sure.

        That's a bit harsh.  ZFS is telling you that you have corrupted data 
based on the checksums.  Other types of filesystems would likely simply 
pass the corrupted data on silently.

> It's not I/O problem. AFAIK, default ZFS I/O error behavior is "wait"
> to repair (i've 10U4, non-configurable). Then why it panics?

        Do you have the panic messages?  ZFS won't cause panics based on bad 
checksums.  It will by default cause panic if it can't write data out to 
any device or if it completely loses access to non-redundant devices or 
loses both redundant devices at the same time.

> Recently there were discussions on failure of OpenSolaris community.
> Now it's been more than half a month since I reported such an error.
> Nobody even posted something like "RTFM". Come on guys, I know you
> are there and busy with enterprise customers... but at least give me
> some troubleshooting ideas. i'm totally lost.
> 
> just to remind, it's heavily loaded fs with 3-4 million files and
> folders.
> 
> Link to original post: 
> http://www.opensolaris.org/jive/thread.jspa?threadID=57425

        Since this seems to show the same number of checksum errors across 2 
different channels and 4 different drives.  Given that, I'd assume that 
this is likely a dual-channel HBA of some sort.  It would appear that 
you either have bad hardware or some sort of driver issue.

Regards,
Phil

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to