Rustam wrote: > Today my production server crashed 4 times. THIS IS NIGHTMARE! > Self-healing file system?! For me ZFS is SELF-KILLING filesystem. > > I cannot fsck it, there's no such tool. I cannot scrub it, it crashes > 30-40 minutes after scrub starts. I cannot use it, it crashes a > number of times every day! And with every crash number of checksum > failures is growing: > > NAME STATE READ WRITE CKSUM box5 ONLINE 0 > 0 0 ...after a few hours... box5 ONLINE 0 0 > 4 ...after a few hours... box5 ONLINE 0 0 62 > ...after another few hours... box5 ONLINE 0 0 > 120 ...crash! and we start again... box5 ONLINE 0 0 > 0 ...etc... > > actually 120 is record, sometimes it crashed as soon as it boots. > > and always there's a permanent error: errors: Permanent errors have > been detected in the following files: box5:<0x0> > > and very wise self-healing advice: http://www.sun.com/msg/ZFS-8000-8A > Restore the file in question if possible. Otherwise restore the > entire pool from backup. > > Thanks, but if I restore it from backup it won't be ZFS anymore, > that's for sure.
That's a bit harsh. ZFS is telling you that you have corrupted data based on the checksums. Other types of filesystems would likely simply pass the corrupted data on silently. > It's not I/O problem. AFAIK, default ZFS I/O error behavior is "wait" > to repair (i've 10U4, non-configurable). Then why it panics? Do you have the panic messages? ZFS won't cause panics based on bad checksums. It will by default cause panic if it can't write data out to any device or if it completely loses access to non-redundant devices or loses both redundant devices at the same time. > Recently there were discussions on failure of OpenSolaris community. > Now it's been more than half a month since I reported such an error. > Nobody even posted something like "RTFM". Come on guys, I know you > are there and busy with enterprise customers... but at least give me > some troubleshooting ideas. i'm totally lost. > > just to remind, it's heavily loaded fs with 3-4 million files and > folders. > > Link to original post: > http://www.opensolaris.org/jive/thread.jspa?threadID=57425 Since this seems to show the same number of checksum errors across 2 different channels and 4 different drives. Given that, I'd assume that this is likely a dual-channel HBA of some sort. It would appear that you either have bad hardware or some sort of driver issue. Regards, Phil _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss