On Tue, May 31, 2011 at 11:25:56AM +0200, Olaf Seibert wrote: > On Mon 30 May 2011 at 12:19:10 -0500, Dan Nelson wrote: > > The ZFS compression code will panic if it can't allocate the buffer needed > > to store the compressed data, so that's unlikely to be your problem. The > > only time I have seen an "illegal byte sequence" error was when trying to > > copy raw disk images containing ZFS pools to different disks, and the > > destination disk was a different size than the original. I wasn't even able > > to import the pool in that case, though. > > Yet somehow some incorrect data got written, it seems. That never > happened before, fortunately, even though we had crashes before that > seemed to be related to ZFS running out of memory. > > > The zfs IO code overloads the EILSEQ error code and uses it as a "checksum > > error" code. Returning that error for the same block on all disks is > > definitely weird. Could you have run a partitioning tool, or some other > > program that would have done direct writes to all of your component disks? > > I hope I would remember doing that if I did! > > > Your scrub is also a bit worrying - 24k checksum errors definitely shouldn't > > occur during normal usage. > > It turns out that the errors are easy to provoke: they happen every time > I do an ls of of the affected directories. There were processes running > that were likely to be trying to write to the same directories (the file > system is exported over NFS), so in that case it is easy to imagine that > the numbers rack up quickly. > > I moved those directories to the side, for the moment, but I haven't > been able to delete them yet. The data is a bit bigger than we're able > to backup so "just restoring a backup" isn't an easy thing to do. > Possibly I could make a new filesystem in the same pool, if that would > do the trick; it isn't more than 50% full but the affected one is the > biggest filesystem in it. > > The end result of the scrub is as follows: > > pool: tank > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: scrub completed after 12h56m with 3 errors on Mon May 30 23:56:47 2011 > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 6.38K > raidz2 ONLINE 0 0 25.4K > da0 ONLINE 0 0 0 > da1 ONLINE 0 0 0 > da2 ONLINE 0 0 0 > da3 ONLINE 0 0 0 > da4 ONLINE 0 0 0 > da5 ONLINE 0 0 0 > > errors: Permanent errors have been detected in the following files: > > tank/vol-fourquid-1:<0x0> > tank/vol-fourquid-1@saturday:<0x0> > > /tank/vol-fourquid-1/.zfs/snapshot/saturday/backups/dumps/dump_usr_friday.dump > > /tank/vol-fourquid-1/.zfs/snapshot/saturday/sverberne/CLEF-IP11/parts_abs+desc > > /tank/vol-fourquid-1/.zfs/snapshot/sunday/sverberne/CLEF-IP11/parts_abs+desc > > /tank/vol-fourquid-1/.zfs/snapshot/monday/sverberne/CLEF-IP11/parts_abs+desc
Mickael Maillot responded to this thread, pointing that situations like this could be caused by bad RAM. I admit that's a possibility; with ZFS in use the most likely memory-utilising piece (meaning volume-wise) of the system would be the ZFS ARC. I don't know if you'd necessarily see things like sig11's on random daemons, etc. (it often depends on where within the addressing range the bad DRAM chip would be associated). Can you rule out bad RAM by letting something like memtest86+ run for 12-24 hours? It's not a 100% infallible utility, but usually for simple things, it will detect/report errors within the first 15-30 minutes. Please keep in mind that even if you have ECC RAM, testing with memtest86+ would be worthwhile. Single-bit errors are correctable by ECC, while multi-bit aren't (but are detectable). "ChipKill" (see Wikipedia please) might work around this problem, but I've never personally used it (never seen it on any Intel systems I've used, only AMD systems). Finally, depending on what CPU model you have, northbridge problems (older systems) or on-die MCH (newer CPUs, e.g. Core iX and recent Xeon) problems could manifest themselves like this. However, in those situations I'd imagine you'd be seeing a lot of other oddities on the system and not limited to just ZFS. Newer systems which support MCA (again see Wikipedia; Machine Check Architecture) would/show throw MCEs which FreeBSD 8.x should absolutely notice/report (you'd see a lot of nastigrams on the console). I think that about does it for my ideas/blabbing on that topic. -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB | _______________________________________________ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"