Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-28 Thread Frank Middleton
Thanks to everyone who made suggestions! This machine has run memtest for a week and VTS for several days with no errors. It does seem that the problem is probably in the CPU cache. On 03/24/10 10:07 AM, Damon Atkins wrote: You could try copying the file to /tmp (ie swap/ram) and do a continues

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-24 Thread Richard Elling
On Mar 23, 2010, at 11:21 PM, Daniel Carosone wrote: > On Tue, Mar 23, 2010 at 07:22:59PM -0400, Frank Middleton wrote: >> On 03/22/10 11:50 PM, Richard Elling wrote: >> >>> Look again, the checksums are different. >> >> Whoops, you are correct, as usual. Just 6 bits out of 256 different... >>

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-24 Thread Damon Atkins
you could also use psradm to take a CPU off-line. At boot I would ??assume?? the system boots the same way every time unless something changes, so you could be hiting the came CPU core every time or the same bit of RAM until booted fully. Or even run SunVTS "Validation Test Suite" which I beliv

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-24 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 How about running memtest86+ (http://www.memtest.org/) on the machine for a while? It doesn't test the arithmetics on the CPU very much, but it stresses data paths quite a lot. Just a quick suggestion... - -- Saso Damon Atkins wrote: > You could try

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-24 Thread Damon Atkins
You could try copying the file to /tmp (ie swap/ram) and do a continues loop of checksums e.g. while [ ! -f ibdlpi.so.1.x ] ; do sleep 1; cp libdlpi.so.1 libdlpi.so.1.x ; A="`sha512sum -b libdlpi.so.1.x`" ; [ "$A" == "what it should be libdlpi.so.1.x" ] && rm libdlpi.so.1.x ; done ; date Ass

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-23 Thread Daniel Carosone
On Tue, Mar 23, 2010 at 07:22:59PM -0400, Frank Middleton wrote: > On 03/22/10 11:50 PM, Richard Elling wrote: > >> Look again, the checksums are different. > > Whoops, you are correct, as usual. Just 6 bits out of 256 different... > > Look which bits are different - digits 24, 53-56 in both cas

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-23 Thread Frank Middleton
On 03/22/10 11:50 PM, Richard Elling wrote: Look again, the checksums are different. Whoops, you are correct, as usual. Just 6 bits out of 256 different... Last year expected 4a027c11b3ba4cec bf274565d5615b7b 3ef5fe61b2ed672e ec8692f7fd33094a actual 4a027c11b3ba4cec bf274567d5615b7b 3ef5

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-22 Thread Richard Elling
On Mar 22, 2010, at 4:21 PM, Frank Middleton wrote: > On 03/21/10 03:24 PM, Richard Elling wrote: > >> I feel confident we are not seeing a b0rken drive here. But something is >> clearly amiss and we cannot rule out the processor, memory, or controller. > > Absolutely no question of that, other

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-22 Thread Frank Middleton
On 03/21/10 03:24 PM, Richard Elling wrote: I feel confident we are not seeing a b0rken drive here. But something is clearly amiss and we cannot rule out the processor, memory, or controller. Absolutely no question of that, otherwise this list would be flooded :-). However, the purpose of th

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-21 Thread Richard Elling
On Mar 21, 2010, at 11:03 AM, Frank Middleton wrote: > On 03/15/10 01:01 PM, David Dyer-Bennet wrote: > >> This sounds really bizarre. > > Yes, it is. ButCR 6880994 is bizarre too. Rolling back to a conversation with Frank last fall, here is the output of fmdump which shows the single bit flip.

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-21 Thread Frank Middleton
On 03/15/10 01:01 PM, David Dyer-Bennet wrote: This sounds really bizarre. Yes, it is. ButCR 6880994 is bizarre too. One detail suggestion on checking what's going on (since I don't have a clue towards a real root-cause determination): Get an md5sum on a clean copy of the file, say from a n

Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-15 Thread David Dyer-Bennet
On Sun, March 14, 2010 13:54, Frank Middleton wrote: > > How can it even be remotely possible to get a checksum failure on mirrored > drives > with copies=2? That means all four copies were corrupted? Admittedly this > is > on a grotty PC with no ECC and flaky bus parity, but how come the same >

[zfs-discuss] CR 6880994 and pkg fix

2010-03-14 Thread Frank Middleton
Can anyone say what the status of CR 6880994 (kernel/zfs Checksum failures on mirrored drives) might be? Setting copies=2 has mitigated the problem, which manifests itself consistently at boot by flagging libdlpi.so.1, but two recent power cycles in a row with no normal shutdown has resulted in