Re: [lustre-discuss] LustreError on ZFS volumes

Jesse Stroik Tue, 13 Dec 2016 11:16:15 -0800

We discussed a course of action this morning and decided that we'd start by migrating the files off of the OST. Testing suggests files that cannot be completely read will be left on OST0002.

Due to the nature of the corruption - faulty hardware raid controller - it seems unlikely we'll be able to meaningfully save any files that were corrupted. This is something we may evaluate more closely once the lfs_migrate is complete and we have our file list.

We'll then share the list of corrupted files with our users and find out the cost of the lost data. If it's reasonably reproducible, we'll reinitialize the RAID array and reformat the vdev.


Thanks for your help, Tom!

Best,
Jesse Stroik



On 12/12/2016 03:51 PM, Crowe, Tom wrote:

Hi Jessie,

In regards to you seeing 370 objects with errors form ‘zpool status’, but 
having over 400 files with “access issues”, I would suggest running the ‘zpool 
scrub’ to identify all the ZFS objects in the pool that are reporting permanent 
errors.

It would be very important to have a complete list of files w/issues, before 
replicating the VDEV(s) in question.

You may also want to dump the zdb information for the source VDEV(s) with the 
following:

zdb -dddddd source_pool/source_vdev > /some/where/with/room

For example, if the zpool was named pool-01, and the VDEV was named lustre-0001 
and you had free space in a filesystem named /home:

zdb -dddddd pool-01/lustre-0001 > /home/zdb_pool-01_0001_20161212.out

There is a great wealth of data zdb can share about your files. Having the 
output may prove helpful down the road.

Thanks,
Tom

On Dec 12, 2016, at 4:39 PM, Jesse Stroik <[email protected]> wrote:

Thanks for taking the time to respond, Tom,

For clarification, it sounds like you are using hardware based RAID-6, and not 
ZFS raid? Is this correct? Or was the faulty card simply an HBA?



You are correct. This particular file system is still using hardware RAID6.

At the bottom of the ‘zpool status -v pool_name’ output, you may see paths 
and/or zfs object ID’s of the damaged/impacted files. This would be good to 
take note of.



Yes, I output this to files at a few different times and we've had no chance 
since replacing the RAID controller, which makes me feel reasonably comfortable 
leaving the file system in production.

There are 370 objects listed by zpool status -v but I am unable to access at 
least 400 files. Almost all of our files are single stripe.

Running a ‘zpool scrub’ is a good idea. If the zpool is protected with "ZFS raid", the 
scrub may be able to repair some of the damage. If the zpool is not protected with "ZFS 
raid", the scrub will identify any other errors, but likely NOT repair any of the damage.



We're not protected with ZFS RAID, just hardware raid6. I could run a patrol on 
the hardware controller and then a ZFS scrub if that makes the most sense at 
this point. This file system is scheduled to run a scrub the third week of 
every month so it would run one this weekend otherwise.

If you have enough disk space on hardware that is behaving properly (and free 
space in the source zpool), you may want to replicate the VDEV’s (OST) that are 
reporting errors. Having a replicated VDEV can afford you the ability to 
examine the data without fear of further damage. You may also want to extract 
certain files from the replicated VDEV(s) which are producing IO errors on the 
source VDEV.

Something like this for replication should work:

zfs snap source_pool/source_ost@timestamp_label
zfs send -Rv source_pool/source_ost@timestamp_label | zfs receive 
destination_pool/source_oat_replicated

You will need to set zfs_send_corrupt_data to 1 in /sys/module/zfs/parameters 
or the ‘zfs send’ will error and fail when sending a VDEV with read and/or 
checksum errors.
Enabling zfs_send_corrupt_data allows the zfs send operation to complete. Any 
blocks that are damaged on the source side, will have “x2f5baddb10c” replaced 
in the bad blocks on the destination side. This can be helpful in 
troubleshooting if an entire file is corrupt, or parts of the file.

After the replication, you should set the replicated VDEV to read only with 
‘zfs set readonly=on destination_pool/source_ost_replicated’


Thank you for this suggestion. We'll most likely do that.

Best,
Jesse Stroik

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] LustreError on ZFS volumes

Reply via email to