One of our lustre file systems still running lustre 2.5.3 and zfs 0.6.3 experienced corruption due to a bad RAID controller. The OST in question was a RAID6 volume which we've marked inactive. Most of our lustre clients are 2.8.0.

zfs status reports corruption and checksum errors. I have not run a scrub since the corruption was detected but we did replace the bad RAID controller and subsequent write tests to that OST have been fine. We haven't seen a change in the error count with the new raid controller.

We're observing two types of errors. The first is when we attempt to perform a long listing of a file to get its meta data we get "cannot allocate memory" from our client. On the OSS in question, it's logged as:

============
LustreError: 10394:0:(ldlm_resource.c:1188:ldlm_resource_get()) odyssey-OST0002: lvbo_init failed for resource 0x8ccfa8:0x0: rc = -5 LustreError: 8855:0:(osd_object.c:409:osd_object_init()) odyssey-OST0002: lookup [0x100000000:0x8ccf64:0x0]/0x78ed06 failed: rc = -5
============

As far as we can tell, this primarily affects recently written files and we're presently using robinhood to generate a file listing from OST2 to try to verify all files for this particular error.

We do have another error: attempts to read a few of our larger files on that OST result in I/O errors after a partial read. I'm not sure why this would have happened with the bad RAID controller as the two files we're aware of weren't being written to.

I'm interested to learn a bit more about these particular Lustre errors and return code and what our most likely recovery options are.

Best,
Jesse

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to