zfs status reports corruption and checksum errors. I have not run a scrub since the corruption was detected but we did replace the bad RAID controller and subsequent write tests to that OST have been fine. We haven't seen a change in the error count with the new raid controller.
We're observing two types of errors. The first is when we attempt to perform a long listing of a file to get its meta data we get "cannot allocate memory" from our client. On the OSS in question, it's logged as:
============LustreError: 10394:0:(ldlm_resource.c:1188:ldlm_resource_get()) odyssey-OST0002: lvbo_init failed for resource 0x8ccfa8:0x0: rc = -5 LustreError: 8855:0:(osd_object.c:409:osd_object_init()) odyssey-OST0002: lookup [0x100000000:0x8ccf64:0x0]/0x78ed06 failed: rc = -5
============As far as we can tell, this primarily affects recently written files and we're presently using robinhood to generate a file listing from OST2 to try to verify all files for this particular error.
We do have another error: attempts to read a few of our larger files on that OST result in I/O errors after a partial read. I'm not sure why this would have happened with the bad RAID controller as the two files we're aware of weren't being written to.
I'm interested to learn a bit more about these particular Lustre errors and return code and what our most likely recovery options are.
Best, Jesse
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org