Hi Mark, I doubt read-only mode would help here.
Log replay is required to build a consistent store state and one can't bypass it. And looks like your drive/controller still detect some errors while reading.
For the second issue this PR might help (you'll be able to disable csum verification and hopefully OSD will start) once it's merged (https://github.com/ceph/ceph/pull/26247). For now IMO the only way to go for you is to have a custom build with this patch manually applied.
Thanks, Igor On 7/9/2019 5:02 PM, Mark Lehrer wrote:
My main question is this - is there a way to stop any replay or journaling during OSD startup and bring up the pool/fs in read-only mode? Here is a description of what I'm seeing. I have a Luminous cluster with CephFS and 16 8TB SSDs, using size=3. I had a problem with one of my SAS controllers, and now I have at least 3 OSDs that refuse to start. The hardware appears to be fine now. I have my essential data backed up, but there are a few files that I wouldn't mind saving so I want to use this as disaster recovery practice. The two problems I am seeing are: 1) On two of OSDs, there is a startup replay error after successfully replaying quite a few blocks: 2019-07-06 16:08:05.281063 7f6baec66e40 10 bluefs _replay 0x1543000: stop: uuid c366a2d6-e221-98b3-59fe-0f324c9dac8e != super.uuid 263428d5-8963-4339-8815-92ab6067e7a4 2019-07-06 16:08:05.281064 7f6baec66e40 10 bluefs _replay log file size was 0x1543000 2019-07-06 16:08:05.281085 7f6baec66e40 -1 bluefs _replay file with link count 0: file(ino 1485 size 0x15f4c43 mtime 2019-07-04 20:39:39.387601 bdev 1 allocated 1600000 extents [1:0x35771500000+100000,1:0x35771600000+100000,1:0x35771700000+100000,1:0x35771c00000+100000,1:0x35771d00000+100000,1:0x35772200000+100000,1:0x35772300000+100000,1:0x35772800000+100000,1:0x35772900000+100000,1:0x35772a00000+100000,1:0x35772b00000+100000,1:0x35772c00000+100000,1:0x35772d00000+100000,1:0x35772e00000+100000,1:0x35773300000+100000,1:0x35773400000+100000,1:0x35773500000+100000,1:0x35773600000+100000,1:0x35773700000+100000,1:0x35773800000+100000,1:0x35773900000+100000,1:0x35773a00000+100000]) 2019-07-06 16:08:05.281093 7f6baec66e40 -1 bluefs mount failed to replay log: (5) Input/output error 2) The following error happens on at least two OSDs: 2019-07-06 15:58:46.621008 7fdcee030e40 -1 bluestore(/var/lib/ceph/osd/ceph-74) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x147db0c5, expected 0x8f052c9, device location [0x10000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# The system was archiving some unimportant files when it went down, so I really don't care about any of the recent writes. What are my recovery options here? I was thinking that turning off replaying and running in read-only mode would be feasible, but maybe there are better options? Thanks, Mark _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com