Hi guys, So our cluster always got osd down due to medium error.Our current action plan is to replace the defective disk drive.But I was wondering whether it's too sensitive for ceph to take it down.Or whether our action plan was too simple and crude.Any advice for this issue will be appreciated.
medium error from dmesg: [Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm] [Sun Nov 20 15:52:10 2016] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm] [Sun Nov 20 15:52:10 2016] Sense Key : Medium Error [current] [Sun Nov 20 15:52:10 2016] Info fld=0x235f23e0 [Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm] [Sun Nov 20 15:52:10 2016] Add. Sense: Unrecovered read error [Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm] CDB: [Sun Nov 20 15:52:10 2016] Read(10): 28 00 23 5f 23 60 00 02 30 00 [Sun Nov 20 15:52:10 2016] end_request: critical medium error, dev sdm, sector 593437664 osd log always shows after deep-scrub,osd caught read error. -3> 2016-11-20 16:54:39.740795 7f71f7e75700 0 log_channel(cluster) log [INF] : 13.7e9 deep-scrub starts -2> 2016-11-20 16:54:41.958706 7f71f7e75700 0 log_channel(cluster) log [INF] : 13.7e9 deep-scrub ok -1> 2016-11-20 16:54:48.740180 7f71f7e75700 0 log_channel(cluster) log [INF] : 13.5c9 deep-scrub starts 0> 2016-11-20 16:55:00.704106 7f71f7e75700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferl ist&, uint32_t, bool)' thread 7f71f7e75700 time 2016-11-20 16:55:00.699763 os/FileStore.cc: 2850: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f7228bad78b] 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xc58) [0x7f722898b718] 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x2f9) [0x7f7228a17279] 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2c8) [0x7f72289510a8] 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7f7228869eea] 6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x480) [0x7f7228870100] 7: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7f72288717ee] 8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x7f7228756069] 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x7f7228b9e376] 10: (ThreadPool::WorkThread::entry()+0x10) [0x7f7228b9f420] 11: (()+0x8182) [0x7f72279ab182] 12: (clone()+0x6d) [0x7f7225f1647d] megacli showed medium error count. Enclosure Device ID: 32 Slot Number: 15 Device Id: 15 Sequence Number: 2 Media Error Count: 9 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS Raw Size: 1.090 TB [0x8bba0cb0 Sectors] Non Coerced Size: 1.090 TB [0x8baa0cb0 Sectors] Coerced Size: 1.090 TB [0x8ba80000 Sectors] Firmware state: JBOD SAS Address(0): 0x5000c50084f2971d SAS Address(1): 0x0 Connected Port Number: 0(path0)
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com