Hi guys,
So our cluster always got osd down due to medium error.Our current action plan 
is to replace the defective disk drive.But I was wondering whether it's too 
sensitive for ceph to take it down.Or whether our action plan was too simple 
and crude.Any advice for this issue will be appreciated.


medium error from dmesg:
[Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm]
[Sun Nov 20 15:52:10 2016] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm]
[Sun Nov 20 15:52:10 2016] Sense Key : Medium Error [current]
[Sun Nov 20 15:52:10 2016] Info fld=0x235f23e0
[Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm]
[Sun Nov 20 15:52:10 2016] Add. Sense: Unrecovered read error
[Sun Nov 20 15:52:10 2016] sd 0:0:15:0: [sdm] CDB:
[Sun Nov 20 15:52:10 2016] Read(10): 28 00 23 5f 23 60 00 02 30 00
[Sun Nov 20 15:52:10 2016] end_request: critical medium error, dev sdm, sector 
593437664




osd log always shows after deep-scrub,osd caught read error.

  -3> 2016-11-20 16:54:39.740795 7f71f7e75700  0 log_channel(cluster) log [INF] 
: 13.7e9 deep-scrub starts
    -2> 2016-11-20 16:54:41.958706 7f71f7e75700  0 log_channel(cluster) log 
[INF] : 13.7e9 deep-scrub ok
    -1> 2016-11-20 16:54:48.740180 7f71f7e75700  0 log_channel(cluster) log 
[INF] : 13.5c9 deep-scrub starts
     0> 2016-11-20 16:55:00.704106 7f71f7e75700 -1 os/FileStore.cc: In function 
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, 
ceph::bufferl
ist&, uint32_t, bool)' thread 7f71f7e75700 time 2016-11-20 16:55:00.699763
os/FileStore.cc: 2850: FAILED assert(allow_eio || !m_filestore_fail_eio || got 
!= -5)


 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) 
[0x7f7228bad78b]
 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int, bool)+0xc58) [0x7f722898b718]
 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, 
ScrubMap::object&, ThreadPool::TPHandle&)+0x2f9) [0x7f7228a17279]
 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, 
std::allocator<hobject_t> > const&, bool, unsigned int, 
ThreadPool::TPHandle&)+0x2c8) [0x7f72289510a8]
 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned 
int, ThreadPool::TPHandle&)+0x1fa) [0x7f7228869eea]
 6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x480) [0x7f7228870100]
 7: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7f72288717ee]
 8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x7f7228756069]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x7f7228b9e376]
 10: (ThreadPool::WorkThread::entry()+0x10) [0x7f7228b9f420]
 11: (()+0x8182) [0x7f72279ab182]
 12: (clone()+0x6d) [0x7f7225f1647d]




megacli showed medium error count.
Enclosure Device ID: 32
Slot Number: 15
Device Id: 15
Sequence Number: 2
Media Error Count: 9
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 1.090 TB [0x8bba0cb0 Sectors]
Non Coerced Size: 1.090 TB [0x8baa0cb0 Sectors]
Coerced Size: 1.090 TB [0x8ba80000 Sectors]
Firmware state: JBOD
SAS Address(0): 0x5000c50084f2971d
SAS Address(1): 0x0
Connected Port Number: 0(path0) 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to