[ceph-users] Re: monitoring drives

Marc Fri, 14 Oct 2022 06:40:24 -0700

> smartctl can very much read sas drives so I would look into that chain
> first.


I have smartd running and it does recognize the sas drives, however I have 
collectd is grabbing smart data and I am getting nothing from it. This is all 
the stuff I am getting from a sata drive

# SELECT * FROM "smart_value" WHERE "host"='c01' AND "instance"='sdb' AND 
time>=now()-60m limit 50
name: smart_value
time                           host instance type              value
----                           ---- -------- ----              -----
2022-10-14T13:24:04.029043881Z c01  sdb      smart_poweron     118652400
2022-10-14T13:24:04.043975567Z c01  sdb      smart_powercycles 8
2022-10-14T13:24:04.05828545Z  c01  sdb      smart_badsectors  0
2022-10-14T13:24:04.07207858Z  c01  sdb      smart_temperature 30
> SELECT * FROM "smart_pretty" WHERE "host"='c01' AND "instance"='sdb' AND 
> time>=now()-60m limit 50
name: smart_pretty
time                           host instance type            type_instance      
      value
----                           ---- -------- ----            -------------      
      -----
2022-10-14T13:24:04.072900793Z c01  sdb      smart_attribute 
raw-read-error-rate      0
2022-10-14T13:24:04.073731474Z c01  sdb      smart_attribute spin-up-time       
      5383
2022-10-14T13:24:04.074562994Z c01  sdb      smart_attribute start-stop-count   
      8
2022-10-14T13:24:04.075397312Z c01  sdb      smart_attribute 
reallocated-sector-count 0
2022-10-14T13:24:04.07624241Z  c01  sdb      smart_attribute seek-error-rate    
      0
2022-10-14T13:24:04.077058461Z c01  sdb      smart_attribute power-on-hours     
      118652400000
2022-10-14T13:24:04.077886085Z c01  sdb      smart_attribute spin-retry-count   
      0
2022-10-14T13:24:04.078708091Z c01  sdb      smart_attribute 
calibration-retry-count  0
2022-10-14T13:24:04.079542614Z c01  sdb      smart_attribute power-cycle-count  
      8
2022-10-14T13:24:04.080374422Z c01  sdb      smart_attribute 
power-off-retract-count  6
2022-10-14T13:24:04.0812049Z   c01  sdb      smart_attribute load-cycle-count   
      74
2022-10-14T13:24:04.082027399Z c01  sdb      smart_attribute 
temperature-celsius-2    303150
2022-10-14T13:24:04.082879593Z c01  sdb      smart_attribute 
reallocated-event-count  0
2022-10-14T13:24:04.083707815Z c01  sdb      smart_attribute 
current-pending-sector   0
2022-10-14T13:24:04.084536779Z c01  sdb      smart_attribute 
offline-uncorrectable    0
2022-10-14T13:24:04.085365242Z c01  sdb      smart_attribute 
udma-crc-error-count     0
2022-10-14T13:24:04.086191201Z c01  sdb      smart_attribute 
multi-zone-error-rate    0

>   Are they behind a raid controller that is masking the smart
> commands?

No

> As for monitoring, we run the smartd service to keep an eye on drives.
> More often than not I notice weird things with ceph long before smart
> throws an actual error.  Bouncing drives, oddly high latency on our "Max
> OSD Apply Latency" graph. 

Do you only grab one metric in the query or do you also 'calculate' if the disk 
currently is being used and compensate for that in the reported latency. (Or is 
this metric not depending on current use?)

What values should I look for, how many hundreds of ms?

I have 106 metrics listed in ceph_latency. These start with osd, what would be 
the apply latency one?

Osd.opBeforeDequeueOpLat
Osd.opBeforeQueueOpLat
Osd.opLatency
Osd.opPrepareLatency
Osd.opProcessLatency
Osd.opRLatency
Osd.opRPrepareLatency
Osd.opRProcessLatency
Osd.opRwLatency
Osd.opRwPrepareLatency
Osd.opRwProcessLatency
Osd.opWLatency
Osd.opWPrepareLatency
Osd.opWProcessLatency
Osd.subopLatency
Osd.subopWLatency

>  Every few months I throw a smart long test
> at the whole cluster and a few days later go back and rake the results.
> Anything that has a failure gets immediately removed from ceph by me
> regardless if smart says it's fine or not.   At least 90% of the drives
> we RMA have smart passed but failures in the read test.  Never had
> pushback from WDC or Seagate on it.
> 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: monitoring drives

Reply via email to