Hey Ben and Chris. Thanks for your replies!
On Fri, 2024-07-19 at 09:17 +0200, Ben Kochie wrote: > This is one of those tricky situations where there's not a strict > correct answer. Indeed. > For power-on-hours I would probably go with a gauge. > * You don't really have a "perfect" monotonic counter here. Why not? I mean there's what Chris said about some drives that may overflow the number - which in principle sounds unlikely though I must admit that especially with NVMe SMART data I have seen unreasonably low numbers, too (but in those cases, I've never seen high numbers for these drives, so it may also just be some other issue). > * I would also include the serial number label as well, just for > uniqueness identification sake. If one adds a label to a metric, which then stays mostly constant, does this add any considerably amount of space needed for storing it? But more on that below. > * Power-on-hours doesn't really have a lot of use as a counter. Do > actually want to display a counter like `rate(power_on_hours[1h])`? No, not particularly. It's just a number that should at least in theory only increase, and I wanted to do it right. Perhaps I should describe things a bit more, because actually I would have some more cases where it's not clear to me how to map perfectly into metrics. I had already asked at https://discuss.prometheus.io/t/how-to-design-metrics-labels/2337 and one further thread there (which was however swallowed by the anti- spam, and would need some admin to approve it). My exporter parses the RAID CLI tools output, which results in a structure like this (as JSON): { "controllers": { "0": { "properties": { "slot": "0", "serial_number": "000A", "controller_status": "OK", "hardware_revision": "A", "firmware_version": "6.52", "rebuild_priority": "High", "cache_status": "OK", "battery_capacitor_status": "OK", "controller_temperature_celsius": 55.0, "model": "HPE Smart Array P816i-a SR Gen10" }, "sensors": { "0": { "properties": { "location": "Inlet Ambient", "temperature_celsius": 43.0 } }, "1": { "properties": { "location": "ASIC", "temperature_celsius": 55.0 } }, "2": { "properties": { "location": "Top", "temperature_celsius": 41.0 } } }, "arrays": { "A": { "properties": { "unused_space_bytes": 0.0, "used_space_bytes": 960139939020.8, "status": "OK", "multidomain_status": "OK" }, "logical_drives": { "1": { "properties": { "size_bytes": 480069969510.4, "raid_level": "1", "chunk_size_bytes": 262144, "data_stripe_size_bytes": 262144, "status": "OK", "unrecoverable_media_errors": "None", "multidomain_status": "OK", "caching": false, "device": "/dev/sda", "logical_drive_label": "system" } } }, "physical_drives": { "3I:1:15": { "properties": { "port": "3I", "box": "1", "bay": "15", "status": "OK", "drive_role": "Data", "interface_type": "Solid State SATA", "size_bytes": 480000000000, "firmware_version": "HPG1", "serial_number": "000B", "wwn": "00001", "model": "ATA VK000480GXAWK", "temperature_celsius": 35.0, "usage_remaining_percent": 99.6, "power_on_hours": 25391.0, "life_remaining_based_on_workload_to_date_days": 263431.0, "shingled_magnetic_recording_support": "None" } }, "3I:1:16": { "properties": { "port": "3I", "box": "1", "bay": "16", "status": "OK", "drive_role": "Data", "interface_type": "Solid State SATA", "size_bytes": 480000000000, "firmware_version": "HPG1", "serial_number": "000C", "wwn": "00002", "model": "ATA VK000480GXAWK", "temperature_celsius": 32.0, "usage_remaining_percent": 99.6, "power_on_hours": 25391.0, "life_remaining_based_on_workload_to_date_days": 263431.0, "shingled_magnetic_recording_support": "None" } } } }, "B": { "properties": { "unused_space_bytes": 0.0, "used_space_bytes": 224014499043082.25, "status": "OK", "multidomain_status": "OK" }, "logical_drives": { "2": { "properties": { "size_bytes": 17592186044416.0, "raid_level": "6", "chunk_size_bytes": 524288, "data_stripe_size_bytes": 6291456, "status": "OK", "unrecoverable_media_errors": "None", "multidomain_status": "OK", "caching": true, "parity_initialization_status": "Initialization Completed", "device": "/dev/sde", "logical_drive_label": "data-a-0" } }, "3": { "properties": { "size_bytes": 17592186044416.0, "raid_level": "6", "chunk_size_bytes": 524288, "data_stripe_size_bytes": 6291456, "status": "OK", "unrecoverable_media_errors": "None", "multidomain_status": "OK", "caching": true, "parity_initialization_status": "Initialization Completed", "device": "/dev/sdb", "logical_drive_label": "data-a-1" } }, "physical_drives": { "1I:1:1": { "properties": { "port": "1I", "box": "1", "bay": "1", "status": "OK", "drive_role": "Data", "interface_type": "SAS", "size_bytes": 16000000000000, "rotational_speed_rpm": 7200, "firmware_version": "HPD4", "serial_number": "000D", "wwn": "00003", "model": "HPE MB016000JWZHE", "temperature_celsius": 30.0, "shingled_magnetic_recording_support": "None" } }, "1I:1:2": { "properties": { "port": "1I", "box": "1", "bay": "2", "status": "OK", "drive_role": "Data", "interface_type": "SAS", "size_bytes": 16000000000000, "rotational_speed_rpm": 7200, "firmware_version": "HPD4", "serial_number": "000E", "wwn": "00004", "model": "HPE MB016000JWZHE", "temperature_celsius": 33.0, "shingled_magnetic_recording_support": "None" } }, "3I:1:14": { "properties": { "port": "3I", "box": "1", "bay": "14", "status": "OK", "drive_role": "Data", "interface_type": "SAS", "size_bytes": 16000000000000, "rotational_speed_rpm": 7200, "firmware_version": "HPD4", "serial_number": "000F", "wwn": "00005", "model": "HPE MB016000JWZHE", "temperature_celsius": 37.0, "shingled_magnetic_recording_support": "None" } } } } }, "unassigned_physical_drives": {} } } } (The above output is a bit shortened, simply for readability) In short: - there can be multiple controllers - controllers have sensors and arrays - arrays have logical drives - physical drives are assigned to arrays, too, but may also be shared spares (which cannot be properly deduced from the RAID tool) or unassigned drives (and thus not belong to an array) Right now, I'd map that as follows: # HELP smartraid_controller_cache_module_temperature_celsius Temperature of the cache module of a SmartRAID controller in celsius. # TYPE smartraid_controller_cache_module_temperature_celsius gauge # HELP smartraid_controller_capacitor_temperature_celsius Temperature of the capacitor of a SmartRAID controller in celsius. # TYPE smartraid_controller_capacitor_temperature_celsius gauge # HELP smartraid_controller_info Information about a SmartRAID controller. # TYPE smartraid_controller_info gauge smartraid_controller_info{battery_capacitor_status="OK",cache_status="OK",controller_name="0",controller_status="OK",firmware_version="6.52",hardware_revision="A",model="HPE Smart Array P816i-a SR Gen10",rebuild_priority="High",serial_number="000A"} 1.0 # HELP smartraid_controller_temperature_celsius Temperature of a SmartRAID controller in celsius. # TYPE smartraid_controller_temperature_celsius gauge smartraid_controller_temperature_celsius{controller_name="0"} 55.0 # HELP smartraid_controller_sensor_info Information about a SmartRAID controller sensor. # TYPE smartraid_controller_sensor_info gauge smartraid_controller_sensor_info{controller_name="0",location="Inlet Ambient",sensor_name="0"} 1.0 smartraid_controller_sensor_info{controller_name="0",location="ASIC",sensor_name="1"} 1.0 smartraid_controller_sensor_info{controller_name="0",location="Top",sensor_name="2"} 1.0 # HELP smartraid_controller_sensor_temperature_celsius Temperature of a SmartRAID controller sensor in celsius. # TYPE smartraid_controller_sensor_temperature_celsius gauge smartraid_controller_sensor_temperature_celsius{controller_name="0",sensor_name="0"} 43.0 smartraid_controller_sensor_temperature_celsius{controller_name="0",sensor_name="1"} 55.0 smartraid_controller_sensor_temperature_celsius{controller_name="0",sensor_name="2"} 41.0 # HELP smartraid_array_info Information about a SmartRAID array. # TYPE smartraid_array_info gauge smartraid_array_info{array_name="A",controller_name="0",multidomain_status="OK",status="OK"} 1.0 smartraid_array_info{array_name="B",controller_name="0",multidomain_status="OK",status="OK"} 1.0 # HELP smartraid_array_unused_space_bytes Unused space of a SmartRAID array in bytes. # TYPE smartraid_array_unused_space_bytes gauge smartraid_array_unused_space_bytes{array_name="A",controller_name="0"} 0.0 smartraid_array_unused_space_bytes{array_name="B",controller_name="0"} 0.0 # HELP smartraid_array_used_space_bytes Used space of a SmartRAID array in bytes. # TYPE smartraid_array_used_space_bytes gauge smartraid_array_used_space_bytes{array_name="A",controller_name="0"} 9.601399390208e+011 smartraid_array_used_space_bytes{array_name="B",controller_name="0"} 2.2401449904308225e+014 # HELP smartraid_logical_drive_chunk_size_bytes Chunk size of a SmartRAID logical drive in bytes. A chunk is a set of consecutive bytes per physical drive. # TYPE smartraid_logical_drive_chunk_size_bytes gauge smartraid_logical_drive_chunk_size_bytes{array_name="A",controller_name="0",logical_drive_name="1"} 262144.0 smartraid_logical_drive_chunk_size_bytes{array_name="B",controller_name="0",logical_drive_name="2"} 524288.0 smartraid_logical_drive_chunk_size_bytes{array_name="B",controller_name="0",logical_drive_name="3"} 524288.0 # HELP smartraid_logical_drive_data_stripe_size_bytes Data stripe size of a SmartRAID logical drive in bytes. A data stripe is a row of data (but not parity) chunks over the physical drives. # TYPE smartraid_logical_drive_data_stripe_size_bytes gauge smartraid_logical_drive_data_stripe_size_bytes{array_name="A",controller_name="0",logical_drive_name="1"} 262144.0 smartraid_logical_drive_data_stripe_size_bytes{array_name="B",controller_name="0",logical_drive_name="2"} 6.291456e+06 smartraid_logical_drive_data_stripe_size_bytes{array_name="B",controller_name="0",logical_drive_name="3"} 6.291456e+06 # HELP smartraid_logical_drive_info Information about a SmartRAID logical drive. # TYPE smartraid_logical_drive_info gauge smartraid_logical_drive_info{array_name="A",caching="0",controller_name="0",device="/dev/sda",logical_drive_label="system",logical_drive_name="1",multidomain_status="OK",parity_initialization_status="",raid_level="1",status="OK",unrecoverable_media_errors="None"} 1.0 smartraid_logical_drive_info{array_name="B",caching="1",controller_name="0",device="/dev/sde",logical_drive_label="data-a-0",logical_drive_name="2",multidomain_status="OK",parity_initialization_status="Initialization Completed",raid_level="6",status="OK",unrecoverable_media_errors="None"} 1.0 smartraid_logical_drive_info{array_name="B",caching="1",controller_name="0",device="/dev/sdb",logical_drive_label="data-a-1",logical_drive_name="3",multidomain_status="OK",parity_initialization_status="Initialization Completed",raid_level="6",status="OK",unrecoverable_media_errors="None"} 1.0 # HELP smartraid_logical_drive_size_bytes Size of a SmartRAID logical drive in bytes. # TYPE smartraid_logical_drive_size_bytes gauge smartraid_logical_drive_size_bytes{array_name="A",controller_name="0",logical_drive_name="1"} 4.800699695104e+011 smartraid_logical_drive_size_bytes{array_name="B",controller_name="0",logical_drive_name="2"} 1.7592186044416e+013 smartraid_logical_drive_size_bytes{array_name="B",controller_name="0",logical_drive_name="3"} 1.7592186044416e+013 # HELP smartraid_physical_drive_info Information about a SmartRAID physical drive. # TYPE smartraid_physical_drive_info gauge smartraid_physical_drive_info{array_name="A",bay="15",box="1",controller_name="0",drive_role="Data",firmware_version="HPG1",interface_type="Solid State SATA",model="ATA VK000480GXAWK",multi_actuator_drive="",physical_drive_name="3I:1:15",port="3I",serial_number="000B",shingled_magnetic_recording_support="None",status="OK",wwn="00001"} 1.0 smartraid_physical_drive_info{array_name="A",bay="16",box="1",controller_name="0",drive_role="Data",firmware_version="HPG1",interface_type="Solid State SATA",model="ATA VK000480GXAWK",multi_actuator_drive="",physical_drive_name="3I:1:16",port="3I",serial_number="000C",shingled_magnetic_recording_support="None",status="OK",wwn="00002"} 1.0 smartraid_physical_drive_info{array_name="B",bay="1",box="1",controller_name="0",drive_role="Data",firmware_version="HPD4",interface_type="SAS",model="HPE MB016000JWZHE",multi_actuator_drive="",physical_drive_name="1I:1:1",port="1I",serial_number="000D",shingled_magnetic_recording_support="None",status="OK",wwn="00003"} 1.0 smartraid_physical_drive_info{array_name="B",bay="2",box="1",controller_name="0",drive_role="Data",firmware_version="HPD4",interface_type="SAS",model="HPE MB016000JWZHE",multi_actuator_drive="",physical_drive_name="1I:1:2",port="1I",serial_number="000E",shingled_magnetic_recording_support="None",status="OK",wwn="00004"} 1.0 smartraid_physical_drive_info{array_name="B",bay="14",box="1",controller_name="0",drive_role="Data",firmware_version="HPD4",interface_type="SAS",model="HPE MB016000JWZHE",multi_actuator_drive="",physical_drive_name="3I:1:14",port="3I",serial_number="000F",shingled_magnetic_recording_support="None",status="OK",wwn="00005"} 1.0 # HELP smartraid_physical_drive_life_remaining_based_on_workload_to_date_days Remaining lifetime (estimated based on the workload to date) of a SmartRAID physical drive in days. # TYPE smartraid_physical_drive_life_remaining_based_on_workload_to_date_days gauge smartraid_physical_drive_life_remaining_based_on_workload_to_date_days{controller_name="0",physical_drive_name="3I:1:15"} 263431.0 smartraid_physical_drive_life_remaining_based_on_workload_to_date_days{controller_name="0",physical_drive_name="3I:1:16"} 263431.0 # HELP smartraid_physical_drive_power_on_hours_total Power-on time of a SmartRAID physical drive in hours. # TYPE smartraid_physical_drive_power_on_hours_total counter smartraid_physical_drive_power_on_hours_total{controller_name="0",physical_drive_name="3I:1:15"} 25391.0 smartraid_physical_drive_power_on_hours_total{controller_name="0",physical_drive_name="3I:1:16"} 25391.0 # HELP smartraid_physical_drive_rotational_speed_rpm Rotational speed of a SmartRAID physical drive in revolutions per minute. # TYPE smartraid_physical_drive_rotational_speed_rpm gauge smartraid_physical_drive_rotational_speed_rpm{controller_name="0",physical_drive_name="1I:1:1"} 7200.0 smartraid_physical_drive_rotational_speed_rpm{controller_name="0",physical_drive_name="1I:1:2"} 7200.0 smartraid_physical_drive_rotational_speed_rpm{controller_name="0",physical_drive_name="3I:1:14"} 7200.0 # HELP smartraid_physical_drive_size_bytes Size of a SmartRAID physical drive in bytes. # TYPE smartraid_physical_drive_size_bytes gauge smartraid_physical_drive_size_bytes{controller_name="0",physical_drive_name="3I:1:15"} 4.8e+011 smartraid_physical_drive_size_bytes{controller_name="0",physical_drive_name="3I:1:16"} 4.8e+011 smartraid_physical_drive_size_bytes{controller_name="0",physical_drive_name="1I:1:1"} 1.6e+013 smartraid_physical_drive_size_bytes{controller_name="0",physical_drive_name="1I:1:2"} 1.6e+013 smartraid_physical_drive_size_bytes{controller_name="0",physical_drive_name="3I:1:14"} 1.6e+013 # HELP smartraid_physical_drive_temperature_celsius Temperature of a SmartRAID physical drive in celsius. # TYPE smartraid_physical_drive_temperature_celsius gauge smartraid_physical_drive_temperature_celsius{controller_name="0",physical_drive_name="3I:1:15"} 35.0 smartraid_physical_drive_temperature_celsius{controller_name="0",physical_drive_name="3I:1:16"} 32.0 smartraid_physical_drive_temperature_celsius{controller_name="0",physical_drive_name="1I:1:1"} 30.0 smartraid_physical_drive_temperature_celsius{controller_name="0",physical_drive_name="1I:1:2"} 33.0 smartraid_physical_drive_temperature_celsius{controller_name="0",physical_drive_name="3I:1:14"} 37.0 # HELP smartraid_physical_drive_usage_remaining_ratio Remaining usage (in terms of durability) of a SmartRAID physical drive as ratio. # TYPE smartraid_physical_drive_usage_remaining_ratio gauge smartraid_physical_drive_usage_remaining_ratio{controller_name="0",physical_drive_name="3I:1:15"} 0.996 smartraid_physical_drive_usage_remaining_ratio{controller_name="0",physical_drive_name="3I:1:16"} 0.996 What I did is the following: - Any property from the JSON which is a true number (like temperatures, byte sizes, etc.), did get their own metric. - Booleans became a label, too. Storing them as a metric value would of course be an alternative (see the question below). - Anything that's a string, became a label of an _info metric. I do know about enums[0], but the main problem with enum candidates like `status` is that I don't know all possible values (and that the closed upstream tool may at any time add/change them) - There are labels that act like primary keys: for controllers: `controller_name` for sensors: `controller_name`, `sensor_name` for arrays: `controller_name`, `array_name` for LDs: `controller_name`, `array_name`, `logical_drive_name` for PDs:`controller_name`, `physical_drive_name` For PDs, the _info metric does also contain `array_name` but there it's not like a primary key as it may e.g. be empty for unassigned drives. The idea is that with operators like `… and on(…) …` one should e.g. be able to use the _info metric to select all drives that have say status==OK ... and intersect that with e.g. the temperatures. And it should be possible to make proper alerts like count(smartraid_physical_drive_info{label!="OK"}) != 0 to get any drives which are not OK. - If some property doesn't appear (like HDDs have no smartraid_physical_drive_usage_remaining_ratio), the metric doesn't appear for the respective labels. Similar, if a property that is mapped to a label doesn't appear, it's left empty. Now there are quite some things that could have done differently from how I did them in the first draft: 1) The "primary key" labels, don't truly uniquely identify the respective object over all times, but only for a given time. This is basically the point we've discussed before, with the PDs and the serial number. There may be two different controllers with name `0`, not at the same time, but when one breaks and is replaced by another. Array A may be deleted and replaced with another one of the same name, same for LDs. And most obvious of course for PDs. I could of course use further labels like WWN or serial numbers, but there wouldn't be such mean for arrays and LDs. Also, while in the case of PDs it may make sense to include the, serial so that one can more easily tell that e.g. some power-on- hours are that from another drive... it makes IMO less sense for e.g. the controller, when one relates that to a PD. If the controller changes (but still has name `0`) and the PDs all stay... one would not want the PDs, or LDs or arrays to be considered different ones from those with the old controller. So the question is: *If* one actually includes things like serial number: where? *Or* should one leave that exercise to the query (cause the info, which serial number a drive name had at a given time is in principle there, namely in the _info metric) - but that could make queries quite complex. I should further add, that there is no heavenly law that e.g serial numbers are actually different. In fact I have seen cases where devices effectively get the same serial number - the kernel does not enforce them to be different in any way. 2) Metrics like: - smartraid_physical_drive_size_bytes - smartraid_physical_drive_rotational_speed_rpm can in principle not change (again: unless the PD is replaced with another one of the same name). So should they rather be labels, despite being numbers? OTOH, labels like: - smartraid_logical_drive_size_bytes - smartraid_logical_drive_chunk_size_bytes - smartraid_logical_drive_data_stripe_size_bytes *can* in principle change (namely if the RAID is converted). 3) Is there a better way to map he various `status` properties? I made them now labels, as described above. And remember that I don't know all possible value such enums may take. 4) The way I mapped the temperatures... I described the alternative I see in https://discuss.prometheus.io/t/how-to-design-metrics-labels/2337 I went now for the approach to have a dedicated metric for those where there's a dedicated property in the RAID tool output, like: - smartraid_controller_temperature_celsius - smartraid_controller_cache_module_temperature_celsius - smartraid_controller_capacitor_temperature_celsius - smartraid_physical_drive_temperature_celsius and a single metric with "type" label (named `location`) for those where the output is variable: - smartraid_controller_sensor_info{controller_name="0",location="Inlet Ambient",sensor_name="0"} 1.0 - smartraid_controller_sensor_info{controller_name="0",location="ASIC",sensor_name="1"} 1.0 - smartraid_controller_sensor_info{controller_name="0",location="Top",sensor_name="2"} 1.0 (with the corresponding smartraid_controller_sensor_temperature_celsius) 5) Booleans There are properties like: - multi_actuator_drive (PDs) which will never change (for a given PD) like: - shingled_magnetic_recording_support (PDs) which might theoretically change (if it actually tells whether it does SMR, not whether it can do, and if it can be disabled, which I think is possible with zoned drives) like: - caching (LDs) which may actually change at any time (e.g. when the battery fails) I mapped them now to labels, rather than metrics. This question is similar to (2) above... should it be labels or metrics. For multi_actuator_drive I'd say clearly label. For `caching` it might even be better to have a metric. But not sure. 6) *If* one maps booleans to labels (like I did), is there any recommended values? I simply took 0/1, because they're the shortest. But of course there are true/false, yes/no, etc. pp.. 7) I've also played with the though to split up my current: *_info metrics: - one which contains those labels that are truly static per object), like serial number and hardware revision - one like *_status_info that contains those, which may change, like `status`, `firmware_version`, etc. But I didn't do so, as most can actually change (at least in principle). Thanks, Chris. [0] https://prometheus.github.io/client_python/instrumenting/enum/ -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/25f3ad2bffcd1cb5ee1c8f6056fcba7c9de6d656.camel%40gmail.com.