Hey Ben and Chris.

Thanks for your replies!

On Fri, 2024-07-19 at 09:17 +0200, Ben Kochie wrote:
> This is one of those tricky situations where there's not a strict
> correct answer.

Indeed.


> For power-on-hours I would probably go with a gauge.
> * You don't really have a "perfect" monotonic counter here.

Why not?
I mean there's what Chris said about some drives that may overflow the
number - which in principle sounds unlikely though I must admit that
especially with NVMe SMART data I have seen unreasonably low numbers,
too (but in those cases, I've never seen high numbers for these drives,
so it may also just be some other issue).


> * I would also include the serial number label as well, just for
> uniqueness identification sake.

If one adds a label to a metric, which then stays mostly constant, does
this add any considerably amount of space needed for storing it?

But more on that below.


> * Power-on-hours doesn't really have a lot of use as a counter. Do
> actually want to display a counter like `rate(power_on_hours[1h])`?

No, not particularly. It's just a number that should at least in theory
only increase, and I wanted to do it right.


Perhaps I should describe things a bit more, because actually I would
have some more cases where it's not clear to me how to map perfectly
into metrics.

I had already asked at
https://discuss.prometheus.io/t/how-to-design-metrics-labels/2337
and one further thread there (which was however swallowed by the anti-
spam, and would need some admin to approve it).


My exporter parses the RAID CLI tools output, which results in a structure like 
this (as JSON):
{
  "controllers": {
    "0": {
      "properties": {
        "slot": "0",
        "serial_number": "000A",
        "controller_status": "OK",
        "hardware_revision": "A",
        "firmware_version": "6.52",
        "rebuild_priority": "High",
        "cache_status": "OK",
        "battery_capacitor_status": "OK",
        "controller_temperature_celsius": 55.0,
        "model": "HPE Smart Array P816i-a SR Gen10"
      },
      "sensors": {
        "0": {
          "properties": {
            "location": "Inlet Ambient",
            "temperature_celsius": 43.0
          }
        },
        "1": {
          "properties": {
            "location": "ASIC",
            "temperature_celsius": 55.0
          }
        },
        "2": {
          "properties": {
            "location": "Top",
            "temperature_celsius": 41.0
          }
        }
      },
      "arrays": {
        "A": {
          "properties": {
            "unused_space_bytes": 0.0,
            "used_space_bytes": 960139939020.8,
            "status": "OK",
            "multidomain_status": "OK"
          },
          "logical_drives": {
            "1": {
              "properties": {
                "size_bytes": 480069969510.4,
                "raid_level": "1",
                "chunk_size_bytes": 262144,
                "data_stripe_size_bytes": 262144,
                "status": "OK",
                "unrecoverable_media_errors": "None",
                "multidomain_status": "OK",
                "caching": false,
                "device": "/dev/sda",
                "logical_drive_label": "system"
              }
            }
          },
          "physical_drives": {
            "3I:1:15": {
              "properties": {
                "port": "3I",
                "box": "1",
                "bay": "15",
                "status": "OK",
                "drive_role": "Data",
                "interface_type": "Solid State SATA",
                "size_bytes": 480000000000,
                "firmware_version": "HPG1",
                "serial_number": "000B",
                "wwn": "00001",
                "model": "ATA VK000480GXAWK",
                "temperature_celsius": 35.0,
                "usage_remaining_percent": 99.6,
                "power_on_hours": 25391.0,
                "life_remaining_based_on_workload_to_date_days": 263431.0,
                "shingled_magnetic_recording_support": "None"
              }
            },
            "3I:1:16": {
              "properties": {
                "port": "3I",
                "box": "1",
                "bay": "16",
                "status": "OK",
                "drive_role": "Data",
                "interface_type": "Solid State SATA",
                "size_bytes": 480000000000,
                "firmware_version": "HPG1",
                "serial_number": "000C",
                "wwn": "00002",
                "model": "ATA VK000480GXAWK",
                "temperature_celsius": 32.0,
                "usage_remaining_percent": 99.6,
                "power_on_hours": 25391.0,
                "life_remaining_based_on_workload_to_date_days": 263431.0,
                "shingled_magnetic_recording_support": "None"
              }
            }
          }
        },
        "B": {
          "properties": {
            "unused_space_bytes": 0.0,
            "used_space_bytes": 224014499043082.25,
            "status": "OK",
            "multidomain_status": "OK"
          },
          "logical_drives": {
            "2": {
              "properties": {
                "size_bytes": 17592186044416.0,
                "raid_level": "6",
                "chunk_size_bytes": 524288,
                "data_stripe_size_bytes": 6291456,
                "status": "OK",
                "unrecoverable_media_errors": "None",
                "multidomain_status": "OK",
                "caching": true,
                "parity_initialization_status": "Initialization Completed",
                "device": "/dev/sde",
                "logical_drive_label": "data-a-0"
              }
            },
            "3": {
              "properties": {
                "size_bytes": 17592186044416.0,
                "raid_level": "6",
                "chunk_size_bytes": 524288,
                "data_stripe_size_bytes": 6291456,
                "status": "OK",
                "unrecoverable_media_errors": "None",
                "multidomain_status": "OK",
                "caching": true,
                "parity_initialization_status": "Initialization Completed",
                "device": "/dev/sdb",
                "logical_drive_label": "data-a-1"
              }
            },
          "physical_drives": {
            "1I:1:1": {
              "properties": {
                "port": "1I",
                "box": "1",
                "bay": "1",
                "status": "OK",
                "drive_role": "Data",
                "interface_type": "SAS",
                "size_bytes": 16000000000000,
                "rotational_speed_rpm": 7200,
                "firmware_version": "HPD4",
                "serial_number": "000D",
                "wwn": "00003",
                "model": "HPE MB016000JWZHE",
                "temperature_celsius": 30.0,
                "shingled_magnetic_recording_support": "None"
              }
            },
            "1I:1:2": {
              "properties": {
                "port": "1I",
                "box": "1",
                "bay": "2",
                "status": "OK",
                "drive_role": "Data",
                "interface_type": "SAS",
                "size_bytes": 16000000000000,
                "rotational_speed_rpm": 7200,
                "firmware_version": "HPD4",
                "serial_number": "000E",
                "wwn": "00004",
                "model": "HPE MB016000JWZHE",
                "temperature_celsius": 33.0,
                "shingled_magnetic_recording_support": "None"
              }
            },
            "3I:1:14": {
              "properties": {
                "port": "3I",
                "box": "1",
                "bay": "14",
                "status": "OK",
                "drive_role": "Data",
                "interface_type": "SAS",
                "size_bytes": 16000000000000,
                "rotational_speed_rpm": 7200,
                "firmware_version": "HPD4",
                "serial_number": "000F",
                "wwn": "00005",
                "model": "HPE MB016000JWZHE",
                "temperature_celsius": 37.0,
                "shingled_magnetic_recording_support": "None"
              }
            }
          }
        }
      },
      "unassigned_physical_drives": {}
    }
  }
}

(The above output is a bit shortened, simply for readability)

In short:
- there can be multiple controllers
- controllers have sensors and arrays
- arrays have logical drives
- physical drives are assigned to arrays, too,
  but may also be shared spares (which cannot be properly deduced from
  the RAID tool) or unassigned drives (and thus not belong to an array)


Right now, I'd map that as follows:
# HELP smartraid_controller_cache_module_temperature_celsius Temperature of the 
cache module of a SmartRAID controller in celsius.
# TYPE smartraid_controller_cache_module_temperature_celsius gauge
# HELP smartraid_controller_capacitor_temperature_celsius Temperature of the 
capacitor of a SmartRAID controller in celsius.
# TYPE smartraid_controller_capacitor_temperature_celsius gauge
# HELP smartraid_controller_info Information about a SmartRAID controller.
# TYPE smartraid_controller_info gauge
smartraid_controller_info{battery_capacitor_status="OK",cache_status="OK",controller_name="0",controller_status="OK",firmware_version="6.52",hardware_revision="A",model="HPE
 Smart Array P816i-a SR Gen10",rebuild_priority="High",serial_number="000A"} 1.0
# HELP smartraid_controller_temperature_celsius Temperature of a SmartRAID 
controller in celsius.
# TYPE smartraid_controller_temperature_celsius gauge
smartraid_controller_temperature_celsius{controller_name="0"} 55.0
# HELP smartraid_controller_sensor_info Information about a SmartRAID 
controller sensor.
# TYPE smartraid_controller_sensor_info gauge
smartraid_controller_sensor_info{controller_name="0",location="Inlet 
Ambient",sensor_name="0"} 1.0
smartraid_controller_sensor_info{controller_name="0",location="ASIC",sensor_name="1"}
 1.0
smartraid_controller_sensor_info{controller_name="0",location="Top",sensor_name="2"}
 1.0
# HELP smartraid_controller_sensor_temperature_celsius Temperature of a 
SmartRAID controller sensor in celsius.
# TYPE smartraid_controller_sensor_temperature_celsius gauge
smartraid_controller_sensor_temperature_celsius{controller_name="0",sensor_name="0"}
 43.0
smartraid_controller_sensor_temperature_celsius{controller_name="0",sensor_name="1"}
 55.0
smartraid_controller_sensor_temperature_celsius{controller_name="0",sensor_name="2"}
 41.0
# HELP smartraid_array_info Information about a SmartRAID array.
# TYPE smartraid_array_info gauge
smartraid_array_info{array_name="A",controller_name="0",multidomain_status="OK",status="OK"}
 1.0
smartraid_array_info{array_name="B",controller_name="0",multidomain_status="OK",status="OK"}
 1.0
# HELP smartraid_array_unused_space_bytes Unused space of a SmartRAID array in 
bytes.
# TYPE smartraid_array_unused_space_bytes gauge
smartraid_array_unused_space_bytes{array_name="A",controller_name="0"} 0.0
smartraid_array_unused_space_bytes{array_name="B",controller_name="0"} 0.0
# HELP smartraid_array_used_space_bytes Used space of a SmartRAID array in 
bytes.
# TYPE smartraid_array_used_space_bytes gauge
smartraid_array_used_space_bytes{array_name="A",controller_name="0"} 
9.601399390208e+011
smartraid_array_used_space_bytes{array_name="B",controller_name="0"} 
2.2401449904308225e+014
# HELP smartraid_logical_drive_chunk_size_bytes Chunk size of a SmartRAID 
logical drive in bytes. A chunk is a set of consecutive bytes per physical 
drive.
# TYPE smartraid_logical_drive_chunk_size_bytes gauge
smartraid_logical_drive_chunk_size_bytes{array_name="A",controller_name="0",logical_drive_name="1"}
 262144.0
smartraid_logical_drive_chunk_size_bytes{array_name="B",controller_name="0",logical_drive_name="2"}
 524288.0
smartraid_logical_drive_chunk_size_bytes{array_name="B",controller_name="0",logical_drive_name="3"}
 524288.0
# HELP smartraid_logical_drive_data_stripe_size_bytes Data stripe size of a 
SmartRAID logical drive in bytes. A data stripe is a row of data (but not 
parity) chunks over the physical drives.
# TYPE smartraid_logical_drive_data_stripe_size_bytes gauge
smartraid_logical_drive_data_stripe_size_bytes{array_name="A",controller_name="0",logical_drive_name="1"}
 262144.0
smartraid_logical_drive_data_stripe_size_bytes{array_name="B",controller_name="0",logical_drive_name="2"}
 6.291456e+06
smartraid_logical_drive_data_stripe_size_bytes{array_name="B",controller_name="0",logical_drive_name="3"}
 6.291456e+06
# HELP smartraid_logical_drive_info Information about a SmartRAID logical drive.
# TYPE smartraid_logical_drive_info gauge
smartraid_logical_drive_info{array_name="A",caching="0",controller_name="0",device="/dev/sda",logical_drive_label="system",logical_drive_name="1",multidomain_status="OK",parity_initialization_status="",raid_level="1",status="OK",unrecoverable_media_errors="None"}
 1.0
smartraid_logical_drive_info{array_name="B",caching="1",controller_name="0",device="/dev/sde",logical_drive_label="data-a-0",logical_drive_name="2",multidomain_status="OK",parity_initialization_status="Initialization
 Completed",raid_level="6",status="OK",unrecoverable_media_errors="None"} 1.0
smartraid_logical_drive_info{array_name="B",caching="1",controller_name="0",device="/dev/sdb",logical_drive_label="data-a-1",logical_drive_name="3",multidomain_status="OK",parity_initialization_status="Initialization
 Completed",raid_level="6",status="OK",unrecoverable_media_errors="None"} 1.0
# HELP smartraid_logical_drive_size_bytes Size of a SmartRAID logical drive in 
bytes.
# TYPE smartraid_logical_drive_size_bytes gauge
smartraid_logical_drive_size_bytes{array_name="A",controller_name="0",logical_drive_name="1"}
 4.800699695104e+011
smartraid_logical_drive_size_bytes{array_name="B",controller_name="0",logical_drive_name="2"}
 1.7592186044416e+013
smartraid_logical_drive_size_bytes{array_name="B",controller_name="0",logical_drive_name="3"}
 1.7592186044416e+013
# HELP smartraid_physical_drive_info Information about a SmartRAID physical 
drive.
# TYPE smartraid_physical_drive_info gauge
smartraid_physical_drive_info{array_name="A",bay="15",box="1",controller_name="0",drive_role="Data",firmware_version="HPG1",interface_type="Solid
 State SATA",model="ATA 
VK000480GXAWK",multi_actuator_drive="",physical_drive_name="3I:1:15",port="3I",serial_number="000B",shingled_magnetic_recording_support="None",status="OK",wwn="00001"}
 1.0
smartraid_physical_drive_info{array_name="A",bay="16",box="1",controller_name="0",drive_role="Data",firmware_version="HPG1",interface_type="Solid
 State SATA",model="ATA 
VK000480GXAWK",multi_actuator_drive="",physical_drive_name="3I:1:16",port="3I",serial_number="000C",shingled_magnetic_recording_support="None",status="OK",wwn="00002"}
 1.0
smartraid_physical_drive_info{array_name="B",bay="1",box="1",controller_name="0",drive_role="Data",firmware_version="HPD4",interface_type="SAS",model="HPE
 
MB016000JWZHE",multi_actuator_drive="",physical_drive_name="1I:1:1",port="1I",serial_number="000D",shingled_magnetic_recording_support="None",status="OK",wwn="00003"}
 1.0
smartraid_physical_drive_info{array_name="B",bay="2",box="1",controller_name="0",drive_role="Data",firmware_version="HPD4",interface_type="SAS",model="HPE
 
MB016000JWZHE",multi_actuator_drive="",physical_drive_name="1I:1:2",port="1I",serial_number="000E",shingled_magnetic_recording_support="None",status="OK",wwn="00004"}
 1.0
smartraid_physical_drive_info{array_name="B",bay="14",box="1",controller_name="0",drive_role="Data",firmware_version="HPD4",interface_type="SAS",model="HPE
 
MB016000JWZHE",multi_actuator_drive="",physical_drive_name="3I:1:14",port="3I",serial_number="000F",shingled_magnetic_recording_support="None",status="OK",wwn="00005"}
 1.0
# HELP smartraid_physical_drive_life_remaining_based_on_workload_to_date_days 
Remaining lifetime (estimated based on the workload to date) of a SmartRAID 
physical drive in days.
# TYPE smartraid_physical_drive_life_remaining_based_on_workload_to_date_days 
gauge
smartraid_physical_drive_life_remaining_based_on_workload_to_date_days{controller_name="0",physical_drive_name="3I:1:15"}
 263431.0
smartraid_physical_drive_life_remaining_based_on_workload_to_date_days{controller_name="0",physical_drive_name="3I:1:16"}
 263431.0
# HELP smartraid_physical_drive_power_on_hours_total Power-on time of a 
SmartRAID physical drive in hours.
# TYPE smartraid_physical_drive_power_on_hours_total counter
smartraid_physical_drive_power_on_hours_total{controller_name="0",physical_drive_name="3I:1:15"}
 25391.0
smartraid_physical_drive_power_on_hours_total{controller_name="0",physical_drive_name="3I:1:16"}
 25391.0
# HELP smartraid_physical_drive_rotational_speed_rpm Rotational speed of a 
SmartRAID physical drive in revolutions per minute.
# TYPE smartraid_physical_drive_rotational_speed_rpm gauge
smartraid_physical_drive_rotational_speed_rpm{controller_name="0",physical_drive_name="1I:1:1"}
 7200.0
smartraid_physical_drive_rotational_speed_rpm{controller_name="0",physical_drive_name="1I:1:2"}
 7200.0
smartraid_physical_drive_rotational_speed_rpm{controller_name="0",physical_drive_name="3I:1:14"}
 7200.0
# HELP smartraid_physical_drive_size_bytes Size of a SmartRAID physical drive 
in bytes.
# TYPE smartraid_physical_drive_size_bytes gauge
smartraid_physical_drive_size_bytes{controller_name="0",physical_drive_name="3I:1:15"}
 4.8e+011
smartraid_physical_drive_size_bytes{controller_name="0",physical_drive_name="3I:1:16"}
 4.8e+011
smartraid_physical_drive_size_bytes{controller_name="0",physical_drive_name="1I:1:1"}
 1.6e+013
smartraid_physical_drive_size_bytes{controller_name="0",physical_drive_name="1I:1:2"}
 1.6e+013
smartraid_physical_drive_size_bytes{controller_name="0",physical_drive_name="3I:1:14"}
 1.6e+013
# HELP smartraid_physical_drive_temperature_celsius Temperature of a SmartRAID 
physical drive in celsius.
# TYPE smartraid_physical_drive_temperature_celsius gauge
smartraid_physical_drive_temperature_celsius{controller_name="0",physical_drive_name="3I:1:15"}
 35.0
smartraid_physical_drive_temperature_celsius{controller_name="0",physical_drive_name="3I:1:16"}
 32.0
smartraid_physical_drive_temperature_celsius{controller_name="0",physical_drive_name="1I:1:1"}
 30.0
smartraid_physical_drive_temperature_celsius{controller_name="0",physical_drive_name="1I:1:2"}
 33.0
smartraid_physical_drive_temperature_celsius{controller_name="0",physical_drive_name="3I:1:14"}
 37.0
# HELP smartraid_physical_drive_usage_remaining_ratio Remaining usage (in terms 
of durability) of a SmartRAID physical drive as ratio.
# TYPE smartraid_physical_drive_usage_remaining_ratio gauge
smartraid_physical_drive_usage_remaining_ratio{controller_name="0",physical_drive_name="3I:1:15"}
 0.996
smartraid_physical_drive_usage_remaining_ratio{controller_name="0",physical_drive_name="3I:1:16"}
 0.996


What I did is the following:
- Any property from the JSON which is a true number (like temperatures,
  byte sizes, etc.), did get their own metric.

- Booleans became a label, too. Storing them as a metric value would of
  course be an alternative (see the question below).

- Anything that's a string, became a label of an _info metric.
  I do know about enums[0], but the main problem with enum candidates
  like `status` is that I don't know all possible values (and that the
  closed upstream tool may at any time add/change them)

- There are labels that act like primary keys:
  for controllers: `controller_name`
  for sensors: `controller_name`, `sensor_name` 
  for arrays: `controller_name`, `array_name`
  for LDs: `controller_name`, `array_name`, `logical_drive_name`
  for PDs:`controller_name`, `physical_drive_name`
  
  For PDs, the _info metric does also contain `array_name` but
  there it's not like a primary key as it may e.g. be empty for
  unassigned drives.

  The idea is that with operators like `… and on(…) …` one should e.g.
  be able to use the _info metric to select all drives that have say
  status==OK ... and intersect that with e.g. the temperatures.

  And it should be possible to make proper alerts like
    count(smartraid_physical_drive_info{label!="OK"}) != 0
  to get any drives which are not OK.

- If some property doesn't appear (like HDDs have no
  smartraid_physical_drive_usage_remaining_ratio), the metric doesn't
  appear for the respective labels. Similar, if a property that is
  mapped to a label doesn't appear, it's left empty.


Now there are quite some things that could have done differently from
how I did them in the first draft:

1) The "primary key" labels, don't truly uniquely identify the
   respective object over all times, but only for a given time.

   This is basically the point we've discussed before, with the PDs and
   the serial number.

   There may be two different controllers with name `0`, not at the
   same time, but when one breaks and is replaced by another.
   Array A may be deleted and replaced with another one of the same
   name, same for LDs.
   And most obvious of course for PDs.

   I could of course use further labels like WWN or serial numbers, but
   there wouldn't be such mean for arrays and LDs.

   Also, while in the case of PDs it may make sense to include the,
   serial so that one can more easily tell that e.g. some power-on-
   hours are that from another drive... it makes IMO less sense for
   e.g. the controller, when one relates that to a PD.
   If the controller changes (but still has name `0`) and the PDs all
   stay... one would not want the PDs, or LDs or arrays to be
   considered different ones from those with the old controller.

   So the question is:
   *If* one actually includes things like serial number: where?
   *Or* should one leave that exercise to the query (cause the info,
   which serial number a drive name had at a given time is in principle
   there, namely in the _info metric) - but that could make queries
   quite complex.

   I should further add, that there is no heavenly law that e.g serial
   numbers are actually different. In fact I have seen cases where
   devices effectively get the same serial number - the kernel does not
   enforce them to be different in any way.


2) Metrics like:
   - smartraid_physical_drive_size_bytes 
   - smartraid_physical_drive_rotational_speed_rpm
   can in principle not change (again: unless the PD is replaced with
   another one of the same name).

   So should they rather be labels, despite being numbers?

   OTOH, labels like:
   - smartraid_logical_drive_size_bytes
   - smartraid_logical_drive_chunk_size_bytes
   - smartraid_logical_drive_data_stripe_size_bytes
   *can* in principle change (namely if the RAID is converted).


3) Is there a better way to map he various `status` properties?
   I made them now labels, as described above.
   And remember that I don't know all possible value such enums may
   take.


4) The way I mapped the temperatures...
   I described the alternative I see in
   https://discuss.prometheus.io/t/how-to-design-metrics-labels/2337
   I went now for the approach to have a dedicated metric for those
   where there's a dedicated property in the RAID tool output, like:
   - smartraid_controller_temperature_celsius
   - smartraid_controller_cache_module_temperature_celsius
   - smartraid_controller_capacitor_temperature_celsius
   - smartraid_physical_drive_temperature_celsius
   and a single metric with "type" label (named `location`) for those
   where the output is variable:
   - smartraid_controller_sensor_info{controller_name="0",location="Inlet 
Ambient",sensor_name="0"} 1.0
   - 
smartraid_controller_sensor_info{controller_name="0",location="ASIC",sensor_name="1"}
 1.0
   - 
smartraid_controller_sensor_info{controller_name="0",location="Top",sensor_name="2"}
 1.0
   (with the corresponding
   smartraid_controller_sensor_temperature_celsius)


5) Booleans
   There are properties like:
   - multi_actuator_drive (PDs)
     which will never change (for a given PD)
   like:
   - shingled_magnetic_recording_support (PDs)
     which might theoretically change (if it actually tells whether it
     does SMR, not whether it can do, and if it can be disabled, which
     I think is possible with zoned drives)
   like:
   - caching (LDs)
     which may actually change at any time (e.g. when the battery
     fails)

   I mapped them now to labels, rather than metrics.
   This question is similar to (2) above... should it be labels or
   metrics.
   For multi_actuator_drive I'd say clearly label.
   For `caching` it might even be better to have a metric.

   But not sure.


6) *If* one maps booleans to labels (like I did), is there any
   recommended values?
   I simply took 0/1, because they're the shortest. But of course there
   are true/false, yes/no, etc. pp..


7) I've also played with the though to split up my current:
   *_info metrics:
   - one which contains those labels that are truly static per object),
     like serial number and hardware revision
   - one like *_status_info that contains those, which may change, like
     `status`, `firmware_version`, etc.
   But I didn't do so, as most can actually change (at least in
   principle).


Thanks,
Chris.

[0] https://prometheus.github.io/client_python/instrumenting/enum/

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/25f3ad2bffcd1cb5ee1c8f6056fcba7c9de6d656.camel%40gmail.com.

Reply via email to