[ https://issues.apache.org/jira/browse/SOLR-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311024#comment-17311024 ]
Jan Høydahl commented on SOLR-15300: ------------------------------------ Agree. Last week I was attempting to create a simple generic Prometheus Alert Rule to trigger alerts whenever a collection has a shard whose intended replicationFactor is not satisfied. Something like * Green - all OK: All replicas in all shards have state==active (and represented in live_nodes) * Yellow - still operational but replicationFactor not satisfied at the moment (Would trigger a non-critical alert "Shard N for collection C has a lower replicationFactor (A) than configured (B)." * Red - no replicas for a shard are active. They may be in any other state (Would trigger a critical alert "Collection C is down. Shard N has no live replicas. Recovery is in progress). Currently I cannot find a single metric that can figure this out. I have tried compiling various JQ logic on the CLUSTERSTATE data, but it's quite hard to combine the configured replicationFactor with the actual in a generic way for all replicas in all shards of a collection and fold it into something alertable. So very much +1 to improving this situation. Perhaps this collides a bit with the PRS effort which aims to not touch state.json for state changes in replicas... So I don't know.. > Shard "state" flag is confusing and of limited value to outside consumers > ------------------------------------------------------------------------- > > Key: SOLR-15300 > URL: https://issues.apache.org/jira/browse/SOLR-15300 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Priority: Major > > Solr API (and consequently the metric reporters, which are often used for > Solr monitoring) report the shard as being in ACTIVE state even when in > reality its functionality is severely compromised (eg. no replicas, all > replicas down, or no leader). > This reported state is technically correct because it is used only for > tracking of the SPLITSHARD operations, as defined in {{Slice.State}}. > However, this may be misleading and more often unhelpful than not - for > constant monitoring a flag that actually reports impaired functionality of a > shard would be more useful than a flag that reports a relatively uncommon > SPLITSHARD operation. > We could either redefine the meaning of the existing flag (and change its > state according to some of the criteria I listed above), or add another flag > to represent the "health" status of a shard. The value of this flag would > then provide an easy way to monitor and to alert external systems of > dangerous function impairment, without monitoring the state of all replicas > of a collection. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org