[jira] [Commented] (SOLR-15300) Shard "state" flag is confusing and of limited value to outside consumers

Andrzej Bialecki (Jira) Wed, 07 Apr 2021 07:12:21 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316366#comment-17316366
 ]


Andrzej Bialecki commented on SOLR-15300:
-----------------------------------------

The {{replicationFactor}} is ill-defined, at least the way it's used. It 
doesn't reflect anything other than the initial setup - you are free to add / 
remove replicas and then it no longer holds true. It doesn't reflect per shard 
replication either.

I would go even further - we should remove it from collection state because 
it's misleading.

Another question is "what is the intended replication factor and how to measure 
it"? This is not obvious either because it may depend on circumstances (eg. 
adding replicas during search traffic spikes and removing them afterwards). 
This may be a task for some external agent to figure out.

I think it's much easier to focus in this issue on clearly reporting the most 
common abnormal states - eg. shard has replicas down/recovering, shard has no 
replicas, shard has no leader.

Also, at the Java level you can already get all this information, so I think 
the scope of this issue is only what to do about the external reporting / 
monitoring, either via metrics or via ClusterState / Slice. As such, I think 
that we don't have to explicitly store this state anywhere, we can construct it 
on the fly for the purpose of reporting.

> Shard "state" flag is confusing and of limited value to outside consumers
> -------------------------------------------------------------------------
>
>                 Key: SOLR-15300
>                 URL: https://issues.apache.org/jira/browse/SOLR-15300
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>
> Solr API (and consequently the metric reporters, which are often used for 
> Solr monitoring) report the shard as being in ACTIVE state even when in 
> reality its functionality is severely compromised (eg. no replicas, all 
> replicas down, or no leader).
> This reported state is technically correct because it is used only for 
> tracking of the SPLITSHARD operations, as defined in {{Slice.State}}. 
> However, this may be misleading and more often unhelpful than not - for 
> constant monitoring a flag that actually reports impaired functionality of a 
> shard would be more useful than a flag that reports a relatively uncommon 
> SPLITSHARD operation.
> We could either redefine the meaning of the existing flag (and change its 
> state according to some of the criteria I listed above), or add another flag 
> to represent the "health" status of a shard. The value of this flag would 
> then provide an easy way to monitor and to alert external systems of 
> dangerous function impairment, without monitoring the state of all replicas 
> of a collection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-15300) Shard "state" flag is confusing and of limited value to outside consumers

Reply via email to