[ 
https://issues.apache.org/jira/browse/SOLR-15056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332818#comment-17332818
 ] 

Walter Underwood edited comment on SOLR-15056 at 4/27/21, 12:34 AM:
--------------------------------------------------------------------

[~ctargett] I'm glad to fix this to follow a standard style.

The current documentation does not say which JMX metrics are used by each 
circuit breaker. I added that documentation.

Some changes document what I found in the code, for example the current 
documentation says:
{quote}If circuit breakers are enabled, requests may be rejected under the 
condition of high node duress with an appropriate HTTP error code (typically 
503).

It is up to the client to handle this error and potentially build a retrial 
logic as this should ideally be a transient situation.
{quote}
The updated doc says:
{quote}Circuit breakers only interrupt search requests (`SearchHandler`). They 
are not checked for update requests, admin requests, etc. They are checked for 
distributed search requests, so they may result in partial failures for 
multi-shard requests. Circuit breakers are checked once, early in the request 
evaluation, before significant work is done. Long-running requests will not be 
interrupted.

Servers rejecting traffic with a 503 code may mislead a load balancer into 
thinking that they are broken, when they are actually intelligently handling 
overload. This could cause Solr hosts to be dropped from a load balancer, 
causing a cascading overload on the remaining hosts. Make sure the load 
balancer is configured to allow the servers to shed excess load with 503 
responses.
{quote}
There are a number of minor changes, mostly simplifying sentence structure or 
vocabulary. I've learned to do this so documents are more accessible to 
non-native English speakers. For example, "condition of high node duress" is 
replaced with "high node load".

The Wikipedia link is a broken URL, that is also fixed.


was (Author: wunder):
[~ctargett] I'm glad to fix this to follow a standard style.

The current documentation does not say which JMX metrics are used by each 
circuit breaker. I added that documentation.

Some changes document what I found in the code, for example the current 
documentation says:
{quote}If circuit breakers are enabled, requests may be rejected under the 
condition of high node duress with an appropriate HTTP error code (typically 
503).

It is up to the client to handle this error and potentially build a retrial 
logic as this should ideally be a transient situation.
{quote}
The updated doc says:
{quote}Circuit breakers only interrupt search requests (`SearchHandler`). They 
are not checked for update requests, admin requests, etc. They are checked for 
distributed search requests, so they may result in partial failures for 
multi-shard requests. Circuit breakers are checked once, early in the request 
evaluation, before significant work is done. Long-running requests will not be 
interrupted.

Servers rejecting traffic with a 503 code may mislead a load balancer into 
thinking that they are broken, when they are actually intelligently handling 
overload. This could cause Solr hosts to be dropped from a load balancer, 
causing a cascading overload on the remaining hosts. Make sure the load 
balancer is configured to allow the servers to shed excess load with 503 
responses.
{quote}
The Wikipedia link is a broken URL, that is also fixed.

> CPU circuit breaker needs to use CPU utilization, not Unix load average
> -----------------------------------------------------------------------
>
>                 Key: SOLR-15056
>                 URL: https://issues.apache.org/jira/browse/SOLR-15056
>             Project: Solr
>          Issue Type: Bug
>          Components: metrics
>    Affects Versions: 8.7
>            Reporter: Walter Underwood
>            Assignee: Atri Sharma
>            Priority: Major
>              Labels: Metrics
>         Attachments: 
> 0001-SOLR-15056-Circuit-Breakers-use-CPU-utilization-inst.patch, 
> 0002-SOLR-15056-clean-up-linkage-to-SolrCore-add-back-loa.patch, 
> SOLR-15056.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The config range, 50% to 95%, assumes that the circuit breaker is triggered 
> by a CPU utilization metric that goes from 0% to 100%. But the code uses the 
> metric OperatingSystemMXBean.getSystemLoadAverage(). That is an average of 
> the count of processes waiting to run. It is effectively unbounded. I've seen 
> it as high as 50 to 100. It is not bound by 1.0 (100%).
> A good limit for load average would need to be aware of the number of CPUs 
> available to the JVM. A load average of 8 is no problem for a 32 CPU host. It 
> is a critical situation for a 2 CPU host.
> Also, load average is a Unix OS metric. I don't know if it is even available 
> on Windows.
> Instead, use a CPU utilization metric that goes from 0.0 to 1.0. A good 
> choice is OperatingSystemMXBean.getSystemCPULoad(). This name also uses 
> "load", but it is a usage metric.
> From the Javadoc:
> > Returns the "recent cpu usage" for the whole system. This value is a double 
> >in the [0.0,1.0] interval. A value of 0.0 means that all CPUs were idle 
> >during the recent period of time observed, while a value of 1.0 means that 
> >all CPUs were actively running 100% of the time during the recent period 
> >being observed. All values betweens 0.0 and 1.0 are possible depending of 
> >the activities going on in the system. If the system recent cpu usage is not 
> >available, the method returns a negative value.
> https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getSystemCpuLoad()
> Also update the documentation to explain which JMX metrics are used for the 
> memory and CPU circuit breakers.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to