[ https://issues.apache.org/jira/browse/SOLR-15056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307552#comment-17307552 ]
Walter Underwood commented on SOLR-15056: ----------------------------------------- 1. The bug is that the CPU utilization circuit breaker is not based on CPU utilization. It is based on load average. I've changed that to use CPU utilization and added a new new on that is based on load average. 2. There are now three circuit breakers, all have unit tests. The load average test was moved to the new load average circuit breaker and a new CPU usage test was created for the existing CPU usage breaker. 3. The documentation changes can't really be separated because they match the three circuit breakers in the code. Part of the bug was that the code did not match the documentation. The underlying JMX metrics are now named and linked. Other behavior is called out, for example that the circuit breaker only stops search requests, not update or admin requests. The other changes are simplifying grammar and wording, things like using "usage" instead of "utilization" consistently. > CPU circuit breaker needs to use CPU utilization, not Unix load average > ----------------------------------------------------------------------- > > Key: SOLR-15056 > URL: https://issues.apache.org/jira/browse/SOLR-15056 > Project: Solr > Issue Type: Bug > Components: metrics > Affects Versions: 8.7 > Reporter: Walter Underwood > Assignee: Atri Sharma > Priority: Major > Labels: Metrics > Attachments: > 0001-SOLR-15056-Circuit-Breakers-use-CPU-utilization-inst.patch, > 0002-SOLR-15056-clean-up-linkage-to-SolrCore-add-back-loa.patch, > SOLR-15056.patch > > > The config range, 50% to 95%, assumes that the circuit breaker is triggered > by a CPU utilization metric that goes from 0% to 100%. But the code uses the > metric OperatingSystemMXBean.getSystemLoadAverage(). That is an average of > the count of processes waiting to run. It is effectively unbounded. I've seen > it as high as 50 to 100. It is not bound by 1.0 (100%). > A good limit for load average would need to be aware of the number of CPUs > available to the JVM. A load average of 8 is no problem for a 32 CPU host. It > is a critical situation for a 2 CPU host. > Also, load average is a Unix OS metric. I don't know if it is even available > on Windows. > Instead, use a CPU utilization metric that goes from 0.0 to 1.0. A good > choice is OperatingSystemMXBean.getSystemCPULoad(). This name also uses > "load", but it is a usage metric. > From the Javadoc: > > Returns the "recent cpu usage" for the whole system. This value is a double > >in the [0.0,1.0] interval. A value of 0.0 means that all CPUs were idle > >during the recent period of time observed, while a value of 1.0 means that > >all CPUs were actively running 100% of the time during the recent period > >being observed. All values betweens 0.0 and 1.0 are possible depending of > >the activities going on in the system. If the system recent cpu usage is not > >available, the method returns a negative value. > https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getSystemCpuLoad() > Also update the documentation to explain which JMX metrics are used for the > memory and CPU circuit breakers. > -- This message was sent by Atlassian Jira (v8.3.4#803005)