Andrzej Bialecki created SOLR-17138: ---------------------------------------
Summary: Support other QueryTimeout criteria Key: SOLR-17138 URL: https://issues.apache.org/jira/browse/SOLR-17138 Project: Solr Issue Type: New Feature Security Level: Public (Default Security Level. Issues are Public) Components: Query Budget Reporter: Andrzej Bialecki Complex Solr queries can consume significant memory and CPU while being processed. When OOM or CPU saturation is reached Solr becomes unresponsive, which further compounds the problem. Often such “killer queries” are not written to logs, which makes them difficult to diagnose. This happens even with best practices in place. It should be possible to set limits in Solr that cannot be exceeded by individual queries. This mechanism would monitor an accumulating “cost” of a query while it’s being executed and compare it to the configured maximum cost (budget), expressed in terms of CPU and/or memory usage that can be attributed to this query. Should these limits be exceeded the individual query execution should be terminated, without affecting other concurrently executing queries. The CircuitBreakers functionality doesn't distinguish the source of the load and can't protect other query executions from a particular runaway query. We need a more fine-grained mechanism. The existing `QueryTimeout` API enables such termination of individual queries. However, the existing implementation (`SolrQueryTimeoutImpl` used with `timeAllowed` query param) only uses elapsed wall-clock time as the termination criterion. This is insufficient - in case of resource contention the wall-clock time doesn’t represent correctly the actual CPU cost of executing a particular query. A query may produce results after a long time not because of its complexity or bad behavior but because of the general resource contention caused by other concurrently executing queries. OTOH a single runaway query may consume all resources and cause all other valid queries to fail if they exceed the wall-clock `timeAllowed`. I propose adding two additional criteria for limiting the maximum "query budget": * per-thread CPU time: using `getThreadCpuTime` to periodically check (`QueryTimeout.shouldExit()`) the current CPU consumption since the start of the query execution. * per-thread memory allocation: using `getThreadAllocatedBytes`. I ran some JMH microbenchmarks to ensure that these two methods are available on modern OS/JVM combinations and their cost is negligible (less than 0.5 us/call). This means that the initial implementation may call these methods directly for every `shouldExist()` call without undue burden. If we decide that this still adds too much overhead we can change this to periodic updates in a background thread. These two "query budget" constraints can be implemented as subclasses of `QueryTimeout`. Initially we can use a similar configuration mechanism as with `timeAllowed`, i.e. pass the max value as a query param, or add it to the search handler's invariants. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org