Andrzej Bialecki created SOLR-17138:
---------------------------------------

             Summary: Support other QueryTimeout criteria
                 Key: SOLR-17138
                 URL: https://issues.apache.org/jira/browse/SOLR-17138
             Project: Solr
          Issue Type: New Feature
      Security Level: Public (Default Security Level. Issues are Public)
          Components: Query Budget
            Reporter: Andrzej Bialecki


Complex Solr queries can consume significant memory and CPU while being 
processed. When OOM or CPU saturation is reached Solr becomes unresponsive, 
which further compounds the problem. Often such “killer queries” are not 
written to logs, which makes them difficult to diagnose. This happens even with 
best practices in place.

It should be possible to set limits in Solr that cannot be exceeded by 
individual queries. This mechanism would monitor an accumulating “cost” of a 
query while it’s being executed and compare it to the configured maximum cost 
(budget), expressed in terms of CPU and/or memory usage that can be attributed 
to this query. Should these limits be exceeded the individual query execution 
should be terminated, without affecting other concurrently executing queries.

The CircuitBreakers functionality doesn't distinguish the source of the load 
and can't protect other query executions from a particular runaway query. We 
need a more fine-grained mechanism.

The existing `QueryTimeout` API enables such termination of individual queries. 
However, the existing implementation (`SolrQueryTimeoutImpl` used with 
`timeAllowed` query param) only uses elapsed wall-clock time as the termination 
criterion. This is insufficient - in case of resource contention the wall-clock 
time doesn’t represent correctly the actual CPU cost of executing a particular 
query. A query may produce results after a long time not because of its 
complexity or bad behavior but because of the general resource contention 
caused by other concurrently executing queries. OTOH a single runaway query may 
consume all resources and cause all other valid queries to fail if they exceed 
the wall-clock `timeAllowed`.

I propose adding two additional criteria for limiting the maximum "query 
budget":
 * per-thread CPU time: using `getThreadCpuTime` to periodically check 
(`QueryTimeout.shouldExit()`) the current CPU consumption since the start of 
the query execution.
 * per-thread memory allocation: using `getThreadAllocatedBytes`.

I ran some JMH microbenchmarks to ensure that these two methods are available 
on modern OS/JVM combinations and their cost is negligible (less than 0.5 
us/call). This means that the initial implementation may call these methods 
directly for every `shouldExist()` call without undue burden. If we decide that 
this still adds too much overhead we can change this to periodic updates in a 
background thread.

These two "query budget" constraints can be implemented as subclasses of 
`QueryTimeout`. Initially we can use a similar configuration mechanism as with 
`timeAllowed`, i.e. pass the max value as a query param, or add it to the 
search handler's invariants.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to