[ 
https://issues.apache.org/jira/browse/SOLR-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki updated SOLR-17138:
------------------------------------
    Description: 
Complex Solr queries can consume significant memory and CPU while being 
processed. When OOM or CPU saturation is reached Solr becomes unresponsive, 
which further compounds the problem. Often such “killer queries” are not 
written to logs, which makes them difficult to diagnose. This happens even with 
best practices in place.

It should be possible to set limits in Solr that cannot be exceeded by 
individual queries. This mechanism would monitor an accumulating “cost” of a 
query while it’s being executed and compare it to the configured maximum cost 
(budget), expressed in terms of CPU and/or memory usage that can be attributed 
to this query. Should these limits be exceeded the individual query execution 
should be terminated, without affecting other concurrently executing queries.

The CircuitBreakers functionality doesn't distinguish the source of the load 
and can't protect other query executions from a particular runaway query. We 
need a more fine-grained mechanism.

The existing {{QueryTimeout}} API enables such termination of individual 
queries. However, the existing implementation ({{SolrQueryTimeoutImpl}} used 
with {{timeAllowed}} query param) only uses elapsed wall-clock time as the 
termination criterion. This is insufficient - in case of resource contention 
the wall-clock time doesn’t represent correctly the actual CPU cost of 
executing a particular query. A query may produce results after a long time not 
because of its complexity or bad behavior but because of the general resource 
contention caused by other concurrently executing queries. OTOH a single 
runaway query may consume all resources and cause all other valid queries to 
fail if they exceed the wall-clock {{timeAllowed}}.

I propose adding two additional criteria for limiting the maximum "query 
budget":
 * per-thread CPU time: using {{getThreadCpuTime}} to periodically check 
({{QueryTimeout.shouldExit()}}) the current CPU consumption since the start of 
the query execution.
 * per-thread memory allocation: using {{getThreadAllocatedBytes}}.

I ran some JMH microbenchmarks to ensure that these two methods are available 
on modern OS/JVM combinations and their cost is negligible (less than 0.5 
us/call). This means that the initial implementation may call these methods 
directly for every {{shouldExit()}} call without undue burden. If we decide 
that this still adds too much overhead we can change this to periodic updates 
in a background thread.

These two "query budget" constraints can be implemented as subclasses of 
{{QueryTimeout}}. Initially we can use a similar configuration mechanism as 
with {{timeAllowed}}, i.e. pass the max value as a query param, or add it to 
the search handler's invariants.

  was:
Complex Solr queries can consume significant memory and CPU while being 
processed. When OOM or CPU saturation is reached Solr becomes unresponsive, 
which further compounds the problem. Often such “killer queries” are not 
written to logs, which makes them difficult to diagnose. This happens even with 
best practices in place.

It should be possible to set limits in Solr that cannot be exceeded by 
individual queries. This mechanism would monitor an accumulating “cost” of a 
query while it’s being executed and compare it to the configured maximum cost 
(budget), expressed in terms of CPU and/or memory usage that can be attributed 
to this query. Should these limits be exceeded the individual query execution 
should be terminated, without affecting other concurrently executing queries.

The CircuitBreakers functionality doesn't distinguish the source of the load 
and can't protect other query executions from a particular runaway query. We 
need a more fine-grained mechanism.

The existing `QueryTimeout` API enables such termination of individual queries. 
However, the existing implementation (`SolrQueryTimeoutImpl` used with 
`timeAllowed` query param) only uses elapsed wall-clock time as the termination 
criterion. This is insufficient - in case of resource contention the wall-clock 
time doesn’t represent correctly the actual CPU cost of executing a particular 
query. A query may produce results after a long time not because of its 
complexity or bad behavior but because of the general resource contention 
caused by other concurrently executing queries. OTOH a single runaway query may 
consume all resources and cause all other valid queries to fail if they exceed 
the wall-clock `timeAllowed`.

I propose adding two additional criteria for limiting the maximum "query 
budget":
 * per-thread CPU time: using `getThreadCpuTime` to periodically check 
(`QueryTimeout.shouldExit()`) the current CPU consumption since the start of 
the query execution.
 * per-thread memory allocation: using `getThreadAllocatedBytes`.

I ran some JMH microbenchmarks to ensure that these two methods are available 
on modern OS/JVM combinations and their cost is negligible (less than 0.5 
us/call). This means that the initial implementation may call these methods 
directly for every `shouldExit()` call without undue burden. If we decide that 
this still adds too much overhead we can change this to periodic updates in a 
background thread.

These two "query budget" constraints can be implemented as subclasses of 
`QueryTimeout`. Initially we can use a similar configuration mechanism as with 
`timeAllowed`, i.e. pass the max value as a query param, or add it to the 
search handler's invariants.


> Support other QueryTimeout criteria
> -----------------------------------
>
>                 Key: SOLR-17138
>                 URL: https://issues.apache.org/jira/browse/SOLR-17138
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Query Budget
>            Reporter: Andrzej Bialecki
>            Priority: Major
>
> Complex Solr queries can consume significant memory and CPU while being 
> processed. When OOM or CPU saturation is reached Solr becomes unresponsive, 
> which further compounds the problem. Often such “killer queries” are not 
> written to logs, which makes them difficult to diagnose. This happens even 
> with best practices in place.
> It should be possible to set limits in Solr that cannot be exceeded by 
> individual queries. This mechanism would monitor an accumulating “cost” of a 
> query while it’s being executed and compare it to the configured maximum cost 
> (budget), expressed in terms of CPU and/or memory usage that can be 
> attributed to this query. Should these limits be exceeded the individual 
> query execution should be terminated, without affecting other concurrently 
> executing queries.
> The CircuitBreakers functionality doesn't distinguish the source of the load 
> and can't protect other query executions from a particular runaway query. We 
> need a more fine-grained mechanism.
> The existing {{QueryTimeout}} API enables such termination of individual 
> queries. However, the existing implementation ({{SolrQueryTimeoutImpl}} used 
> with {{timeAllowed}} query param) only uses elapsed wall-clock time as the 
> termination criterion. This is insufficient - in case of resource contention 
> the wall-clock time doesn’t represent correctly the actual CPU cost of 
> executing a particular query. A query may produce results after a long time 
> not because of its complexity or bad behavior but because of the general 
> resource contention caused by other concurrently executing queries. OTOH a 
> single runaway query may consume all resources and cause all other valid 
> queries to fail if they exceed the wall-clock {{timeAllowed}}.
> I propose adding two additional criteria for limiting the maximum "query 
> budget":
>  * per-thread CPU time: using {{getThreadCpuTime}} to periodically check 
> ({{QueryTimeout.shouldExit()}}) the current CPU consumption since the start 
> of the query execution.
>  * per-thread memory allocation: using {{getThreadAllocatedBytes}}.
> I ran some JMH microbenchmarks to ensure that these two methods are available 
> on modern OS/JVM combinations and their cost is negligible (less than 0.5 
> us/call). This means that the initial implementation may call these methods 
> directly for every {{shouldExit()}} call without undue burden. If we decide 
> that this still adds too much overhead we can change this to periodic updates 
> in a background thread.
> These two "query budget" constraints can be implemented as subclasses of 
> {{QueryTimeout}}. Initially we can use a similar configuration mechanism as 
> with {{timeAllowed}}, i.e. pass the max value as a query param, or add it to 
> the search handler's invariants.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to