[ 
https://issues.apache.org/jira/browse/SOLR-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891263#comment-17891263
 ] 

Gus Heck edited comment on SOLR-17158 at 10/20/24 3:25 PM:
-----------------------------------------------------------

Also adding metadata about what fraction of shards completed seems like a 
reasonable follow on feature, and full info about which in the debug case... 
but one of the things I think is difficult about failover type behavior here is 
that there are several types of failures:
 # The Limit was just too small, even a healthy server can't answer in the 
allotted time/space (this is a 4xx type of case if an error is to be thrown)
 # The query is unreasonable, and even a healthy server can't answer it in the 
allotted time/space (this is a 4xx type of case if an error is to be thrown)
 # The query and Limit are reasonable, but the (these are 5xx like cases if an 
error is to be thrown)
 ## The cluster is under extreme load and thus all shards are going to be 
unable to answer
 ## This individual node is under extreme load and an alternative node might 
answer.

In every case except 3.2 repeating the request is harmful. The code already 
detects and retries if the http communication fails, but adding a limit 
parameter means that we can effectively avoid that retry code. If 3.1 is the 
usual problem that would be a good thing, and if 3.2 is most common that's not 
so good. In the case where zero shards responded 3.1 seems much more likely. So 
after pondering all this for a long time, I've come to the thought that 
throwing exceptions or otherwise using the response to gauge server health is a 
poor substitute for system monitoring. So certainly the metadata you suggest 
might be nice to see for troubleshooting, but I'm leery of the notion that it 
might be used for automated fail-over / fall-back.

Also as you can see if we start throwing errors, we have no way to decide what 
error to throw... 4xx says "user, you must change your request" and 5xx says 
"come back later with that request, we've got problems"... So this is another 
part of why I settled on 200 OK, YOU ASKED FOR IT ;)


was (Author: gus_heck):
Also adding metadata about what fraction of shards completed seems like a 
reasonable follow on feature, and full info about which in the debug case... 
but one of the things I think is difficult about failover type behavior here is 
that there are several types of failures:
 # The Limit was just too small, even a healthy server can't answer in the 
allotted time/space (this is a 4xx type of case if an error is to be thrown)
 # The query is unreasonable, and even a healthy server can't answer it in the 
allotted time/space (this is a 4xx type of case if an error is to be thrown)
 # The query and Limit are reasonable, but the (these are 5xx like cases if an 
error is to be thrown)
 ## The cluster is under extreme load and thus all shards are going to be 
unable to answer
 ## This individual node is under extreme load and an alternative node might 
answer.

In every case except 3.2 repeating the request is harmful. The code already 
detects and retries if the http communication fails, but adding this 
timeAllowed parameter means that we can effectively hide from that retry code. 
If 3.1 is the usual problem that would be a good thing, and if 3.2 is most 
common that's not so good. In the case where zero shards responded 3.1 seems 
much more likely. So after pondering all this for a long time, I've come to the 
thought that throwing exceptions or otherwise using the response to gauge 
server health is a poor substitute for system monitoring. So certainly the 
metadata you suggest might be nice to see for troubleshooting, but I'm leery of 
the notion that it might be used for automated fail-over / fall-back.

Also as you can see if we start throwing errors, we have no way to decide what 
error to throw... 4xx says "user, you must change your request" and 5xx says 
"come back later with that request, we've got problems"... So this is another 
part of why I settled on 200 OK, YOU ASKED FOR IT ;)

> Terminate distributed processing quickly when query limit is reached
> --------------------------------------------------------------------
>
>                 Key: SOLR-17158
>                 URL: https://issues.apache.org/jira/browse/SOLR-17158
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Query Limits
>            Reporter: Andrzej Bialecki
>            Assignee: Gus Heck
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: main (10.0), 9.8
>
>          Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Solr should make sure that when query limits are reached and partial results 
> are not needed (and not wanted) then both the processing in shards and in the 
> query coordinator should be terminated as quickly as possible, and Solr 
> should minimize wasted resources spent on eg. returning data from the 
> remaining shards, merging responses in the coordinator, or returning any data 
> back to the user.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to