[ https://issues.apache.org/jira/browse/SOLR-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891263#comment-17891263 ]
Gus Heck edited comment on SOLR-17158 at 10/20/24 3:25 PM: ----------------------------------------------------------- Also adding metadata about what fraction of shards completed seems like a reasonable follow on feature, and full info about which in the debug case... but one of the things I think is difficult about failover type behavior here is that there are several types of failures: # The Limit was just too small, even a healthy server can't answer in the allotted time/space (this is a 4xx type of case if an error is to be thrown) # The query is unreasonable, and even a healthy server can't answer it in the allotted time/space (this is a 4xx type of case if an error is to be thrown) # The query and Limit are reasonable, but the (these are 5xx like cases if an error is to be thrown) ## The cluster is under extreme load and thus all shards are going to be unable to answer ## This individual node is under extreme load and an alternative node might answer. In every case except 3.2 repeating the request is harmful. The code already detects and retries if the http communication fails, but adding a limit parameter means that we can effectively avoid that retry code. If 3.1 is the usual problem that would be a good thing, and if 3.2 is most common that's not so good. In the case where zero shards responded 3.1 seems much more likely. So after pondering all this for a long time, I've come to the thought that throwing exceptions or otherwise using the response to gauge server health is a poor substitute for system monitoring. So certainly the metadata you suggest might be nice to see for troubleshooting, but I'm leery of the notion that it might be used for automated fail-over / fall-back. Also as you can see if we start throwing errors, we have no way to decide what error to throw... 4xx says "user, you must change your request" and 5xx says "come back later with that request, we've got problems"... So this is another part of why I settled on 200 OK, YOU ASKED FOR IT ;) was (Author: gus_heck): Also adding metadata about what fraction of shards completed seems like a reasonable follow on feature, and full info about which in the debug case... but one of the things I think is difficult about failover type behavior here is that there are several types of failures: # The Limit was just too small, even a healthy server can't answer in the allotted time/space (this is a 4xx type of case if an error is to be thrown) # The query is unreasonable, and even a healthy server can't answer it in the allotted time/space (this is a 4xx type of case if an error is to be thrown) # The query and Limit are reasonable, but the (these are 5xx like cases if an error is to be thrown) ## The cluster is under extreme load and thus all shards are going to be unable to answer ## This individual node is under extreme load and an alternative node might answer. In every case except 3.2 repeating the request is harmful. The code already detects and retries if the http communication fails, but adding this timeAllowed parameter means that we can effectively hide from that retry code. If 3.1 is the usual problem that would be a good thing, and if 3.2 is most common that's not so good. In the case where zero shards responded 3.1 seems much more likely. So after pondering all this for a long time, I've come to the thought that throwing exceptions or otherwise using the response to gauge server health is a poor substitute for system monitoring. So certainly the metadata you suggest might be nice to see for troubleshooting, but I'm leery of the notion that it might be used for automated fail-over / fall-back. Also as you can see if we start throwing errors, we have no way to decide what error to throw... 4xx says "user, you must change your request" and 5xx says "come back later with that request, we've got problems"... So this is another part of why I settled on 200 OK, YOU ASKED FOR IT ;) > Terminate distributed processing quickly when query limit is reached > -------------------------------------------------------------------- > > Key: SOLR-17158 > URL: https://issues.apache.org/jira/browse/SOLR-17158 > Project: Solr > Issue Type: Sub-task > Components: Query Limits > Reporter: Andrzej Bialecki > Assignee: Gus Heck > Priority: Major > Labels: pull-request-available > Fix For: main (10.0), 9.8 > > Time Spent: 8h 50m > Remaining Estimate: 0h > > Solr should make sure that when query limits are reached and partial results > are not needed (and not wanted) then both the processing in shards and in the > query coordinator should be terminated as quickly as possible, and Solr > should minimize wasted resources spent on eg. returning data from the > remaining shards, merging responses in the coordinator, or returning any data > back to the user. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org