[ 
https://issues.apache.org/jira/browse/SOLR-17819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Houston Putman updated SOLR-17819:
----------------------------------
    Description: 
When beasting {{DistributedDebugComponentTest.testTolerantSearch}}, there is a 
really weird error around cancelling requests. The 
{{DistributedDebugComponentTest.testTolerantSearch}} does a non-tolerant search 
then does a tolerant search. The second part of the test, testing tolerant 
search fails very occasionally (but only when the non-tolerant search is done 
first, when that is commented out, the tolerant search does not fail).

The tolerant search fails (occasionally) because all three shard requests fail 
instead of just 1 of the shard requests failing (because of a non-exisistant 
endpoint). the bad shard has the failure that the test expects, but the good 
shards both fail with {{java.io.IOException: 
cancel_stream_error/unexpected_data_frame}} meaning that the requests were 
cancelled, even thought the request is "tolerant". I did a lot of debugging 
here, and noticed that Solr is behaving correctly and we are not cancelling 
shard requests for tolerant solr requests. And the fact that if the 
"non-tolerant search" request case right before the tolerant search request is 
commented out, the failures stop, tell us that the cancellations from the 
non-tolerant request are bleeding into the tolerant request. This is bad. I 
also confirmed this by commenting out the line that actually cancels the HTTP 
requests: 
[https://github.com/apache/solr/blob/branch_9_9/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L570-L574]

This only happens on branch_9x (and presumable branch_9_9), not on main. So I 
believe it's a bug in Jetty 10, which Jetty 12 has solved. So we are probably 
fine just fixing this part on branch_9x and branch_9_9, and leaving the request 
cancellation enabled on main (10.x).

Amazingly, when beasting, there is a big difference in whether the non-existent 
endpoint is put first or last in the list of shards. The failure rate is much 
higher when the bad shard is the first listed rather the last one listed.

  was:
However, after fixing that and beasting the tests, there is a really weird 
error around cancelling requests. The  does a non-tolerant search then does a 
tolerant search. The error I described above was breaking the non-tolerant 
search. That is easily fixable. The second part of the test, testing tolerant 
search fails very occasionally (but only when the non-tolerant search is done 
first, when that is commented out, the tolerant search does not fail).

When beasting {{DistributedDebugComponentTest.testTolerantSearch}} , and adding 
a loop to do the requests 1,000 times, the tolerant search fails because all 
three shard requests fail instead of just 1 of the shard requests failing 
(because of a non-exisistant endpoint). the bad shard has the failure that the 
test expects, but the good shards both fail with {{java.io.IOException: 
cancel_stream_error/unexpected_data_frame}} meaning that the requests were 
cancelled, even thought the request is "tolerant". I did a lot of debugging 
here, and noticed that Solr is behaving correctly and we are not cancelling 
shard requests for tolerant solr requests. And the fact that if the 
"non-tolerant search" request case right before the tolerant search request is 
commented out, the failures stop, tell us that the cancellations from the 
non-tolerant request are bleeding into the tolerant request. This is bad. I 
also confirmed this by commenting out the line that actually cancels the HTTP 
requests (when commented out, the test succeeds): 
[https://github.com/apache/solr/blob/branch_9_9/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L570-L574]

This only happens on branch_9x (and presumable branch_9_9), not on main. So I 
believe it's a bug in Jetty 10, which Jetty 12 has solved. So we are probably 
fine just fixing this part on branch_9x and branch_9_9, and leaving the request 
cancellation enabled on main (10.x).

Amazingly, when beasting, there is a big difference in whether the non-existent 
endpoint is put first or last in the list of shards. The failure rate is much 
higher when the bad shard is the first listed rather the last one listed.


> HttpShardHandler non-tolerant request cancellation bleeds across requests
> -------------------------------------------------------------------------
>
>                 Key: SOLR-17819
>                 URL: https://issues.apache.org/jira/browse/SOLR-17819
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Houston Putman
>            Assignee: Houston Putman
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 9.9
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When beasting {{DistributedDebugComponentTest.testTolerantSearch}}, there is 
> a really weird error around cancelling requests. The 
> {{DistributedDebugComponentTest.testTolerantSearch}} does a non-tolerant 
> search then does a tolerant search. The second part of the test, testing 
> tolerant search fails very occasionally (but only when the non-tolerant 
> search is done first, when that is commented out, the tolerant search does 
> not fail).
> The tolerant search fails (occasionally) because all three shard requests 
> fail instead of just 1 of the shard requests failing (because of a 
> non-exisistant endpoint). the bad shard has the failure that the test 
> expects, but the good shards both fail with {{java.io.IOException: 
> cancel_stream_error/unexpected_data_frame}} meaning that the requests were 
> cancelled, even thought the request is "tolerant". I did a lot of debugging 
> here, and noticed that Solr is behaving correctly and we are not cancelling 
> shard requests for tolerant solr requests. And the fact that if the 
> "non-tolerant search" request case right before the tolerant search request 
> is commented out, the failures stop, tell us that the cancellations from the 
> non-tolerant request are bleeding into the tolerant request. This is bad. I 
> also confirmed this by commenting out the line that actually cancels the HTTP 
> requests: 
> [https://github.com/apache/solr/blob/branch_9_9/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L570-L574]
> This only happens on branch_9x (and presumable branch_9_9), not on main. So I 
> believe it's a bug in Jetty 10, which Jetty 12 has solved. So we are probably 
> fine just fixing this part on branch_9x and branch_9_9, and leaving the 
> request cancellation enabled on main (10.x).
> Amazingly, when beasting, there is a big difference in whether the 
> non-existent endpoint is put first or last in the list of shards. The failure 
> rate is much higher when the bad shard is the first listed rather the last 
> one listed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to