[jira] [Updated] (SOLR-17143) Streaming with multiple shards can trigger unexpected IdleTimeout

Patson Luk (Jira) Tue, 30 Jan 2024 14:21:08 -0800


     [ 
https://issues.apache.org/jira/browse/SOLR-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Patson Luk updated SOLR-17143:
------------------------------
    Summary: Streaming with multiple shards can trigger unexpected IdleTimeout  
(was: Streaming with multiple shards can triggered unexpected IdleTimeout)

> Streaming with multiple shards can trigger unexpected IdleTimeout
> -----------------------------------------------------------------
>
>                 Key: SOLR-17143
>                 URL: https://issues.apache.org/jira/browse/SOLR-17143
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 9.4.1
>            Reporter: Patson Luk
>            Priority: Critical
>
> With the new [test case 
> submitted|https://github.com/cowpaths/fullstory-solr/commit/383134928e372f19d96b1b16459a3566169d3ff4]
>  , we re-produced an issue with streaming in our production cloud 
> environment. 
> The test case demonstrates that with a collection of 2 shards, which 20k docs 
> are indexed. 10k docs have id with routing prefix `a`, while the other 10k 
> with `c`. Each of those prefix would hash to different shard, producing 2 
> shards of 10k docs each.
> Now, if we stream by sorting on the id, both shards would send back some data 
> initially, however only one shard (that hosts prefix `a`) will have continued 
> traffic due to the sorted iteration, the other shard would eventually throw 
> {{IdleTimeout}} as the stream was pending w/o network activity.
> If we change the test case `SHARD_COUNT` from 2 to 1, then the case runs 
> fine. 
> In our environment, we have jetty http connector timeout as 120 secs, yet we 
> still run into that occasionally, the client does consume the data in a 
> reasonable rate, however with up to 1024 shards per collection, it's quite 
> easy that some shards might not have data streamed within 120 secs hence 
> triggering the mentioned timeout.
> We assume such issue with streaming is not uncommon for any distributed 
> system, and am wondering what could be done to fix or mitigate that. 
> Several ideas that we have:
> 1. If possible, we might want to stream per shard instead of per collection. 
> However, there are cases that we do want to stream on the whole collection 
> with sorted ordering
> 2. Are there any low level "keep-alive" that is already built in? I couldn't 
> find any so far :)
> 3. Keep the stream alive by pushing small amount of dummy data from the 
> aggregator (the solr node which distributes the stream request as /export to 
> other nodes) but it got very hacky and is still not working. Didn't dig too 
> deep as I wish to surface this issue to the Solr community and gather some 
> thoughts first!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Updated] (SOLR-17143) Streaming with multiple shards can trigger unexpected IdleTimeout

Reply via email to