[ https://issues.apache.org/jira/browse/SOLR-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patson Luk updated SOLR-17143: ------------------------------ Summary: Streaming with multiple shards can trigger unexpected IdleTimeout (was: Streaming with multiple shards can triggered unexpected IdleTimeout) > Streaming with multiple shards can trigger unexpected IdleTimeout > ----------------------------------------------------------------- > > Key: SOLR-17143 > URL: https://issues.apache.org/jira/browse/SOLR-17143 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: 9.4.1 > Reporter: Patson Luk > Priority: Critical > > With the new [test case > submitted|https://github.com/cowpaths/fullstory-solr/commit/383134928e372f19d96b1b16459a3566169d3ff4] > , we re-produced an issue with streaming in our production cloud > environment. > The test case demonstrates that with a collection of 2 shards, which 20k docs > are indexed. 10k docs have id with routing prefix `a`, while the other 10k > with `c`. Each of those prefix would hash to different shard, producing 2 > shards of 10k docs each. > Now, if we stream by sorting on the id, both shards would send back some data > initially, however only one shard (that hosts prefix `a`) will have continued > traffic due to the sorted iteration, the other shard would eventually throw > {{IdleTimeout}} as the stream was pending w/o network activity. > If we change the test case `SHARD_COUNT` from 2 to 1, then the case runs > fine. > In our environment, we have jetty http connector timeout as 120 secs, yet we > still run into that occasionally, the client does consume the data in a > reasonable rate, however with up to 1024 shards per collection, it's quite > easy that some shards might not have data streamed within 120 secs hence > triggering the mentioned timeout. > We assume such issue with streaming is not uncommon for any distributed > system, and am wondering what could be done to fix or mitigate that. > Several ideas that we have: > 1. If possible, we might want to stream per shard instead of per collection. > However, there are cases that we do want to stream on the whole collection > with sorted ordering > 2. Are there any low level "keep-alive" that is already built in? I couldn't > find any so far :) > 3. Keep the stream alive by pushing small amount of dummy data from the > aggregator (the solr node which distributes the stream request as /export to > other nodes) but it got very hacky and is still not working. Didn't dig too > deep as I wish to surface this issue to the Solr community and gather some > thoughts first! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org