[ https://issues.apache.org/jira/browse/SOLR-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patson Luk updated SOLR-17143: ------------------------------ Description: With the new [test case submitted|https://github.com/cowpaths/fullstory-solr/commit/383134928e372f19d96b1b16459a3566169d3ff4] , we re-produced an issue with streaming in our production cloud environment. The test case demonstrates that with a collection of 2 shards, which 20k docs are indexed. 10k docs have id with routing prefix `a`, while the other 10k with `c`. Each of those prefix would hash to different shard, producing 2 shards of 10k docs each. Now, if we stream by sorting on the id, both shards would send back some data initially, however only one shard (that hosts prefix `a`) will have continued traffic due to the sorted iteration, the other shard would eventually throw {{IdleTimeout}} as the stream was pending w/o network activity. If we change the test case `SHARD_COUNT` from 2 to 1, then the case runs fine. In our environment, we have jetty http connector timeout as 120 secs, yet we still run into that occasionally, the client does consume the data in a reasonable rate, however with up to 1024 shards per collection, it's quite easy that some shards might not have data streamed within 120 secs hence triggering the mentioned timeout. We assume such issue with streaming is not uncommon for any distributed system, and am wondering what could be done to fix or mitigate that. Several ideas that we have: 1. If possible, we might want to stream per shard instead of per collection. However, there are cases that we do want to stream on the whole collection with sorted ordering 2. Are there any low level "keep-alive" that is already built in? I couldn't find any so far :) 3. Keep the stream alive by pushing small amount of dummy data from the aggregator (the solr node which distributes the stream request as /export to other nodes) but it got very hacky and is still not working. Didn't dig too deep as I wish to surface this issue to the Solr community and gather some thoughts first! was: With the new test case submitted, we re-produced an issue with streaming in our production cloud environment. The test case demonstrates that with a collection of 2 shards, which 20k docs are indexed. 10k docs have id with routing prefix `a`, while the other 10k with `c`. Each of those prefix would hash to different shard, producing 2 shards of 10k docs each. Now, if we stream by sorting on the id, both shards would send back some data initially, however only one shard (that hosts prefix `a`) will have continued traffic due to the sorted iteration, the other shard would eventually throw {{IdleTimeout}} as the stream was pending w/o network activity. If we change the test case `SHARD_COUNT` from 2 to 1, then the case runs fine. In our environment, we have jetty http connector timeout as 120 secs, yet we still run into that occasionally, the client does consume the data in a reasonable rate, however with up to 1024 shards per collection, it's quite easy that some shards might not have data streamed within 120 secs hence triggering the mentioned timeout. We assume such issue with streaming is not uncommon for any distributed system, and am wondering what could be done to fix or mitigate that. Several ideas that we have: 1. If possible, we might want to stream per shard instead of per collection. However, there are cases that we do want to stream on the whole collection with sorted ordering 2. Are there any low level "keep-alive" that is already built in? I couldn't find any so far :) 3. Keep the stream alive by pushing small amount of dummy data from the aggregator (the solr node which distributes the stream request as /export to other nodes) but it got very hacky and is still not working. Didn't dig too deep as I wish to surface this issue to the Solr community and gather some thoughts first! > Streaming with multiple shards can triggered unexpected IdleTimeout > ------------------------------------------------------------------- > > Key: SOLR-17143 > URL: https://issues.apache.org/jira/browse/SOLR-17143 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: 9.4.1 > Reporter: Patson Luk > Priority: Critical > > With the new [test case > submitted|https://github.com/cowpaths/fullstory-solr/commit/383134928e372f19d96b1b16459a3566169d3ff4] > , we re-produced an issue with streaming in our production cloud > environment. > The test case demonstrates that with a collection of 2 shards, which 20k docs > are indexed. 10k docs have id with routing prefix `a`, while the other 10k > with `c`. Each of those prefix would hash to different shard, producing 2 > shards of 10k docs each. > Now, if we stream by sorting on the id, both shards would send back some data > initially, however only one shard (that hosts prefix `a`) will have continued > traffic due to the sorted iteration, the other shard would eventually throw > {{IdleTimeout}} as the stream was pending w/o network activity. > If we change the test case `SHARD_COUNT` from 2 to 1, then the case runs > fine. > In our environment, we have jetty http connector timeout as 120 secs, yet we > still run into that occasionally, the client does consume the data in a > reasonable rate, however with up to 1024 shards per collection, it's quite > easy that some shards might not have data streamed within 120 secs hence > triggering the mentioned timeout. > We assume such issue with streaming is not uncommon for any distributed > system, and am wondering what could be done to fix or mitigate that. > Several ideas that we have: > 1. If possible, we might want to stream per shard instead of per collection. > However, there are cases that we do want to stream on the whole collection > with sorted ordering > 2. Are there any low level "keep-alive" that is already built in? I couldn't > find any so far :) > 3. Keep the stream alive by pushing small amount of dummy data from the > aggregator (the solr node which distributes the stream request as /export to > other nodes) but it got very hacky and is still not working. Didn't dig too > deep as I wish to surface this issue to the Solr community and gather some > thoughts first! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org