[ https://issues.apache.org/jira/browse/SOLR-16348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585153#comment-17585153 ]
David Smiley commented on SOLR-16348: ------------------------------------- bq. I wonder about the bulk data-loading scenario.... my first reaction is that bulk loading of actual text documents (vs small iot or low text analysis documents) can often be waiting on solr. I know of systems that peg solr for most of a week when they want to re-index. Ideally you have a reasonable guess on the number of shards needed when creating the collection but alas, that's not necessarily easy. Note that splits are async by nature; they won't block the client. Nonetheless they are "heavy" operations. bq. Having it split a shard in the background on a separate thread could easily have that thread get starved, and maybe even have multiple split routines backlogged? What is the thread starvation scenario you speak of? If they can't complete because the split quota is truly reached then it's fair -- by-design. The number of such threads would be no more than the quota. That said, I could imagine a no-thread approach as well involving ZK watches on the lock. I need to think more. bq. the sending system needs to be able to handle a pause in accepting documents... It'd be a nice option for another JIRA issue irrespective of how a split is invoked. Today we buffer at the parent shard but a pause option would be nice. bq. I feel like this should be turned off in the bulk/re-index use case. In the full index bulk case you typically have an idea of how much data will be loaded, and can prepare the index, and when re-indexing you have the prior index as an example, so the shards should just be set correctly to begin with. It _would_ waste time calculating the index size per batch if the collection has already been sized correctly. I agree it should be easily toggled on/off, like with an update request param. Maybe this should be a general feature of Solr that specific URPs need not write support for, like how RequestHandler's can be disabled. There is an existing "enable" (or enabled?) attribute in PluginInfo which is generic but few plugin types support this. bq. Finally, there's the question of what to do with the old shard That's an existing issue with Splits; SolrCloud ought to clean this up automatically. BTW remember that this URP is very much opt-in like nearly all the other URPs. Don't use it if it's not useful to your XYZ company :-). Solr 8x had the autoscaling framework that could poll all cores at some interval and split the big ones. Solr 9 doesn't have that but someone could write something similar. > New SplitShard UpdateRequestProcessor > ------------------------------------- > > Key: SOLR-16348 > URL: https://issues.apache.org/jira/browse/SOLR-16348 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: UpdateRequestProcessors > Reporter: David Smiley > Priority: Major > > The > [SplitShard|https://solr.apache.org/guide/solr/latest/deployment-guide/shard-management.html#splitshard] > command is used to split a shard into smaller shards to get better query > scalability, especially across multiple machines. The most practical way to > use it is to split shards larger than a configured size. Of course shards > don't just grow by themselves; they grow when data is added. Here I propose > a new UpdateRequestProcessor that splits based on the shard size. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org