[ 
https://issues.apache.org/jira/browse/SOLR-16348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585153#comment-17585153
 ] 

David Smiley commented on SOLR-16348:
-------------------------------------

bq. I wonder about the bulk data-loading scenario.... my first reaction is that 
bulk loading of actual text documents (vs small iot or low text analysis 
documents) can often be waiting on solr. I know of systems that peg solr for 
most of a week when they want to re-index.

Ideally you have a reasonable guess on the number of shards needed when 
creating the collection but alas, that's not necessarily easy.  Note that 
splits are async by nature; they won't block the client.  Nonetheless they are 
"heavy" operations.

bq. Having it split a shard in the background on a separate thread could easily 
have that thread get starved, and maybe even have multiple split routines 
backlogged?

What is the thread starvation scenario you speak of?  If they can't complete 
because the split quota is truly reached then it's fair -- by-design.  The 
number of such threads would be no more than the quota.  That said, I could 
imagine a no-thread approach as well involving ZK watches on the lock.  I need 
to think more.

bq. the sending system needs to be able to handle a pause in accepting 
documents...

It'd be a nice option for another JIRA issue irrespective of how a split is 
invoked.  Today we buffer at the parent shard but a pause option would be nice.

bq. I feel like this should be turned off in the bulk/re-index use case. In the 
full index bulk case you typically have an idea of how much data will be 
loaded, and can prepare the index, and when re-indexing you have the prior 
index as an example, so the shards should just be set correctly to begin with.

It _would_ waste time calculating the index size per batch if the collection 
has already been sized correctly.  I agree it should be easily toggled on/off, 
like with an update request param.  Maybe this should be a general feature of 
Solr that specific URPs need not write support for, like how RequestHandler's 
can be disabled.  There is an existing "enable" (or enabled?) attribute in 
PluginInfo which is generic but few plugin types support this.

bq. Finally, there's the question of what to do with the old shard

That's an existing issue with Splits; SolrCloud ought to clean this up 
automatically.

BTW remember that this URP is very much opt-in like nearly all the other URPs.  
Don't use it if it's not useful to your XYZ company :-). Solr 8x had the 
autoscaling framework that could poll all cores at some interval and split the 
big ones.  Solr 9 doesn't have that but someone could write something similar.

> New SplitShard UpdateRequestProcessor
> -------------------------------------
>
>                 Key: SOLR-16348
>                 URL: https://issues.apache.org/jira/browse/SOLR-16348
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: UpdateRequestProcessors
>            Reporter: David Smiley
>            Priority: Major
>
> The 
> [SplitShard|https://solr.apache.org/guide/solr/latest/deployment-guide/shard-management.html#splitshard]
>  command is used to split a shard into smaller shards to get better query 
> scalability, especially across multiple machines.  The most practical way to 
> use it is to split shards larger than a configured size.  Of course shards 
> don't just grow by themselves; they grow when data is added.  Here I propose 
> a new UpdateRequestProcessor that splits based on the shard size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to