[ 
https://issues.apache.org/jira/browse/SOLR-16348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582343#comment-17582343
 ] 

Gus Heck commented on SOLR-16348:
---------------------------------

This sounds like a really cool feature. One of the key pain points for several 
customers I've had is the difficulty in predicting the size of a tenant's 
index, and managing clients that grow beyond expectations.

This feature will shine best when it is in a system receiving a steady flow of 
data I think. 

I wonder about the bulk data-loading scenario.... my first reaction is that 
bulk loading of actual text documents (vs small iot or low text analysis 
documents) can often be waiting on solr. I know of systems that peg solr for 
most of a week when they want to re-index.

Having it split a shard in the background on a separate thread could easily 
have that thread get starved, and maybe even have multiple split routines 
backlogged? Alternately if it's synchronized with the accepting of documents, 
we have to defend ourselves from OOM by not accepting documents, and the 
sending system needs to be able to handle a pause in accepting documents...

I feel like this should be turned off in the bulk/re-index use case. In the 
full index bulk case you typically have an idea of how much data will be 
loaded, and can prepare the index, and when re-indexing you have the prior 
index as an example, so the shards should just be set correctly to begin with.

"Daily bulk" cases that peg solr for 1-2h at night will be a harder case. Maybe 
that case should suspend/resume splitting based on a (configurable) period of 
silence, or load reduction (not that I think that's easy).

Another alternative to disabling is setting up a /reindex or /bulk handler that 
lacks your URP. I'd rather not have it be a system property because one may 
want to turn it on/off simultaneously across the cluster and maybe only for one 
collection at a time. A collection property in zk sounds better.

Another even harder case to think about is periods of unplanned load, possibly 
ramping up gradually and then tapering off (i.e. a pattern like something in 
social media going viral). That almost requires the decision to split or not to 
be load based which is a sticky problem.

One could have a request parameter, but that then relies on nobody else sending 
an update you don't know about, so I don't like that option. Many mature 
organizations have multiple paths for data to enter the index and coordination 
like that is infeasible.

The non-cloud case is harder because zk can't coordinate things so 
enable/disable of this maybe the ability to turn this on/off dynamically is a 
cloud feature? Though if it were reacting to load that might be shard level and 
not require coordination. 

This also has an interaction with systems routing by tenant id or other 
business id that rely on co-location for graph/join operations. 

Finally, there's the question of what to do with the old shard... the present 
split command is documented to leave the old shard in place, so split leaves 
you with 3, two of which are in use. Also this is another case in which users 
need to be careful to have enough free disk. This operation filling up the disk 
could then cause issues writing new docs...

Kind of a thought dump there, but hope it helps.

> New SplitShard UpdateRequestProcessor
> -------------------------------------
>
>                 Key: SOLR-16348
>                 URL: https://issues.apache.org/jira/browse/SOLR-16348
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: UpdateRequestProcessors
>            Reporter: David Smiley
>            Priority: Major
>
> The 
> [SplitShard|https://solr.apache.org/guide/solr/latest/deployment-guide/shard-management.html#splitshard]
>  command is used to split a shard into smaller shards to get better query 
> scalability, especially across multiple machines.  The most practical way to 
> use it is to split shards larger than a configured size.  Of course shards 
> don't just grow by themselves; they grow when data is added.  Here I propose 
> a new UpdateRequestProcessor that splits based on the shard size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to