[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704
 ] 

Rahul Goswami commented on SOLR-17725:
--------------------------------------

[~janhoy]  Thanks for taking the time to review the JIRA. Please find my 
thoughts on your questions below:

 
1) Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
> The implementation needs to run on a hot index for it to be lossless. 
> Indexing calls happen using Solr APIs so Solr will need to be running. In our 
> custom implementation I have hooked the process into SolrDispatchFilter 
> load() so that the process can start upon server start for least operational 
> overhead. As a generic solution I am thinking we can expose it as an action 
> (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for 
> trackability. This way users can hook up the command into their shell/cmd 
> scripts after Solr starts. Open to suggestions here,  
 
2) Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and do merges?
> Reducing infrastructure costs is a major design goal here. Also removing the 
> operational overhead of index uprgade during Solr uprgade when possible. 
 
3) Requring Lucene API change is a potential blocker, I'd not be surprised if 
the Lucene project rejects making the "created-version" property writable, so 
such a discussion with them would come early
> I agree. I am hopeful(!!) this will not be rejected though since they can 
> implement guardrails around changing the "created-version" property for added 
> security. In my implementation I added the change in CommitInfos to check for 
> all the segments in a commit and ensure they are the new version in every 
> aspect before setting the created-version property. This already happens in a 
> synchronized block so in my (limited) opinion, it should be safe. The API 
> they give us can do all required internal validations and fail gracefully 
> without any harm to the index. I can get a discussion started with the Lucene 
> folks once we agree on the basics of this implemetation. Or do you suggest I 
> do that right away?
 
4) Obviously a new Solr API needs to play well with SolrCloud as well as other 
features such such as shard split / move etc. Have you thought about locking / 
conflicts?
> SolrCloud challenges are not factored into the current implementation. But 
> given the process works at Core level and agnostic of the mode, I am 
> optimistic we can adapt the solution for SolrCloud through PR discussions.
We might have to block certain operations like splitshard while this process is 
underway on a collection. 
 
5) A reindex-collection API is probably wanted, however it could be acceptable 
to implement a "core-level" API first and later add a "collection-level" API on 
top of it
> Agreed
 
6) Challenge the assumption that "in-place" segment level is the best choice 
for this feature. Re-indexing into a new collection due to major schema changes 
is also a common use case that this will not address
> I would revert to my answer to your second question in defense of the 
> "in-place" implementation. Segment level processing gives us the ability to 
> restrict pollution of index due to merges as we reindex and also 
> restartability. 
Agreed this is not a substitute for when a field data type changes. This is 
intended to be a substitute for index upgrade when you upgrade Solr so as to 
overcome the X --> X+1 --> X+2 version upgrade path limitation which exists 
today despite no schema changes. Of course, users are free to add new fields 
and should still be able to use this utility.

> Automatically upgrade Solr indexes without needing to reindex from source
> -------------------------------------------------------------------------
>
>                 Key: SOLR-17725
>                 URL: https://issues.apache.org/jira/browse/SOLR-17725
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Rahul Goswami
>            Priority: Major
>         Attachments: High Level Design.png
>
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove a lot of operational 
> headaches, especially in environments with hundreds/thousands of very large 
> indexes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to