[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704 ]
Rahul Goswami commented on SOLR-17725: -------------------------------------- [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational overhead of index uprgade during Solr uprgade when possible. 3) Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early > I agree. I am hopeful(!!) this will not be rejected though since they can > implement guardrails around changing the "created-version" property for added > security. In my implementation I added the change in CommitInfos to check for > all the segments in a commit and ensure they are the new version in every > aspect before setting the created-version property. This already happens in a > synchronized block so in my (limited) opinion, it should be safe. The API > they give us can do all required internal validations and fail gracefully > without any harm to the index. I can get a discussion started with the Lucene > folks once we agree on the basics of this implemetation. Or do you suggest I > do that right away? 4) Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? > SolrCloud challenges are not factored into the current implementation. But > given the process works at Core level and agnostic of the mode, I am > optimistic we can adapt the solution for SolrCloud through PR discussions. We might have to block certain operations like splitshard while this process is underway on a collection. 5) A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it > Agreed 6) Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address > I would revert to my answer to your second question in defense of the > "in-place" implementation. Segment level processing gives us the ability to > restrict pollution of index due to merges as we reindex and also > restartability. Agreed this is not a substitute for when a field data type changes. This is intended to be a substitute for index upgrade when you upgrade Solr so as to overcome the X --> X+1 --> X+2 version upgrade path limitation which exists today despite no schema changes. Of course, users are free to add new fields and should still be able to use this utility. > Automatically upgrade Solr indexes without needing to reindex from source > ------------------------------------------------------------------------- > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement > Reporter: Rahul Goswami > Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org