[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704 ]
Rahul Goswami edited comment on SOLR-17725 at 4/7/25 8:11 PM: -------------------------------------------------------------- [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational overhead of index uprgade during Solr uprgade when possible. 3) Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early > I agree. I am hopeful(!!) this will not be rejected though since they can > implement guardrails around changing the "created-version" property for added > security. In my implementation I added the change in CommitInfos to check for > all the segments in a commit and ensure they are the new version in every > aspect before setting the created-version property. This already happens in a > synchronized block upon commit, so in my (limited) opinion, it should be > safe. The API they give us can do all required internal validations and fail > gracefully without any harm to the index. I can get a discussion started with > the Lucene folks once we agree on the basics of this implementation. Or do > you suggest I do that right away? 4) Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? > SolrCloud challenges are not factored into the current implementation. But > given the process works at Core level and agnostic of the mode, I am > optimistic we can adapt the solution for SolrCloud through PR discussions. We might have to block certain operations like splitshard while this process is underway on a collection. 5) A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it > Agreed 6) Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address > I would revert to my answer to your second question in defense of the > "in-place" implementation. Segment level processing gives us the ability to > restrict pollution of index due to merges as we reindex and also > restartability. Agreed this is not a substitute for when a field data type changes. This is intended to be a substitute for index upgrade when you upgrade Solr so as to overcome the X --> X+1 --> X+2 version upgrade path limitation which exists today despite no schema changes. Of course, users are free to add new fields and should still be able to use this utility. was (Author: rahul196...@gmail.com): [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational overhead of index uprgade during Solr uprgade when possible. 3) Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early > I agree. I am hopeful(!!) this will not be rejected though since they can > implement guardrails around changing the "created-version" property for added > security. In my implementation I added the change in CommitInfos to check for > all the segments in a commit and ensure they are the new version in every > aspect before setting the created-version property. This already happens in a > synchronized block so in my (limited) opinion, it should be safe. The API > they give us can do all required internal validations and fail gracefully > without any harm to the index. I can get a discussion started with the Lucene > folks once we agree on the basics of this implemetation. Or do you suggest I > do that right away? 4) Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? > SolrCloud challenges are not factored into the current implementation. But > given the process works at Core level and agnostic of the mode, I am > optimistic we can adapt the solution for SolrCloud through PR discussions. We might have to block certain operations like splitshard while this process is underway on a collection. 5) A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it > Agreed 6) Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address > I would revert to my answer to your second question in defense of the > "in-place" implementation. Segment level processing gives us the ability to > restrict pollution of index due to merges as we reindex and also > restartability. Agreed this is not a substitute for when a field data type changes. This is intended to be a substitute for index upgrade when you upgrade Solr so as to overcome the X --> X+1 --> X+2 version upgrade path limitation which exists today despite no schema changes. Of course, users are free to add new fields and should still be able to use this utility. > Automatically upgrade Solr indexes without needing to reindex from source > ------------------------------------------------------------------------- > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement > Reporter: Rahul Goswami > Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org