[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944396#comment-17944396
 ] 

Rahul Goswami commented on SOLR-17725:
--------------------------------------

Will do [~dsmiley] Thanks.

 

[~gus] As far as I can see, the current implementation doesn't run the risk of 
corruption. The status is maintained in two ways:

1) At the core level -> to keep track of which core was being processed when 
the service went down/killed. A file autoupgrade_status.csv is maintained which 
is written each time a core is picked up for processing and a status is set for 
the same. Each time the process resumes it picks up the core with status 
"REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper 
. This is an implementation detail I am happy to discuss further, but in our 
(Commvault's)  implementation we recognize the following statuses

            DEFAULT,
            REINDEXING_ACTIVE,
            REINDEXING_PAUSED,
            PROCESSED,
            ERROR,
            CORRECTVERSION

 

2) At the segment level -> This is where we piggyback on Lucene's design and 
it's beautiful! As we iterate over each segment, we are read the live docs out 
of  the segment, create a SolrInputDocument out of it and reindex using Solr's 
API. This helps achieve two things: 

i) A reindexed doc helps mark an existing (old) doc as deleted (when 
auto-commit kicks in). This way if the service goes down, we don't need to 
process the already processed docs of the service. And if the service goes down 
before a commit could be processed, the small penalty is reprocessing the docs 
of only that segment. 

ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it 
reclaiming space in the process. Hence we never process the same segment again.

Note that as we do this, we are in no way interfering with Lucene's index 
structure directly and only interacting by means of APIs.

 

A combination of these factors helps maintain continuity in the processing of a 
core despite failures, without running the risk of corruption.

 

 

> Automatically upgrade Solr indexes without needing to reindex from source
> -------------------------------------------------------------------------
>
>                 Key: SOLR-17725
>                 URL: https://issues.apache.org/jira/browse/SOLR-17725
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Rahul Goswami
>            Priority: Major
>         Attachments: High Level Design.png
>
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove a lot of operational 
> headaches, especially in environments with hundreds/thousands of very large 
> indexes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to