[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451382#comment-16451382
 ] 

Robert Muir commented on LUCENE-8264:
-------------------------------------

Its not possible to warn-only: the encoding of things changed completely. I 
think the key issue here is Lucene is an *index* not a *database*. Because it 
is a lossy *index* and does not retain all of the user's data, its not possible 
to safely migrate some things automagically. In the norms case IndexWriter 
needs to re-analyze the text ("re-index") and compute stats to get back the 
value, so it can be re-encoded. The function is {{y = f(x)}} and if {{x}} is 
not available its not possible, so lucene can't do it.

Also related to this change, in some cases, its necessary for the user to 
migrate away from index-time boosts. The removal of these is what opened the 
door to adrien's more efficient encoding here. So the user has to decide to put 
such feature values into a NumericDocValuesField and use expressions/function 
queries to combine with the documents score, or via the new FeatureField (which 
can be much more efficient), or whatever. This case is interesting because it 
emphasizes there are other things besides just the original document's text 
that need to be dealt with on upgrades.

I don't agree with the idea that lucene should be forced to drag along all 
kinds of nonsense data and slowly corrupt itself over time, or that some 
improvements aren't possible because the format can't be changed. Instead I 
think projects like solr that advertise themselves as a *database* need to add 
the ability to regenerate a new lucene index efficiently (e.g. minimizing 
network traffic across distributed nodes, etc). They need to use the additional 
stuff they have (e.g. original user's data, abstractions of some sort over 
lucene stuff like scoring features) to make this easier. Lucene is just the 
indexing/search library.

> Allow an option to rewrite all segments
> ---------------------------------------
>
>                 Key: LUCENE-8264
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8264
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to