[
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451382#comment-16451382
]
Robert Muir commented on LUCENE-8264:
-------------------------------------
Its not possible to warn-only: the encoding of things changed completely. I
think the key issue here is Lucene is an *index* not a *database*. Because it
is a lossy *index* and does not retain all of the user's data, its not possible
to safely migrate some things automagically. In the norms case IndexWriter
needs to re-analyze the text ("re-index") and compute stats to get back the
value, so it can be re-encoded. The function is {{y = f(x)}} and if {{x}} is
not available its not possible, so lucene can't do it.
Also related to this change, in some cases, its necessary for the user to
migrate away from index-time boosts. The removal of these is what opened the
door to adrien's more efficient encoding here. So the user has to decide to put
such feature values into a NumericDocValuesField and use expressions/function
queries to combine with the documents score, or via the new FeatureField (which
can be much more efficient), or whatever. This case is interesting because it
emphasizes there are other things besides just the original document's text
that need to be dealt with on upgrades.
I don't agree with the idea that lucene should be forced to drag along all
kinds of nonsense data and slowly corrupt itself over time, or that some
improvements aren't possible because the format can't be changed. Instead I
think projects like solr that advertise themselves as a *database* need to add
the ability to regenerate a new lucene index efficiently (e.g. minimizing
network traffic across distributed nodes, etc). They need to use the additional
stuff they have (e.g. original user's data, abstractions of some sort over
lucene stuff like scoring features) to make this easier. Lucene is just the
indexing/search library.
> Allow an option to rewrite all segments
> ---------------------------------------
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during
> upgrades, if we could specify that all segments get rewritten.
> One example: Upgrading 5x->6x->7x. When segments are merged, they're
> rewritten into the current format. However, there's no guarantee that a
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily
> be successful.
> How many merge policies support this is an open question. I propose to start
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's
> increasingly difficult as systems get larger.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]