[
https://issues.apache.org/jira/browse/LUCENE-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Erick Erickson updated LUCENE-8004:
-----------------------------------
Environment:
Marking as blocker since I think there are some observations here that should
be discussed and resolved before releasing 7.4. Related to LUCENE-7976.
1> By default it uses TMP and merges down to one segment. It's ironic that we
say "don't optimize", then provide a tool that does exactly that.
2> if we are respecting max segment size (the new default), TMP will not
attempt to merge segments that are >= the max segment size _and_ have no
deleted docs.
3> <1> and <2> mean that the bit at the end of
{{UpgradeIndexMergePolicy.findForcedMerges}} where it collects all the old
segments not returned by {{in.findForcedMerges}} into one segment has the
potential to create a very large segments, which is trappy.
Straw-man proposal:
I really dislike constructs like
{{ if (instanceof TieredMergePolicy) { do one thing} else { do something else}
}}
Seems like that knowledge should be built into the classes themselves. WDYT
about a new method or two in MergePolicy? They'd be no-ops for everything
except TMP at this point. We need a couple of things:
1> hard-coding {{w.forceMerge(1);}} in IndexUpgrader is evil so I'm thinking of
something like
{{w.forceMerge(iwc.getMergePolicy().getDefaultForceMergeSegmentCount());}}
instead. It would return 1 by default. Return default in MergePolicy and
override only in TMP? Make it abstract and override everywhere?
2a> We need a way to tell TMP.findForcedMerges to return segments even if
they're large and have no deletes in this case. Another (expert) method on
MergePolicy set/getUsingForUpgrade? I don't particularly like that at all
2b.1> change UpgradeIndexMergePolicy.findForcedMerges to singleton-merge all
leftover segments rather than merge them all into a single segment. I'd argue
that this is "more correct"; Anything findForcedMerges leaves un-merged was
determined by the merge policy NOT to be "cheap" so we should respect that.
2.b.2> change UpgradeIndexMergePolicy to "somehow" respect max segment size
when it gathers the segments into at the end of findForcedMerges(). This
pollutes MergePolicy some more.
2.b.3> Add a new method to MergePolicy {{findRewriteAllSegments}} or something.
It'd look just like {{findForcedMerges}} except it would "understand" that
every segment would need to be rewritten, even large ones with no deleted
documents. This would make <1> unnecessary. For everything except TMP it would
be a passthrough to {{findForcedMerges}} at this point.
I don't much like changing MergePolicy, it's nice and clean and adding methods
to get specifics from each subclass pollutes it a bit. So <2.b.2> at least
doesn't require that the caller understand anything about the guts of the class.
Also keep in mind that distributed systems have a hard time using the
IndexUpgrader on every node in the system, so we need to have something that is
easy to invoke in that environment, see the discussion at LUCENE-8264.
I'm totally open to any better ideas, haven't started work on this yet.
Priority: Blocker (was: Major)
> IndexUpgraderTool should rewrite segments rather than forceMerge
> ----------------------------------------------------------------
>
> Key: LUCENE-8004
> URL: https://issues.apache.org/jira/browse/LUCENE-8004
> Project: Lucene - Core
> Issue Type: Improvement
> Environment: Marking as blocker since I think there are some
> observations here that should be discussed and resolved before releasing 7.4.
> Related to LUCENE-7976.
> 1> By default it uses TMP and merges down to one segment. It's ironic that we
> say "don't optimize", then provide a tool that does exactly that.
> 2> if we are respecting max segment size (the new default), TMP will not
> attempt to merge segments that are >= the max segment size _and_ have no
> deleted docs.
> 3> <1> and <2> mean that the bit at the end of
> {{UpgradeIndexMergePolicy.findForcedMerges}} where it collects all the old
> segments not returned by {{in.findForcedMerges}} into one segment has the
> potential to create a very large segments, which is trappy.
> Straw-man proposal:
> I really dislike constructs like
> {{ if (instanceof TieredMergePolicy) { do one thing} else { do something
> else} }}
> Seems like that knowledge should be built into the classes themselves. WDYT
> about a new method or two in MergePolicy? They'd be no-ops for everything
> except TMP at this point. We need a couple of things:
> 1> hard-coding {{w.forceMerge(1);}} in IndexUpgrader is evil so I'm thinking
> of something like
> {{w.forceMerge(iwc.getMergePolicy().getDefaultForceMergeSegmentCount());}}
> instead. It would return 1 by default. Return default in MergePolicy and
> override only in TMP? Make it abstract and override everywhere?
> 2a> We need a way to tell TMP.findForcedMerges to return segments even if
> they're large and have no deletes in this case. Another (expert) method on
> MergePolicy set/getUsingForUpgrade? I don't particularly like that at all
> 2b.1> change UpgradeIndexMergePolicy.findForcedMerges to singleton-merge all
> leftover segments rather than merge them all into a single segment. I'd argue
> that this is "more correct"; Anything findForcedMerges leaves un-merged was
> determined by the merge policy NOT to be "cheap" so we should respect that.
> 2.b.2> change UpgradeIndexMergePolicy to "somehow" respect max segment size
> when it gathers the segments into at the end of findForcedMerges(). This
> pollutes MergePolicy some more.
> 2.b.3> Add a new method to MergePolicy {{findRewriteAllSegments}} or
> something. It'd look just like {{findForcedMerges}} except it would
> "understand" that every segment would need to be rewritten, even large ones
> with no deleted documents. This would make <1> unnecessary. For everything
> except TMP it would be a passthrough to {{findForcedMerges}} at this point.
> I don't much like changing MergePolicy, it's nice and clean and adding
> methods to get specifics from each subclass pollutes it a bit. So <2.b.2> at
> least doesn't require that the caller understand anything about the guts of
> the class.
> Also keep in mind that distributed systems have a hard time using the
> IndexUpgrader on every node in the system, so we need to have something that
> is easy to invoke in that environment, see the discussion at LUCENE-8264.
> I'm totally open to any better ideas, haven't started work on this yet.
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Priority: Blocker
>
> Spinoff from LUCENE-7976. We help users get themselves into a corner by using
> forceMerge on an index to rewrite all segments in the current Lucene format.
> We should rewrite each individual segment instead. This would also help with
> upgrading X-2->X-1, then X-1->X.
> Of course the preferred method is to re-index from scratch.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]