jpountz commented on code in PR #13430:
URL: https://github.com/apache/lucene/pull/13430#discussion_r1619456981
##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -93,6 +93,7 @@ public class TieredMergePolicy extends MergePolicy {
private double segsPerTier = 10.0;
private double forceMergeDeletesPctAllowed = 10.0;
private double deletesPctAllowed = 20.0;
+ private int targetSearchConcurrency = -1;
Review Comment:
Should it default to 1? I expect it to yield the same behavior as today?
##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -581,11 +621,14 @@ private MergeSpecification doFindMerges(
// whose length is less than the merge factor, it means we are reaching
// the tail of the list of segments and will only find smaller merges.
// Stop here.
- if (bestScore != null && hitTooLarge == false && candidate.size() <
mergeFactor) {
+ if (bestScore != null
+ && hitTooLarge == false
+ && hitMaxDocs == false
+ && candidate.size() < mergeFactor) {
break;
}
- final MergeScore score = score(candidate, hitTooLarge, segInfosSizes);
+ final MergeScore score = score(candidate, hitTooLarge || hitMaxDocs,
segInfosSizes);
Review Comment:
I don't think that we want to do this. `hitTooLarge` is considered a good
thing by this merge policy. It means that we found a merge that reached the
maximum merged segment size. It's great because if we perform this merge, then
this segment will not be eligible again for merging (unless it gets many
deletes), so we'll essentially be done with it. So these merges are given a
very good score by assuming that they have a perfect skew.
`hitMaxDocs` is different, we don't want to prioritize unbalanced merges
that produce segments of `hitMaxDocs` documents? It's better to keep
prioritizing balanced small segment merges as usual?
My expectation for handling `allowedMaxDoc` is that we would just never
score any merge that has more than `allowedMaxDoc` documents and force the
merge policy to select one of the candidate merges that produce fewer documents
than that.
##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -257,6 +258,25 @@ public double getSegmentsPerTier() {
return segsPerTier;
}
+ /**
+ * Sets the target search concurrency. This allows merging to ensure that
there are at least
+ * targetSearchConcurrency segments on the top tier. This setting can be
overriden by force
Review Comment:
Let's rather say that this prevents creating segments that are bigger than
maxDoc/targetSearchConcurrency, which in-turn makes the work parallelizable
into `targetSearchConcurrency` slices of similar doc counts?
I think it's also worth clarifying that this makes merging less aggressive,
ie. high values of `targetSearchConcurrency` result in indexes that do less
merging and have more segments.
##########
lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java:
##########
@@ -431,6 +447,128 @@ public void testForcedMergesRespectSegSize() throws
Exception {
dir.close();
}
+ public void testForcedMergesRespectsTargetSearchConcurrency() throws
Exception {
Review Comment:
You may want to check out `BaseMergePolicyTestCase#testSimulateAppendOnly`.
It simulates merging without actually indexing docs, which allows simulating
many many rounds of merging in very little time, while comparing how some
parameters (like `targetSearchConcurrency`) affect merging decisions and
metrics such as write amplification through merges.
##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -409,6 +433,8 @@ public MergeSpecification findMerges(
// allowedSegCount may occasionally be less than segsPerTier
// if segment sizes are below the floor size
allowedSegCount = Math.max(allowedSegCount, segsPerTier);
+ // Override with the targetSearchConcurrency if it is greater
+ allowedSegCount = Math.max(allowedSegCount, targetSearchConcurrency);
Review Comment:
I would also do a change like this one a few lines above:
```patch
diff --git
a/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java
b/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java
index 208aa297287..4126601e51e 100644
--- a/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java
+++ b/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java
@@ -392,13 +392,14 @@ public class TieredMergePolicy extends MergePolicy {
allowedDelCount = Math.max(0, allowedDelCount);
final int mergeFactor = (int) Math.min(maxMergeAtOnce, segsPerTier);
+ final int maxNumSegmentsOnHighestTier = Math.max(segsPerTier,
targetSearchConcurrency);
// Compute max allowed segments in the index
long levelSize = Math.max(minSegmentBytes, floorSegmentBytes);
long bytesLeft = totIndexBytes;
double allowedSegCount = 0;
while (true) {
final double segCountLevel = bytesLeft / (double) levelSize;
- if (segCountLevel < segsPerTier || levelSize ==
maxMergedSegmentBytes) {
+ if (segCountLevel <= maxNumSegmentsOnHighestTier || levelSize ==
maxMergedSegmentBytes) {
allowedSegCount += Math.ceil(segCountLevel);
break;
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]