Re: [PR] Add support for intra-segment search concurrency [lucene]

via GitHub Wed, 31 Jul 2024 05:09:49 -0700


jpountz commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1698394566



##########
lucene/core/src/test/org/apache/lucene/search/TestIndexSearcher.java:
##########
@@ -293,4 +297,33 @@ public void testNullExecutorNonNullTaskExecutor() {
     IndexSearcher indexSearcher = new IndexSearcher(reader);
     assertNotNull(indexSearcher.getTaskExecutor());
   }
+
+  public void testSegmentPartitionsSameSlice() {
+    IndexSearcher indexSearcher =
+        new IndexSearcher(reader, Runnable::run) {
+          @Override
+          protected LeafSlice[] slices(List<LeafReaderContext> leaves) {
+            List<LeafSlice> slices = new ArrayList<>();
+            for (LeafReaderContext ctx : leaves) {
+              slices.add(
+                  new LeafSlice(
+                      new ArrayList<>(
+                          List.of(
+                              LeafReaderContextPartition.createFromAndTo(ctx, 
0, 1),
+                              LeafReaderContextPartition.createFrom(ctx, 
1)))));
+            }
+            return slices.toArray(new LeafSlice[0]);
+          }
+        };
+    try {
+      indexSearcher.getSlices();
+      fail("should throw exception");
+    } catch (IllegalStateException e) {
+      assertEquals(
+          "The same slice targets multiple partitions of the same leaf reader. 
"
+              + "A segment should rather get partitioned to be searched 
concurrently from as many slices as the "
+              + "number of partitions it is split into.",
+          e.getMessage());
+    }

Review Comment:
   Use `expectThrows`?



##########
lucene/core/src/test/org/apache/lucene/index/TestForTooMuchCloning.java:
##########
@@ -80,7 +80,7 @@ public void test() throws Exception {
     // System.out.println("query clone count=" + queryCloneCount);
     assertTrue(
         "too many calls to IndexInput.clone during TermRangeQuery: " + 
queryCloneCount,
-        queryCloneCount < 50);
+        queryCloneCount <= Math.max(s.getLeafContexts().size(), 
s.getSlices().length) * 4);

Review Comment:
   Hmm, shouldn't the multiplicative factor be the number of leaf context 
partitions?



##########
lucene/core/src/test/org/apache/lucene/index/TestLazyProxSkipping.java:
##########
@@ -104,8 +104,8 @@ public TokenStreamComponents createComponents(String 
fieldName) {
     writer.close();
 
     LeafReader reader = getOnlyLeafReader(DirectoryReader.open(directory));
-
-    this.searcher = newSearcher(reader);
+    // disable concurrency because it has a single segment and makes 
assumptions based on that.
+    this.searcher = newSearcher(reader, random().nextBoolean(), 
random().nextBoolean(), false);

Review Comment:
   FYI I'm removing this test in another PR, so all good. :)



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -890,11 +945,70 @@ public static class LeafSlice {
      *
      * @lucene.experimental
      */
-    public final LeafReaderContext[] leaves;
+    public final LeafReaderContextPartition[] leaves;
 
-    public LeafSlice(List<LeafReaderContext> leavesList) {
-      Collections.sort(leavesList, Comparator.comparingInt(l -> l.docBase));
-      this.leaves = leavesList.toArray(new LeafReaderContext[0]);
+    public LeafSlice(List<LeafReaderContextPartition> 
leafReaderContextPartitions) {
+      leafReaderContextPartitions.sort(Comparator.comparingInt(l -> 
l.ctx.docBase));
+      // TODO should we sort by minDocId too?
+      this.leaves = leafReaderContextPartitions.toArray(new 
LeafReaderContextPartition[0]);
+    }
+
+    /**
+     * Returns the total number of docs that a slice targets, by summing the 
number of docs that
+     * each of its leaf context partitions targets.
+     */
+    public int getNumDocs() {
+      return Arrays.stream(leaves)
+          .map(LeafReaderContextPartition::getNumDocs)
+          .reduce(Integer::sum)
+          .get();
+    }
+  }
+
+  /**
+   * Holds information about a specific leaf context and the corresponding 
range of doc ids to
+   * search within.
+   *
+   * @lucene.experimental
+   */
+  public static final class LeafReaderContextPartition {
+    private final int minDocId;
+    private final int maxDocId;
+    private final int numDocs;
+    public final LeafReaderContext ctx;
+
+    private LeafReaderContextPartition(
+        LeafReaderContext leafReaderContext, int minDocId, int maxDocId, int 
numDocs) {
+      this.ctx = leafReaderContext;
+      this.minDocId = minDocId;
+      this.maxDocId = maxDocId;
+      this.numDocs = numDocs;

Review Comment:
   Can you add validation, e.g. minDocId >= 0, maxDocId > minDocId, minDocId < 
context.reader().maxDoc(), etc? Also maybe we don't need to take a `numDocs` 
and could compute it dynamically as `min(maxDocId, context.reader.maxDoc()) - 
minDocId`?



##########
lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollectorManager.java:
##########
@@ -28,17 +31,77 @@
  */
 public class TotalHitCountCollectorManager
     implements CollectorManager<TotalHitCountCollector, Integer> {
+
+  /**
+   * Internal state shared across the different collectors that this collector 
manager creates. It
+   * tracks leaves seen as an argument of {@link 
Collector#getLeafCollector(LeafReaderContext)}
+   * calls, to ensure correctness: if the first partition of a segment early 
terminates, count has
+   * been already retrieved for the entire segment hence subsequent partitions 
of the same segment
+   * should also early terminate. If the first partition of a segment computes 
hit counts,
+   * subsequent partitions of the same segment should do the same, to prevent 
their counts from
+   * being retrieve from {@link LRUQueryCache} (which returns counts for the 
entire segment)
+   */
+  private final Map<LeafReaderContext, Boolean> seenContexts = new HashMap<>();
+
   @Override
   public TotalHitCountCollector newCollector() throws IOException {
-    return new TotalHitCountCollector();
+    return new LeafPartitionAwareTotalHitCountCollector(seenContexts);
   }
 
   @Override
   public Integer reduce(Collection<TotalHitCountCollector> collectors) throws 
IOException {
+    // TODO this makes the collector manager instance reusable across multiple 
searches. It isn't a
+    // strict requirement
+    // but it is currently supported as collector managers normally don't hold 
state, while
+    // collectors do.
+    // This currently works but would not allow to perform incremental 
reductions in the future. It
+    // would be easy enough
+    // to expose an additional method to the CollectorManager interface to 
explicitly clear state,
+    // which index searcher
+    // calls before starting a new search. That feels like overkill at the 
moment because there is
+    // no real usecase for it.
+    // An alternative is to document the requirement to not reuse collector 
managers across
+    // searches, that would be a
+    // breaking change. Perhaps not needed for now.
+    seenContexts.clear();
     int totalHits = 0;
     for (TotalHitCountCollector collector : collectors) {
       totalHits += collector.getTotalHits();
     }
     return totalHits;
   }
+
+  private static class LeafPartitionAwareTotalHitCountCollector extends 
TotalHitCountCollector {
+    private final Map<LeafReaderContext, Boolean> seenContexts;
+
+    LeafPartitionAwareTotalHitCountCollector(Map<LeafReaderContext, Boolean> 
seenContexts) {
+      this.seenContexts = seenContexts;
+    }
+
+    @Override
+    public LeafCollector getLeafCollector(LeafReaderContext context) throws 
IOException {
+      // TODO we could synchronize on context instead, but then the set would 
need to be made
+      // thread-safe as well, probably overkill?
+      synchronized (seenContexts) {
+        Boolean earlyTerminated = seenContexts.get(context);
+        // first time we see this leaf
+        if (earlyTerminated == null) {
+          try {
+            LeafCollector leafCollector = super.getLeafCollector(context);
+            seenContexts.put(context, false);
+            return leafCollector;
+          } catch (CollectionTerminatedException e) {
+            seenContexts.put(context, true);
+            throw e;
+          }
+        }
+        if (earlyTerminated) {
+          // previous partition of the same leaf early terminated
+          throw new CollectionTerminatedException();
+        }

Review Comment:
   This if statement could be put outside of the synchronized block?



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -890,11 +945,70 @@ public static class LeafSlice {
      *
      * @lucene.experimental
      */
-    public final LeafReaderContext[] leaves;
+    public final LeafReaderContextPartition[] leaves;
 
-    public LeafSlice(List<LeafReaderContext> leavesList) {
-      Collections.sort(leavesList, Comparator.comparingInt(l -> l.docBase));
-      this.leaves = leavesList.toArray(new LeafReaderContext[0]);
+    public LeafSlice(List<LeafReaderContextPartition> 
leafReaderContextPartitions) {
+      leafReaderContextPartitions.sort(Comparator.comparingInt(l -> 
l.ctx.docBase));
+      // TODO should we sort by minDocId too?

Review Comment:
   I think so.



##########
lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollectorManager.java:
##########
@@ -28,17 +31,77 @@
  */
 public class TotalHitCountCollectorManager
     implements CollectorManager<TotalHitCountCollector, Integer> {
+
+  /**
+   * Internal state shared across the different collectors that this collector 
manager creates. It
+   * tracks leaves seen as an argument of {@link 
Collector#getLeafCollector(LeafReaderContext)}
+   * calls, to ensure correctness: if the first partition of a segment early 
terminates, count has
+   * been already retrieved for the entire segment hence subsequent partitions 
of the same segment
+   * should also early terminate. If the first partition of a segment computes 
hit counts,
+   * subsequent partitions of the same segment should do the same, to prevent 
their counts from
+   * being retrieve from {@link LRUQueryCache} (which returns counts for the 
entire segment)
+   */
+  private final Map<LeafReaderContext, Boolean> seenContexts = new HashMap<>();
+
   @Override
   public TotalHitCountCollector newCollector() throws IOException {
-    return new TotalHitCountCollector();
+    return new LeafPartitionAwareTotalHitCountCollector(seenContexts);
   }
 
   @Override
   public Integer reduce(Collection<TotalHitCountCollector> collectors) throws 
IOException {
+    // TODO this makes the collector manager instance reusable across multiple 
searches. It isn't a
+    // strict requirement

Review Comment:
   It looks like your IDE and gradlew tidy disagree about the line length so 
every second line is wrapped at different lengths?



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -890,11 +945,70 @@ public static class LeafSlice {
      *
      * @lucene.experimental
      */
-    public final LeafReaderContext[] leaves;
+    public final LeafReaderContextPartition[] leaves;

Review Comment:
   Should we rename `leaves` to `partitions`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add support for intra-segment search concurrency [lucene]

Reply via email to