Re: [PR] Add support for intra-segment search concurrency [lucene]

via GitHub Mon, 26 Aug 2024 04:43:05 -0700


original-brownbear commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1731119841



##########
lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollectorManager.java:
##########
@@ -28,17 +31,77 @@
  */
 public class TotalHitCountCollectorManager
     implements CollectorManager<TotalHitCountCollector, Integer> {
+
+  /**
+   * Internal state shared across the different collectors that this collector 
manager creates. It
+   * tracks leaves seen as an argument of {@link 
Collector#getLeafCollector(LeafReaderContext)}
+   * calls, to ensure correctness: if the first partition of a segment early 
terminates, count has
+   * been already retrieved for the entire segment hence subsequent partitions 
of the same segment
+   * should also early terminate. If the first partition of a segment computes 
hit counts,
+   * subsequent partitions of the same segment should do the same, to prevent 
their counts from
+   * being retrieve from {@link LRUQueryCache} (which returns counts for the 
entire segment)
+   */
+  private final Map<LeafReaderContext, Boolean> seenContexts = new HashMap<>();
+
   @Override
   public TotalHitCountCollector newCollector() throws IOException {
-    return new TotalHitCountCollector();
+    return new LeafPartitionAwareTotalHitCountCollector(seenContexts);
   }
 
   @Override
   public Integer reduce(Collection<TotalHitCountCollector> collectors) throws 
IOException {
+    // TODO this makes the collector manager instance reusable across multiple 
searches. It isn't a
+    // strict requirement
+    // but it is currently supported as collector managers normally don't hold 
state, while
+    // collectors do.
+    // This currently works but would not allow to perform incremental 
reductions in the future. It
+    // would be easy enough
+    // to expose an additional method to the CollectorManager interface to 
explicitly clear state,
+    // which index searcher
+    // calls before starting a new search. That feels like overkill at the 
moment because there is
+    // no real usecase for it.
+    // An alternative is to document the requirement to not reuse collector 
managers across
+    // searches, that would be a
+    // breaking change. Perhaps not needed for now.
+    seenContexts.clear();
     int totalHits = 0;
     for (TotalHitCountCollector collector : collectors) {
       totalHits += collector.getTotalHits();
     }
     return totalHits;
   }
+
+  private static class LeafPartitionAwareTotalHitCountCollector extends 
TotalHitCountCollector {
+    private final Map<LeafReaderContext, Boolean> seenContexts;
+
+    LeafPartitionAwareTotalHitCountCollector(Map<LeafReaderContext, Boolean> 
seenContexts) {
+      this.seenContexts = seenContexts;
+    }
+
+    @Override
+    public LeafCollector getLeafCollector(LeafReaderContext context) throws 
IOException {
+      // TODO we could synchronize on context instead, but then the set would 
need to be made
+      // thread-safe as well, probably overkill?
+      synchronized (seenContexts) {
+        Boolean earlyTerminated = seenContexts.get(context);
+        // first time we see this leaf
+        if (earlyTerminated == null) {
+          try {
+            LeafCollector leafCollector = super.getLeafCollector(context);
+            seenContexts.put(context, false);
+            return leafCollector;
+          } catch (CollectionTerminatedException e) {
+            seenContexts.put(context, true);
+            throw e;
+          }
+        }
+        if (earlyTerminated) {
+          // previous partition of the same leaf early terminated
+          throw new CollectionTerminatedException();
+        }

Review Comment:
   I would expect the compiler to do that for us already but agreed, that would 
look better :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add support for intra-segment search concurrency [lucene]

Reply via email to