jpountz commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1698394566
##########
lucene/core/src/test/org/apache/lucene/search/TestIndexSearcher.java:
##########
@@ -293,4 +297,33 @@ public void testNullExecutorNonNullTaskExecutor() {
IndexSearcher indexSearcher = new IndexSearcher(reader);
assertNotNull(indexSearcher.getTaskExecutor());
}
+
+ public void testSegmentPartitionsSameSlice() {
+ IndexSearcher indexSearcher =
+ new IndexSearcher(reader, Runnable::run) {
+ @Override
+ protected LeafSlice[] slices(List<LeafReaderContext> leaves) {
+ List<LeafSlice> slices = new ArrayList<>();
+ for (LeafReaderContext ctx : leaves) {
+ slices.add(
+ new LeafSlice(
+ new ArrayList<>(
+ List.of(
+ LeafReaderContextPartition.createFromAndTo(ctx,
0, 1),
+ LeafReaderContextPartition.createFrom(ctx,
1)))));
+ }
+ return slices.toArray(new LeafSlice[0]);
+ }
+ };
+ try {
+ indexSearcher.getSlices();
+ fail("should throw exception");
+ } catch (IllegalStateException e) {
+ assertEquals(
+ "The same slice targets multiple partitions of the same leaf reader.
"
+ + "A segment should rather get partitioned to be searched
concurrently from as many slices as the "
+ + "number of partitions it is split into.",
+ e.getMessage());
+ }
Review Comment:
Use `expectThrows`?
##########
lucene/core/src/test/org/apache/lucene/index/TestForTooMuchCloning.java:
##########
@@ -80,7 +80,7 @@ public void test() throws Exception {
// System.out.println("query clone count=" + queryCloneCount);
assertTrue(
"too many calls to IndexInput.clone during TermRangeQuery: " +
queryCloneCount,
- queryCloneCount < 50);
+ queryCloneCount <= Math.max(s.getLeafContexts().size(),
s.getSlices().length) * 4);
Review Comment:
Hmm, shouldn't the multiplicative factor be the number of leaf context
partitions?
##########
lucene/core/src/test/org/apache/lucene/index/TestLazyProxSkipping.java:
##########
@@ -104,8 +104,8 @@ public TokenStreamComponents createComponents(String
fieldName) {
writer.close();
LeafReader reader = getOnlyLeafReader(DirectoryReader.open(directory));
-
- this.searcher = newSearcher(reader);
+ // disable concurrency because it has a single segment and makes
assumptions based on that.
+ this.searcher = newSearcher(reader, random().nextBoolean(),
random().nextBoolean(), false);
Review Comment:
FYI I'm removing this test in another PR, so all good. :)
##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -890,11 +945,70 @@ public static class LeafSlice {
*
* @lucene.experimental
*/
- public final LeafReaderContext[] leaves;
+ public final LeafReaderContextPartition[] leaves;
- public LeafSlice(List<LeafReaderContext> leavesList) {
- Collections.sort(leavesList, Comparator.comparingInt(l -> l.docBase));
- this.leaves = leavesList.toArray(new LeafReaderContext[0]);
+ public LeafSlice(List<LeafReaderContextPartition>
leafReaderContextPartitions) {
+ leafReaderContextPartitions.sort(Comparator.comparingInt(l ->
l.ctx.docBase));
+ // TODO should we sort by minDocId too?
+ this.leaves = leafReaderContextPartitions.toArray(new
LeafReaderContextPartition[0]);
+ }
+
+ /**
+ * Returns the total number of docs that a slice targets, by summing the
number of docs that
+ * each of its leaf context partitions targets.
+ */
+ public int getNumDocs() {
+ return Arrays.stream(leaves)
+ .map(LeafReaderContextPartition::getNumDocs)
+ .reduce(Integer::sum)
+ .get();
+ }
+ }
+
+ /**
+ * Holds information about a specific leaf context and the corresponding
range of doc ids to
+ * search within.
+ *
+ * @lucene.experimental
+ */
+ public static final class LeafReaderContextPartition {
+ private final int minDocId;
+ private final int maxDocId;
+ private final int numDocs;
+ public final LeafReaderContext ctx;
+
+ private LeafReaderContextPartition(
+ LeafReaderContext leafReaderContext, int minDocId, int maxDocId, int
numDocs) {
+ this.ctx = leafReaderContext;
+ this.minDocId = minDocId;
+ this.maxDocId = maxDocId;
+ this.numDocs = numDocs;
Review Comment:
Can you add validation, e.g. minDocId >= 0, maxDocId > minDocId, minDocId <
context.reader().maxDoc(), etc? Also maybe we don't need to take a `numDocs`
and could compute it dynamically as `min(maxDocId, context.reader.maxDoc()) -
minDocId`?
##########
lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollectorManager.java:
##########
@@ -28,17 +31,77 @@
*/
public class TotalHitCountCollectorManager
implements CollectorManager<TotalHitCountCollector, Integer> {
+
+ /**
+ * Internal state shared across the different collectors that this collector
manager creates. It
+ * tracks leaves seen as an argument of {@link
Collector#getLeafCollector(LeafReaderContext)}
+ * calls, to ensure correctness: if the first partition of a segment early
terminates, count has
+ * been already retrieved for the entire segment hence subsequent partitions
of the same segment
+ * should also early terminate. If the first partition of a segment computes
hit counts,
+ * subsequent partitions of the same segment should do the same, to prevent
their counts from
+ * being retrieve from {@link LRUQueryCache} (which returns counts for the
entire segment)
+ */
+ private final Map<LeafReaderContext, Boolean> seenContexts = new HashMap<>();
+
@Override
public TotalHitCountCollector newCollector() throws IOException {
- return new TotalHitCountCollector();
+ return new LeafPartitionAwareTotalHitCountCollector(seenContexts);
}
@Override
public Integer reduce(Collection<TotalHitCountCollector> collectors) throws
IOException {
+ // TODO this makes the collector manager instance reusable across multiple
searches. It isn't a
+ // strict requirement
+ // but it is currently supported as collector managers normally don't hold
state, while
+ // collectors do.
+ // This currently works but would not allow to perform incremental
reductions in the future. It
+ // would be easy enough
+ // to expose an additional method to the CollectorManager interface to
explicitly clear state,
+ // which index searcher
+ // calls before starting a new search. That feels like overkill at the
moment because there is
+ // no real usecase for it.
+ // An alternative is to document the requirement to not reuse collector
managers across
+ // searches, that would be a
+ // breaking change. Perhaps not needed for now.
+ seenContexts.clear();
int totalHits = 0;
for (TotalHitCountCollector collector : collectors) {
totalHits += collector.getTotalHits();
}
return totalHits;
}
+
+ private static class LeafPartitionAwareTotalHitCountCollector extends
TotalHitCountCollector {
+ private final Map<LeafReaderContext, Boolean> seenContexts;
+
+ LeafPartitionAwareTotalHitCountCollector(Map<LeafReaderContext, Boolean>
seenContexts) {
+ this.seenContexts = seenContexts;
+ }
+
+ @Override
+ public LeafCollector getLeafCollector(LeafReaderContext context) throws
IOException {
+ // TODO we could synchronize on context instead, but then the set would
need to be made
+ // thread-safe as well, probably overkill?
+ synchronized (seenContexts) {
+ Boolean earlyTerminated = seenContexts.get(context);
+ // first time we see this leaf
+ if (earlyTerminated == null) {
+ try {
+ LeafCollector leafCollector = super.getLeafCollector(context);
+ seenContexts.put(context, false);
+ return leafCollector;
+ } catch (CollectionTerminatedException e) {
+ seenContexts.put(context, true);
+ throw e;
+ }
+ }
+ if (earlyTerminated) {
+ // previous partition of the same leaf early terminated
+ throw new CollectionTerminatedException();
+ }
Review Comment:
This if statement could be put outside of the synchronized block?
##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -890,11 +945,70 @@ public static class LeafSlice {
*
* @lucene.experimental
*/
- public final LeafReaderContext[] leaves;
+ public final LeafReaderContextPartition[] leaves;
- public LeafSlice(List<LeafReaderContext> leavesList) {
- Collections.sort(leavesList, Comparator.comparingInt(l -> l.docBase));
- this.leaves = leavesList.toArray(new LeafReaderContext[0]);
+ public LeafSlice(List<LeafReaderContextPartition>
leafReaderContextPartitions) {
+ leafReaderContextPartitions.sort(Comparator.comparingInt(l ->
l.ctx.docBase));
+ // TODO should we sort by minDocId too?
Review Comment:
I think so.
##########
lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollectorManager.java:
##########
@@ -28,17 +31,77 @@
*/
public class TotalHitCountCollectorManager
implements CollectorManager<TotalHitCountCollector, Integer> {
+
+ /**
+ * Internal state shared across the different collectors that this collector
manager creates. It
+ * tracks leaves seen as an argument of {@link
Collector#getLeafCollector(LeafReaderContext)}
+ * calls, to ensure correctness: if the first partition of a segment early
terminates, count has
+ * been already retrieved for the entire segment hence subsequent partitions
of the same segment
+ * should also early terminate. If the first partition of a segment computes
hit counts,
+ * subsequent partitions of the same segment should do the same, to prevent
their counts from
+ * being retrieve from {@link LRUQueryCache} (which returns counts for the
entire segment)
+ */
+ private final Map<LeafReaderContext, Boolean> seenContexts = new HashMap<>();
+
@Override
public TotalHitCountCollector newCollector() throws IOException {
- return new TotalHitCountCollector();
+ return new LeafPartitionAwareTotalHitCountCollector(seenContexts);
}
@Override
public Integer reduce(Collection<TotalHitCountCollector> collectors) throws
IOException {
+ // TODO this makes the collector manager instance reusable across multiple
searches. It isn't a
+ // strict requirement
Review Comment:
It looks like your IDE and gradlew tidy disagree about the line length so
every second line is wrapped at different lengths?
##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -890,11 +945,70 @@ public static class LeafSlice {
*
* @lucene.experimental
*/
- public final LeafReaderContext[] leaves;
+ public final LeafReaderContextPartition[] leaves;
Review Comment:
Should we rename `leaves` to `partitions`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]