Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

via GitHub Mon, 06 Nov 2023 09:40:00 -0800


epotyom commented on code in PR #12769:
URL: https://github.com/apache/lucene/pull/12769#discussion_r1383719431



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java:
##########
@@ -312,6 +316,111 @@ public int getOrdinal(FacetLabel cp) throws IOException {
     return ret;
   }
 
+  @Override
+  public int[] getBulkOrdinals(FacetLabel... categoryPaths) throws IOException 
{
+    ensureOpen();
+    if (categoryPaths.length == 0) {
+      return new int[0];
+    }
+    if (categoryPaths.length == 1) {
+      return new int[] {getOrdinal(categoryPaths[0])};
+    }
+    // First try to find results in the cache:
+    int[] result = new int[categoryPaths.length];
+    int[] indexesMissingFromCache = new int[10]; // initial size, will grow 
when required
+    int numberOfMissingFromCache = 0;
+    FacetLabel cp;
+    Integer res;
+    for (int i = 0; i < categoryPaths.length; i++) {
+      cp = categoryPaths[i];
+      synchronized (ordinalCache) {
+        res = ordinalCache.get(cp);
+      }
+      if (res != null) {
+        if (res < indexReader.maxDoc()) {
+          // Since the cache is shared with DTR instances allocated from
+          // doOpenIfChanged, we need to ensure that the ordinal is one that
+          // this DTR instance recognizes.
+          result[i] = res;
+        } else {
+          // if we get here, it means that the category was found in the cache,
+          // but is not recognized by this TR instance. Therefore, there's no
+          // need to continue search for the path on disk, because we won't 
find
+          // it there too.
+          result[i] = TaxonomyReader.INVALID_ORDINAL;
+        }
+      } else {
+        indexesMissingFromCache =
+            ArrayUtil.grow(indexesMissingFromCache, numberOfMissingFromCache + 
1);
+        indexesMissingFromCache[numberOfMissingFromCache++] = i;
+      }
+    }
+    // all ordinals found in cache
+    if (indexesMissingFromCache.length == 0) {
+      return result;
+    }
+
+    // If we're still here, we have at least one cache miss. We need to fetch 
the
+    // value from disk, and then also put results in the cache
+
+    // Create array of missing terms, and sort them so that later we scan 
terms dictionary
+    // forward-only.
+    // Note: similar functionality exists within BytesRefHash and 
BytesRefArray, but they don't
+    // reuse BytesRefs and assign their own ords. It is cheaper to have custom 
implementation here.
+    BytesRef[] termsToGet = new BytesRef[numberOfMissingFromCache];
+    for (int i = 0; i < termsToGet.length; i++) {
+      cp = categoryPaths[indexesMissingFromCache[i]];
+      termsToGet[i] = new BytesRef(FacetsConfig.pathToString(cp.components, 
cp.length));
+    }
+    // sort both terms and their indexes in the input parameter
+    int[] finalMissingFromCache = indexesMissingFromCache;
+
+    new StringSorter(BytesRefComparator.NATURAL) {
+
+      @Override
+      protected void swap(int i, int j) {
+        int tmp = finalMissingFromCache[i];
+        finalMissingFromCache[i] = finalMissingFromCache[j];
+        finalMissingFromCache[j] = tmp;
+        BytesRef tmpBytes = termsToGet[i];
+        termsToGet[i] = termsToGet[j];
+        termsToGet[j] = tmpBytes;
+      }
+
+      @Override
+      protected void get(BytesRefBuilder builder, BytesRef result, int i) {
+        BytesRef ref = termsToGet[i];
+        result.offset = ref.offset;
+        result.length = ref.length;
+        result.bytes = ref.bytes;
+      }
+    }.sort(0, numberOfMissingFromCache);
+
+    TermsEnum te = MultiTerms.getTerms(indexReader, Consts.FULL).iterator();
+    PostingsEnum postings = null;
+    int ord;
+    int resIndex;
+    for (int i = 0; i < numberOfMissingFromCache; i++) {
+      resIndex = indexesMissingFromCache[i];
+      if (te.seekExact(termsToGet[i])) {
+        postings = te.postings(postings, 0);
+        if (postings != null && postings.nextDoc() != 
DocIdSetIterator.NO_MORE_DOCS) {
+          ord = postings.docID();
+          result[resIndex] = ord;
+          // populate cache
+          synchronized (ordinalCache) {

Review Comment:
   Good idea! Will update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add TaxonomyReader#getBulkOrdinals method (#12180) [lucene]

Reply via email to