Re: [PR] Add new VectorScorer interface to vector value iterators [lucene]

via GitHub Thu, 02 May 2024 08:18:15 -0700


msokolov commented on code in PR #13181:
URL: https://github.com/apache/lucene/pull/13181#discussion_r1587804525



##########
lucene/core/src/java/org/apache/lucene/util/quantization/QuantizedByteVectorValues.java:
##########
@@ -18,13 +18,40 @@
 
 import java.io.IOException;
 import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.VectorScorer;
 
 /**
  * A version of {@link ByteVectorValues}, but additionally retrieving score 
correction offset for
  * Scalar quantization scores.
  *
  * @lucene.experimental
  */
-public abstract class QuantizedByteVectorValues extends ByteVectorValues {
+public abstract class QuantizedByteVectorValues extends DocIdSetIterator {

Review Comment:
   Sorry I haven't kept up with all the exciting developments -- how does one 
specify quantization -- it's part of `FieldInfo` right? In that case does this 
class/interface need to be public? wouldn't VectorValues just "do the right 
thing" as regards quantization if it's specified for a field? Or ... are we 
trying to allow two-pass scoring where we coarsely use quantized score and then 
later refine with full-precision scoring?



##########
lucene/core/src/java/org/apache/lucene/util/quantization/QuantizedByteVectorValues.java:
##########
@@ -18,13 +18,40 @@
 
 import java.io.IOException;
 import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.VectorScorer;
 
 /**
  * A version of {@link ByteVectorValues}, but additionally retrieving score 
correction offset for
  * Scalar quantization scores.
  *
  * @lucene.experimental
  */
-public abstract class QuantizedByteVectorValues extends ByteVectorValues {
+public abstract class QuantizedByteVectorValues extends DocIdSetIterator {

Review Comment:
   Overall the idea of returning a `VectorScorer` from a `VectorValues` makes 
sense to me. One use case I'm not sure we support (and we should) is a vector 
field with no associated HNSW graph where we want to rank documents by their 
vector score "brute force". We would still want access to all the other goodies 
like quantization, predefined scorers, etc. Can we even create a vector field 
with no graph? If we do that would there be a `VectorSimilarity` associated 
with the field?



##########
lucene/core/src/java/org/apache/lucene/search/VectorScorer.java:
##########
@@ -18,64 +18,39 @@
 
 import java.io.IOException;
 import org.apache.lucene.index.ByteVectorValues;
-import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.FloatVectorValues;
-import org.apache.lucene.index.LeafReaderContext;
 import org.apache.lucene.index.VectorSimilarityFunction;
 
 /**
  * Computes the similarity score between a given query vector and different 
document vectors. This
- * is primarily used by {@link KnnFloatVectorQuery} to run an exact, 
exhaustive search over the
- * vectors.
+ * is used for exact searching and scoring
+ *
+ * @lucene.experimental
  */
-abstract class VectorScorer {
-  protected final VectorSimilarityFunction similarity;
+public interface VectorScorer {
 
   /**
-   * Create a new vector scorer instance.
+   * Compute the score for the current document ID.
    *
-   * @param context the reader context
-   * @param fi the FieldInfo for the field containing document vectors
-   * @param query the query vector to compute the similarity for
+   * @return the score for the current document ID
+   * @throws IOException if an exception occurs during score computation
    */
-  static FloatVectorScorer create(LeafReaderContext context, FieldInfo fi, 
float[] query)
-      throws IOException {
-    FloatVectorValues values = context.reader().getFloatVectorValues(fi.name);
-    if (values == null) {
-      FloatVectorValues.checkField(context.reader(), fi.name);
-      return null;
-    }
-    final VectorSimilarityFunction similarity = 
fi.getVectorSimilarityFunction();
-    return new FloatVectorScorer(values, query, similarity);
-  }
-
-  static ByteVectorScorer create(LeafReaderContext context, FieldInfo fi, 
byte[] query)
-      throws IOException {
-    ByteVectorValues values = context.reader().getByteVectorValues(fi.name);
-    if (values == null) {
-      ByteVectorValues.checkField(context.reader(), fi.name);
-      return null;
-    }
-    VectorSimilarityFunction similarity = fi.getVectorSimilarityFunction();
-    return new ByteVectorScorer(values, query, similarity);
-  }
-
-  VectorScorer(VectorSimilarityFunction similarity) {
-    this.similarity = similarity;
-  }
+  float score() throws IOException;
 
-  /** Compute the similarity score for the current document. */
-  abstract float score() throws IOException;
-
-  abstract boolean advanceExact(int doc) throws IOException;
+  /**
+   * @return a {@link DocIdSetIterator} over the documents.
+   */
+  DocIdSetIterator iterator();

Review Comment:
   Your message, @benwtrent, said in this scheme iterators create scorers, but 
here it looks like the "normal" way?



##########
lucene/core/src/java/org/apache/lucene/search/VectorScorer.java:
##########
@@ -18,64 +18,39 @@
 
 import java.io.IOException;
 import org.apache.lucene.index.ByteVectorValues;
-import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.FloatVectorValues;
-import org.apache.lucene.index.LeafReaderContext;
 import org.apache.lucene.index.VectorSimilarityFunction;
 
 /**
  * Computes the similarity score between a given query vector and different 
document vectors. This
- * is primarily used by {@link KnnFloatVectorQuery} to run an exact, 
exhaustive search over the
- * vectors.
+ * is used for exact searching and scoring
+ *
+ * @lucene.experimental
  */
-abstract class VectorScorer {
-  protected final VectorSimilarityFunction similarity;
+public interface VectorScorer {
 
   /**
-   * Create a new vector scorer instance.
+   * Compute the score for the current document ID.
    *
-   * @param context the reader context
-   * @param fi the FieldInfo for the field containing document vectors
-   * @param query the query vector to compute the similarity for
+   * @return the score for the current document ID
+   * @throws IOException if an exception occurs during score computation
    */
-  static FloatVectorScorer create(LeafReaderContext context, FieldInfo fi, 
float[] query)
-      throws IOException {
-    FloatVectorValues values = context.reader().getFloatVectorValues(fi.name);
-    if (values == null) {
-      FloatVectorValues.checkField(context.reader(), fi.name);
-      return null;
-    }
-    final VectorSimilarityFunction similarity = 
fi.getVectorSimilarityFunction();
-    return new FloatVectorScorer(values, query, similarity);
-  }
-
-  static ByteVectorScorer create(LeafReaderContext context, FieldInfo fi, 
byte[] query)
-      throws IOException {
-    ByteVectorValues values = context.reader().getByteVectorValues(fi.name);
-    if (values == null) {
-      ByteVectorValues.checkField(context.reader(), fi.name);
-      return null;
-    }
-    VectorSimilarityFunction similarity = fi.getVectorSimilarityFunction();
-    return new ByteVectorScorer(values, query, similarity);
-  }
-
-  VectorScorer(VectorSimilarityFunction similarity) {
-    this.similarity = similarity;
-  }
+  float score() throws IOException;
 
-  /** Compute the similarity score for the current document. */
-  abstract float score() throws IOException;
-
-  abstract boolean advanceExact(int doc) throws IOException;
+  /**
+   * @return a {@link DocIdSetIterator} over the documents.
+   */
+  DocIdSetIterator iterator();

Review Comment:
   But I guess the `VectorValues` will return ` VectorScorer`? In that case do 
we need this `iterator()` method since presumably the caller already has an 
iterator over docs with vectors in this field?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add new VectorScorer interface to vector value iterators [lucene]

Reply via email to