msokolov commented on code in PR #13181:
URL: https://github.com/apache/lucene/pull/13181#discussion_r1587804525
##########
lucene/core/src/java/org/apache/lucene/util/quantization/QuantizedByteVectorValues.java:
##########
@@ -18,13 +18,40 @@
import java.io.IOException;
import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.VectorScorer;
/**
* A version of {@link ByteVectorValues}, but additionally retrieving score
correction offset for
* Scalar quantization scores.
*
* @lucene.experimental
*/
-public abstract class QuantizedByteVectorValues extends ByteVectorValues {
+public abstract class QuantizedByteVectorValues extends DocIdSetIterator {
Review Comment:
Sorry I haven't kept up with all the exciting developments -- how does one
specify quantization -- it's part of `FieldInfo` right? In that case does this
class/interface need to be public? wouldn't VectorValues just "do the right
thing" as regards quantization if it's specified for a field? Or ... are we
trying to allow two-pass scoring where we coarsely use quantized score and then
later refine with full-precision scoring?
##########
lucene/core/src/java/org/apache/lucene/util/quantization/QuantizedByteVectorValues.java:
##########
@@ -18,13 +18,40 @@
import java.io.IOException;
import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.VectorScorer;
/**
* A version of {@link ByteVectorValues}, but additionally retrieving score
correction offset for
* Scalar quantization scores.
*
* @lucene.experimental
*/
-public abstract class QuantizedByteVectorValues extends ByteVectorValues {
+public abstract class QuantizedByteVectorValues extends DocIdSetIterator {
Review Comment:
Overall the idea of returning a `VectorScorer` from a `VectorValues` makes
sense to me. One use case I'm not sure we support (and we should) is a vector
field with no associated HNSW graph where we want to rank documents by their
vector score "brute force". We would still want access to all the other goodies
like quantization, predefined scorers, etc. Can we even create a vector field
with no graph? If we do that would there be a `VectorSimilarity` associated
with the field?
##########
lucene/core/src/java/org/apache/lucene/search/VectorScorer.java:
##########
@@ -18,64 +18,39 @@
import java.io.IOException;
import org.apache.lucene.index.ByteVectorValues;
-import org.apache.lucene.index.FieldInfo;
import org.apache.lucene.index.FloatVectorValues;
-import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.VectorSimilarityFunction;
/**
* Computes the similarity score between a given query vector and different
document vectors. This
- * is primarily used by {@link KnnFloatVectorQuery} to run an exact,
exhaustive search over the
- * vectors.
+ * is used for exact searching and scoring
+ *
+ * @lucene.experimental
*/
-abstract class VectorScorer {
- protected final VectorSimilarityFunction similarity;
+public interface VectorScorer {
/**
- * Create a new vector scorer instance.
+ * Compute the score for the current document ID.
*
- * @param context the reader context
- * @param fi the FieldInfo for the field containing document vectors
- * @param query the query vector to compute the similarity for
+ * @return the score for the current document ID
+ * @throws IOException if an exception occurs during score computation
*/
- static FloatVectorScorer create(LeafReaderContext context, FieldInfo fi,
float[] query)
- throws IOException {
- FloatVectorValues values = context.reader().getFloatVectorValues(fi.name);
- if (values == null) {
- FloatVectorValues.checkField(context.reader(), fi.name);
- return null;
- }
- final VectorSimilarityFunction similarity =
fi.getVectorSimilarityFunction();
- return new FloatVectorScorer(values, query, similarity);
- }
-
- static ByteVectorScorer create(LeafReaderContext context, FieldInfo fi,
byte[] query)
- throws IOException {
- ByteVectorValues values = context.reader().getByteVectorValues(fi.name);
- if (values == null) {
- ByteVectorValues.checkField(context.reader(), fi.name);
- return null;
- }
- VectorSimilarityFunction similarity = fi.getVectorSimilarityFunction();
- return new ByteVectorScorer(values, query, similarity);
- }
-
- VectorScorer(VectorSimilarityFunction similarity) {
- this.similarity = similarity;
- }
+ float score() throws IOException;
- /** Compute the similarity score for the current document. */
- abstract float score() throws IOException;
-
- abstract boolean advanceExact(int doc) throws IOException;
+ /**
+ * @return a {@link DocIdSetIterator} over the documents.
+ */
+ DocIdSetIterator iterator();
Review Comment:
Your message, @benwtrent, said in this scheme iterators create scorers, but
here it looks like the "normal" way?
##########
lucene/core/src/java/org/apache/lucene/search/VectorScorer.java:
##########
@@ -18,64 +18,39 @@
import java.io.IOException;
import org.apache.lucene.index.ByteVectorValues;
-import org.apache.lucene.index.FieldInfo;
import org.apache.lucene.index.FloatVectorValues;
-import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.VectorSimilarityFunction;
/**
* Computes the similarity score between a given query vector and different
document vectors. This
- * is primarily used by {@link KnnFloatVectorQuery} to run an exact,
exhaustive search over the
- * vectors.
+ * is used for exact searching and scoring
+ *
+ * @lucene.experimental
*/
-abstract class VectorScorer {
- protected final VectorSimilarityFunction similarity;
+public interface VectorScorer {
/**
- * Create a new vector scorer instance.
+ * Compute the score for the current document ID.
*
- * @param context the reader context
- * @param fi the FieldInfo for the field containing document vectors
- * @param query the query vector to compute the similarity for
+ * @return the score for the current document ID
+ * @throws IOException if an exception occurs during score computation
*/
- static FloatVectorScorer create(LeafReaderContext context, FieldInfo fi,
float[] query)
- throws IOException {
- FloatVectorValues values = context.reader().getFloatVectorValues(fi.name);
- if (values == null) {
- FloatVectorValues.checkField(context.reader(), fi.name);
- return null;
- }
- final VectorSimilarityFunction similarity =
fi.getVectorSimilarityFunction();
- return new FloatVectorScorer(values, query, similarity);
- }
-
- static ByteVectorScorer create(LeafReaderContext context, FieldInfo fi,
byte[] query)
- throws IOException {
- ByteVectorValues values = context.reader().getByteVectorValues(fi.name);
- if (values == null) {
- ByteVectorValues.checkField(context.reader(), fi.name);
- return null;
- }
- VectorSimilarityFunction similarity = fi.getVectorSimilarityFunction();
- return new ByteVectorScorer(values, query, similarity);
- }
-
- VectorScorer(VectorSimilarityFunction similarity) {
- this.similarity = similarity;
- }
+ float score() throws IOException;
- /** Compute the similarity score for the current document. */
- abstract float score() throws IOException;
-
- abstract boolean advanceExact(int doc) throws IOException;
+ /**
+ * @return a {@link DocIdSetIterator} over the documents.
+ */
+ DocIdSetIterator iterator();
Review Comment:
But I guess the `VectorValues` will return ` VectorScorer`? In that case do
we need this `iterator()` method since presumably the caller already has an
iterator over docs with vectors in this field?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]