benwtrent commented on code in PR #13308:
URL: https://github.com/apache/lucene/pull/13308#discussion_r1575385344
##########
lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestInt8HnswBackwardsCompatibility.java:
##########
@@ -50,8 +51,14 @@ public class TestInt8HnswBackwardsCompatibility extends
BackwardsCompatibilityTe
private static final String KNN_VECTOR_FIELD = "knn_field";
private static final int DOC_COUNT = 30;
private static final FieldType KNN_VECTOR_FIELD_TYPE =
- KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.COSINE);
+ KnnFloatVectorField.createFieldType(3,
VectorSimilarityFunction.DOT_PRODUCT);
private static final float[] KNN_VECTOR = {0.2f, -0.1f, 0.1f};
+ private static final float[] NORMALIZED_KNN_VECTOR = new
float[KNN_VECTOR.length];
Review Comment:
Same sort of comment as above on versioning and that we index fields into an
old index.
##########
lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBasicBackwardsCompatibility.java:
##########
@@ -98,8 +99,14 @@ public class TestBasicBackwardsCompatibility extends
BackwardsCompatibilityTestB
private static final int KNN_VECTOR_MIN_SUPPORTED_VERSION =
LUCENE_9_0_0.major;
private static final String KNN_VECTOR_FIELD = "knn_field";
private static final FieldType KNN_VECTOR_FIELD_TYPE =
- KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.COSINE);
+ KnnFloatVectorField.createFieldType(3,
VectorSimilarityFunction.DOT_PRODUCT);
Review Comment:
I am not 100% sure this will work as simply as this. We will index NEW
fields into an index created with the OLD code (see where we call `addDoc`),
meaning you will be trying to index a `dot_product` knn field into an index
with a `cosine` field.
You will have to handle the index versions from before a certain version
(thus on cosine) to ones created after (and thus dot-product).
##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -61,24 +60,6 @@ public float compare(byte[] v1, byte[] v2) {
}
},
- /**
- * Cosine similarity. NOTE: the preferred way to perform cosine similarity
is to normalize all
- * vectors to unit length, and instead use {@link
VectorSimilarityFunction#DOT_PRODUCT}. You
- * should only use this function if you need to preserve the original
vectors and cannot normalize
- * them in advance. The similarity score is normalised to assure it is
positive.
- */
- COSINE {
- @Override
- public float compare(float[] v1, float[] v2) {
- return Math.max((1 + cosine(v1, v2)) / 2, 0);
- }
-
- @Override
- public float compare(byte[] v1, byte[] v2) {
- return (1 + cosine(v1, v2)) / 2;
- }
- },
-
Review Comment:
I think this will get tricky. If you remove this ordinal (I think it would
be wise to deprecate first, work through the warnings, etc. get everything
passing, etc.), field info & knn vector format readers will need to somehow
distinguish between "Max inner product" (whose ordinal decreases to cosine's)
and "cosine but in an old index version".
Maybe on write, we adjust the ordinal dynamically, knowing what they are
(thus correcting MIP's ordinal on write, allowing for readers to distinguish).
##########
lucene/core/src/java/org/apache/lucene/util/quantization/ScalarQuantizedRandomVectorScorer.java:
##########
@@ -30,23 +28,6 @@
public class ScalarQuantizedRandomVectorScorer
extends RandomVectorScorer.AbstractRandomVectorScorer<byte[]> {
- public static float quantizeQuery(
Review Comment:
Even in Lucene 10, we will need to support quantizing a query sent to a
Lucene index that uses Cosine.
##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -61,24 +60,6 @@ public float compare(byte[] v1, byte[] v2) {
}
},
- /**
- * Cosine similarity. NOTE: the preferred way to perform cosine similarity
is to normalize all
Review Comment:
I think it would be good to deprecate this first, get all the internal paths
updated to not use it.
Then the deprecation can be merged and backported. Then work on removing it
from main.
Mostly because the deprecation path and internal code usage will all have to
occur anyways :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]