Re: [PR] Deprecate COSINE VectorSimilarity function [lucene]

via GitHub Mon, 22 Apr 2024 15:21:38 -0700


benwtrent commented on code in PR #13308:
URL: https://github.com/apache/lucene/pull/13308#discussion_r1575385344



##########
lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestInt8HnswBackwardsCompatibility.java:
##########
@@ -50,8 +51,14 @@ public class TestInt8HnswBackwardsCompatibility extends 
BackwardsCompatibilityTe
   private static final String KNN_VECTOR_FIELD = "knn_field";
   private static final int DOC_COUNT = 30;
   private static final FieldType KNN_VECTOR_FIELD_TYPE =
-      KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.COSINE);
+      KnnFloatVectorField.createFieldType(3, 
VectorSimilarityFunction.DOT_PRODUCT);
   private static final float[] KNN_VECTOR = {0.2f, -0.1f, 0.1f};
+  private static final float[] NORMALIZED_KNN_VECTOR = new 
float[KNN_VECTOR.length];

Review Comment:
   Same sort of comment as above on versioning and that we index fields into an 
old index.



##########
lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBasicBackwardsCompatibility.java:
##########
@@ -98,8 +99,14 @@ public class TestBasicBackwardsCompatibility extends 
BackwardsCompatibilityTestB
   private static final int KNN_VECTOR_MIN_SUPPORTED_VERSION = 
LUCENE_9_0_0.major;
   private static final String KNN_VECTOR_FIELD = "knn_field";
   private static final FieldType KNN_VECTOR_FIELD_TYPE =
-      KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.COSINE);
+      KnnFloatVectorField.createFieldType(3, 
VectorSimilarityFunction.DOT_PRODUCT);

Review Comment:
   I am not 100% sure this will work as simply as this. We will index NEW 
fields into an index created with the OLD code (see where we call `addDoc`), 
meaning you will be trying to index a `dot_product` knn field into an index 
with a `cosine` field.
   
   You will have to handle the index versions from before a certain version 
(thus on cosine) to ones created after (and thus dot-product).



##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -61,24 +60,6 @@ public float compare(byte[] v1, byte[] v2) {
     }
   },
 
-  /**
-   * Cosine similarity. NOTE: the preferred way to perform cosine similarity 
is to normalize all
-   * vectors to unit length, and instead use {@link 
VectorSimilarityFunction#DOT_PRODUCT}. You
-   * should only use this function if you need to preserve the original 
vectors and cannot normalize
-   * them in advance. The similarity score is normalised to assure it is 
positive.
-   */
-  COSINE {
-    @Override
-    public float compare(float[] v1, float[] v2) {
-      return Math.max((1 + cosine(v1, v2)) / 2, 0);
-    }
-
-    @Override
-    public float compare(byte[] v1, byte[] v2) {
-      return (1 + cosine(v1, v2)) / 2;
-    }
-  },
-

Review Comment:
   I think this will get tricky. If you remove this ordinal (I think it would 
be wise to deprecate first, work through the warnings, etc. get everything 
passing, etc.), field info & knn vector format readers will need to somehow 
distinguish between "Max inner product" (whose ordinal decreases to cosine's) 
and "cosine but in an old index version".
   
   Maybe on write, we adjust the ordinal dynamically, knowing what they are 
(thus correcting MIP's ordinal on write, allowing for readers to distinguish). 



##########
lucene/core/src/java/org/apache/lucene/util/quantization/ScalarQuantizedRandomVectorScorer.java:
##########
@@ -30,23 +28,6 @@
 public class ScalarQuantizedRandomVectorScorer
     extends RandomVectorScorer.AbstractRandomVectorScorer<byte[]> {
 
-  public static float quantizeQuery(

Review Comment:
   Even in Lucene 10, we will need to support quantizing a query sent to a 
Lucene index that uses Cosine.



##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -61,24 +60,6 @@ public float compare(byte[] v1, byte[] v2) {
     }
   },
 
-  /**
-   * Cosine similarity. NOTE: the preferred way to perform cosine similarity 
is to normalize all

Review Comment:
   I think it would be good to deprecate this first, get all the internal paths 
updated to not use it. 
   
   Then the deprecation can be merged and backported. Then work on removing it 
from main.
   
   Mostly because the deprecation path and internal code usage will all have to 
occur anyways :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Deprecate COSINE VectorSimilarity function [lucene]

Reply via email to