benwtrent commented on PR #13200:
URL: https://github.com/apache/lucene/pull/13200#issuecomment-2040503529
@jimczi I match your wall of text with my own :).
> I am a bit concerned about the generalization here. The whole similarity
is currently modeled around the usage of the HNSW codec that assumes that
vectors are compared randomly. This makes the interface heavily geared towards
this use case.
Yeah, I hear ya. That is sort of the difficulty with any pluggable
infrastructure is finding a general enough API.
> It also assumes that we always have access to the raw vectors to perform
the similarity which is a false premise if we think about product quantization,
LSH or any transformational codec.
++
> I wonder if we took the similarity too far in that aspect. In my opinion
the similarity should be set at the knn format level and the options could
depend on what the format can provide. For HNSW and flat codec, they could
continue to share the simple enum we have today with an option to override the
implementation in a custom codec. Users that want to implement funky
similarities on top could do it in a custom codec that overrides the base ones.
We can make this customization more easily accessible in our base codecs if
needed.
While writing this pluggable interface and looking at all the assumptions it
has to make, I kept coming back to “Wait, don’t we already have a pluggable
thing for customizing fields? Yeah, its our codecs…”
And many hyper optimized implementations of various vector similarities
would have to know how the vectors are laid out in memory. That logic and work
should be coupled to the codec.
> The point of the base codecs is to provide good out of the box
functionalities that work in all cases. Blindly accepting any type of
similarities generally is a recipe for failure, we should consider adding a new
similarity as something very expert that requires dealing with a new format
entirely.
++
> I am also keen to reopen the discussion around simplifying the similarity
we currently support.
Agreed, I am going to open a PR soon to deprecate cosine. It’s probably the
most useless one we have now.
> I personally like the fact that it’s a simple enum with very few values.
The issue in how it is exposed today is because each value is linked with an
implementation. I think it would be valuable to make the implementation of
similarities a detail that each knn format needs to provide.Defining similarity
independently of the format complicates matters without much benefit. The only
perceived advantage currently is ensuring consistent scoring when querying a
field with different knn formats within a single index. However, I question the
practicality and necessity of this capability.
The scoring consistency is a nice benefit. But, tying implementation and
scoring to the codec does give us a natural way to deprecate and move forward
support for similarities.
For example, I could see `dot_product` and `maximum_inner_product` being
merged into one similarity in the future. The main sticking point is the danged
scoring as the scaling for the `dot_product` similarity is so unique and
different.
> If we were to start again I’d argue that just supporting dot-product would
be enough and cosine would be left out. I think we can still do that in Lucene
10 and provide the option to normalize the vectors during indexing/querying.
I agree. As for providing an option for normalizing, Lucene already has an
optimized VectorUtil function for this.
> My main question here is whether the similarity should be abstracted at
the knn format level.
> In my opinion, framing similarity as a universal interface for all knn
formats is misleading and could hinder the implementation of other valid knn
formats.
It is true that some vector storage mechanisms would disallow similarities
and others would allow more particular ones (for example, `hamming` only makes
sense for binary values in bytes and would fail for floating points).
I have to think about this more. I agree, stepping back and looking at the
code required to have pluggable similarities and the various backwards
compatibility and (capability in general) woes it could occur is frustrating.
It may not be worth it at all given the cost.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]