benwtrent commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-2045604262
I do think things like `ColBERT` would benefit from having multiple vectors
for a single document field.
One crazy idea I had (others have probably already thought of this, and
found it wanting...) is since HNSW supports non-euclidean space already, what
if HNSW graph nodes simply represented more than one vector?
Then the flat storage system and underlying scorer could handle the distance
computations and HNSW itself doesn't actually have to change.
I could see this maybe hurting recall, but I wonder in practice how bad it
would actually hurt things.
The idea would be:
- A new FlatVectorFormat type that allows more than one vector (or possibly
extending the existing ones)
- That type would provide a scorer to HNSW that resolves the multi-vector
scores by providing a particular aggregation of the scores of the vectors. This
could be "max", "min", "avg", "sum" or something.
- Then we need to test how recall is for the graph for individual vectors
as a query could be one vector (regular passage search) or multiple (ColBERT).
HNSW doesn't actually look at the vectors at all, it simply provides an
ordinal and requests a score, so the change in regards to code wouldn't be too
bad I think.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]