Re: Custom Query Implementation

2024-12-02 Thread Mikhail Khludnev
Morning. I noticed a condition choosing sparse and dense format underneath https://github.com/apache/lucene/blob/6053e1e31378378f6d310a05ea6d7dcdfc45f48b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapByteVectorValues.java#L108 perhaps it may achieve your performance requirements.

Re: Custom Query Implementation

2024-12-02 Thread Viacheslav Dobrynin
Hi, Thanks for the answer! I think this is similar to my initial implementation, where I built the query as follows (PyLucene): def build_query(query): builder = BooleanQuery.Builder() for term in torch.nonzero(query): field_name = to_field_name(term.item()) value = query[

Re: Custom Query Implementation

2024-12-02 Thread Michael Sokolov
Another way is using postings - you can represent each dimension as a term (`dim0`, `dim1`, etc) and index those that occur in a document. To encode a value for a dimension you can either provide a custom term frequency, or index the term multiple times. Then when searching you can form a BooleanQu

Re: Custom Query Implementation

2024-12-02 Thread Viacheslav Dobrynin
Hi, Thanks for the reply. I haven't tried to do that. However, I do not fully understand how in this case an inverted index will be constructed for an efficient search by terms (O(1) for each term as a key )? пн, 2 дек. 2024 г. в 21:55, Patrick Zhai : > Hi, have you tried to encode the sparse v

Re: Custom Query Implementation

2024-12-02 Thread Patrick Zhai
Hi, have you tried to encode the sparse vector yourself using the BinaryDocValueField? One way I can think of is to encode it as (size, index_array, value_array) per doc Intuitively I feel like this should be more efficient than one dimension per field if your dimension is high enough Patrick On

Re: Custom Query Implementation

2024-12-02 Thread Viacheslav Dobrynin
Hi! I need to index sparse vectors, whereas as I understand it, KnnFloatVectorField is designed for dense vectors. Therefore, it seems that this approach will not work. вс, 1 дек. 2024 г. в 18:36, Mikhail Khludnev : > Hi, > May it look like KnnFloatVectorField(... DOT_PRODUCT) > and KnnFloatVect

Re: HNSW graph `connectComponents()` method takes a very long on random vectors

2024-12-02 Thread Viliam Ďurina
With random vectors, the `HnswUtil.components` returns ~76k components on level 0 (in `HnswGraphBuilder`, line 445, in Lucene 10.0). With first 100k vectors of SIFT1M, it finds 5 components. Why that happens I don't know, I don't understand the algorithm enough, I might look into that later, but fo