kaivalnp commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-3304639252
@mikemccand I was able to hack luceneutil to perform the following benchmark:
- Take an additional input `filterFactor`, where documents with `ID %
filterFactor == 0` are considered "live" (so `filterFactor` = 2 implies 50%
docs are "live", `filterFactor` = 5 implies 20% docs are "live" and so on..)
- Take another input `filterStrategy` with possible values of:
- `index-time`: create a separate vector field (and corresponding HNSW
graph), with _just_ the filtered documents
- `query-time`: perform a pre-filtered search on the large graph
- Changes I used are in
https://github.com/kaivalnp/luceneutil/tree/index-time-filtering (warning:
rough changes!)
Cohere vectors, 768d, MAXIMUM_INNER_PRODUCT
`query-time` filtering:
```
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn
beamWidth quantized visited index(s) index_docs/s force_merge(s)
num_segments index_size(MB) filterStrategy filterFactor indexType
0.882 3.959 3.957 1.000 200000 100 50 32
200 no 7759 10.95 18268.18 13.12 1
600.70 query-time 2 HNSW
0.888 3.932 3.931 1.000 200000 100 50 32
200 no 6856 10.59 18878.61 13.08 1
600.85 query-time 5 HNSW
0.877 3.106 3.105 1.000 200000 100 50 32
200 no 4780 10.76 18592.54 12.89 1
600.73 query-time 10 HNSW
0.830 2.313 2.312 0.999 200000 100 50 32
200 no 2814 10.92 18316.70 12.88 1
600.72 query-time 20 HNSW
```
`index-time` filtering:
```
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn
beamWidth quantized visited index(s) index_docs/s force_merge(s)
num_segments index_size(MB) filterStrategy filterFactor indexType
0.929 1.199 1.198 0.999 200000 100 50 32
200 no 4349 15.05 13291.69 18.42 1
901.09 index-time 2 HNSW
0.952 1.040 1.039 0.999 200000 100 50 32
200 no 4084 13.87 14422.73 17.97 1
720.66 index-time 5 HNSW
0.967 0.862 0.861 0.999 200000 100 50 32
200 no 3763 12.90 15501.47 16.01 1
660.61 index-time 10 HNSW
0.980 0.670 0.669 0.998 200000 100 50 32
200 no 3287 12.61 15861.69 15.93 1
630.74 index-time 20 HNSW
```
..and the gains are apparent (\~70% speedup in filtered search time)!
Couple of things to note:
- The `index_size(MB)` is larger for `index-time` filters, because we're
creating a new field without de-duping vectors -- this will be improved once we
de-dup vectors
- The graph search time for `index-time` filtered search will be higher in
reality, as we'll lose some data locality benefits when the user adds
additional vector fields
- This benchmark just measures graph search time, and not the overhead to
create and maintain a pre-filter `BitSet`, so the true gains with `index-time`
filtering are probably higher
- This speedup is not free -- the user pays by moving cost up-front to
indexing, see `index(s)` and `force_merge(s)`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]