kaivalnp commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-3304639252

   @mikemccand I was able to hack luceneutil to perform the following benchmark:
   - Take an additional input `filterFactor`, where documents with `ID % 
filterFactor == 0` are considered "live" (so `filterFactor` = 2 implies 50% 
docs are "live", `filterFactor` = 5 implies 20% docs are "live" and so on..)
   - Take another input `filterStrategy` with possible values of:
        - `index-time`: create a separate vector field (and corresponding HNSW 
graph), with _just_ the filtered documents
        - `query-time`: perform a pre-filtered search on the large graph
   - Changes I used are in 
https://github.com/kaivalnp/luceneutil/tree/index-time-filtering (warning: 
rough changes!)
   
   Cohere vectors, 768d, MAXIMUM_INNER_PRODUCT
   
   `query-time` filtering:
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  
num_segments  index_size(MB)  filterStrategy  filterFactor  indexType
    0.882        3.959   3.957        1.000  200000   100      50       32      
  200         no     7759     10.95      18268.18           13.12             1 
         600.70      query-time             2       HNSW
    0.888        3.932   3.931        1.000  200000   100      50       32      
  200         no     6856     10.59      18878.61           13.08             1 
         600.85      query-time             5       HNSW
    0.877        3.106   3.105        1.000  200000   100      50       32      
  200         no     4780     10.76      18592.54           12.89             1 
         600.73      query-time            10       HNSW
    0.830        2.313   2.312        0.999  200000   100      50       32      
  200         no     2814     10.92      18316.70           12.88             1 
         600.72      query-time            20       HNSW
   ```
   
   `index-time` filtering:
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  
num_segments  index_size(MB)  filterStrategy  filterFactor  indexType
    0.929        1.199   1.198        0.999  200000   100      50       32      
  200         no     4349     15.05      13291.69           18.42             1 
         901.09      index-time             2       HNSW
    0.952        1.040   1.039        0.999  200000   100      50       32      
  200         no     4084     13.87      14422.73           17.97             1 
         720.66      index-time             5       HNSW
    0.967        0.862   0.861        0.999  200000   100      50       32      
  200         no     3763     12.90      15501.47           16.01             1 
         660.61      index-time            10       HNSW
    0.980        0.670   0.669        0.998  200000   100      50       32      
  200         no     3287     12.61      15861.69           15.93             1 
         630.74      index-time            20       HNSW
   ```
   
   ..and the gains are apparent (\~70% speedup in filtered search time)!
   
   Couple of things to note:
   - The `index_size(MB)` is larger for `index-time` filters, because we're 
creating a new field without de-duping vectors -- this will be improved once we 
de-dup vectors
   - The graph search time for `index-time` filtered search will be higher in 
reality, as we'll lose some data locality benefits when the user adds 
additional vector fields
   - This benchmark just measures graph search time, and not the overhead to 
create and maintain a pre-filter `BitSet`, so the true gains with `index-time` 
filtering are probably higher
   - This speedup is not free -- the user pays by moving cost up-front to 
indexing, see `index(s)` and `force_merge(s)`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to