Hi Kent, That's very interesting. We have been thinking about reducing, down-scaling, our dense vectors from 512 to 64 perhaps using PCA. We have about 2.5 million documents and we did some testing (with Apache JMeter) and after about 10 concurrent requests we start to have performance problems (SOLR seems to stall until we reduce the load for a while) so reduced embedding sizes may really help with this.
Just out of curiosity - when you were testing with up to 160M documents with 512 long embeddings were you using a single massive computer ? I've found that performance is OK/useable with 64Gbytes of RAM where SOLR has 30Gbytes and the O/S has the remainder with the SOLR collection/core being around 20Gbytes so within the amount the O/S can cache the disk I/O. Derek On Mon, Feb 27, 2023 at 5:16 AM Kent Fitch <kent.fi...@gmail.com> wrote: > Hi Derek, > > I have been trying a few settings with HNSW in Lucene/SOLR, and whilst my > experiences may not be directly relevant to you, they may provide some > background. > > My tests have been with an index of up to 160M records containing a 512 > element byte embedding. The original embeddings were of text articles > (average length about 450 words) generated by openAI's ada-002 as 1536 > floats, but then encoding as 512 bytes by encoding groups of 3 floats as 1 > byte using PQ encoding using the method described here: > > https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf > The motivation for PQ encoding is basically to reduce index size. A first > attempt at encoding the floats as bytes worked well (I tried to to minimise > error by analysing the distribution of float values across the 1536 > dimensions, and noticed that all but 5 of the dimensions had a very > narrow range for most embeddings, and so using k-means clustering to find > 256 values for those dimensions, and another 256 values for the 5 "outlier" > dimensions yielded good results). However, each vector still occupied 1536 > bytes, and HNSW really needs these to be in memory as otherwise the IOs to > even the RAID 10 nvme devices connected to their own PCIE3 lanes will cause > slow query rates. So quantising 3 floats into 1 byte was attractive. > Again, I used k-means on each of the 512 x 3 byte groups to get 256 > "centroids" to minimise error. The downside of this approach is the need > to define a custom similarity that reads at initiation the 512 centroid > tables (each with 256 mappings to expand a byte code to 3 floating-point > numbers representing a "centroid" point). > > Anyway, the loss caused by this mapping is real but not particularly > consequential: some result lists are slightly degraded/reordered, but HNSW > is an "approximate nearest neighbour" search anyway. > > How sure are you that the unexpected search results you are reporting are > caused by the HNSW ANN rather than the encoding? For example, if you run > an exhaustive search on your 2m records to find the "real" nearest > neighbours to some point representing some base document, how do the > results differ from your HNSW search with various search beamwidths > (provided as the "k" parameter on the KnnByteVectorQuery constructor)? > > Although not directly relevant to your use-case, results I'm seeing on an > index of 160M documents with a ada-002 embedding quantised to 512 bytes > using a recent (11Feb23) Lucene built with a "M" of 64 and a construction > "beamwidth" of 120 and with a custom similarity: > > with a search "k" of 1, the "real" closest match is returned 56% of the > time and requires 18K similarity comparisons. > with a search "k" of 2, the "real" closest match us returned as the top > match 61% of the time and requires 22K comparisons > with "k" of 3, 64%, 24K comparisons > "k" of 5, 70%, 29K > "k" of 10, 78%, 37K > "k" of 20, 87%, 48K > "k" of 50, 94%, 63K > "k" of 120, 97%, 121K > > The nature of the embeddings I loaded is that many are very similar > (basically, randomish variations on a much smaller set of "base" articles, > as we couldnt afford to get embeddings for 160M articles for this test - we > are just trying to test whether Lucene's HNSW is feasible for our > use-case), so in the overwhelming majority of "misses", the top article is > indeed very similar to the article sought. That is, for our use case, the > results are satisfactory, even with the "down-scaling" of the embedding to > 512 bytes. > > best regards > > Kent Fitch > > > > On Mon, Feb 27, 2023 at 5:02 AM Derek C <de...@hssl.ie> wrote: > > > Hi all, > > > > I'm a bit uncertain how KNN with HNSW works in SOLR with dense vector > > fields and searching. > > > > Recently I've been doing tests loading dense vectors after inferencing > > [images] and then checking by eye the closest matches and the results > look > > funny (very similar images not being the nearest results as I'd normally > > expect). > > > > I'm unclear about HNSW in general (like what are the best policies, or a > > good guide or starting point, for choosing hnswMaxConnections and > > hnswBeamWidth values if you know the dense vector length (512) and you > know > > you have 2 million+ documents). > > > > But one thing I'm wondering right now is with a dataset over time, where > > documents have been added and documents have been removed over time, can > > this affect the KNN search (i.e. is it better if all documents, or at > least > > the dense vector field, had be indexed fresh) ? > > > > BTW I haven't yet moved from SOLR 9.0 to 9.1 but I do read that the HNSW > > codec has changed in some way so a reindex is required - I should > probably > > try 9.1 (I would prioritise this if anyone says 9.1 is better quality or > > better performance for KNN searches!). > > > > Thanks for any info! > > > > Derek > > > > -- > > Derek Conniffe > > Harvey Software Systems Ltd T/A HSSL > > Telephone (IRL): 086 856 3823 > > Telephone (US): (650) 449 6044 > > Skype: dconnrt > > Email: de...@hssl.ie > > > > > > *Disclaimer:* This email and any files transmitted with it are > confidential > > and intended solely for the use of the individual or entity to whom they > > are addressed. If you have received this email in error please delete it > > (if you are not the intended recipient you are notified that disclosing, > > copying, distributing or taking any action in reliance on the contents of > > this information is strictly prohibited). > > *Warning*: Although HSSL have taken reasonable precautions to ensure no > > viruses are present in this email, HSSL cannot accept responsibility for > > any loss or damage arising from the use of this email or attachments. > > P For the Environment, please only print this email if necessary. > > > -- -- Derek Conniffe Harvey Software Systems Ltd T/A HSSL Telephone (IRL): 086 856 3823 Telephone (US): (650) 449 6044 Skype: dconnrt Email: de...@hssl.ie *Disclaimer:* This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please delete it (if you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited). *Warning*: Although HSSL have taken reasonable precautions to ensure no viruses are present in this email, HSSL cannot accept responsibility for any loss or damage arising from the use of this email or attachments. P For the Environment, please only print this email if necessary.