Hi all again, Just following-up on this myself and to see if anyone else out there, also using KNN nearest-neighbour vector searches in SOLR, has been thinking about this.
There is an interesting document on Pinecone's website on HNSW specifically about databases that "bolt-on" vector indexes (now of course Pinecone will be a little bit biased when talking about other databases with vector indexes :-) but it does seem to be talking about exactly what I'm trying to understand). They say: "HNSW was designed for relatively static datasets, though, and production applications are anything but static. As your data changes, so do the vector embeddings that represent your data, and, consequently, so must the vector index" ... "Periodically rebuild the index, and either tolerate downtime during the rebuild process or manage a blue-green deployment each time". So when I have a SOLR collection primarily used for vector searches and I'm creating and deleting documents throughout the day I'm wondering if the vector index is becoming out-of-date, maybe with poorer and poorer results, because I'm never rebuilding the SOLR index from scratch (like their 2nd quote above). Maybe I'll have to rebuild the SOLR index periodically ? (probably the "blue-geen" method is the way to go). Unless SOLR does somehow rebuild the HNSW vector index in the background ? But I haven't seen any documentation suggesting this. If anyone doesn't have any knowledge on this, or results from their own experiments, I'm very interested. Derek p.s. I'm very interested in sticking with SOLR for the vector queries because I want to introduce vector searches WITH other, traditional, queries and I think with the later versions of SOLR it's better at this (and not just a filter query of the knn results but I haven't got to try this out more recently - I only experimented with mixing vector and traditional options in an earlier version of SOLR). On Fri, Nov 1, 2024 at 2:20 PM Derek C <de...@hssl.ie> wrote: > Hi all, > > This is something I'm unsure about: > > We have a SOLR collection of documents with knn_vector_cosine embedding > fields which we use to run nearest neighbor searches. We have replicas but > no shards so every node has the entire core/collection of documents. > > We are adding and removing documents all the time in nightly add and > remove batches. > > What I'm wondering is: - > > Because HNSW relates vertices to each other I don't know if, or how, SOLR > "re-indexes" data as new documents are added and removed. So : > 1) Does the accuracy (of nearest neighbours to a supplied embedding) get > worse over time? > 2) If I deleted all documents in the collection and re-loaded them (so > re-indexed them) would I get different (better?) results to a > nearest-neighbour knn query ? > > thank you for any information on this > > Derek > > > -- > Derek Conniffe > Skype: dconnrt > Email: de...@hssl.ie > > > *Disclaimer:*This email and any files transmitted with it are > confidential and intended solely for the use of the individual or entity to > whom they are addressed. If you have received this email in error please > delete it (if you are not the intended recipient you are notified that > disclosing, copying, distributing or taking any action in reliance on the > contents of this information is strictly prohibited). > *Warning*: Although HSSL have taken reasonable precautions to ensure no > viruses are present in this email, HSSL cannot accept responsibility for > any loss or damage arising from the use of this email or attachments.PFor > the Environment, please only print this email if necessary. > > -- -- Derek Conniffe Harvey Software Systems Ltd T/A HSSL Telephone (IRL): 086 856 3823 Telephone (US): (650) 449 6044 Skype: dconnrt Email: de...@hssl.ie *Disclaimer:* This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please delete it (if you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited). *Warning*: Although HSSL have taken reasonable precautions to ensure no viruses are present in this email, HSSL cannot accept responsibility for any loss or damage arising from the use of this email or attachments. P For the Environment, please only print this email if necessary.