[ https://issues.apache.org/jira/browse/SOLR-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513596#comment-17513596 ]
Adrien Grand commented on SOLR-14397: ------------------------------------- Looks like this issue has been superseded by SOLR-15880? > Vector Search in Solr > --------------------- > > Key: SOLR-14397 > URL: https://issues.apache.org/jira/browse/SOLR-14397 > Project: Solr > Issue Type: Improvement > Reporter: Trey Grainger > Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > Search engines have traditionally relied upon token-based matching (typically > keywords) on an inverted index, plus relevance ranking based upon keyword > occurrence statistics. This can be viewed as a "sparse vector” match (where > each term is a one-hot encoded dimension in the vector), since only a few > keywords out of all possible keywords are considered in each query. With the > introduction of deep-learning-based transformers over the last few years, > however, the state of the art in relevance has moved to ranking models based > upon dense vectors that encode a latent, semantic understanding of both > language constructs and the underlying domain upon which the model was > trained. These dense vectors are also referred to as “embeddings”. An example > of this kind of embedding would be taking the phrase “chief executive officer > of the tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3] > . Other similar phrases should encode to vectors with very similar numbers, > so we may expect a query like “CEO of a technology org” to generate a vector > like [0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity > calculation between these vectors, we would expect a number closer to 1.0, > whereas a very unrelated text blurb would generate a much smaller cosine > similarity. > This is a proposal for how we should implement these vector search > capabilities in Solr. > h1. Search Process Overview: > In order to implement dense vector search, the following process is typically > followed: > h2. Offline: > An encoder is built. An encoder can take in text (a query, a sentence, a > paragraph, a document, etc.) and return a dense vector representing that > document in a rich semantic space. The semantic space is learned from > training on textual data (usually, though other sources work, too), typically > from the domain of the search engine. > h2. Document Ingestion: > When documents are processed, they are passed to the encoder, and the dense > vector(s) returned are stored as fields on the document. There could be one > or more vectors per-document, as the granularity of the vectors could be > per-document, per field, per paragraph, per-sentence, or even per phrase or > per term. > h2. Query Time: > *Encoding:* The query is translated to a dense vector by passing it to the > encoder > Quantization: The query is quantized. Quantization is the process of taking > a vector with many values and turning it into “terms” in a vector space that > approximates the full vector space of the dense vectors. > *ANN Matching:* A query on the quantized vector tokens is executed as an ANN > (approximate nearest neighbor) search. This allows finding most of the best > matching documents (typically up to 95%) with a traditional and efficient > lookup against the inverted index. > _(optional)_ *ANN Ranking*: ranking may be performed based upon the matched > quantized tokens to get a rough, initial ranking of documents based upon the > similarity of the query and document vectors. This allows the next step > (re-ranking) to be performed on a smaller subset of documents. > *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is > performed, a similarity calculation (cosine, dot-product, or any number of > other calculations) is typically performed between the full (non-quantized) > dense vectors for the query and those in the document. This re-ranking will > typically be on the top-N results for performance reasons. > *Return Results:* As with any search, the final step is typically to return > the results in relevance-ranked order. In this case, that would be sorted by > the re-ranking similarity score (i.e. “cosine descending”). > ------ > *Variant:* For small document sets, it may be preferable to rank all > documents and skip steps steps 2, 3, and 4. This is because ANN Matching > typically reduces recall (current state of the art is around 95% recall), so > it can be beneficial to rank all documents if performance is not a concern. > In this case, step 5 is performed on the full doc set and would obviously > just be considered “ranking” instead of “re-”ranking. > h1. Proposed Implementation in Solr: > h2. Phase 1: Storage of Dense Vectors & Scoring on Vector Similarity > * > h3. Dense Vector Field: > We will add a new dense vector field type in Solr. This field type would be a > compressed encoding of a dense vector into a BinaryDocValues Field. There are > other ways to do it, but this is almost certain to be the most efficient. > Ideally this field is multi-valued. If it is single-valued then we are > either limited to only document-level vectors, or otherwise we have to create > many vector fields (i.e. per paragraph or term) and search across them, which > will never be practical or scale well. BinaryDocValues does not natively > support multiple values, but since the Solr field type will need to be > responsible for encoding the numeric vectors as a byte[] it should also be > able to encode multiple vectors at the same time. > * > h3. Vector Similarity Value Sources: > Implement function queries (value sources) that take in a query vector, a > (multi-valued) vector field name, and a multi-value selector ("max", "min" > ,"avg", "first"), and return the computed calculation. This function could > then be used to score documents via the “func” query parser. ANN queries > could be implemented using separate “fq” params for filtering, or as part of > the “q” query parameter to support optional ANN Ranking, and then the > existing Solr re-ranking functionality could be used for re-ranking the top N > results using the “func” query parser and these new cosine or dot product > value sources as the re-rank query. Existing Solr logic for performing > two-phase iteration will ensure that these vector functions will not be > excessively computed for documents that do not already match the “cheaper” > ANN queries. > *Example*: > {noformat} > q={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", vectors_field, first){noformat} > ^ simple, no-ANN. Third param is optional and is the function > (min|max|avg|first|last). I think this should default to "max" (or whatever > "most related" is for each similarity function. It's "max" for cosine and dot > product). > *Example:* > {noformat} > q={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", vectors_field, > first)&fq=[ANN_QUERY]{noformat} > ^ ensures function is not computed unless ANN query matches due to two-phase > iterator > *Example:* > {noformat} > q=[ANN_QUERY]&rq={!rerank reRankQuery=$rqq reRankDocs=10000 > reRankWeight=1000000}&rqq={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", > vectors_field){noformat} > ^Supports an initial ranking based upon the ANN QUERY, followed by a > re-ranking based upon the cosine function. > The key implementation challenges are expected to be: > * ensuring decent query (speed) performance > * Determining an efficient vector/byte encoding that has decent > disk-space/query time decoding trade offs and makes it possible to decode > only the ‘first’ vector per doc in cases where that is all that’s requested > by the distance function > h2. Phase 2: Pluggable ANN Search > * Implement pluggable quantization/ANN search. There are multiple different > algorithms to do quantization for efficient ANN search. The simplest is > something like LSH, with more advanced implementations being worked on in > Lucene right now in LUCENE-9004 (HNSW) and LUCENE-9136 (IVFFLAT), and with a > prototyped LSH implementation (Solr Vector Search plugin) and a random > projections approach (Hangry) linked to in SOLR-12890. > * Given that the transformers/encoders are a rapidly evolving area of data > science and NLP research, it feels unreasonable to assume that Solr will be > able to keep up to date on the latest model integrations. Given that, if we > build support into Solr for encoder models, we need to make it very pluggable. > * Phase 1 fully enables quantization to be done OUTSIDE of Solr, and thus > for external systems to accomplish quantization by creating quantized terms > externally, indexing them to Solr, and then sending in queries to Solr with > the quantized terms. I’m not convinced that keeping quantization outside of > Lucene/Solr isn’t a more manageable model long term, but if certain models do > get integrated natively into Lucene (as is currently being worked on) then we > should obviously look to leverage them. > h2. Important Considerations: > * Multi-valued dense vectors are a key differentiating feature here. I’m > considering this a key requirement and designing accordingly. This will allow > for multiple embeddings per document (i.e. “word embeddings, sentence > embeddings, paragraph embeddings, etc.) to be modeled. I like the way > [~ehatcher] designed the the Payload Function Query and think we should model > the multi-valued handling similarly. > * As best as possible, it would be nice to support multiple (pluggable) > fields/value sources that can be converted to dense vectors and compared. At > a minimum we need an efficient dense vector field, but it may be beneficial > to support a sparse vector field where only meaningful values need to be > encoded, in order to save space. Additionally, there may be some utility in > other field types or arbitrary value sources that can product vector-like > output to be able to be leveraged in the function queries being proposed. > h1. Note on Prior Approaches > * There are a handful of ways to perform vector search today with Solr. One > is through Streaming Expressions, and alternatively there are a few plugins > which have implemented this functionality for Solr. One of these was > contributed in SOLR-12890, and I've outlined examples of most of these > approaches in the comments on SOLR-12890. > * The Streaming Expressions approach works well out of the box, but doesn't > integrate nicely with traditional keyword search use cases, so we need a > solution that integrates with real-time queries and enables the full suite of > Solr's query-time features. > * The most advanced Solr plugin out there seems to be this one: > https://github.com/moshebla/solr-vector-scoring (forked the version in > SOLR-12890 and then added LSH functionality for ANN). > * I encountered several challenges when working with the implementation in > SOLR-12890 which informed the above proposed design: > 1. The plugin encodes dense vectors into payloads attached to terms > representing vector positions. This is pretty inefficient and is way too slow > at significant scale. I believe we instead need the underlying field types to > be binary doc values field with efficient compression and decoding for > optimal performance when scanning through all the matching documents and > dimensions. > 2. The plugin only supports one vector per-field, making supporting > word/phrase vectors, sentence vectors, paragraph vectors, etc. challenging. > It's not obvious how to modify the current approach to support multiple > vectors per field. > 3. The plugin uses a query parser instead of just a function query, which > prevents re-use of the vector similarity function. Using a function query > instead will enable the vector similarity scores to be easily leveraged for > relevance scoring, sorting, returned as fields, or used within other function > queries. Given that there are two goals of 1) filtering on ANN, and 2) > Scoring on vector similarity (usually via re-ranking), using a function query > for the scoring piece will be more flexible across use scoring use cases > (combining with other functions, returning, sorting, etc.) > 4. The LSH quantization/ANN implementation is a cool idea, but it is > hard-coded in the implementation. I recommend making it a separate filter and > making the implementation more pluggable (though for now this can be > externalized and passed in to Solr). We may be able to pull the LSH work in > during phase 2. > For cleaner discussion and tracking, I'm moving this revised proposal to this > new JIRA, and we can use SOLR-12890 for continue discussion (if any) of the > previously contributed plugin implementation. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org