[GitHub] [solr] alessandrobenedetti commented on a change in pull request #476: SOLR-15880: K Nearest Neighbors Search

GitBox Wed, 19 Jan 2022 03:38:24 -0800


alessandrobenedetti commented on a change in pull request #476:
URL: https://github.com/apache/solr/pull/476#discussion_r787660168




##########
File path: solr/solr-ref-guide/src/neural-search.adoc
##########
@@ -0,0 +1,324 @@
+= Neural Search
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+Search comprises of performing four primary steps:
+
+* generate a representation of the query that specifies the information need
+* generate a representation of the document that captures the information 
contained
+* match the query and the document representations from the corpus of 
information
+* assign a score to each matched document in order to establish a meaningful 
document ranking by relevance in the results
+
+The Apache Solr *Neural Search* module adds support for neural networks based 
techniques that can improve various aspects of search.
+
+These techniques can be differentiated based on whether they affect the query 
representation, the document representation, or the estimation of the relevance 
score.
+
+Neural Search is an industry derivation from the academic field of 
https://www.microsoft.com/en-us/research/uploads/prod/2017/06/fntir2018-neuralir-mitra.pdf[Neural
 information Retrieval].
+
+== Neural Search Concepts
+
+=== Deep Learning
+
+More and more frequently, we hear about how Artificial Intelligence (AI) 
permeates every aspect of our lives.
+
+When we talk about AI we are referring to a superset of techniques that enable 
machines to learn and show intelligence like humans.
+
+Since computing power has strongly and steadily advanced in the recent past, 
AI has seen a resurgence lately and it is now used in many domains, including 
software engineering and Information Retrieval (the science that regulates 
Search Engines and similar systems).
+
+In particular the advent of https://en.wikipedia.org/wiki/Deep_learning[Deep 
Learning] introduced the use of deep neural networks to solve complex problems 
that could not be solved simply by an algorithm.
+
+Deep Learning can be used to produce a vector representation of both the query 
and the documents in a corpus of information.
+
+=== Dense Vector Representation 
+A Dense vector describes information as an array of elements, each of them 
explicitly defined.
+
+Various Deep Learning models such as 
https://en.wikipedia.org/wiki/BERT_(language_model)[BERT] are able to encode 
textual information as dense vectors, to be used for Dense Retrieval strategies.
+
+For additional information you can refer to this 
https://sease.io/2021/12/using-bert-to-improve-search-relevance.html[blog post].
+
+=== Dense Retrieval
+Given a dense vector `v` that models the information need, the easiest 
approach for providing dense vector retrieval would be to calculate the 
distance(euclidean, dot product, etc.) between `v` and each vector `d` that 
represents a document in the corpus of information.
+
+This approach is quite expensive, so many approximate strategies are currently 
under active research.
+
+The strategy implemented in Apache Lucene and used by Apache Solr is based on 
Navigable Small-world graph.
+
+It provides efficient approximate nearest neighbor search for high dimensional 
vectors.
+
+See https://doi.org/10.1016/j.is.2013.10.006[Approximate nearest neighbor 
algorithm based on navigable small world graphs [2014]] and 
https://arxiv.org/abs/1603.09320[this paper [2018]] for details.
+
+
+== Index Time
+This is the list of Apache Solr field types designed to support Neural Search:
+
+=== DenseVectorField
+The Dense Vector field gives the possibility of indexing and searching dense 
vectors of float elements.
+
+e.g.
+
+`[1.0, 2.5, 3.7, 4.1]` (array of float elements)
+
+Here's how `DenseVectorField` should be configured in the schema:
+
+[source,xml]
+<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" 
similarityFunction="cosine"/>
+<field name="vector" type="knn_vector" indexed="true" stored="true"/>
+
+`vectorDimension`::
++
+[%autowidth,frame=none]
+|===
+|Mandatory
+|===
++
+The dimension of the dense vector to pass in.
++
+Accepted values:
+Integer < = 1024.
+
+`similarityFunction`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `euclidean`
+|===
++
+Vector similarity function; used in search to return top K most similar 
vectors to a target vector.
++
+Accepted values: `euclidean`, `dot_product`  or `cosine`.
+
+* `euclidean`: https://en.wikipedia.org/wiki/Euclidean_distance[Euclidean 
distance]
+* `dot_product`: https://en.wikipedia.org/wiki/Dot_product[Dot product]. 
*NOTE*: this similarity is intended as an optimized way to perform cosine 
similarity. In order to use it, all vectors must be of unit length, including 
both document and query vectors. Using dot product with vectors that are not 
unit length can result in errors or poor search results..
+* `cosine`: https://en.wikipedia.org/wiki/Cosine_similarity[Cosine 
similarity]. *NOTE*: the preferred way to perform cosine similarity is to 
normalize all vectors to unit length, and instead use DOT_PRODUCT. You should 
only use this function if you need to preserve the original vectors and cannot 
normalize them in advance.
+
+*N.B.* To use the following advanced parameters that customise the codec format
+and the hyper-parameter of the HNSW algorithm make sure you set this 
configuration in the solrconfig.xml:
+[source,xml]
+<config>
+<codecFactory class="solr.SchemaCodecFactory"/>
+...
+
+Here's how `DenseVectorField` can be configured with the advanced codec 
hyper-parameters:
+
+[source,xml]
+<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" 
similarityFunction="cosine" codecFormat="Lucene90HnswVectorsFormat" 
hnswMaxConnections="10" hnswBeamWidth="40"/>
+<field name="vector" type="knn_vector" indexed="true" stored="true"/>
+
+`codecFormat`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `Lucene90HnswVectorsFormat`
+|===
++
+(advanced) Specifies the knn codec implementation to use
++
+
+Accepted values: `Lucene90HnswVectorsFormat`.
+
+Please note that the `codecFormat` accepted values may change in future 
releases.
+
+
+
+[NOTE]
+Lucene index back-compatibility is only supported for the default codec.
+If you choose to customize the `codecFormat` in your schema, upgrading to a 
future version of Solr may require you to either switch back to the default 
codec and optimize your index to rewrite it into the default codec before 
upgrading, or re-build your entire index from scratch after upgrading.
+
+`hnswMaxConnections`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: 16
+|===
++
+(advanced) This parameter is specific for the `Lucene90HnswVectorsFormat` 
codec format:
++
+Controls how many of the nearest neighbor candidates are connected to the new 
node.
++
+See https://doi.org/10.1016/j.is.2013.10.006[Approximate nearest neighbor 
algorithm based on navigable small world graphs [2014]] and 
https://arxiv.org/abs/1603.09320[this paper [2018]] for details.
++
+It has the same meaning as `M` from the later paper.
++
+Accepted values:
+Integer.
+
+`hnswBeamWidth`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: 100
+|===
++
+(advanced) This parameter is specific for the `Lucene90HnswVectorsFormat` 
codec format:
++
+It is the number of nearest neighbor candidates to track while searching the 
graph for each newly inserted node.
++
+See https://doi.org/10.1016/j.is.2013.10.006[Approximate nearest neighbor 
algorithm based on navigable small world graphs [2014]] and 
https://arxiv.org/abs/1603.09320[this paper [2018]] for details.
++
+It has the same meaning as `efConstruction` from the later paper.
++
+Accepted values:
+Integer.
+
+DenseVectorField supports the attributes: `indexed`, `stored`.
+
+*N.B.* currently multivalue is not supported
+
+Here's how a `DenseVectorField` should be indexed:
+
+[.dynamic-tabs]
+--
+[example.tab-pane#json]
+====
+[.tab-label]*JSON*
+[source,json]
+----
+[{ "id": "1",
+"vector": [1.0, 2.5, 3.7, 4.1]
+},
+{ "id": "2",
+"vector": [1.5, 5.5, 6.7, 65.1]
+}
+]
+----
+====
+
+[example.tab-pane#xml]
+====
+[.tab-label]*XML*
+[source,xml]
+----
+<add>
+<doc>
+<field name="id">1</field>
+<field name="vector">1.0</field>
+<field name="vector">2.5</field>
+<field name="vector">3.7</field>
+<field name="vector">4.1</field>
+</doc>
+<doc>
+<field name="id">2</field>
+<field name="vector">1.5</field>
+<field name="vector">5.5</field>
+<field name="vector">6.7</field>
+<field name="vector">65.1</field>
+</doc>
+</add>
+----
+====
+
+[example.tab-pane#solrj]
+====
+[.tab-label]*SolrJ*
+[source,java,indent=0]
+----
+final SolrClient client = getSolrClient();
+
+final SolrInputDocument d1 = new SolrInputDocument();
+d1.setField("id", "1");
+d1.setField("vector", Arrays.asList(1.0f, 2.5f, 3.7f, 4.1f));
+
+
+final SolrInputDocument d2 = new SolrInputDocument();
+d2.setField("id", "2");
+d2.setField("vector", Arrays.asList(1.5f, 5.5f, 6.7f, 65.1f));
+
+client.add(Arrays.asList(d1, d2));
+----
+====
+--
+
+== Query Time
+This is the list of Apache Solr query approaches designed to support Neural 
Search:

Review comment:
       I removed the List reference, but I would prefer to keep it as a 
subsection of "Query Time" to make a clear distinction.
   What do you think>

##########
File path: solr/solr-ref-guide/src/neural-search.adoc
##########
@@ -0,0 +1,324 @@
+= Neural Search
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+Search comprises of performing four primary steps:
+
+* generate a representation of the query that specifies the information need
+* generate a representation of the document that captures the information 
contained
+* match the query and the document representations from the corpus of 
information
+* assign a score to each matched document in order to establish a meaningful 
document ranking by relevance in the results
+
+The Apache Solr *Neural Search* module adds support for neural networks based 
techniques that can improve various aspects of search.
+
+These techniques can be differentiated based on whether they affect the query 
representation, the document representation, or the estimation of the relevance 
score.
+
+Neural Search is an industry derivation from the academic field of 
https://www.microsoft.com/en-us/research/uploads/prod/2017/06/fntir2018-neuralir-mitra.pdf[Neural
 information Retrieval].
+
+== Neural Search Concepts
+
+=== Deep Learning
+
+More and more frequently, we hear about how Artificial Intelligence (AI) 
permeates every aspect of our lives.
+
+When we talk about AI we are referring to a superset of techniques that enable 
machines to learn and show intelligence like humans.
+
+Since computing power has strongly and steadily advanced in the recent past, 
AI has seen a resurgence lately and it is now used in many domains, including 
software engineering and Information Retrieval (the science that regulates 
Search Engines and similar systems).
+
+In particular the advent of https://en.wikipedia.org/wiki/Deep_learning[Deep 
Learning] introduced the use of deep neural networks to solve complex problems 
that could not be solved simply by an algorithm.
+
+Deep Learning can be used to produce a vector representation of both the query 
and the documents in a corpus of information.
+
+=== Dense Vector Representation 
+A Dense vector describes information as an array of elements, each of them 
explicitly defined.
+
+Various Deep Learning models such as 
https://en.wikipedia.org/wiki/BERT_(language_model)[BERT] are able to encode 
textual information as dense vectors, to be used for Dense Retrieval strategies.
+
+For additional information you can refer to this 
https://sease.io/2021/12/using-bert-to-improve-search-relevance.html[blog post].
+
+=== Dense Retrieval
+Given a dense vector `v` that models the information need, the easiest 
approach for providing dense vector retrieval would be to calculate the 
distance(euclidean, dot product, etc.) between `v` and each vector `d` that 
represents a document in the corpus of information.
+
+This approach is quite expensive, so many approximate strategies are currently 
under active research.
+
+The strategy implemented in Apache Lucene and used by Apache Solr is based on 
Navigable Small-world graph.
+
+It provides efficient approximate nearest neighbor search for high dimensional 
vectors.
+
+See https://doi.org/10.1016/j.is.2013.10.006[Approximate nearest neighbor 
algorithm based on navigable small world graphs [2014]] and 
https://arxiv.org/abs/1603.09320[this paper [2018]] for details.
+
+
+== Index Time
+This is the list of Apache Solr field types designed to support Neural Search:
+
+=== DenseVectorField
+The Dense Vector field gives the possibility of indexing and searching dense 
vectors of float elements.
+
+e.g.
+
+`[1.0, 2.5, 3.7, 4.1]` (array of float elements)
+
+Here's how `DenseVectorField` should be configured in the schema:
+
+[source,xml]
+<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" 
similarityFunction="cosine"/>
+<field name="vector" type="knn_vector" indexed="true" stored="true"/>
+
+`vectorDimension`::
++
+[%autowidth,frame=none]
+|===
+|Mandatory
+|===
++
+The dimension of the dense vector to pass in.
++
+Accepted values:
+Integer < = 1024.
+
+`similarityFunction`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `euclidean`
+|===
++
+Vector similarity function; used in search to return top K most similar 
vectors to a target vector.
++
+Accepted values: `euclidean`, `dot_product`  or `cosine`.
+
+* `euclidean`: https://en.wikipedia.org/wiki/Euclidean_distance[Euclidean 
distance]
+* `dot_product`: https://en.wikipedia.org/wiki/Dot_product[Dot product]. 
*NOTE*: this similarity is intended as an optimized way to perform cosine 
similarity. In order to use it, all vectors must be of unit length, including 
both document and query vectors. Using dot product with vectors that are not 
unit length can result in errors or poor search results..
+* `cosine`: https://en.wikipedia.org/wiki/Cosine_similarity[Cosine 
similarity]. *NOTE*: the preferred way to perform cosine similarity is to 
normalize all vectors to unit length, and instead use DOT_PRODUCT. You should 
only use this function if you need to preserve the original vectors and cannot 
normalize them in advance.
+
+*N.B.* To use the following advanced parameters that customise the codec format
+and the hyper-parameter of the HNSW algorithm make sure you set this 
configuration in the solrconfig.xml:
+[source,xml]
+<config>
+<codecFactory class="solr.SchemaCodecFactory"/>
+...
+
+Here's how `DenseVectorField` can be configured with the advanced codec 
hyper-parameters:
+
+[source,xml]
+<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" 
similarityFunction="cosine" codecFormat="Lucene90HnswVectorsFormat" 
hnswMaxConnections="10" hnswBeamWidth="40"/>
+<field name="vector" type="knn_vector" indexed="true" stored="true"/>
+
+`codecFormat`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `Lucene90HnswVectorsFormat`
+|===
++
+(advanced) Specifies the knn codec implementation to use
++
+
+Accepted values: `Lucene90HnswVectorsFormat`.
+
+Please note that the `codecFormat` accepted values may change in future 
releases.
+
+
+
+[NOTE]
+Lucene index back-compatibility is only supported for the default codec.
+If you choose to customize the `codecFormat` in your schema, upgrading to a 
future version of Solr may require you to either switch back to the default 
codec and optimize your index to rewrite it into the default codec before 
upgrading, or re-build your entire index from scratch after upgrading.
+
+`hnswMaxConnections`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: 16
+|===
++
+(advanced) This parameter is specific for the `Lucene90HnswVectorsFormat` 
codec format:
++
+Controls how many of the nearest neighbor candidates are connected to the new 
node.
++
+See https://doi.org/10.1016/j.is.2013.10.006[Approximate nearest neighbor 
algorithm based on navigable small world graphs [2014]] and 
https://arxiv.org/abs/1603.09320[this paper [2018]] for details.
++
+It has the same meaning as `M` from the later paper.
++
+Accepted values:
+Integer.
+
+`hnswBeamWidth`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: 100
+|===
++
+(advanced) This parameter is specific for the `Lucene90HnswVectorsFormat` 
codec format:
++
+It is the number of nearest neighbor candidates to track while searching the 
graph for each newly inserted node.
++
+See https://doi.org/10.1016/j.is.2013.10.006[Approximate nearest neighbor 
algorithm based on navigable small world graphs [2014]] and 
https://arxiv.org/abs/1603.09320[this paper [2018]] for details.
++
+It has the same meaning as `efConstruction` from the later paper.
++
+Accepted values:
+Integer.
+
+DenseVectorField supports the attributes: `indexed`, `stored`.
+
+*N.B.* currently multivalue is not supported
+
+Here's how a `DenseVectorField` should be indexed:
+
+[.dynamic-tabs]
+--
+[example.tab-pane#json]
+====
+[.tab-label]*JSON*
+[source,json]
+----
+[{ "id": "1",
+"vector": [1.0, 2.5, 3.7, 4.1]
+},
+{ "id": "2",
+"vector": [1.5, 5.5, 6.7, 65.1]
+}
+]
+----
+====
+
+[example.tab-pane#xml]
+====
+[.tab-label]*XML*
+[source,xml]
+----
+<add>
+<doc>
+<field name="id">1</field>
+<field name="vector">1.0</field>
+<field name="vector">2.5</field>
+<field name="vector">3.7</field>
+<field name="vector">4.1</field>
+</doc>
+<doc>
+<field name="id">2</field>
+<field name="vector">1.5</field>
+<field name="vector">5.5</field>
+<field name="vector">6.7</field>
+<field name="vector">65.1</field>
+</doc>
+</add>
+----
+====
+
+[example.tab-pane#solrj]
+====
+[.tab-label]*SolrJ*
+[source,java,indent=0]
+----
+final SolrClient client = getSolrClient();
+
+final SolrInputDocument d1 = new SolrInputDocument();
+d1.setField("id", "1");
+d1.setField("vector", Arrays.asList(1.0f, 2.5f, 3.7f, 4.1f));
+
+
+final SolrInputDocument d2 = new SolrInputDocument();
+d2.setField("id", "2");
+d2.setField("vector", Arrays.asList(1.5f, 5.5f, 6.7f, 65.1f));
+
+client.add(Arrays.asList(d1, d2));
+----
+====
+--
+
+== Query Time
+This is the list of Apache Solr query approaches designed to support Neural 
Search:

Review comment:
       I removed the List reference, but I would prefer to keep it as a 
subsection of "Query Time" to make a clear distinction.
   What do you think?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] alessandrobenedetti commented on a change in pull request #476: SOLR-15880: K Nearest Neighbors Search

Reply via email to