alessandrobenedetti commented on a change in pull request #476: URL: https://github.com/apache/solr/pull/476#discussion_r787660168
########## File path: solr/solr-ref-guide/src/neural-search.adoc ########## @@ -0,0 +1,324 @@ += Neural Search +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +Search comprises of performing four primary steps: + +* generate a representation of the query that specifies the information need +* generate a representation of the document that captures the information contained +* match the query and the document representations from the corpus of information +* assign a score to each matched document in order to establish a meaningful document ranking by relevance in the results + +The Apache Solr *Neural Search* module adds support for neural networks based techniques that can improve various aspects of search. + +These techniques can be differentiated based on whether they affect the query representation, the document representation, or the estimation of the relevance score. + +Neural Search is an industry derivation from the academic field of https://www.microsoft.com/en-us/research/uploads/prod/2017/06/fntir2018-neuralir-mitra.pdf[Neural information Retrieval]. + +== Neural Search Concepts + +=== Deep Learning + +More and more frequently, we hear about how Artificial Intelligence (AI) permeates every aspect of our lives. + +When we talk about AI we are referring to a superset of techniques that enable machines to learn and show intelligence like humans. + +Since computing power has strongly and steadily advanced in the recent past, AI has seen a resurgence lately and it is now used in many domains, including software engineering and Information Retrieval (the science that regulates Search Engines and similar systems). + +In particular the advent of https://en.wikipedia.org/wiki/Deep_learning[Deep Learning] introduced the use of deep neural networks to solve complex problems that could not be solved simply by an algorithm. + +Deep Learning can be used to produce a vector representation of both the query and the documents in a corpus of information. + +=== Dense Vector Representation +A Dense vector describes information as an array of elements, each of them explicitly defined. + +Various Deep Learning models such as https://en.wikipedia.org/wiki/BERT_(language_model)[BERT] are able to encode textual information as dense vectors, to be used for Dense Retrieval strategies. + +For additional information you can refer to this https://sease.io/2021/12/using-bert-to-improve-search-relevance.html[blog post]. + +=== Dense Retrieval +Given a dense vector `v` that models the information need, the easiest approach for providing dense vector retrieval would be to calculate the distance(euclidean, dot product, etc.) between `v` and each vector `d` that represents a document in the corpus of information. + +This approach is quite expensive, so many approximate strategies are currently under active research. + +The strategy implemented in Apache Lucene and used by Apache Solr is based on Navigable Small-world graph. + +It provides efficient approximate nearest neighbor search for high dimensional vectors. + +See https://doi.org/10.1016/j.is.2013.10.006[Approximate nearest neighbor algorithm based on navigable small world graphs [2014]] and https://arxiv.org/abs/1603.09320[this paper [2018]] for details. + + +== Index Time +This is the list of Apache Solr field types designed to support Neural Search: + +=== DenseVectorField +The Dense Vector field gives the possibility of indexing and searching dense vectors of float elements. + +e.g. + +`[1.0, 2.5, 3.7, 4.1]` (array of float elements) + +Here's how `DenseVectorField` should be configured in the schema: + +[source,xml] +<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine"/> +<field name="vector" type="knn_vector" indexed="true" stored="true"/> + +`vectorDimension`:: ++ +[%autowidth,frame=none] +|=== +|Mandatory +|=== ++ +The dimension of the dense vector to pass in. ++ +Accepted values: +Integer < = 1024. + +`similarityFunction`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: `euclidean` +|=== ++ +Vector similarity function; used in search to return top K most similar vectors to a target vector. ++ +Accepted values: `euclidean`, `dot_product` or `cosine`. + +* `euclidean`: https://en.wikipedia.org/wiki/Euclidean_distance[Euclidean distance] +* `dot_product`: https://en.wikipedia.org/wiki/Dot_product[Dot product]. *NOTE*: this similarity is intended as an optimized way to perform cosine similarity. In order to use it, all vectors must be of unit length, including both document and query vectors. Using dot product with vectors that are not unit length can result in errors or poor search results.. +* `cosine`: https://en.wikipedia.org/wiki/Cosine_similarity[Cosine similarity]. *NOTE*: the preferred way to perform cosine similarity is to normalize all vectors to unit length, and instead use DOT_PRODUCT. You should only use this function if you need to preserve the original vectors and cannot normalize them in advance. + +*N.B.* To use the following advanced parameters that customise the codec format +and the hyper-parameter of the HNSW algorithm make sure you set this configuration in the solrconfig.xml: +[source,xml] +<config> +<codecFactory class="solr.SchemaCodecFactory"/> +... + +Here's how `DenseVectorField` can be configured with the advanced codec hyper-parameters: + +[source,xml] +<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine" codecFormat="Lucene90HnswVectorsFormat" hnswMaxConnections="10" hnswBeamWidth="40"/> +<field name="vector" type="knn_vector" indexed="true" stored="true"/> + +`codecFormat`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: `Lucene90HnswVectorsFormat` +|=== ++ +(advanced) Specifies the knn codec implementation to use ++ + +Accepted values: `Lucene90HnswVectorsFormat`. + +Please note that the `codecFormat` accepted values may change in future releases. + + + +[NOTE] +Lucene index back-compatibility is only supported for the default codec. +If you choose to customize the `codecFormat` in your schema, upgrading to a future version of Solr may require you to either switch back to the default codec and optimize your index to rewrite it into the default codec before upgrading, or re-build your entire index from scratch after upgrading. + +`hnswMaxConnections`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: 16 +|=== ++ +(advanced) This parameter is specific for the `Lucene90HnswVectorsFormat` codec format: ++ +Controls how many of the nearest neighbor candidates are connected to the new node. ++ +See https://doi.org/10.1016/j.is.2013.10.006[Approximate nearest neighbor algorithm based on navigable small world graphs [2014]] and https://arxiv.org/abs/1603.09320[this paper [2018]] for details. ++ +It has the same meaning as `M` from the later paper. ++ +Accepted values: +Integer. + +`hnswBeamWidth`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: 100 +|=== ++ +(advanced) This parameter is specific for the `Lucene90HnswVectorsFormat` codec format: ++ +It is the number of nearest neighbor candidates to track while searching the graph for each newly inserted node. ++ +See https://doi.org/10.1016/j.is.2013.10.006[Approximate nearest neighbor algorithm based on navigable small world graphs [2014]] and https://arxiv.org/abs/1603.09320[this paper [2018]] for details. ++ +It has the same meaning as `efConstruction` from the later paper. ++ +Accepted values: +Integer. + +DenseVectorField supports the attributes: `indexed`, `stored`. + +*N.B.* currently multivalue is not supported + +Here's how a `DenseVectorField` should be indexed: + +[.dynamic-tabs] +-- +[example.tab-pane#json] +==== +[.tab-label]*JSON* +[source,json] +---- +[{ "id": "1", +"vector": [1.0, 2.5, 3.7, 4.1] +}, +{ "id": "2", +"vector": [1.5, 5.5, 6.7, 65.1] +} +] +---- +==== + +[example.tab-pane#xml] +==== +[.tab-label]*XML* +[source,xml] +---- +<add> +<doc> +<field name="id">1</field> +<field name="vector">1.0</field> +<field name="vector">2.5</field> +<field name="vector">3.7</field> +<field name="vector">4.1</field> +</doc> +<doc> +<field name="id">2</field> +<field name="vector">1.5</field> +<field name="vector">5.5</field> +<field name="vector">6.7</field> +<field name="vector">65.1</field> +</doc> +</add> +---- +==== + +[example.tab-pane#solrj] +==== +[.tab-label]*SolrJ* +[source,java,indent=0] +---- +final SolrClient client = getSolrClient(); + +final SolrInputDocument d1 = new SolrInputDocument(); +d1.setField("id", "1"); +d1.setField("vector", Arrays.asList(1.0f, 2.5f, 3.7f, 4.1f)); + + +final SolrInputDocument d2 = new SolrInputDocument(); +d2.setField("id", "2"); +d2.setField("vector", Arrays.asList(1.5f, 5.5f, 6.7f, 65.1f)); + +client.add(Arrays.asList(d1, d2)); +---- +==== +-- + +== Query Time +This is the list of Apache Solr query approaches designed to support Neural Search: Review comment: I removed the List reference, but I would prefer to keep it as a subsection of "Query Time" to make a clear distinction. What do you think> ########## File path: solr/solr-ref-guide/src/neural-search.adoc ########## @@ -0,0 +1,324 @@ += Neural Search +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +Search comprises of performing four primary steps: + +* generate a representation of the query that specifies the information need +* generate a representation of the document that captures the information contained +* match the query and the document representations from the corpus of information +* assign a score to each matched document in order to establish a meaningful document ranking by relevance in the results + +The Apache Solr *Neural Search* module adds support for neural networks based techniques that can improve various aspects of search. + +These techniques can be differentiated based on whether they affect the query representation, the document representation, or the estimation of the relevance score. + +Neural Search is an industry derivation from the academic field of https://www.microsoft.com/en-us/research/uploads/prod/2017/06/fntir2018-neuralir-mitra.pdf[Neural information Retrieval]. + +== Neural Search Concepts + +=== Deep Learning + +More and more frequently, we hear about how Artificial Intelligence (AI) permeates every aspect of our lives. + +When we talk about AI we are referring to a superset of techniques that enable machines to learn and show intelligence like humans. + +Since computing power has strongly and steadily advanced in the recent past, AI has seen a resurgence lately and it is now used in many domains, including software engineering and Information Retrieval (the science that regulates Search Engines and similar systems). + +In particular the advent of https://en.wikipedia.org/wiki/Deep_learning[Deep Learning] introduced the use of deep neural networks to solve complex problems that could not be solved simply by an algorithm. + +Deep Learning can be used to produce a vector representation of both the query and the documents in a corpus of information. + +=== Dense Vector Representation +A Dense vector describes information as an array of elements, each of them explicitly defined. + +Various Deep Learning models such as https://en.wikipedia.org/wiki/BERT_(language_model)[BERT] are able to encode textual information as dense vectors, to be used for Dense Retrieval strategies. + +For additional information you can refer to this https://sease.io/2021/12/using-bert-to-improve-search-relevance.html[blog post]. + +=== Dense Retrieval +Given a dense vector `v` that models the information need, the easiest approach for providing dense vector retrieval would be to calculate the distance(euclidean, dot product, etc.) between `v` and each vector `d` that represents a document in the corpus of information. + +This approach is quite expensive, so many approximate strategies are currently under active research. + +The strategy implemented in Apache Lucene and used by Apache Solr is based on Navigable Small-world graph. + +It provides efficient approximate nearest neighbor search for high dimensional vectors. + +See https://doi.org/10.1016/j.is.2013.10.006[Approximate nearest neighbor algorithm based on navigable small world graphs [2014]] and https://arxiv.org/abs/1603.09320[this paper [2018]] for details. + + +== Index Time +This is the list of Apache Solr field types designed to support Neural Search: + +=== DenseVectorField +The Dense Vector field gives the possibility of indexing and searching dense vectors of float elements. + +e.g. + +`[1.0, 2.5, 3.7, 4.1]` (array of float elements) + +Here's how `DenseVectorField` should be configured in the schema: + +[source,xml] +<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine"/> +<field name="vector" type="knn_vector" indexed="true" stored="true"/> + +`vectorDimension`:: ++ +[%autowidth,frame=none] +|=== +|Mandatory +|=== ++ +The dimension of the dense vector to pass in. ++ +Accepted values: +Integer < = 1024. + +`similarityFunction`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: `euclidean` +|=== ++ +Vector similarity function; used in search to return top K most similar vectors to a target vector. ++ +Accepted values: `euclidean`, `dot_product` or `cosine`. + +* `euclidean`: https://en.wikipedia.org/wiki/Euclidean_distance[Euclidean distance] +* `dot_product`: https://en.wikipedia.org/wiki/Dot_product[Dot product]. *NOTE*: this similarity is intended as an optimized way to perform cosine similarity. In order to use it, all vectors must be of unit length, including both document and query vectors. Using dot product with vectors that are not unit length can result in errors or poor search results.. +* `cosine`: https://en.wikipedia.org/wiki/Cosine_similarity[Cosine similarity]. *NOTE*: the preferred way to perform cosine similarity is to normalize all vectors to unit length, and instead use DOT_PRODUCT. You should only use this function if you need to preserve the original vectors and cannot normalize them in advance. + +*N.B.* To use the following advanced parameters that customise the codec format +and the hyper-parameter of the HNSW algorithm make sure you set this configuration in the solrconfig.xml: +[source,xml] +<config> +<codecFactory class="solr.SchemaCodecFactory"/> +... + +Here's how `DenseVectorField` can be configured with the advanced codec hyper-parameters: + +[source,xml] +<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine" codecFormat="Lucene90HnswVectorsFormat" hnswMaxConnections="10" hnswBeamWidth="40"/> +<field name="vector" type="knn_vector" indexed="true" stored="true"/> + +`codecFormat`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: `Lucene90HnswVectorsFormat` +|=== ++ +(advanced) Specifies the knn codec implementation to use ++ + +Accepted values: `Lucene90HnswVectorsFormat`. + +Please note that the `codecFormat` accepted values may change in future releases. + + + +[NOTE] +Lucene index back-compatibility is only supported for the default codec. +If you choose to customize the `codecFormat` in your schema, upgrading to a future version of Solr may require you to either switch back to the default codec and optimize your index to rewrite it into the default codec before upgrading, or re-build your entire index from scratch after upgrading. + +`hnswMaxConnections`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: 16 +|=== ++ +(advanced) This parameter is specific for the `Lucene90HnswVectorsFormat` codec format: ++ +Controls how many of the nearest neighbor candidates are connected to the new node. ++ +See https://doi.org/10.1016/j.is.2013.10.006[Approximate nearest neighbor algorithm based on navigable small world graphs [2014]] and https://arxiv.org/abs/1603.09320[this paper [2018]] for details. ++ +It has the same meaning as `M` from the later paper. ++ +Accepted values: +Integer. + +`hnswBeamWidth`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: 100 +|=== ++ +(advanced) This parameter is specific for the `Lucene90HnswVectorsFormat` codec format: ++ +It is the number of nearest neighbor candidates to track while searching the graph for each newly inserted node. ++ +See https://doi.org/10.1016/j.is.2013.10.006[Approximate nearest neighbor algorithm based on navigable small world graphs [2014]] and https://arxiv.org/abs/1603.09320[this paper [2018]] for details. ++ +It has the same meaning as `efConstruction` from the later paper. ++ +Accepted values: +Integer. + +DenseVectorField supports the attributes: `indexed`, `stored`. + +*N.B.* currently multivalue is not supported + +Here's how a `DenseVectorField` should be indexed: + +[.dynamic-tabs] +-- +[example.tab-pane#json] +==== +[.tab-label]*JSON* +[source,json] +---- +[{ "id": "1", +"vector": [1.0, 2.5, 3.7, 4.1] +}, +{ "id": "2", +"vector": [1.5, 5.5, 6.7, 65.1] +} +] +---- +==== + +[example.tab-pane#xml] +==== +[.tab-label]*XML* +[source,xml] +---- +<add> +<doc> +<field name="id">1</field> +<field name="vector">1.0</field> +<field name="vector">2.5</field> +<field name="vector">3.7</field> +<field name="vector">4.1</field> +</doc> +<doc> +<field name="id">2</field> +<field name="vector">1.5</field> +<field name="vector">5.5</field> +<field name="vector">6.7</field> +<field name="vector">65.1</field> +</doc> +</add> +---- +==== + +[example.tab-pane#solrj] +==== +[.tab-label]*SolrJ* +[source,java,indent=0] +---- +final SolrClient client = getSolrClient(); + +final SolrInputDocument d1 = new SolrInputDocument(); +d1.setField("id", "1"); +d1.setField("vector", Arrays.asList(1.0f, 2.5f, 3.7f, 4.1f)); + + +final SolrInputDocument d2 = new SolrInputDocument(); +d2.setField("id", "2"); +d2.setField("vector", Arrays.asList(1.5f, 5.5f, 6.7f, 65.1f)); + +client.add(Arrays.asList(d1, d2)); +---- +==== +-- + +== Query Time +This is the list of Apache Solr query approaches designed to support Neural Search: Review comment: I removed the List reference, but I would prefer to keep it as a subsection of "Query Time" to make a clear distinction. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org