ctargett commented on a change in pull request #12: URL: https://github.com/apache/solr/pull/12#discussion_r601788329
########## File path: solr/solr-ref-guide/src/morelikethis.adoc ########## @@ -16,97 +16,629 @@ // specific language governing permissions and limitations // under the License. -The `MoreLikeThis` search component enables users to query for documents similar to a document in their result list. +MoreLikeThis enables queries for documents similar to a document in their result list. It does this by using terms from the original document to find similar documents in the index. -There are three ways to use MoreLikeThis. The first, and most common, is to use it as a request handler. In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link). +There are several ways to use MoreLikeThis. +The first, and most common, is to use it as a request handler. +In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link). -The second is to use it as a search component. This is less desirable since it performs the MoreLikeThis analysis on every document returned. This may slow search results. +The second is to use it as a search component. +This is less desirable since it performs the MoreLikeThis analysis on every document that matches a user query. This may slow search results. -The final approach is to use it as a request handler but with externally supplied text. This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document. +Another approach is to use it as a request handler but with externally supplied text. +This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document. + +Finally, the MLT query parser can be used. +This operates in much the same way as the request handler but since it is a query parser it can be used in filter queries, boost queries, etc., and results can be paginated or highlighted as needed. == How MoreLikeThis Works -`MoreLikeThis` constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields ( see the `mlt.fl` parameter, below). For best results, the fields should have stored term vectors in `schema.xml`. For example: +`MoreLikeThis` constructs a Lucene query based on terms in a document. +It does this by pulling terms from the list of fields provided with the request. -[source,xml] ----- -<field name="cat" ... termVectors="true" /> ----- +For best results, the fields should have stored term vectors (`termVectors=true`), which can be <<defining-fields.adoc#,configured in the schema>>. +If term vectors are not stored, MoreLikeThis can generate terms from stored fields. +The field used for the `uniqueKey` must also be stored in order for MoreLikeThis to work properly. + +Terms from the original document are filtered using thresholds defined with the MoreLikeThis parameters. +Once the terms have been selected, a query is run with any other query parameters as appropriate and a new document set is returned. -If term vectors are not stored, `MoreLikeThis` will generate terms from stored fields. A `uniqueKey` must also be stored in order for MoreLikeThis to work properly. +== MoreLikeThis Handler and Component -The next phase filters terms from the original document using thresholds defined with the MoreLikeThis parameters. Finally, a query is run with these terms, and any other query parameters that have been defined (see the `mlt.qf` parameter, below) and a new document set is returned. +The MoreLikeThis request handler and search component share several parameters, but also have some key differences in response and operation, as described below. -== Common Parameters for MoreLikeThis +=== Common Handler and Component Parameters -The table below summarizes the `MoreLikeThis` parameters supported by Lucene/Solr. These parameters can be used with any of the three possible MoreLikeThis approaches. +The list below summarizes the `MoreLikeThis` parameters supported by Solr. +These parameters can be used with the MoreLikeThis search component or request handler. `mlt.fl`:: -Specifies the fields to use for similarity. If possible, these should have stored `termVectors`. ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +Specifies the fields to use for similarity. +A list of fields can be provided separated by commas. +If possible, the fields should have stored `termVectors`. `mlt.mintf`:: -Specifies the Minimum Term Frequency, the frequency below which terms will be ignored in the source document. ++ +[%autowidth,frame=none] +|=== +|Optional |Default: `2` +|=== ++ +Specifies the minimum frequency below which terms will be ignored in the source document. `mlt.mindf`:: -Specifies the Minimum Document Frequency, the frequency at which words will be ignored which do not occur in at least this many documents. ++ +[%autowidth,frame=none] +|=== +|Optional |Default: `5` +|=== ++ +Specifies the minimum frequency below which terms will be ignored which do not occur in at least this many documents. `mlt.maxdf`:: -Specifies the Maximum Document Frequency, the frequency at which words will be ignored which occur in more than this many documents. ++ +[%autowidth,frame=none] +|=== +|Optional |Default: none +|=== ++ +Specifies the maximum frequency above which terms will be ignored which occur in more than this many documents. `mlt.maxdfpct`:: -Specifies the Maximum Document Frequency using a relative ratio to the number of documents in the index. The argument must be an integer between 0 and 100. For example 75 means the word will be ignored if it occurs in more than 75 percent of the documents in the index. ++ +[%autowidth,frame=none] +|=== +|Optional |Default: none +|=== ++ +Specifies the maximum document frequency using a ratio relative to the number of documents in the index. +The value provided must be an integer between `0` and `100`. +For example, `mlt.maxdfpct=75` means the word will be ignored if it occurs in more than 75 percent of the documents in the index. `mlt.minwl`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: none +|=== ++ Sets the minimum word length below which words will be ignored. `mlt.maxwl`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: none +|=== ++ Sets the maximum word length above which words will be ignored. `mlt.maxqt`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: `25` +|=== ++ Sets the maximum number of query terms that will be included in any generated query. `mlt.maxntp`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: `5000` +|=== ++ Sets the maximum number of tokens to parse in each example document field that is not stored with TermVector support. `mlt.boost`:: -Specifies if the query will be boosted by the interesting term relevance. It can be either "true" or "false". ++ +[%autowidth,frame=none] +|=== +|Optional |Default: `false` +|=== ++ +Specifies if the query will be boosted by the interesting term relevance. +Possible values are `true` or `false`. `mlt.qf`:: -Query fields and their boosts using the same format as that used by the <<the-dismax-query-parser.adoc#,DisMax Query Parser>>. These fields must also be specified in `mlt.fl`. ++ +[%autowidth,frame=none] +|=== +|Optional |Default: none +|=== ++ +Query fields and their boosts using the same format used by the <<the-dismax-query-parser.adoc#,DisMax Query Parser>>. +These fields must also be specified in `mlt.fl`. + +`mlt.interestingTerms`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: `none` +|=== ++ +Adds a section in the response that shows the top terms (based on TF/IDF) used for the MoreLikeThis query. +It supports three possible values: ++ +* `list` lists the terms. +* `none` lists no terms (the default). +* `details` lists the terms along with the boost value used for each term. +Unless `mlt.boost=true`, all terms will have `boost=1.0`. + ++ +To use this parameter with the <<MoreLikeThis Search Component,search component>>, the query cannot be distributed. +In order to get interesting terms, the query must be sent to a single shard and limited to that shard only (with the <<distributed-requests.adoc#limiting-which-shards-are-queried,`shards`>> parameter). +Multi-shard support is, however, available with the MoreLikeThis request handler. + +=== MoreLikeThis Request Handler + +==== Request Handler Configuration + +The MoreLikeThis request handler is not configured by default and needs to be set up before using it. +You can do this by manually editing `solrconfig.xml` or with the Config API: + +[.dynamic-tabs] +-- +[example.tab-pane#manualconfig] +==== +[.tab-label]*Manual Configuration* + +[source,xml] +---- +<requestHandler name="/mlt" class="solr.MoreLikeThisHandler"> + <str name="mlt.fl">body</str> +</requestHandler> +---- +==== + +[example.tab-pane#configapi] +==== +[.tab-label]*Config API* + +[source,bash] +---- +curl -X POST -H 'Content-type:application/json' -d { + "add-requesthandler": { + "name": "/mlt", + "class": "solr.MoreLikeThisHandler", + "defaults": {"mlt.fl": "body"} + } +} http://localhost:8983/solr/<collection>/config +---- +==== +-- + +Both of the above examples set the `mlt.fl` parameter to "body" for the request handler. +This means that all requests to the handler will use that value for the parameter unless specifically overridden in an individual request. + +For more about request handler configuration in general, see the section <<requesthandlers-and-searchcomponents-in-solrconfig.adoc#default-components,RequestHandlers and SearchComponents in Solrconfig>>. + +==== Request Handler Parameters + +The MoreLikeThis request handler supports the following parameters in addition to the <<Common Handler and Component Parameters,common parameters>> above. +It supports faceting, paging, and filtering using common query parameters, but does not work well with alternate query parsers. + +`mlt.match.include`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: `false` Review comment: Ah, yes, I see that now. I'll fix it - thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org